Introduction

There’s often a reason to encode an arbitrary sequence of bytes as a sequence of symbols from an alphabet smaller than 256 characters. ASCII-based alphabets are popular, but the precise choice of alphabet often differs, depending on the constraints of each particular situation.

Having chosen an alphabet with $b$ characters, it’s tempting to say “now we just encode it in base $b$ ”. But a number of implementation details still remain, and if they aren’t precisely specified, different implementations might not be interoperable.

For example:

Are the input and encoded data to be treated as big-endian or little-endian numbers?
Is the input data to be treated as a single large integer, as in variants of base58, or broken into chunks, as in Ascii85?
How are leading or trailing 0x00 bytes to be encoded?

This document provides a way to concisely specify a relatively efficient scheme for encoding arbitrary sequences of bytes into sequences of symbols from any alphabet of between two and 256 characters (inclusive). It’s compatible with the unpadded base16, base32, and base64 encodings, and, to a certain extent, with Ascii85 (though the y and z shorthands aren’t supported). Finally, a base 26 alphabetic encoding is suggested, which can assist with the efficient encoding of data in valid URIs stored in Aztec codes, which allow for compact encoding of long single-case alphabetic sequences.

Conventions

The key words “MUST”, “MUST NOT”, “REQUIRED”, “SHALL”, “SHALL NOT”, “SHOULD”, “SHOULD NOT”, “RECOMMENDED”, “NOT RECOMMENDED”, “MAY”, and “OPTIONAL” in this document are to be interpreted as described in BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all capitals, as shown here.

$\left\lfloor\cdot\right\rfloor$ denotes the floor function, $\left\lceil\cdot\right\rceil$ denotes the ceiling function, and $\lg$ denotes the base-two logarithm.

Design choices

For compatibility with the most popular existing encodings, the encoding scheme defined in this document uses the big-endian convention, treating earlier bytes or symbols as more significant digits. Each byte of input is treated in the usual way as an unsigned integer (between 0 and 255); a string of bytes $A_0, A_1, \ldots, A_{k-1}$ is treated as a string of bits $a_0, a_1, \ldots, a_{8k-1}$ , where the bits of $A_\iota$ are $a_{8\iota}, a_{8\iota+1}, \ldots, a_{8\iota+7}$ , arranged in order from the most significant bit — $a_{8\iota}$ — to the least significant — $a_{8\iota+7}$ .

Unless the chosen base is a power of two, encodings that treat the input (say, $a_0, a_1, \ldots, a_{8k-1}$ ) as a single large integer — $\sum_{0 \leq i < 8k}{2^{8k-i-1} a_i}$ — can’t determine the first output symbol until the entire input has been read (or at least its length is known), and the time-complexity of computing the output is quadratic in the length of the input. Therefore, such encodings are undesirable in cases where large streams of data are to be encoded.

An alternative is possible in which the input is treated as a fraction — $\sum_{0 \leq i < 8k}{2^{-(i+1)} a_i}$ ; in fact, the base16, base32, and base64 encodings can be viewed as this type of encoding system. This type of encoding allows the output stream to begin before the entire input has been read, but again, if the chosen base is not a power of two, then the average time required to compute the next output symbol grows as the encoding algorithm processes more and more input data.

Therefore, in order to be suitable for as many purposes as possible, this document adopts the practice of dividing the input into chunks of specific numbers of bits. Each full chunk is encoded into a fixed number of output symbols, and any final partial chunk is encoded into a number of symbols that allows the decoder to reconstruct the exact length of the original input; in this way, this scheme ensures that there’s no loss of information regarding leading or trailing 0x00 bytes.

Parameters

Given integers $n$ and $b$ satisfying $n \geq 1$ and $2 \leq b \leq 256$ , and distinct symbols $c_0, c_1, \ldots, c_{b-1}$ , this document defines the $n$ -bit chunky base $b$ encoding using the alphabet $c_0, c_1, \ldots, c_{b-1}$ . Each full chunk of $n$ bits of the input is encoded in a full chunk of $m$ symbols, where $m = \left\lceil \frac{n}{\lg{b}} \right\rceil$ .

Typically, a user of this specification will first choose an alphabet that satisfies their constraints; the size of this alphabet determines the parameter $b$ . They MAY also specify some lenience regarding which strings of symbols will be accepted as valid encoded strings. For example, if the alphabet itself contains no whitespace characters, they might specify that whitespace characters can be arbitrarily inserted into encoded strings, and that decoders are to ignore them; or, if the alphabet doesn’t contain both the upper and lower case forms of any character, they might specify that the encoding is case insensitive, and that decoders are to treat upper and lower case characters identically.

The parameter $n$ can be any positive integer, but for a given value of $b$ , some values of $n$ will be more suitable than others. Certainly, $n$ SHOULD be at least $\lg{b}$ , otherwise part of the chosen alphabet will remain entirely unused.

The encoding will be most space-efficient when $\frac{n}{\lg{b}}$ is not much less than an integer. When $b$ isn’t a power of 2, the continued fraction expansion of $\lg{b}$ can be useful in identifying an appropriate value of $n$ . Every second convergent will be a rational number that slightly underestimates $\lg{b}$ , with each successive convergent being a closer approximation; the numerators of these rational underestimates (when written in their simplest forms) are good candidates for $n$ . In tension with this space-efficiency consideration is the fact that smaller values of $n$ allow for more time-efficient implementations.

Encoding

If the string of input bytes is empty, it is encoded to the empty string of output characters. Otherwise, to encode a string of input bytes $A_0, A_1, \ldots, A_{k-1}$ , the corresponding sequence of input bits $a_0, a_1, \ldots, a_{8k-1}$ (as defined above) is divided into chunks $a_{jn}, a_{jn+1}, \ldots, a_{jn+n-1}$ of $n$ bits each, where $j$ ranges over the integers satisfying $0 \leq j < J$ , and $J = \left\lceil \frac{8k}{n} \right\rceil - 1$ . There’s also a final (possibly partial) chunk containing the bits $a_{J n}, a_{J n + 1}, \ldots, a_{8k-1}$ .

With $m$ defined as above, the $j$ th input chunk is encoded into the output chunk $c_{d_{jm}} c_{d_{jm+1}} \ldots c_{d_{jm+m-1}}$ , where $d_{jm}, d_{jm+1}, \ldots, d_{jm+m-1}$ are the unique integers between $0$ and $b-1$ such that

\sum_{0 \leq i < m} b^{m-1 - i} d_{jm + i} = \sum_{0 \leq i < n} 2^{n-1 - i} a_{jn + i}

Let $n' = 8k - J n$ and $m' = m - \left\lfloor \frac{n - n'}{\lg{b}} \right\rfloor$ . The final input chunk is encoded to the final output chunk $c_{d_{J m}} c_{d_{J m + 1}} \ldots c_{d_{J m + m' - 1}}$ , where $d_{J m}, d_{J m + 1}, \ldots, d_{J m + m' - 1}$ are the unique integers between $0$ and $b-1$ such that

\sum_{0 \leq i < m'} b^{m' - 1 - i} d_{J m + i} = \left\lfloor \frac{ \sum_{0 \leq i < n'} 2^{n-1 - i} a_{J n + i} }{ b^{m-m'} } \right\rfloor

The final output of the encoding function is $c_{d_0} c_{d_1} \ldots c_{d_{J m + m' - 1}}$ .

Decoding

The empty string of characters is decoded to the empty string of bytes. Otherwise, to decode a string of characters $c_{d_0} c_{d_1} \ldots c_{d_{p-1}}$ , it’s divided into chunks $c_{d_{jm}} c_{d_{jm+1}} \ldots c_{d_{jm+m-1}}$ of $m$ characters each, where $j$ ranges over the integers satisfying $0 \leq j < J$ , and $J = \left\lceil \frac{p}{m} \right\rceil - 1$ . There’s also a final (possibly partial) chunk containing the characters $c_{d_{Jm}} c_{d_{Jm+1}} \ldots c_{d_{p-1}}$ .

The $j$ th encoded chunk is decoded back into the chunk of bits $a_{jn}, a_{jn+1}, \ldots, a_{jn+n-1}$ by finding the unique such bits that satisfy

\sum_{0 \leq i < m} b^{m-1 - i} d_{jm + i} = \sum_{0 \leq i < n} 2^{n-1 - i} a_{jn + i}

If the left-hand side of this equation is greater than or equal to $2^n$ , then the string of characters isn’t a valid encoding, and MUST be rejected by the decoder.

Let $m' = p - J m$ , let $k = \left\lfloor \frac{ (J + 1) n - \lceil (m - m') \lg{b} \rceil }{ 8 } \right\rfloor$ , and let $n' = 8 k - J n$ . The final chunk of characters is decoded back into the final chunk of bits by finding the bits $a_{Jn}, a_{Jn+1}, \ldots, a_{Jn + n' - 1}$ that satisfy

\sum_{0 \leq i < n'} 2^{n' - 1 - i} a_{Jn + i} = \left\lfloor \frac{ \sum_{0 \leq i < m'} b^{m-1 - i} d_{J m + i} + b^{m-m'} - 1 }{ 2^{n-n'} } \right\rfloor

Again, if this is impossible (because the right-hand side is greater than or equal to $2^{n'}$ ), the string of characters isn’t a valid encoding, and MUST be rejected by the decoder. Additionally, the decoder MUST ensure that $m - m' = \left\lfloor \frac{n - n'}{\lg b} \right\rfloor$ and

\sum_{0 \leq i < m'} b^{m-1 - i} d_{J m + i} - 1 < \sum_{0 \leq i < n'} 2^{n-1 - i} a_{J n + 1}

If either of these is false, the string of characters wasn’t a valid encoding, and MUST be rejected by the decoder.

The final output of the decoding function is $A_0, A_1, \ldots, A_{k-1}$ , where the decoded bits have been recomposed into bytes.

Specific instances

Some existing encodings can be seen as instances of chunky base $b$ encodings. For example:

Base16 is case-insensitive 4-bit chunky base 16 using the alphabet 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, A, B, C, D, E, F.
Unpadded base32 is case-insensitive 5-bit chunky base 32 using the alphabet A, B, …, Z, 2, 3, …, 7.
Unpadded base64 is 6-bit chunky base 64 using the alphabet A, B, …, Z, a, b, …, z, 0, 1, …, 9, +, /.
Ascii85 (without the y or z shorthands) is 32-bit chunky base 85 using the alphabet consisting of the ASCII characters from 33 (!) to 117 (u), inclusive.

This document defines a new encoding called “airtameg”: Airtameg is 14-bit chunky base 26 using the alphabet a, b, …, z. Users of airtameg MAY specify that whitespace characters are allowed; they MAY also specify the use of either an upper case variant (using the alphabet A, B, …, Z) or a case-insensitive variant. However, airtameg allows no such lenience unless it’s explicitly specified as being permitted in a particular context.

Implementation notes

Although the above uses non-integers to specify certain things, such as the encoding and decoding of final chunks, implementations need not use rational or floating-point representations internally for these purposes. Indeed, naive use of floating point arithmetic could lead to incorrect results due to rounding errors when $n$ is sufficiently large.

Implementations could, for example, pad the last input chunk with trailing unset bits, encode it as an ordinary chunk, and discard the appropriate number of trailing characters from the result. Decoders could likewise pad the last chunk with trailing $c_{b - 1}$ characters (not $c_0$ characters), decode it as normal, and discard the correct number of trailing bits, having performed the necessary checks.

Examples

The following table presents some examples of valid and invalid encodings. The focus is on airtameg, since examples for the other encodings are readily available elsewhere. In particular, where a row is marked as having an “invalid” encoding, this indicates that the encoding in the airtameg column is unequivocally invalid, but the encodings in other columns might be acceptable in some variants of those encoding schemes.

Comment	Airtameg	Base16	Base32	Base64	Ascii85
Empty string
Invalid encoding: a shorter encoding would encode the same (empty) data	`a`	`0`	`A`	`A`	`!`
Valid encoding of a byte with all bits unset	`aa`	`00`	`AA`	`AA`	`!!`
Invalid encoding: would encode the same as above, but isn’t minimal	`ab`		`AD`	`AM`	`!-`
Valid encoding of a byte with all bits set	`yd`	`FF`	`74`	`/w`	`rr`
Invalid encoding: would overflow when decoding	`yg`				`sA`
Valid encoding	`gematriaa`	`41301FF548`	`IEYB75KI`	`QTAf9Ug`	`5qjDR8,`

Security considerations

Chunky base $b$ encodings do not encrypt their input data or provide any other cryptographic features. Therefore, encoded data MUST be treated with as much care as the unencoded data it represents. If the unencoded data is meant to be secret, then the encoded form MUST be kept just as secret, for example, and if the data to be encoded or decoded is from an untrusted source, any necessary protections against flooding-style denial-of-service attacks MUST be applied before the data is sent to the encoder or decoder.

In situations where a unique encoding is needed (for example, if encoded objects are to be identified by their cryptographic hashes), a chunky base $b$ encoding will preserve the uniqueness of the form of the data, provided that no lenience is allowed, such as case insensitivity or the insertion of whitespace. However, if an object already has multiple possible forms before being encoded by a chunky base $b$ encoding, it will still have multiple encoded forms.

In situations where timing attacks are a threat worth considering, users of chunky base $b$ SHOULD ensure that their encoding and decoding algorithms don’t, by the length of time they take to execute, reveal any information about the data they are encoding or decoding. Implementations that provide resistance to timing attacks MUST make clear the limits on such resistance. For example, “This implementation ensures that data of the same length will always be encoded (or decoded) using the same number of instructions, but longer data will take longer to process”, or “This implementation allows the user to specify an “apparent length” of data that’s being encoded (or decoded); as long as the actual length of data doesn’t exceed the apparent length, encoding (or decoding) data using the same apparent length will always be done using the same number of instructions”.