Introduction

There’s often a reason to encode an arbitrary sequence of bytes as a sequence of symbols from an alphabet smaller than 256 characters. ASCII-based alphabets are popular, but the precise choice of alphabet often differs, depending on the constraints of each particular situation.

Having chosen an alphabet with bb characters, it’s tempting to say “now we just encode it in base bb”. But a number of implementation details still remain, and if they aren’t precisely specified, different implementations might not be interoperable.

For example:

  • Are the input and encoded data to be treated as big-endian or little-endian numbers?
  • Is the input data to be treated as a single large integer, as in variants of base58, or broken into chunks, as in Ascii85?
  • How are leading or trailing 0x00 bytes to be encoded?

This document provides a way to concisely specify a relatively efficient scheme for encoding arbitrary sequences of bytes into sequences of symbols from any alphabet of between two and 256 characters (inclusive). It’s compatible with the unpadded base16, base32, and base64 encodings, and, to a certain extent, with Ascii85 (though the y and z shorthands aren’t supported). Finally, a base 26 alphabetic encoding is suggested, which can assist with the efficient encoding of data in valid URIs stored in Aztec codes, which allow for compact encoding of long single-case alphabetic sequences.

Conventions

The key words “MUST”, “MUST NOT”, “REQUIRED”, “SHALL”, “SHALL NOT”, “SHOULD”, “SHOULD NOT”, “RECOMMENDED”, “NOT RECOMMENDED”, “MAY”, and “OPTIONAL” in this document are to be interpreted as described in BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all capitals, as shown here.

\left\lfloor\cdot\right\rfloor denotes the floor function, \left\lceil\cdot\right\rceil denotes the ceiling function, and lg\lg denotes the base-two logarithm.

Design choices

For compatibility with the most popular existing encodings, the encoding scheme defined in this document uses the big-endian convention, treating earlier bytes or symbols as more significant digits. Each byte of input is treated in the usual way as an unsigned integer (between 0 and 255); a string of bytes A0,A1,,Ak1A_0, A_1, \ldots, A_{k-1} is treated as a string of bits a0,a1,,a8k1a_0, a_1, \ldots, a_{8k-1}, where the bits of AιA_\iota are a8ι,a8ι+1,,a8ι+7a_{8\iota}, a_{8\iota+1}, \ldots, a_{8\iota+7}, arranged in order from the most significant bit — a8ιa_{8\iota} — to the least significant — a8ι+7a_{8\iota+7}.

Unless the chosen base is a power of two, encodings that treat the input (say, a0,a1,,a8k1a_0, a_1, \ldots, a_{8k-1}) as a single large integer — 0i<8k28ki1ai\sum_{0 \leq i < 8k}{2^{8k-i-1} a_i} — can’t determine the first output symbol until the entire input has been read (or at least its length is known), and the time-complexity of computing the output is quadratic in the length of the input. Therefore, such encodings are undesirable in cases where large streams of data are to be encoded.

An alternative is possible in which the input is treated as a fraction — 0i<8k2(i+1)ai\sum_{0 \leq i < 8k}{2^{-(i+1)} a_i}; in fact, the base16, base32, and base64 encodings can be viewed as this type of encoding system. This type of encoding allows the output stream to begin before the entire input has been read, but again, if the chosen base is not a power of two, then the average time required to compute the next output symbol grows as the encoding algorithm processes more and more input data.

Therefore, in order to be suitable for as many purposes as possible, this document adopts the practice of dividing the input into chunks of specific numbers of bits. Each full chunk is encoded into a fixed number of output symbols, and any final partial chunk is encoded into a number of symbols that allows the decoder to reconstruct the exact length of the original input; in this way, this scheme ensures that there’s no loss of information regarding leading or trailing 0x00 bytes.

Parameters

Given integers nn and bb satisfying n1n \geq 1 and 2b2562 \leq b \leq 256, and distinct symbols c0,c1,,cb1c_0, c_1, \ldots, c_{b-1}, this document defines the nn-bit chunky base bb encoding using the alphabet c0,c1,,cb1c_0, c_1, \ldots, c_{b-1}. Each full chunk of nn bits of the input is encoded in a full chunk of mm symbols, where m=nlgbm = \left\lceil \frac{n}{\lg{b}} \right\rceil.

Typically, a user of this specification will first choose an alphabet that satisfies their constraints; the size of this alphabet determines the parameter bb. They MAY also specify some lenience regarding which strings of symbols will be accepted as valid encoded strings. For example, if the alphabet itself contains no whitespace characters, they might specify that whitespace characters can be arbitrarily inserted into encoded strings, and that decoders are to ignore them; or, if the alphabet doesn’t contain both the upper and lower case forms of any character, they might specify that the encoding is case insensitive, and that decoders are to treat upper and lower case characters identically.

The parameter nn can be any positive integer, but for a given value of bb, some values of nn will be more suitable than others. Certainly, nn SHOULD be at least lgb\lg{b}, otherwise part of the chosen alphabet will remain entirely unused.

The encoding will be most space-efficient when nlgb\frac{n}{\lg{b}} is not much less than an integer. When bb isn’t a power of 2, the continued fraction expansion of lgb\lg{b} can be useful in identifying an appropriate value of nn. Every second convergent will be a rational number that slightly underestimates lgb\lg{b}, with each successive convergent being a closer approximation; the numerators of these rational underestimates (when written in their simplest forms) are good candidates for nn. In tension with this space-efficiency consideration is the fact that smaller values of nn allow for more time-efficient implementations.

Encoding

If the string of input bytes is empty, it is encoded to the empty string of output characters. Otherwise, to encode a string of input bytes A0,A1,,Ak1A_0, A_1, \ldots, A_{k-1}, the corresponding sequence of input bits a0,a1,,a8k1a_0, a_1, \ldots, a_{8k-1} (as defined above) is divided into chunks ajn,ajn+1,,ajn+n1a_{jn}, a_{jn+1}, \ldots, a_{jn+n-1} of nn bits each, where jj ranges over the integers satisfying 0j<J0 \leq j < J, and J=8kn1J = \left\lceil \frac{8k}{n} \right\rceil - 1. There’s also a final (possibly partial) chunk containing the bits aJn,aJn+1,,a8k1a_{J n}, a_{J n + 1}, \ldots, a_{8k-1}.

With mm defined as above, the jjth input chunk is encoded into the output chunk cdjmcdjm+1cdjm+m1c_{d_{jm}} c_{d_{jm+1}} \ldots c_{d_{jm+m-1}}, where djm,djm+1,,djm+m1d_{jm}, d_{jm+1}, \ldots, d_{jm+m-1} are the unique integers between 00 and b1b-1 such that

0i<mbm1idjm+i=0i<n2n1iajn+i\sum_{0 \leq i < m} b^{m-1 - i} d_{jm + i} = \sum_{0 \leq i < n} 2^{n-1 - i} a_{jn + i}

Let n=8kJnn' = 8k - J n and m=mnnlgbm' = m - \left\lfloor \frac{n - n'}{\lg{b}} \right\rfloor. The final input chunk is encoded to the final output chunk cdJmcdJm+1cdJm+m1c_{d_{J m}} c_{d_{J m + 1}} \ldots c_{d_{J m + m' - 1}}, where dJm,dJm+1,,dJm+m1d_{J m}, d_{J m + 1}, \ldots, d_{J m + m' - 1} are the unique integers between 00 and b1b-1 such that

0i<mbm1idJm+i=0i<n2n1iaJn+ibmm\sum_{0 \leq i < m'} b^{m' - 1 - i} d_{J m + i} = \left\lfloor \frac{ \sum_{0 \leq i < n'} 2^{n-1 - i} a_{J n + i} }{ b^{m-m'} } \right\rfloor

The final output of the encoding function is cd0cd1cdJm+m1c_{d_0} c_{d_1} \ldots c_{d_{J m + m' - 1}}.

Decoding

The empty string of characters is decoded to the empty string of bytes. Otherwise, to decode a string of characters cd0cd1cdp1c_{d_0} c_{d_1} \ldots c_{d_{p-1}}, it’s divided into chunks cdjmcdjm+1cdjm+m1c_{d_{jm}} c_{d_{jm+1}} \ldots c_{d_{jm+m-1}} of mm characters each, where jj ranges over the integers satisfying 0j<J0 \leq j < J, and J=pm1J = \left\lceil \frac{p}{m} \right\rceil - 1. There’s also a final (possibly partial) chunk containing the characters cdJmcdJm+1cdp1c_{d_{Jm}} c_{d_{Jm+1}} \ldots c_{d_{p-1}}.

The jjth encoded chunk is decoded back into the chunk of bits ajn,ajn+1,,ajn+n1a_{jn}, a_{jn+1}, \ldots, a_{jn+n-1} by finding the unique such bits that satisfy

0i<mbm1idjm+i=0i<n2n1iajn+i\sum_{0 \leq i < m} b^{m-1 - i} d_{jm + i} = \sum_{0 \leq i < n} 2^{n-1 - i} a_{jn + i}

If the left-hand side of this equation is greater than or equal to 2n2^n, then the string of characters isn’t a valid encoding, and MUST be rejected by the decoder.

Let m=pJmm' = p - J m, let k=(J+1)n(mm)lgb8k = \left\lfloor \frac{ (J + 1) n - \lceil (m - m') \lg{b} \rceil }{ 8 } \right\rfloor, and let n=8kJnn' = 8 k - J n. The final chunk of characters is decoded back into the final chunk of bits by finding the bits aJn,aJn+1,,aJn+n1a_{Jn}, a_{Jn+1}, \ldots, a_{Jn + n' - 1} that satisfy

0i<n2n1iaJn+i=0i<mbm1idJm+i+bmm12nn\sum_{0 \leq i < n'} 2^{n' - 1 - i} a_{Jn + i} = \left\lfloor \frac{ \sum_{0 \leq i < m'} b^{m-1 - i} d_{J m + i} + b^{m-m'} - 1 }{ 2^{n-n'} } \right\rfloor

Again, if this is impossible (because the right-hand side is greater than or equal to 2n2^{n'}), the string of characters isn’t a valid encoding, and MUST be rejected by the decoder. Additionally, the decoder MUST ensure that mm=nnlgbm - m' = \left\lfloor \frac{n - n'}{\lg b} \right\rfloor and

0i<mbm1idJm+i1<0i<n2n1iaJn+1\sum_{0 \leq i < m'} b^{m-1 - i} d_{J m + i} - 1 < \sum_{0 \leq i < n'} 2^{n-1 - i} a_{J n + 1}

If either of these is false, the string of characters wasn’t a valid encoding, and MUST be rejected by the decoder.

The final output of the decoding function is A0,A1,,Ak1A_0, A_1, \ldots, A_{k-1}, where the decoded bits have been recomposed into bytes.

Specific instances

Some existing encodings can be seen as instances of chunky base bb encodings. For example:

  • Base16 is case-insensitive 4-bit chunky base 16 using the alphabet 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, A, B, C, D, E, F.
  • Unpadded base32 is case-insensitive 5-bit chunky base 32 using the alphabet A, B, …, Z, 2, 3, …, 7.
  • Unpadded base64 is 6-bit chunky base 64 using the alphabet A, B, …, Z, a, b, …, z, 0, 1, …, 9, +, /.
  • Ascii85 (without the y or z shorthands) is 32-bit chunky base 85 using the alphabet consisting of the ASCII characters from 33 (!) to 117 (u), inclusive.

This document defines a new encoding called “airtameg”: Airtameg is 14-bit chunky base 26 using the alphabet a, b, …, z. Users of airtameg MAY specify that whitespace characters are allowed; they MAY also specify the use of either an upper case variant (using the alphabet A, B, …, Z) or a case-insensitive variant. However, airtameg allows no such lenience unless it’s explicitly specified as being permitted in a particular context.

Implementation notes

Although the above uses non-integers to specify certain things, such as the encoding and decoding of final chunks, implementations need not use rational or floating-point representations internally for these purposes. Indeed, naive use of floating point arithmetic could lead to incorrect results due to rounding errors when nn is sufficiently large.

Implementations could, for example, pad the last input chunk with trailing unset bits, encode it as an ordinary chunk, and discard the appropriate number of trailing characters from the result. Decoders could likewise pad the last chunk with trailing cb1c_{b - 1} characters (not c0c_0 characters), decode it as normal, and discard the correct number of trailing bits, having performed the necessary checks.

Examples

The following table presents some examples of valid and invalid encodings. The focus is on airtameg, since examples for the other encodings are readily available elsewhere. In particular, where a row is marked as having an “invalid” encoding, this indicates that the encoding in the airtameg column is unequivocally invalid, but the encodings in other columns might be acceptable in some variants of those encoding schemes.

Comment Airtameg Base16 Base32 Base64 Ascii85
Empty string          
Invalid encoding: a shorter encoding would encode the same (empty) data a 0 A A !
Valid encoding of a byte with all bits unset aa 00 AA AA !!
Invalid encoding: would encode the same as above, but isn’t minimal ab   AD AM !-
Valid encoding of a byte with all bits set yd FF 74 /w rr
Invalid encoding: would overflow when decoding yg       sA
Valid encoding gematriaa 41301FF548 IEYB75KI QTAf9Ug 5qjDR8,

Security considerations

Chunky base bb encodings do not encrypt their input data or provide any other cryptographic features. Therefore, encoded data MUST be treated with as much care as the unencoded data it represents. If the unencoded data is meant to be secret, then the encoded form MUST be kept just as secret, for example, and if the data to be encoded or decoded is from an untrusted source, any necessary protections against flooding-style denial-of-service attacks MUST be applied before the data is sent to the encoder or decoder.

In situations where a unique encoding is needed (for example, if encoded objects are to be identified by their cryptographic hashes), a chunky base bb encoding will preserve the uniqueness of the form of the data, provided that no lenience is allowed, such as case insensitivity or the insertion of whitespace. However, if an object already has multiple possible forms before being encoded by a chunky base bb encoding, it will still have multiple encoded forms.

In situations where timing attacks are a threat worth considering, users of chunky base bb SHOULD ensure that their encoding and decoding algorithms don’t, by the length of time they take to execute, reveal any information about the data they are encoding or decoding. Implementations that provide resistance to timing attacks MUST make clear the limits on such resistance. For example, “This implementation ensures that data of the same length will always be encoded (or decoded) using the same number of instructions, but longer data will take longer to process”, or “This implementation allows the user to specify an “apparent length” of data that’s being encoded (or decoded); as long as the actual length of data doesn’t exceed the apparent length, encoding (or decoding) data using the same apparent length will always be done using the same number of instructions”.