There’s often a reason to encode an arbitrary sequence of bytes as a sequence of symbols from an alphabet smaller than 256 characters. ASCII-based alphabets are popular, but the precise choice of alphabet often differs, depending on the constraints of each particular situation.
Having chosen an alphabet with characters, it’s tempting to say “now we just encode it in base ”. But a number of implementation details still remain, and if they aren’t precisely specified, different implementations might not be interoperable.
- Are the input and encoded data to be treated as big-endian or little-endian numbers?
- Is the input data to be treated as a single large integer, as in variants of base58, or broken into chunks, as in Ascii85?
- How are leading or trailing
0x00bytes to be encoded?
This document provides a way to concisely specify
a relatively efficient scheme for
encoding arbitrary sequences of bytes
into sequences of symbols from
any alphabet of between two and 256 characters (inclusive).
It’s compatible with the unpadded
base16, base32, and base64 encodings,
and, to a certain extent, with Ascii85
z shorthands aren’t supported).
Finally, a base 26 alphabetic encoding is suggested,
which can assist with the efficient encoding of data
in valid URIs stored in Aztec codes,
which allow for compact encoding of
long single-case alphabetic sequences.
The key words “MUST”, “MUST NOT”, “REQUIRED”, “SHALL”, “SHALL NOT”, “SHOULD”, “SHOULD NOT”, “RECOMMENDED”, “NOT RECOMMENDED”, “MAY”, and “OPTIONAL” in this document are to be interpreted as described in BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all capitals, as shown here.
denotes the floor function, denotes the ceiling function, and denotes the base-two logarithm.
For compatibility with the most popular existing encodings, the encoding scheme defined in this document uses the big-endian convention, treating earlier bytes or symbols as more significant digits. Each byte of input is treated in the usual way as an unsigned integer (between 0 and 255); a string of bytes is treated as a string of bits , where the bits of are , arranged in order from the most significant bit — — to the least significant — .
Unless the chosen base is a power of two, encodings that treat the input (say, ) as a single large integer — — can’t determine the first output symbol until the entire input has been read (or at least its length is known), and the time-complexity of computing the output is quadratic in the length of the input. Therefore, such encodings are undesirable in cases where large streams of data are to be encoded.
An alternative is possible in which the input is treated as a fraction — ; in fact, the base16, base32, and base64 encodings can be viewed as this type of encoding system. This type of encoding allows the output stream to begin before the entire input has been read, but again, if the chosen base is not a power of two, then the average time required to compute the next output symbol grows as the encoding algorithm processes more and more input data.
Therefore, in order to be suitable for as many purposes as possible,
this document adopts the practice of
dividing the input into chunks of specific numbers of bits.
Each full chunk is encoded into a fixed number of output symbols,
and any final partial chunk is encoded into a number of symbols that
allows the decoder to reconstruct the exact length of the original input;
in this way, this scheme ensures that there’s
no loss of information regarding leading or trailing
Given integers and satisfying and , and distinct symbols , this document defines the -bit chunky base encoding using the alphabet . Each full chunk of bits of the input is encoded in a full chunk of symbols, where .
Typically, a user of this specification will first choose an alphabet that satisfies their constraints; the size of this alphabet determines the parameter . They MAY also specify some lenience regarding which strings of symbols will be accepted as valid encoded strings. For example, if the alphabet itself contains no whitespace characters, they might specify that whitespace characters can be arbitrarily inserted into encoded strings, and that decoders are to ignore them; or, if the alphabet doesn’t contain both the upper and lower case forms of any character, they might specify that the encoding is case insensitive, and that decoders are to treat upper and lower case characters identically.
The parameter can be any positive integer, but for a given value of , some values of will be more suitable than others. Certainly, SHOULD be at least , otherwise part of the chosen alphabet will remain entirely unused.
The encoding will be most space-efficient when is not much less than an integer. When isn’t a power of 2, the continued fraction expansion of can be useful in identifying an appropriate value of . Every second convergent will be a rational number that slightly underestimates , with each successive convergent being a closer approximation; the numerators of these rational underestimates (when written in their simplest forms) are good candidates for . In tension with this space-efficiency consideration is the fact that smaller values of allow for more time-efficient implementations.
If the string of input bytes is empty, it is encoded to the empty string of output characters. Otherwise, to encode a string of input bytes , the corresponding sequence of input bits (as defined above) is divided into chunks of bits each, where ranges over the integers satisfying , and . There’s also a final (possibly partial) chunk containing the bits .
With defined as above, the th input chunk is encoded into the output chunk , where are the unique integers between and such that
Let and . The final input chunk is encoded to the final output chunk , where are the unique integers between and such that
The final output of the encoding function is .
The empty string of characters is decoded to the empty string of bytes. Otherwise, to decode a string of characters , it’s divided into chunks of characters each, where ranges over the integers satisfying , and . There’s also a final (possibly partial) chunk containing the characters .
The th encoded chunk is decoded back into the chunk of bits by finding the unique such bits that satisfy
If the left-hand side of this equation is greater than or equal to , then the string of characters isn’t a valid encoding, and MUST be rejected by the decoder.
Let , let , and let . The final chunk of characters is decoded back into the final chunk of bits by finding the bits that satisfy
Again, if this is impossible (because the right-hand side is greater than or equal to ), the string of characters isn’t a valid encoding, and MUST be rejected by the decoder. Additionally, the decoder MUST ensure that and
If either of these is false, the string of characters wasn’t a valid encoding, and MUST be rejected by the decoder.
The final output of the decoding function is , where the decoded bits have been recomposed into bytes.
Some existing encodings can be seen as instances of chunky base encodings. For example:
- Base16 is case-insensitive 4-bit chunky base 16 using the alphabet
- Unpadded base32 is case-insensitive 5-bit chunky base 32 using the
- Unpadded base64 is 6-bit chunky base 64 using the alphabet
- Ascii85 (without the
zshorthands) is 32-bit chunky base 85 using the alphabet consisting of the ASCII characters from 33 (
!) to 117 (
This document defines a new encoding called “airtameg”: Airtameg is
14-bit chunky base 26 using the alphabet
z. Users of
airtameg MAY specify that whitespace characters are allowed; they MAY
also specify the use of either an upper case variant (using the
Z) or a case-insensitive variant. However,
airtameg allows no such lenience unless it’s explicitly specified as
being permitted in a particular context.
Although the above uses non-integers to specify certain things, such as the encoding and decoding of final chunks, implementations need not use rational or floating-point representations internally for these purposes. Indeed, naive use of floating point arithmetic could lead to incorrect results due to rounding errors when is sufficiently large.
Implementations could, for example, pad the last input chunk with trailing unset bits, encode it as an ordinary chunk, and discard the appropriate number of trailing characters from the result. Decoders could likewise pad the last chunk with trailing characters (not characters), decode it as normal, and discard the correct number of trailing bits, having performed the necessary checks.
The following table presents some examples of valid and invalid encodings. The focus is on airtameg, since examples for the other encodings are readily available elsewhere. In particular, where a row is marked as having an “invalid” encoding, this indicates that the encoding in the airtameg column is unequivocally invalid, but the encodings in other columns might be acceptable in some variants of those encoding schemes.
|Invalid encoding: a shorter encoding would encode the same (empty) data||
|Valid encoding of a byte with all bits unset||
|Invalid encoding: would encode the same as above, but isn’t minimal||
|Valid encoding of a byte with all bits set||
|Invalid encoding: would overflow when decoding||
Chunky base encodings do not encrypt their input data or provide any other cryptographic features. Therefore, encoded data MUST be treated with as much care as the unencoded data it represents. If the unencoded data is meant to be secret, then the encoded form MUST be kept just as secret, for example, and if the data to be encoded or decoded is from an untrusted source, any necessary protections against flooding-style denial-of-service attacks MUST be applied before the data is sent to the encoder or decoder.
In situations where a unique encoding is needed (for example, if encoded objects are to be identified by their cryptographic hashes), a chunky base encoding will preserve the uniqueness of the form of the data, provided that no lenience is allowed, such as case insensitivity or the insertion of whitespace. However, if an object already has multiple possible forms before being encoded by a chunky base encoding, it will still have multiple encoded forms.
In situations where timing attacks are a threat worth considering, users of chunky base SHOULD ensure that their encoding and decoding algorithms don’t, by the length of time they take to execute, reveal any information about the data they are encoding or decoding. Implementations that provide resistance to timing attacks MUST make clear the limits on such resistance. For example, “This implementation ensures that data of the same length will always be encoded (or decoded) using the same number of instructions, but longer data will take longer to process”, or “This implementation allows the user to specify an “apparent length” of data that’s being encoded (or decoded); as long as the actual length of data doesn’t exceed the apparent length, encoding (or decoding) data using the same apparent length will always be done using the same number of instructions”.