Introduction
A very popular way to encode binary data is Base64. The basis of this is an encoding table. As you might expect, there are 64 total characters that go into the tale. There are multiple implementations of base64 with slight differences. They are all the same except for the last two characters and line ending requirements. The first 62 characters are always “ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789”. PEM and MIME encoding are the most common and use “+/” as the last two characters. PEM and MIME may use the same characters but they have different maximum line lengths. I’m going to implement PEM/MINE but I’m not going to implement new line support.
Properties
Size
In Base64 encoding, 3 binary bytes are represented as 4 characters. This gives us a 4:3 ratio, meaning there is 33% overhead for base64. To get the number of base64 characters needed for a given binary data blob, take the length of input and round up to the nearest multiple of 3. Then, divide by 3 to get the number of 3 byte blocks. Multiply by 4 to get the number of base64 characters.
The date for each set of 3 binary bytes is spread over 4 characters giving us 6 bits per character. 3 bytes total 24 bits (3*8=24). There 4 characters for the 24 bits (24/4=6). Think of it this way 3/4 (putting 3 bytes into 4) means the data is split 75% from each byte to span 4 bytes. Thus, 6 bits utilized per character.
Padding
Base64 works in blocks of 3 bytes. Implementations such as PEM require padding
using =
character. Padding keeps the correct number of characters when
there are not enough bytes in the sequence. Ending with a single =
means the
last binary sequence contains 2 bytes. Ending with ==
means it contains 1
byte.
Binary to base64
Since a little math is involved to determine the encoded size we’ll use this simple function to help us out.
size_t b64_encoded_size(size_t inlen)
{
size_t ret;
ret = inlen;
if (inlen % 3 != 0)
ret += 3 - (inlen % 3);
ret /= 3;
ret *= 4;
return ret;
}
It’s pretty self explanatory but if you didn’t read the paragraphs above that explain the math, I highly recommend you go back and do so.
const char b64chars[] = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/";
Now we need to move the mapping table. Unlike decoding this is an array which we’ll reference by index as we encode. This table is used to turn 3 byte sequences into characters.
char *b64_encode(const unsigned char *in, size_t len)
{
char *out;
size_t elen;
size_t i;
size_t j;
size_t v;
if (in == NULL || len == 0)
return NULL;
elen = b64_encoded_size(len);
out = malloc(elen+1);
out[elen] = '\0';
for (i=0, j=0; i<len; i+=3, j+=4) {
v = in[i];
v = i+1 < len ? v << 8 | in[i+1] : v << 8;
v = i+2 < len ? v << 8 | in[i+2] : v << 8;
out[j] = b64chars[(v >> 18) & 0x3F];
out[j+1] = b64chars[(v >> 12) & 0x3F];
if (i+1 < len) {
out[j+2] = b64chars[(v >> 6) & 0x3F];
} else {
out[j+2] = '=';
}
if (i+2 < len) {
out[j+3] = b64chars[v & 0x3F];
} else {
out[j+3] = '=';
}
}
return out;
}
The encode function works in blocks of 3 bytes of binary data to turn it into base64 ASCII characters.
v = in[i];
v = i+1 < len ? v << 8 | in[i+1] : v << 8;
v = i+2 < len ? v << 8 | in[i+2] : v << 8;
Here we push the 3 bytes into an int
. If there are less than 3 bytes just
shift so we can properly pull the value back out later.
out[j] = b64chars[(v >> 18) & 0x3F];
out[j+1] = b64chars[(v >> 12) & 0x3F];
Next we pull the base64 character out of the mapping. 0x3F is used because it is 6 bits where all are set (a mask). Since each output byte is split into a series of 6 bits we need to only map using the bits being put into a given output byte.
if (i+1 < len) {
out[j+2] = b64chars[(v >> 6) & 0x3F];
} else {
out[j+2] = '=';
}
if (i+2 < len) {
out[j+3] = b64chars[v & 0x3F];
} else {
out[j+3] = '=';
}
Finally, add the byte or an =
for padding if there are less than 3 bytes in
the last block. 1 and 2 fewer than 3 bytes gets 1 or 2 padding characters.
Base64 to binary
Much like encoding, we need to do some math to determine binary data size of some Base64 encoded data we might have.
size_t b64_decoded_size(const char *in)
{
size_t len;
size_t ret;
size_t i;
if (in == NULL)
return 0;
len = strlen(in);
ret = len / 4 * 3;
for (i=len; i-->0; ) {
if (in[i] == '=') {
ret--;
} else {
break;
}
}
return ret;
}
Really this is just the opposite of what we didn’t for encoding.
Next we need a decoding table! Wooo, magic numbers.
int b64invs[] = { 62, -1, -1, -1, 63, 52, 53, 54, 55, 56, 57, 58,
59, 60, 61, -1, -1, -1, -1, -1, -1, -1, 0, 1, 2, 3, 4, 5,
6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20,
21, 22, 23, 24, 25, -1, -1, -1, -1, -1, -1, 26, 27, 28,
29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42,
43, 44, 45, 46, 47, 48, 49, 50, 51 };
This is our decoding table. This is based on a shift from +
which is the
lowest ASCII character (43) in the sequence. The table index is the value in
b64chars
and the table value is the index of the vale in b64chars
. -1 is
used as a placeholder for values that are not in the table.
I don’t know about you but I hate magic numbers someone on the internet says to use. I’m fine with using them as long as I can verify they’re correct.
void b64_generate_decode_table()
{
int inv[80];
size_t i;
memset(inv, -1, sizeof(inv));
for (i=0; i<sizeof(b64chars)-1; i++) {
inv[b64chars[i]-43] = i;
}
}
Not to fear, this is the magic calculation that generated the table.
int inv[80];
inv
is 80 elements because that is the total number of characters between
+
and z
. Not all are base64 characters which is why there will be -1 holes.
inv[b64chars[i]-43] = i;
We need to subtract 43 to shift +
to be the 0 index. Once we run this
function we’ll end up with the above encoding table. It’s a better idea to use
the precomputed table than generate it anew for every decode.
int b64_isvalidchar(char c)
{
if (c >= '0' && c <= '9')
return 1;
if (c >= 'A' && c <= 'Z')
return 1;
if (c >= 'a' && c <= 'z')
return 1;
if (c == '+' || c == '/' || c == '=')
return 1;
return 0;
}
We’ll need a verification function to check if characters are are valid for the
base64 character set. This should be expanded to handle newlines, deal with
line length requirements, ignore whitespace if necessary, and verify there are
two or less =
, and =
is only present at the end. Verification could also
happen in the main decode function instead of being split out. I find it easier
to split it out otherwise the decode function becomes more difficult to
understand.
int b64_decode(const char *in, unsigned char *out, size_t outlen)
{
size_t len;
size_t i;
size_t j;
int v;
if (in == NULL || out == NULL)
return 0;
len = strlen(in);
if (outlen < b64_decoded_size(in) || len % 4 != 0)
return 0;
for (i=0; i<len; i++) {
if (!b64_isvalidchar(in[i])) {
return 0;
}
}
for (i=0, j=0; i<len; i+=4, j+=3) {
v = b64invs[in[i]-43];
v = (v << 6) | b64invs[in[i+1]-43];
v = in[i+2]=='=' ? v << 6 : (v << 6) | b64invs[in[i+2]-43];
v = in[i+3]=='=' ? v << 6 : (v << 6) | b64invs[in[i+3]-43];
out[j] = (v >> 16) & 0xFF;
if (in[i+2] != '=')
out[j+1] = (v >> 8) & 0xFF;
if (in[i+3] != '=')
out[j+2] = v & 0xFF;
}
return 1;
}
Now that we have all our helpers we can take a look at the main decode function. It will decode the sets of 4 base64 characters into 3 bytes. This handles the last grouping of bytes not being a multiple of three.
v = b64invs[in[i]-43];
v = (v << 6) | b64invs[in[i+1]-43];
We’re going to push the first two characters decoded into an int
that will hold our 3 bytes. These always need to be pushed because there will
always be 2 characters out of the 4 that aren’t padding.
v = in[i+2]=='=' ? v << 6 : (v << 6) | b64invs[in[i+2]-43];
v = in[i+3]=='=' ? v << 6 : (v << 6) | b64invs[in[i+3]-43];
The last two characters can be padding, so we need to keep this in mind. We
always need to shift to put the preceding bytes into the correct position
within the int
. To make it easy on ourselves, we’ll always shift even when
it’s padding.
out[j] = (v >> 16) & 0xFF;
if (in[i+2] != '=')
out[j+1] = (v >> 8) & 0xFF;
if (in[i+3] != '=')
out[j+2] = v & 0xFF;
There can be 3 total bytes that can be pulled out of the int
. If there is
padding we don’t want to pull those out and add them to the buffer. Otherwise
the buffer would need to be a multiple of three and there is no reason for
this. If there is less data than a multiple of three we don’t need to mandate
the extra space.
Example
Now for a quick example to test with:
int main(int argc, char **argv)
{
const char *data = "ABC123Test Lets Try this' input and see What \"happens\"";
char *enc;
char *out;
size_t out_len;
printf("data: '%s'\n", data);
enc = b64_encode((const unsigned char *)data, strlen(data));
printf("encoded: '%s'\n", enc);
printf("dec size %s data size\n", b64_decoded_size(enc) == strlen(data) ? "==" : "!=");
/* +1 for the NULL terminator. */
out_len = b64_decoded_size(enc)+1;
out = malloc(out_len);
if (!b64_decode(enc, (unsigned char *)out, out_len)) {
printf("Decode Failure\n");
return 1;
}
out[out_len] = '\0';
printf("dec: '%s'\n", out);
printf("data %s dec\n", strcmp(data, out) == 0 ? "==" : "!=");
free(out);
return 0;
}
And the output:
data: 'ABC123Test Lets Try this' input and see What "happens"'
encoded: 'QUJDMTIzVGVzdCBMZXRzIFRyeSB0aGlzJyBpbnB1dCBhbmQgc2VlIFdoYXQgImhhcHBlbnMi'
dec size == data size
dec: 'ABC123Test Lets Try this' input and see What "happens"'
data == dec