Introduction

A very popular way to encode binary data is Base64. The basis of this is an encoding table. As you might expect, there are 64 total characters that go into the tale. There are multiple implementations of base64 with slight differences. They are all the same except for the last two characters and line ending requirements. The first 62 characters are always “ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789”. PEM and MIME encoding are the most common and use “+/” as the last two characters. PEM and MIME may use the same characters but they have different maximum line lengths. I’m going to implement PEM/MINE but I’m not going to implement new line support.

Properties

Size

In Base64 encoding, 3 binary bytes are represented as 4 characters. This gives us a 4:3 ratio, meaning there is 33% overhead for base64. To get the number of base64 characters needed for a given binary data blob, take the length of input and round up to the nearest multiple of 3. Then, divide by 3 to get the number of 3 byte blocks. Multiply by 4 to get the number of base64 characters.

The date for each set of 3 binary bytes is spread over 4 characters giving us 6 bits per character. 3 bytes total 24 bits (3*8=24). There 4 characters for the 24 bits (24/4=6). Think of it this way 3/4 (putting 3 bytes into 4) means the data is split 75% from each byte to span 4 bytes. Thus, 6 bits utilized per character.

Padding

Base64 works in blocks of 3 bytes. Implementations such as PEM require padding using = character. Padding keeps the correct number of characters when there are not enough bytes in the sequence. Ending with a single = means the last binary sequence contains 2 bytes. Ending with == means it contains 1 byte.

Binary to base64

Since a little math is involved to determine the encoded size we’ll use this simple function to help us out.

size_t b64_encoded_size(size_t inlen)
{
	size_t ret;

	ret = inlen;
	if (inlen % 3 != 0)
		ret += 3 - (inlen % 3);
	ret /= 3;
	ret *= 4;

	return ret;
}

It’s pretty self explanatory but if you didn’t read the paragraphs above that explain the math, I highly recommend you go back and do so.

const char b64chars[] = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/";

Now we need to move the mapping table. Unlike decoding this is an array which we’ll reference by index as we encode. This table is used to turn 3 byte sequences into characters.

char *b64_encode(const unsigned char *in, size_t len)
{
	char   *out;
	size_t  elen;
	size_t  i;
	size_t  j;
	size_t  v;

	if (in == NULL || len == 0)
		return NULL;

	elen = b64_encoded_size(len);
	out  = malloc(elen+1);
	out[elen] = '\0';

	for (i=0, j=0; i<len; i+=3, j+=4) {
		v = in[i];
		v = i+1 < len ? v << 8 | in[i+1] : v << 8;
		v = i+2 < len ? v << 8 | in[i+2] : v << 8;

		out[j]   = b64chars[(v >> 18) & 0x3F];
		out[j+1] = b64chars[(v >> 12) & 0x3F];
		if (i+1 < len) {
			out[j+2] = b64chars[(v >> 6) & 0x3F];
		} else {
			out[j+2] = '=';
		}
		if (i+2 < len) {
			out[j+3] = b64chars[v & 0x3F];
		} else {
			out[j+3] = '=';
		}
	}

	return out;
}

The encode function works in blocks of 3 bytes of binary data to turn it into base64 ASCII characters.

v = in[i];
v = i+1 < len ? v << 8 | in[i+1] : v << 8;
v = i+2 < len ? v << 8 | in[i+2] : v << 8;

Here we push the 3 bytes into an int. If there are less than 3 bytes just shift so we can properly pull the value back out later.

out[j]   = b64chars[(v >> 18) & 0x3F];
out[j+1] = b64chars[(v >> 12) & 0x3F];

Next we pull the base64 character out of the mapping. 0x3F is used because it is 6 bits where all are set (a mask). Since each output byte is split into a series of 6 bits we need to only map using the bits being put into a given output byte.

if (i+1 < len) {
	out[j+2] = b64chars[(v >> 6) & 0x3F];
} else {
	out[j+2] = '=';
}
if (i+2 < len) {
	out[j+3] = b64chars[v & 0x3F];
} else {
	out[j+3] = '=';
}

Finally, add the byte or an = for padding if there are less than 3 bytes in the last block. 1 and 2 fewer than 3 bytes gets 1 or 2 padding characters.

Base64 to binary

Much like encoding, we need to do some math to determine binary data size of some Base64 encoded data we might have.

size_t b64_decoded_size(const char *in)
{
	size_t len;
	size_t ret;
	size_t i;

	if (in == NULL)
		return 0;

	len = strlen(in);
	ret = len / 4 * 3;

	for (i=len; i-->0; ) {
		if (in[i] == '=') {
			ret--;
		} else {
			break;
		}
	}

	return ret;
}

Really this is just the opposite of what we didn’t for encoding.

Next we need a decoding table! Wooo, magic numbers.

int b64invs[] = { 62, -1, -1, -1, 63, 52, 53, 54, 55, 56, 57, 58,
	59, 60, 61, -1, -1, -1, -1, -1, -1, -1, 0, 1, 2, 3, 4, 5,
	6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20,
	21, 22, 23, 24, 25, -1, -1, -1, -1, -1, -1, 26, 27, 28,
	29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42,
	43, 44, 45, 46, 47, 48, 49, 50, 51 };

This is our decoding table. This is based on a shift from + which is the lowest ASCII character (43) in the sequence. The table index is the value in b64chars and the table value is the index of the vale in b64chars. -1 is used as a placeholder for values that are not in the table.

I don’t know about you but I hate magic numbers someone on the internet says to use. I’m fine with using them as long as I can verify they’re correct.

void b64_generate_decode_table()
{
	int    inv[80];
	size_t i;

	memset(inv, -1, sizeof(inv));
	for (i=0; i<sizeof(b64chars)-1; i++) {
		inv[b64chars[i]-43] = i;
	}
}

Not to fear, this is the magic calculation that generated the table.

int inv[80];

inv is 80 elements because that is the total number of characters between + and z. Not all are base64 characters which is why there will be -1 holes.

inv[b64chars[i]-43] = i;

We need to subtract 43 to shift + to be the 0 index. Once we run this function we’ll end up with the above encoding table. It’s a better idea to use the precomputed table than generate it anew for every decode.

int b64_isvalidchar(char c)
{
	if (c >= '0' && c <= '9')
		return 1;
	if (c >= 'A' && c <= 'Z')
		return 1;
	if (c >= 'a' && c <= 'z')
		return 1;
	if (c == '+' || c == '/' || c == '=')
		return 1;
	return 0;
}

We’ll need a verification function to check if characters are are valid for the base64 character set. This should be expanded to handle newlines, deal with line length requirements, ignore whitespace if necessary, and verify there are two or less = , and = is only present at the end. Verification could also happen in the main decode function instead of being split out. I find it easier to split it out otherwise the decode function becomes more difficult to understand.

int b64_decode(const char *in, unsigned char *out, size_t outlen)
{
	size_t len;
	size_t i;
	size_t j;
	int    v;

	if (in == NULL || out == NULL)
		return 0;

	len = strlen(in);
	if (outlen < b64_decoded_size(in) || len % 4 != 0)
		return 0;

	for (i=0; i<len; i++) {
		if (!b64_isvalidchar(in[i])) {
			return 0;
		}
	}

	for (i=0, j=0; i<len; i+=4, j+=3) {
		v = b64invs[in[i]-43];
		v = (v << 6) | b64invs[in[i+1]-43];
		v = in[i+2]=='=' ? v << 6 : (v << 6) | b64invs[in[i+2]-43];
		v = in[i+3]=='=' ? v << 6 : (v << 6) | b64invs[in[i+3]-43];

		out[j] = (v >> 16) & 0xFF;
		if (in[i+2] != '=')
			out[j+1] = (v >> 8) & 0xFF;
		if (in[i+3] != '=')
			out[j+2] = v & 0xFF;
	}

	return 1;
}

Now that we have all our helpers we can take a look at the main decode function. It will decode the sets of 4 base64 characters into 3 bytes. This handles the last grouping of bytes not being a multiple of three.

v = b64invs[in[i]-43];
v = (v << 6) | b64invs[in[i+1]-43];

We’re going to push the first two characters decoded into an int that will hold our 3 bytes. These always need to be pushed because there will always be 2 characters out of the 4 that aren’t padding.

v = in[i+2]=='=' ? v << 6 : (v << 6) | b64invs[in[i+2]-43];
v = in[i+3]=='=' ? v << 6 : (v << 6) | b64invs[in[i+3]-43];

The last two characters can be padding, so we need to keep this in mind. We always need to shift to put the preceding bytes into the correct position within the int. To make it easy on ourselves, we’ll always shift even when it’s padding.

out[j] = (v >> 16) & 0xFF;
if (in[i+2] != '=')
	out[j+1] = (v >> 8) & 0xFF;
if (in[i+3] != '=')
	out[j+2] = v & 0xFF;

There can be 3 total bytes that can be pulled out of the int. If there is padding we don’t want to pull those out and add them to the buffer. Otherwise the buffer would need to be a multiple of three and there is no reason for this. If there is less data than a multiple of three we don’t need to mandate the extra space.

Example

Now for a quick example to test with:

int main(int argc, char **argv)
{
	const char *data = "ABC123Test Lets Try this' input and see What \"happens\"";
	char       *enc;
	char       *out;
	size_t      out_len;

	printf("data:    '%s'\n", data);

	enc = b64_encode((const unsigned char *)data, strlen(data));
	printf("encoded: '%s'\n", enc);

	printf("dec size %s data size\n", b64_decoded_size(enc) == strlen(data) ? "==" : "!=");

	/* +1 for the NULL terminator. */
	out_len = b64_decoded_size(enc)+1;
	out = malloc(out_len);

	if (!b64_decode(enc, (unsigned char *)out, out_len)) {
		printf("Decode Failure\n");
		return 1;
	}
	out[out_len] = '\0';

	printf("dec:     '%s'\n", out);
	printf("data %s dec\n", strcmp(data, out) == 0 ? "==" : "!=");
	free(out);

	return 0;
}

And the output:

data:    'ABC123Test Lets Try this' input and see What "happens"'
encoded: 'QUJDMTIzVGVzdCBMZXRzIFRyeSB0aGlzJyBpbnB1dCBhbmQgc2VlIFdoYXQgImhhcHBlbnMi'
dec size == data size
dec:     'ABC123Test Lets Try this' input and see What "happens"'
data == dec