Introduction

So, C doesn’t have a native hashtable object but that’s not a problem because we can use one one someone else wrote. Lack of a robust standard library is probably the biggest impoundments of working with C.

It’s a real shame C doesn’t natively support hashtables because they are so versatile. You should keep this saying in mind, “when in doubt hash it out”. It works for programming and for life in general.

The big libraries out there that provide hashtables are Apache’s APR and GLib. These are massive and provide a lot of other stuff (not necessarily a bad thing). The real issue I have with them is they’re more than just a library, they’re a way of coding. What I mean is they push you into their coding style and often their low level data types. For example GLib wants you to use their gpointer type which casts off the *, YUCK!

For the moment, I need a hashtable but I don’t want to use the rest of a large, existing library. It’s not difficult to write a light weight generic hashtable.

Design

Obviously we’re going to write our own hashtable implementation. The first thing to consider is, we don’t want it to be type specific, because it’s supposed to be generic. This leads us into “the talk”, not the one you’re parents had with you in elementary school, the one about void pointers GASP. TLDR; they’re powerful, allow typeless programming and if you know what you’re doing they’re perfectly okay to use.

Header

Let’s start by looking at the full public header.

htable.h

#ifndef __HTABLE_H__
#define __HTABLE_H__

#include <stdbool.h>

struct htable;
typedef struct htable htable_t;

struct htable_enum;
typedef struct htable_enum htable_enum_t;

typedef unsigned int (*htable_hash)(const void *in, unsigned int seed);
typedef void *(*htable_kcopy)(void *in);
typedef bool (*htable_keq)(const void *a, const void *b);
typedef void (*htable_kfree)(void *in);
typedef void *(*htable_vcopy)(void *in);
typedef void (*htable_vfree)(void *in);

typedef struct {
    htable_kcopy key_copy;
    htable_kfree key_free;
    htable_vcopy val_copy;
    htable_vfree val_free;
} htable_cbs;

htable_t *htable_create(htable_hash hfunc, htable_keq keq, htable_cbs *cbs);
void htable_destroy(htable_t *ht);

void htable_insert(htable_t *ht, void *key, void *val);
void htable_remove(htable_t *ht, void *key);

bool htable_get(htable_t *ht, void *key, void **val);
void *htable_get_direct(htable_t *ht, void *key);

htable_enum_t *htable_enum_create(htable_t *ht);
bool htable_enum_next(htable_enum_t *he, void **key, void **val);
void htable_enum_destroy(htable_enum_t *he);

#endif /* __HTABLE_H__ */

Wow! What in the world is going on here? Lets break this down into it’s three main parts.

Callbacks

typedef unsigned int (*htable_hash)(const void *in, unsigned int seed);
typedef void *(*htable_kcopy)(void *in);
typedef bool (*htable_keq)(const void *a, const void *b);
typedef void (*htable_kfree)(void *in);
typedef void *(*htable_vcopy)(void *in);
typedef void (*htable_vfree)(void *in);

typedef struct {
    htable_kcopy key_copy;
    htable_kfree key_free;
    htable_vcopy val_copy;
    htable_vfree val_free;
} htable_cbs;

First we have function pointer typedefs. The implementation being generic, still has to know about the type it’s working on. To accommodate this, we’ll use a series of callbacks for data manipulation.

We need a hash function and a key equality function at a minimum. Since these are required, we’ll make them parameters in htable_create. We aren’t putting them into callback struct because we’re going to make everything there optional.

I’ll talk about the hash callback seed parameter and what an optional callback means a bit later.

Functions

htable_t *htable_create(htable_hash hfunc, htable_keq keq, htable_cbs *cbs);
void htable_destroy(htable_t *ht);

void htable_insert(htable_t *ht, void *key, void *val);
void htable_remove(htable_t *ht, void *key);

bool htable_get(htable_t *ht, void *key, void **val);
void *htable_get_direct(htable_t *ht, void *key);

We’re going to support the basics, create, destroy, insert, remove and what not. This is going to be a fairly basic implementation but we can still extend it later when we need to. No reason to put in everything under the sun right now.

Enumeration

htable_enum_t *htable_enum_create(htable_t *ht);
bool htable_enum_next(htable_enum_t *he, void **key, void **val);
void htable_enum_destroy(htable_enum_t *he);

There is also a set of enumerate functions which will allow us to get the data out of the table even if we’re not referencing a specific key. Trust me, we want this.

Implementation

We’re going to go with a chaining implementation. Don’t worry, as we go we’ll be sure to mitigate chaining attacks.

htable.c

Includes

#include <limits.h>
#include <stdlib.h>
#include <time.h>

#include "htable.h"

Object Data

struct htable_bucket {
    void *key;
    void *val;
    struct htable_bucket *next;
};
typedef struct htable_bucket htable_bucket_t;

struct htable {
    htable_hash      hfunc;
    htable_keq       keq;
    htable_cbs       cbs;
    htable_bucket_t *buckets;
    size_t           num_buckets;
    size_t           num_used;
    unsigned int     seed;
};

struct htable_enum {
    htable_t        *ht;
    htable_bucket_t *cur;
    size_t           idx;
};

static const size_t BUCKET_START = 16;

The hashtable object is an array of buckets which will be expanded as needed. A bucket holds a key value pair and can point to a chain of buckets outside of the hashtable’s array. A bucket in the array is considered empty when the key is set to NULL.

When a key is hashed, it produces a number which will be reduced to an index in the array. This is where we either put or find a given key value pair.

Since hashes aren’t unique two inputs can hash (then reduce) to the same index in our bucket array. This is why a bucket can point to another bucket via a linked list. We’re building a chain of elements when there is a hash collision. Choosing a good hash algorithm will keep everything dispersed evenly in the array and produce few collisions so there shouldn’t be many chains and they should be very short.

Finally, we have the seed which I ignored earlier. Traversing a linked list has O(n) performance and attackers know this. An attacker can construct a series of inputs that will always hash to the same value. They’re trying to cause chains to grow out of control and cause a denial of service. This was found in PHP back in 2011 and could cause a web server to max out it’s CPU. Bad stuff.

We want our hashtable to prevent this situation from ever happening and that’s where the seed comes in. Basically, we generate a random seed (can be pseudo random) and use it as part of the hash function. The seed being dynamically generated each time we create a hashtable means a hash for a given key will always be different per hashtable. Now an attacker can’t really figure out what inputs will cause a collision.

Since we’re keeping everything in an array of buckets we’ll always start with 16 buckets and grow by a power of 2.

Default Callbacks

static void *htable_passthough_copy(void *v)
{
    return v;
}

static void htable_passthough_destroy(void *v)
{
    return;
}

Earlier I said we want to have some optional callbacks. So, here we go. If one isn’t set, we’ll just use a pass through function. Now’s a good time as any to talk about ownership.

The copy and destroy functions mean that the hashtable can copy the key and or value, taking ownership of the copy. Otherwise, if we’re doing pass though the caller keeps ownership and is responsible for keeping them around for the life of the hashtable.

Hashing

static size_t htable_bucket_idx(htable_t *ht, void *key)
{
    return ht->hfunc(key, ht->seed) % ht->num_buckets;
}

When we get a hash we need to reduce it to fit into the number of buckets we currently have. Calculate and mod!

Adding

static void htable_add_to_bucket(htable_t *ht, void *key, void *val, bool isrehash)
{
    htable_bucket_t *cur;
    htable_bucket_t *last;
    size_t           idx;

    idx = htable_bucket_idx(ht, key);
    if (ht->buckets[idx].key == NULL) {
        if (!isrehash) {
            key = ht->cbs.key_copy(key);
            if (val != NULL)
                val = ht->cbs.val_copy(val);
        }
        ht->buckets[idx].key = key;
        ht->buckets[idx].val = val;
        if (!isrehash)
            ht->num_used++;
    } else {
        last = ht->buckets+idx;
        cur  = ht->buckets+idx;
        do {
            if (ht->keq(key, cur->key)) {
                if (cur->val != NULL)
                    ht->cbs.val_free(cur->val);
                if (!isrehash && val != NULL)
                    val = ht->cbs.val_copy(val);
                cur->val = val;
                last = NULL;
                break;
            }
            last = cur;
            cur  = cur->next;
        } while (cur != NULL);

        if (last != NULL) {
            cur = calloc(1, sizeof(*cur->next));
            if (!isrehash) {
                key = ht->cbs.key_copy(key);
                if (val != NULL)
                    val = ht->cbs.val_copy(val);
            }
            cur->key = key;
            cur->val = val;
            last->next = cur;
            if (!isrehash)
                ht->num_used++;
        }
    }
}

First, check if the index in the bucket has something already. If not then we add the key and value to the bucket.

Second, if that index already has a key value pair, check if the key is a match and if so replace the value.

Third, this is the tricky part, we’ve determined the key isn’t in a bucket, but there is something already at that index. So, traverse the chain checking each node if we have the key already. If we do, replace the value and we’re done. If we don’t have the key in the chain, we’ll hit the end and create a new node to add to the chain.

Phew! I guess explaining it is simpler than the code looks.

Rehashing

static void htable_rehash(htable_t *ht)
{
    htable_bucket_t *buckets;
    htable_bucket_t *cur;
    htable_bucket_t *next;
    size_t           num_buckets;
    size_t           i;

    if (ht->num_used+1 < (size_t)(ht->num_buckets*0.75) || ht->num_buckets >= 1<<31)
        return;

    num_buckets = ht->num_buckets;
    buckets     = ht->buckets;
    ht->num_buckets <<= 1;
    ht->buckets = calloc(ht->num_buckets, sizeof(*buckets));

    for (i=0; i<num_buckets; i++) {
        if (buckets[i].key == NULL)
            continue;

        htable_add_to_bucket(ht, buckets[i].key, buckets[i].val, true);
        if (buckets[i].next != NULL) {
            cur = buckets[i].next;
            do {
                htable_add_to_bucket(ht, cur->key, cur->val, true);
                next = cur->next;
                free(cur);
                cur = next;
            } while (cur != NULL);
        }
    }

    free(buckets);
}

We want to keep from creating chains as much as possible so we periodically grow the bucket array. A good rule of thumb is to grow when it gets 75% full. Also, we don’t want to to grow larger than 1<<31 buckets otherwise we’ll run into problems. Having a hard limit isn’t a problem because we really don’t have a hard limit. We still have an infinite hashtable because when we fill all buckets, we can still chain new entries. Not ideal but when you’re at 2147483648+ buckets… chains are the least of your worries.

The trick here is we swap the bucket array into a temporary pointer. We then create a new one for the hashtable and iterate the old one. Pulling out each pair and adding it to the new array. The not really hard part is, running though each chain and destroying the chain nodes.

Since we’re transferring the data from one array to another we use the isrehash parameter for htable_add_to_bucket to prevent making any copies.

Finally, we clean up the old bucket array.

Create and Destroy

htable_t *htable_create(htable_hash hfunc, htable_keq keq, htable_cbs *cbs)
{
    htable_t *ht;

    if (hfunc == NULL || keq == NULL)
        return NULL;

    ht = calloc(1, sizeof(*ht));
    ht->hfunc = hfunc;
    ht->keq = keq;

    ht->cbs.key_copy = htable_passthough_copy;
    ht->cbs.key_free = htable_passthough_destroy;
    ht->cbs.val_copy = htable_passthough_copy;
    ht->cbs.val_free = htable_passthough_destroy;
    if (cbs != NULL) {
        if (cbs->key_copy != NULL) ht->cbs.key_copy = cbs->key_copy;
        if (cbs->key_free != NULL) ht->cbs.key_free = cbs->key_free;
        if (cbs->val_copy != NULL) ht->cbs.val_copy = cbs->val_copy;
        if (cbs->val_free != NULL) ht->cbs.val_free = cbs->val_free;
    }

    ht->num_buckets = BUCKET_START;
    ht->buckets     = calloc(BUCKET_START, sizeof(*ht->buckets));

    ht->seed = (unsigned int)time(NULL);
    ht->seed ^= ((unsigned int)htable_create << 16) | (unsigned int)ht;
    ht->seed ^= (unsigned int)&ht->seed;

    return ht;
}

Most of create is self explanatory but the seed not so much. We only need a seed that changes per hashtable and isn’t easily guessable. Time plus a few addresses is sufficient for this purpose. When creating the see we push the htable_create pointer to the left to ensure we’re filling the high bits with something.

void htable_destroy(htable_t *ht)
{
    htable_bucket_t *next;
    htable_bucket_t *cur;
    size_t           i;

    if (ht == NULL)
        return;

    for (i=0; i<ht->num_buckets; i++) {
        if (ht->buckets[i].key == NULL)
            continue;
        ht->cbs.key_free(ht->buckets[i].key);
        ht->cbs.val_free(ht->buckets[i].val);

        next = ht->buckets[i].next;
        while (next != NULL) {
            cur = next;
            ht->cbs.key_free(cur->key);
            ht->cbs.val_free(cur->val);
            next = cur->next;
            free(cur);
         }
    }

    free(ht->buckets);
    free(ht);
}

Destroy is similar to all the other functions that deal with chains. We need to iterate the chain at each bucket and destroy the nodes.

Insert and Remove

void htable_insert(htable_t *ht, void *key, void *val)
{
    void *ckey;
    void *cval;

    if (ht == NULL || key == NULL)
        return;

    htable_rehash(ht);
    htable_add_to_bucket(ht, key, val, false);
}

Insert is super simple because we already have internal functions that handle all this for us. This ends up being a wrapper for rehash and add.

void htable_remove(htable_t *ht, void *key)
{
    htable_bucket_t *cur;
    htable_bucket_t *last;
    size_t           idx;

    if (ht == NULL || key == NULL)
        return;

    idx = htable_bucket_idx(ht, key);
    if (ht->buckets[idx].key == NULL)
        return;

    if (ht->keq(ht->buckets[idx].key, key)) {
        ht->cbs.key_free(ht->buckets[idx].key);
        ht->cbs.val_free(ht->buckets[idx].val);
        ht->buckets[idx].key = NULL;

        cur = ht->buckets[idx].next; 
        if (cur != NULL) {
            ht->buckets[idx].key  = cur->key ;
            ht->buckets[idx].val  = cur->val;
            ht->buckets[idx].next = cur->next;
            free(cur);
        }
        ht->num_used--;
        return;
    }

    last = ht->buckets+idx;
    cur  = last->next;
    while (cur != NULL) {
        if (ht->keq(key, cur->key)) {
            last->next = cur->next;
            ht->cbs.key_free(cur->key);
            ht->cbs.val_free(cur->val);
            free(cur);
            ht->num_used--;
            break;
        }
        last = cur;
        cur  = cur->next;
    }
}

Yet another more complex than it seems function. Get the index using the hash. Check the bucket and chains. Remove the entry if it exists. Relink the chain if necessary. Simple right?

Get

bool htable_get(htable_t *ht, void *key, void **val)
{
    htable_bucket_t *cur;
    size_t           idx;

    if (ht == NULL || key == NULL)
        return false;

    idx = htable_bucket_idx(ht, key);
    if (ht->buckets[idx].key == NULL)
        return false;

    cur = ht->buckets+idx;
    while (cur != NULL) {
        if (ht->keq(key, cur->key)) {
            if (val != NULL) {
                *val = cur->val;
            }
            return true;
        }
        cur = cur->next;
    }

    return false;
}

void *htable_get_direct(htable_t *ht, void *key)
{
    void *val = NULL;
    htable_get(ht, key, &val);
    return val;
}

You’re probably wondering why there are two different get functions. htable_get_direct has more direct usage but a value can be NULL. With this function we have no idea if the value is NULL or if the key doesn’t exist. We’re allowing NULL values by the way because they’re really handy.

This is why we need htable_get. It tells us if the key exists and the value is NULL or if the key doesn’t exist at all. Just like everything else this gets the index, checks the bucket and chain. I hope you’re seeing a pattern.

We could have had a function specifically to check if a key exists but we can just pass NULL for the out parameter for this kind of check. I like having a simple API but if you really want a separate function just add a wrapper around htable_get.

Enumeration

htable_enum_t *htable_enum_create(htable_t *ht)
{
    htable_enum_t *he;

    if (ht == NULL)
        return NULL;

    he = calloc(1, sizeof(*he));
    he->ht = ht;

    return he;
}

bool htable_enum_next(htable_enum_t *he, void **key, void **val)
{
    void *mykey;
    void *myval;

    if (he == NULL || he->idx >= he->ht->num_buckets)
        return false;

    if (key == NULL)
        key = &mykey;
    if (val == NULL)
        val = &myval;

    if (he->cur == NULL) {
        while (he->idx < he->ht->num_buckets && he->ht->buckets[he->idx].key == NULL) {
            he->idx++;
        }
        if (he->idx >= he->ht->num_buckets)
            return false;
        he->cur = he->ht->buckets+he->idx;
        he->idx++;
    }

    *key = he->cur->key;
    *val = he->cur->val;
    he->cur = he->cur->next;

    return true;
}

For enumeration we are going to follow a very simple process. Go though each bucket in the bucket array and check if there is something there. If so return the bucket data, and move to the next bucket. However, if the bucket has a chain we store the pointer to the first node in that bucket’s chain.

The next time around we’ll see cur (the chain node) points to something and keep traversing the chain until we’re done. This might not happen at all if the bucket didn’t have a chain to begin with.

Once we’ve gone though that chain we’re already pointed to the next bucket and we start over from there.

void htable_enum_destroy(htable_enum_t *he)
{
    if (he == NULL)
        return;
    free(he);
}

Not really much here. Since the enumeration object just tracks locations within the hashtable there isn’t really much that needs to be cleaned up.

Next steps

Now we have a very flexible generic hashtable, but it still uses void pointers which isn’t always ideal. What we can, and will do, is create a set of type safe wrappers. This isn’t hard at all and will give us back compiler provided type safety. We’ll go into this in detail in the future.