December 20, 2012

Last Week's Coding Practice - Hash Tables

In an effort to keep myself in tip-top coding shape, I've started doing the weekly coding problems at http://codingforinterviews.com/. Each week, they set a topic, ask you a few questions and send you a simple problem to solve. They don't really check your answers or solution, but the main point is to get you thinking about important data structures, algorithms, etc. This is my second week in, and so far it seems worthwhile. Today I'd like to briefly mention last week's topic: Hash Tables, and document some of the things that I've learnt.

To me, hash tables have been the bread and butter of programming ever since learning about them. Constant-time lookups and insertion, linear storage overhead, complete freedom to choose the key and value types... what's there not to like, right? Unfortunately, while I was using hash tables practically everywhere, I neglected to learn about how they are actually implemented. Last week, that changed.

Back to School

First, I had to make some adjustments to my vocabulary. A hash table (or hashtable, hashmap, hash map, ...) is one way of implementing an associative array, which is a mapping a mapping between keys and values. Other implementations of associative arrays are possible: for example, you could do it using a linked list of (key, value) pairs, but it wouldn't be a particularly good implementation since insertion and retrieval would be O(n).

Next, a hash table consists of a hash function and a fixed-length array. The hash function takes a key, which can be of any data type, and returns an integer (the hash value). This hash function is deterministic (the same key always gets hashed to the same integer). Additionally, one of the most desirable properties of a hash function is that it returns unique hashes for unique keys -- different keys never get the same hash value. When two different keys hash to the same value, it is known as a collision. For good hash functions, collisions are rare in practice. The hash value is mod'ed (for lack of a better word) by the length of the fixed-length array, and the result is used to index into the array.

In a trivial case, each element (bin) of the array can simply hold the value that corresponds to a specific key. Due to collisions (and the mod operation), however, different keys can end up going into the same bin. While collisions are rare, they do occur, and a strategy to handle or reliably avoid them becomes necessary.

There are several ways to deal with a collision. I'll describe one of the simplest: each bin contains a linked list that holds (key, value) pairs. When inserting a (key, value) pair into the hash table you just append the pair to the linked list of that particular bin. When retrieving a value for a particular key, you traverse the entire list for the right bin until you hit the pair with the identical key. Appending a list and traversing it adds overhead to the hash table insertion and retrieval. However, this overhead is O(n) of the length of the linked list. Since collisions are rare, the linked lists will be fairly short, and this overhead will be minimal. Other collision handling strategies reduce this overhead, but are slightly more involved.

Lastly, while hash functions sound like magic, in essence they are quite simple: convert the key into a numeric representation of arbitrary length, and then reduce it to an integer. For numbers or numeric-like data types such as characters, you don't really need to do anything -- the hash is simply equal to the key. For compound data types (arrays, structs, strings), you can obtain the numeric representation by adding the individual elements. Addition alone doesn't evenly distribute the hash value over the entire domain of the hash function (integers), so add-and-multiply is often used instead.

Email me teh codez?

After learning all the stuff above, I was ready to have a try at implementing my own hash table. Here it is, in C, complete with Makefile and all: