Hash Table and Associative Arrays

One of the most amazing data structures in Computer Science, with the most profound consequences, is the hash table. This is also known as a hash map or in Python we call it a dictionary.

In this video, I’m going to go over the theory of the hash table, what it is, how it works, and the associated algorithms.

Definition of an Associative Array

In order to understand a hash table, we must first understand the associative array. Associative Arrays are also called maps or symbol tables or dictionaris.

In an associative array, the values of the array are key-value pairs. The key represents a lookup value or an index. The value is the value stored for that key.

Each key must be unique. That is, if I have the pair (1,2), then I am not allowed to have the pair (1,3) or anything else that starts with a 1.

Associative Array Methods

Associative arrays support some additional methods beyond the standard array methods.

  • Get by key. Given a key, lookup the pair matching the key and return the value.

  • Add by key. Given a key and a value, add a new element for that key and value. Note that there should be no element matching the key.

  • Replace by key. Given a key and a value, remove any element matching the key and add a new key-value pair with the key and value. Alternatively, change the value of the key-value pair matching the key.

  • Set by key. Either add or replace the value by the key.

  • Delete by key. Remove the element with the key.

The Associative Array data structure is an abstract data structure. There are several ways of implementing it, with the hash table being the most common. However, there are also search trees that can be used, which we won’t cover here. (These will be covered in more detail as I review the implementations of RDBMSs.)

In order to better understand the Hash Table, however, I will introduce a simple and naive approach to handling associative arrays, the associative list.

Associative List Definition

The associative list is an implementation of the associative array data type.

In an associative list, data is stored in the form of an array of pairs. It takes O(n) in memory to store n pairs.

As key-value pairs are added, new pairs are appended to the end of the array. This is an O(1) operation. (This assumes, of course, that there is no element with the same key already in the array.)

When a key-value pair is looked up by key, then the entire list is searched. This is an O(n) operation in time, but O(1) in memory.

When a key-value pair is replaced, then it requires the O(n) operation to find the element matching the key, with a simple O(1) replacement, but otherwise it is O(1) in memory.

If you want to set a key, which could either be an add or replace, then it is an O(n) operation but O(1) in memory.

Deleting a key-value pair is also an O(1) in time and memory.

Here is an implementation of this in Python:

def find_by_key(s, key):
    "Returns the (index, value) of the element matching the key."
    for i, (k, v) in enumerate(s):
        if k == key:
            return i, v
    else:
        raise IndexError("No such key")

def get(s, key):
    "Returns the value associated with the key, if any."
    return find_by_key(s, key)[1]

def set(s, key, value):
    "Sets the key to the value, adding or replacing as necessary."
    try:
        i, v = find_by_key(s, key)
    except IndexError:
        # Add
        s.append((key, value)
    else:
        # Replace
        s[i][1] = value

def delete(s, key):
    "Removes the key"
    i, v = find_by_key(s, key)
    del s[i]

Hash Table Definition

A hash table has two parts: An array and a hash function. The hash function is a function that returns an integer given a key.

The hash function is used to determine where the value should be stored. The result of the hash function is mod’d with the size of the array to get the index where the value should be stored.

Because of hash collisions, there is a good chance that the hash function mod n will give you the same index for two different keys. Because of this, the key is stored with the value to test that the value is really for that key or for some other key with the same hash mod n.

In order to reduce collisions, typically n is kept a little larger than the total number of items in the hash table.

In the case of a collision, there are two popular approaches.

  • One is to store a reference to an associative list for all the keys that share the same hash mod n. This is effective but sub-optimal for a few reasons:

    1. If there are a lot of collisions, then you might get long arrays for each value.

    2. At the very least, there is another lookup that needs to be performed, sometimes to a completely different area of memory, which reduces the speed of the algorithm in modern architectures.

  • Another method is to just store the key-value pair in the next empty spot in the array. This is called “Open Addressing”.

The second method is preferrable, particularly as long as you keep the actual number of key-value pairs below about 80% of the room available in the underlying array.

Since Python uses the second method, we’ll talk about that.

Hash Table Algorithms

  • Get by Key: Take the hash of the key, and mod it with length of the underlying array. Go to that index. If the key matches, then return the value. If it is blank, then the key doesn’t exist. If it is some other key, then look at the next and repeat the previous steps.

  • Set by Key: Take the hash of the key mod n. If there is an element there matching the key, replace the value. If it is some other key, then increment the index and look again. If it is empty, then add it in. If we need more room, then we allocate more following a similar pattern to what we do for arrays. We expand the underlying array by some factor of the array length. We will need to repopulate the new array, but that is n O(1) operations, so O(n)

  • Delete by key: Similar to Get by key, except you need to set the element to [None, None]

All of the above are O(1) in time and memory, provided collisions are rare. Note that it is quicker to remove or add elements to hash table than an array. Adding an element or removing it is the same as adding or removing elements from the end of the array.

Populating a hash table of n items is an O(n) operation because adding a single element is an O(1) operation amortized.

Here’s an example implementation to look at:

def find_by_key(s, key):
    i = hash(key) % len(s)

    while True:
        k, v = s[i]
        if k is None:
            raise IndexError("No such key")
        if k == key:
            return i, v
        i = i + 1

def get(s, key):
    return find_by_key(s, key)[1]

def add(s, key, value):
    # If we could keep track of how many items are in the hash table, we can
    # decide whether we need to expand it here.
    i = hash(key) % len(s)
    while True:
        k, v = s[i]
        if k is None:
            s[i] = [key, value]

        if k == key:
            s[i][1] = value
            break

def delete(s, key):
    i, v = find_by_key(s, key)
    s[i] = [None, None]

Using Hash Tables

Hash tables are used extensively, and not just for Python dictionaries.

  • Databases typically use hash tables to quickly look up rows by their values. The hash table has a key of the column value, and the value points to the row. In addition to indexes you might create, as part of a query with a join in it, the database will build up a temporary hash table.

  • Hash tables are ideal for caches. With their O(1) add/get/delete times, you can cache a very large number of items without worrying about a performance hit as you add more items.

  • Mathematical sets require that each item be unique in the set. Hash Tables are often used to implement sets.

  • Objects. In OO programming, hash tables are frequently used to store attributes of objects as well as the vtable used to figure out where a particular method exists as a function. We’ll talk more about how Python does OOP when we cover that topic.

Good Hash Function

The hash table relies on a “good” hash function. What makes a good hash function?

  • It MUST be consistent, returning the same value for a key.

  • It should be O(1) in behavior. IE, it should take the same time to calculate the hash of a long or short string. Many hash functions do not, however.

  • It should spread values out “randomly”. A good example is if you return the same value for each letter of the alphabet, then you’ll get a lot of collisios.

Great care is taken generating a hash function. There are a lot of different possibilities, and often you want some combination of algorithms for the best performance. The goal is to minimize the number of hash collisions with real-world data.

Further Study