Hash Tables

Goals

Understand the general concept of hash tables.
Learn how to use hash tables.
Implement a simple, general hash table.

Concepts

collision
collision resolution
hash table
hash value
separate chaining with linked lists
table
uniformly distributed

Library

Lesson

The data structures you have learned so far have been graphs: individual values are stored in nodes, and the various nodes have references to one or more other nodes, resulting in a web of “links” among the nodes. Graph data types are interested primarily in relationships at several levels. If those relationships have to do with comparisons, as you saw in a binary search tree (BST), then the resulting data structure can help with sorting and lookup.

Another group of data structures are called tables and they behave much like the tables you see in a spreadsheet or even in the examples in these lessons. A table provides a direct correspondence between an input and an output. A typical spreadsheet is a two-dimensional table; each cell value is located based upon two inputs: a row and a column. Java arrays in fact could be considered one-dimensional tables; they “map” a single input value (the array index) to some stored value (the element at that index).

Hash Table

Probably the most popular table data structure is a hash table. It can be thought of as an array, although instead of mapping array indexes it maps some hash value as an input. A hash value is some arbitrary value that is assigned to an object for faster lookup in the future.

To find out how this might work, pretend that you want to remember the favorite color of all your friends. You make a table like this:

Friend Name	Favorite Color
Jane Doe	red
John Smith	green
Richard Roe	blue
…	…

Imagine that you want to look up Richard Roe's favorite color. In a Java program without a hash table you would need to search all your friends by name.

Naive lookup of friend favorite color.

…

private static final String[] FRIEND_NAMES = new String[] {"Jane Doe", "John Smith", "Richard Roe"};
private static final String[] FAVORITE_COLORS = new String[] {"red", "green", "blue"};

/**
 * Looks up the favorite color of a friend.
 * @param friendName The name of the friend.
 * @return The favorite color of the friend, or <code>null</code> if the friend's name is not recognized.
 */
public static @Nullable String getFavoriteColor(@Nonnull final String friendName) {
  checkNotNull(friendName);
  for(int i = 0; i < FRIEND_NAMES.length; ++i) {
    if(FRIEND_NAMES[i].equals(friendName)) {
      return FAVORITE_COLORS[i];
    }
  }
  return null;
}

If you have lots of friends, this could take a long time. In fact, how long it takes in the worst case depends on how many friends you have. From the lesson on algorithms you will remember that this order of growth is linear time, or O(n) in big O notation.

Hash Value

But now imagine that you're clever and you decide to write down some arbitrary number for each friend. You decide to use the birth date of each friend in the form YYYYMMDD, and consider that to be an integer value. This number will be used as a hash value for each friend.

Friend Name	Birth Date	Hash Value	Favorite Color
Jane Doe	1980-01-02	`19800102`	red
John Smith	1975-11-05	`19751105`	green
Richard Roe	1972-04-23	`19720423`	blue
…	…	…	…

Now we can create one array big enough to store all the hash codes (until the year 3000, for example), and store the favorite color as elements in the array.

Friend favorite color lookup using birthday-based hash value.

…

private static final String[] FAVORITE_COLORS;

static {
  FAVORITE_COLORS = new String[99990000];
  FAVORITE_COLORS[19800102] = "red";
  FAVORITE_COLORS[19751105] = "green";
  FAVORITE_COLORS[19720423] = "blue";
}

/**
 * Looks up the favorite color of a friend.
 * @param birthDateHashCode An integer hash value created from the birth date of the friend in the form YYYYMMDD.
 * @return The favorite color of the friend, or <code>null</code> if the friend's hash value is not recognized.
 */
public static @Nullable String getFavoriteColor(final int birthDateHashCode) {
  return FAVORITE_COLORS[birthDateHashCode];
}

Look how simple the lookup method has become: a single line! And it is very quick; in fact, you could have one friend or one million friends, and lookup would still take the same amount of time, as array lookup occurs in constant time. In big O notation it is referred to as O(1), as the lookup time does not depend on the number of elements.

Hash Collision

Now our lookup performance is top-notch, although we are using quite a bit of memory. Everything runs along just fine, and you go on collecting friends' favorite colors until… you realize that two of your friends have the same birthday, resulting in the same hash value: 19770522. This is referred to as a collision, because hash values aren't necessarily unique for each item they identify. What do you do?

There are several approaches to collision resolution to deal with duplicate hash values. To give you an idea of how collision resolution works, we'll explore one of the simplest approaches that exist: separate chaining with linked lists. In this approach, instead of storing a single value at each index in the array, we'll store multiple values—and what easier way to do that than to use the linked list implementation you've already been working on?

Hash Value	Values
…	…
`19720423`
…	…
`19751105`
…	…
`19770522`
…	…
`19800102`
…	…

But what do we store in the nodes of the linked list? When there is a hash collision, the list will contain more than one friend with the same birth date hash value, so we'll need some way to tell the friends apart. We can create a value object that encapsulates the friend's name and the favorite color.

Entry class for storing a friend's name and associated color.

public class Entry {

  /** Constructor. */
  public Entry(@Nonnull String friendName, @Nonnull String favoriteColor) {…}

  public @Nonnull String getFriendName() {…}

  public @Nonnull String getFavoriteColor() {…}

}

Looking up a friend's favorite color now takes the additional step of looking through the linked list to find the matching friend's name in the cases where there were more than one friend with the same birthday. This will happen relatively infrequently, so it won't slow down our lookup algorithm very much.

Friend favorite color lookup using birthday-based hash value, with conflict resolution using linked lists of Entry instances.

…

private static final LinkedList[] FAVORITE_COLORS;

static {
  FAVORITE_COLORS = new LinkedList[99990000];
  FAVORITE_COLORS[19800102] = new LinkedListImpl(new Entry("Jane Doe", "red"));
  FAVORITE_COLORS[19770522] = new LinkedListImpl(new Entry("John Stiles", "gold"), new Entry("Richard Miles", "purple"));
  FAVORITE_COLORS[19751105] = new LinkedListImpl(new Entry("John Smith", "green"));
  FAVORITE_COLORS[19720423] = new LinkedListImpl(new Entry("Richard Roe", "blue"));
}

/**
 * Looks up the favorite color of a friend.
 * @param friendName The name of the friend.
 * @param birthDateHashCode An integer hash value created from the birth date of the friend in the form YYYYMMDD.
 * @return The favorite color of the friend, or <code>null</code> if the friend's hash value is not recognized.
 */
public static @Nullable String getFavoriteColor(@Nonnull final String friendName, final int birthDateHashCode) {
  final LinkedList entries = FAVORITE_COLORS[birthDateHashCode];
  if(entries == null) {  //if there are no entries for this hash value
    return null;
  }
  final int entryCount = entries.getLength();
  for(int i = 0; i < entryCount; i++) {
    final Entry entry = (Entry)entries.get(i);
    if(entry.getFriendName().equals(friendName)) {
      return entry.getFavoriteColor();
    }
  }
  return null;
}

Look closely at the array FAVORITE_COLORS. It is not a linked list. It is an array of linked lists! Each element of this array is a reference to a linked list. All the references start out as null, so if you want a new list, you must create a new one.

Now we can store the favorite colors of all our friends, because we are using the fast lookup of an array combined with a handy hash value for each of our friends: their birth date formatted as an integer value.

Java Hash Codes

But is a hash table only useful for entities that have birthdays? What if we want to quickly look up the Blue Book® price of a car from its model number? What if we want to find out the common name of an animal based upon its scientific name? We need some way to determine an integer hash value for any object, and the Object.hashCode() method does just that. Now you know why: this method provides a consistent way for producing integer hash values that are used by hash tables and other classes that need quick lookup.

In a future lesson you'll learn that any Java object that overrides the Object.equals(Object) must must also override the Object.hashCode() method. If an object's definition of equality changes by overriding Object.equals(Object), the Object.hashCode() method must be updated appropriately so that the object's hash value continues to represent the underlying equality decision.

We can therefore dispense with using a friend's birthday as a hash value. The java.lang.String class already overrides Object.hashCode(), so as long as our friends' names are unique we can use our friends' name as the lookup value and simply ask the strings to provide a hash code. There is a slight complication: we don't know ahead of time what a string's hash code is, or which two strings might have the same hash code. Therefore we'll have to create a separate method with logic for adding values to the linked list if values already exist for some hash code.

Friend favorite color lookup using hash code of friend name.

…

private static final LinkedList[] FAVORITE_COLORS = new LinkedList[Integer.MAX_VALUE];

/**
 * Puts a friend's favorite color in the hash table.
 * If a friend's name is already in the table, it is replaced with the new favorite color.
 */
private static void put(@Nonnull final String friendName, @Nonnull final String favoriteColor) {
  final int hashCode = Math.abs(friendName.hashCode());
  LinkedList entries = FAVORITE_COLORS[hashCode];
  if(entries == null) { //no entries yet
    entries = new LinkedListImpl();
    FAVORITE_COLORS[hashCode] = entries;  //store the list of entries for next time
  }
  //TODO go through the list, and if there is an entry with this friend's name already, replace it with a new entry
  entries.add(new Entry(friendName, favoriteColor));
}

static {
  put("Jane Doe", "red");
  put("John Stiles", "gold");
  put("Richard Miles", "purple");
  put("John Smith", "green";
  put("Richard Roe", "blue");
}

/*
 * Looks up the favorite color of a friend.
 * @param friendName The name of the friend.
 * @return The favorite color of the friend, or <code>null</code> if the friend's name is not recognized.
 */
public static @Nullable String getFavoriteColor(@Nonnull final String friendName) {
  final int hashCode = Math.abs(friendName.hashCode());
  final LinkedList entries = FAVORITE_COLORS[hashCode];
  if(entries == null) {  //if there are no entries for this hash code
    return null;
  }
  final int entryCount = entries.getLength()) {
  for(int i = 0; i < entryCount; i++) {
    final Entry entry = (Entry)entries.get(i);
    if(entry.getFriendName().equals(friendName)) {
      return entry.getFavoriteColor();
    }
  }
  return null;
}

Hash Buckets

Examples of the modulus operator.

`n`	`n / 3`	`n % 3`
`0`	`0`	`0`
`1`	`0`	`1`
`2`	`0`	`2`
`3`	`1`	`0`
`4`	`1`	`1`
`5`	`1`	`2`

We've now succeeded in creating a generalized hash table approach. There is one underlying issue that pops up often in algorithms: we have a very fast algorithm, but we are using a lot of memory to pull this off. We have just a few friends, but because we don't know the range of hash values that will be returned, we create an array large enough to hold as many friends as there are non-negative integers! As the largest int value in Java is 0x7FFFFFFF, this results in an array with 2,147,483,647 elements, all but a few of them containing null—and null takes up just as much space as any other reference. We need to find a way to cut down on the number of array positions (or “buckets”) we are using.

Let's say that we'd rather use 8 buckets instead. We could go to each of our friends one at a time and given them a number between 0 and 7, then start over at 0 for the next friend, and so on. It so happens that there is a Java operator that does just that: the modulus % operator. This operator is usually described as providing the remainder of a division problem. Take a look in the figure at what the results turn out to be for n / 3 and for n % 3.

So using n % x will give us numbers from 0 to x - 1. If we want to use 8 buckets, then we need merely to take our hash code % 8 to reduce the number of buckets.

Using the modulus operator to reduce the number of buckets.

…

private static int BUCKET_COUNT = 8;

private static final LinkedList[] FAVORITE_COLORS = new LinkedList[BUCKET_COUNT];

private static void put(@Nonnull final String friendName, @Nonnull final String favoriteColor) {
  …
  final int hashCodeBucketIndex = Math.abs(friendName.hashCode()) % BUCKET_COUNT;
  …
}

…

public static @Nullable String getFavoriteColor(@Nonnull final String friendName) {
  final int hashCodeBucketIndex = Math.abs(friendName.hashCode()) % BUCKET_COUNT;
}

Review

Summary

A hash table stores objects based upon their hash codes in “buckets”. Because multiple objects may have the same hash code, a hash table uses some hash conflict resolution such as maintaining a linked list of objects in each bucket. Because the number of hash codes may be much larger than the number of buckets, a hash table usually uses the modulus % operator to group hash codes into even fewer buckets. Thus the process looks like this:

key: This is the key that will look up an object in a hash table.
key.hashCode(): Retrieve the hash code for the key.
Math.abs(key.hashCode()): Make sure the hash code is not negative.
Math.abs(key.hashCode()) % bucketCount: Determine which bucket in which to store the entry.
buckets[Math.abs(key.hashCode()) % bucketCount]: Get the linked list of key-object entries one of the keys of which may match.

Gotchas

If a hash function is not uniformly distributed, it winds up lumping all keys in the same “bucket”, making the search have linear O(n) performance, no better than storing all the items in a single linked list (which is essentially what results).
If you use a custom type as a map key, forgetting to implement the Object.hashCode() or Object.equals(Object) method of a key value object will prevent hash table lookup from working correctly. You will learn about overriding these methods in a future lesson.

In the Real World

The decision between performance and memory usage is one that appears frequently in computer science.
Most of the time the various approaches to collision detection are only discussed in job interviews. The rest of the time, you'll use whatever method your library hash table implementation uses.

Think About It

Choose the correct key for hash table lookup for your objects, as hash tables can only store one object association with any particular key.

Self Evaluation

How do hash values relate to hash table input values?
Can a hash table have multiple entries with the same input value? What about the same attribute value?
What if you wanted to associate more than one output value with an input value? How could you accomplish that?

Task

`HashTableImpl`

In your datastruct project, create a HashTableImpl hash table class that implements the following interface:

public interface HashTable {

  /**
   * Stores a value in the hash table and associates it with the given key.
   * <p>If a value is already associated with the given key, it will be discarded
   * and replaced with the new value.</p>
   * @param key The input value.
   * @param value The output value to associate with the given key.
   */
  void put(@Nonnull Object key, @Nonnull Object value);

  /**
   * Retrieves the value associated with a given key.
   * @param key The input value.
   * @return The output value associated with the key,
   *     or <code>null</code> if the key is not present in the table.
   */
  @Nullable Object get(@Nonnull Object key);

  /**
   * Determines if a value is associated with the given key.
   * <p>Overriding this method is optional; an implementation may
   * have a more efficient way of checking to see if a value exists.</p>
   * @param key The input value.
   * @return <code>true</code> if there exists an output value associated with the key.
   */
  default boolean contains(@Nonnull Object key) {
    return get(key) != null;
  }

}

Create appropriate unit tests. For example:
- Test null keys.
- Test null values.
- Test putting duplicate keys with distinct values.
- Test putting duplicate keys with equal values.
- Test putting different keys with equal values.
- Create a test that stores one value using a key of Long.valueOf(0), and another value using a key of Long.valueOf(-1). Verify you can retrieve both values independently. These two keys are interesting, because they represent different numbers yet return the same hash value. See Long hashCode returns same number for different objects in java.

Publication Name Lookup

Implement the following method in Booker for looking up a publication by its name.

/**
 * Looks up a publication with the given name.
 * @param The name or title of the publication.
 * @return The publication with the given name, or <code>null</code> if no such publication is known.
 */
public static @Nullable Publication getPublicationByName(@Nonnull final String publicationName) {
  …
}

This task assumes that you have already refactored your Publication hierarchy so that all Publications have a consolidated getName() or similar method.
In addition to the normal storage you are using for your publications, create a static HashTable of all publications, using their names as keys.
- Do not create new publication instances.
- In your static initializer block, after populating the normal list of publications, iterate over them and store them appropriately in the hash table.
Use this hash table in your getPublicationByName(…) method.
Create appropriate unit tests.
Choose your favorite publication. At the end of Booker's output, look up your chosen publication by name using your getPublicationByName(…) method, and print that book separately.

References

Big-O Algorithm Complexity Cheat Sheet (Eric Rowell)

Acknowledgments

“BLUE BOOK®” is a trademark of Kelley Blue Book Co., Inc.