Hash Tables
Goals
- Understand the general concept of hash tables.
- Learn how to use hash tables.
- Implement a simple, general hash table.
Concepts
- collision
- collision resolution
- hash table
- hash value
- separate chaining with linked lists
- table
- uniformly distributed
Library
java.lang.Object.equals(Object)
java.lang.Object.hashCode()
java.lang.String
java.lang.Math.abs(int)
Lesson
The data structures you have learned so far have been graphs: individual values are stored in nodes, and the various nodes have references to one or more other nodes, resulting in a web of “links” among the nodes. Graph data types are interested primarily in relationships at several levels. If those relationships have to do with comparisons, as you saw in a binary search tree (BST), then the resulting data structure can help with sorting and lookup.
Another group of data structures are called tables and they behave much like the tables you see in a spreadsheet or even in the examples in these lessons. A table provides a direct correspondence between an input and an output. A typical spreadsheet is a two-dimensional table; each cell value is located based upon two inputs: a row and a column. Java arrays in fact could be considered one-dimensional tables; they “map” a single input value (the array index) to some stored value (the element at that index).
Hash Table
Probably the most popular table data structure is a hash table. It can be thought of as an array, although instead of mapping array indexes it maps some hash value as an input. A hash value is some arbitrary value that is assigned to an object for faster lookup in the future.
To find out how this might work, pretend that you want to remember the favorite color of all your friends. You make a table like this:
Friend Name | Favorite Color |
---|---|
Jane Doe | red |
John Smith | green |
Richard Roe | blue |
… | … |
Imagine that you want to look up Richard Roe's favorite color. In a Java program without a hash table you would need to search all your friends by name.
If you have lots of friends, this could take a long time. In fact, how long it takes in the worst case depends on how many friends you have. From the lesson on algorithms you will remember that this order of growth is linear time, or O(n) in big O notation.
Hash Value
But now imagine that you're clever and you decide to write down some arbitrary number for each friend. You decide to use the birth date of each friend in the form YYYYMMDD
, and consider that to be an integer value. This number will be used as a hash value for each friend.
Friend Name | Birth Date | Hash Value | Favorite Color |
---|---|---|---|
Jane Doe | 1980-01-02 | 19800102 | red |
John Smith | 1975-11-05 | 19751105 | green |
Richard Roe | 1972-04-23 | 19720423 | blue |
… | … | … | … |
Now we can create one array big enough to store all the hash codes (until the year 3000, for example), and store the favorite color as elements in the array.
Look how simple the lookup method has become: a single line! And it is very quick; in fact, you could have one friend or one million friends, and lookup would still take the same amount of time, as array lookup occurs in constant time. In big O notation it is referred to as O(1), as the lookup time does not depend on the number of elements.
Hash Collision
Now our lookup performance is top-notch, although we are using quite a bit of memory. Everything runs along just fine, and you go on collecting friends' favorite colors until… you realize that two of your friends have the same birthday, resulting in the same hash value: 19770522
. This is referred to as a collision, because hash values aren't necessarily unique for each item they identify. What do you do?
There are several approaches to collision resolution to deal with duplicate hash values. To give you an idea of how collision resolution works, we'll explore one of the simplest approaches that exist: separate chaining with linked lists. In this approach, instead of storing a single value at each index in the array, we'll store multiple values—and what easier way to do that than to use the linked list implementation you've already been working on?
Hash Value | Values |
---|---|
… | … |
19720423 | |
… | … |
19751105 | |
… | … |
19770522 | |
… | … |
19800102 | |
… | … |
But what do we store in the nodes of the linked list?
When there is a hash collision, the list will contain more than one friend with the same birth date hash value, so we'll need some way to tell the friends apart. We can create a value object that encapsulates the friend's name and the favorite color.
Looking up a friend's favorite color now takes the additional step of looking through the linked list to find the matching friend's name in the cases where there were more than one friend with the same birthday. This will happen relatively infrequently, so it won't slow down our lookup algorithm very much.
Now we can store the favorite colors of all our friends, because we are using the fast lookup of an array combined with a handy hash value for each of our friends: their birth date formatted as an integer value.
Java Hash Codes
But is a hash table only useful for entities that have birthdays? What if we want to quickly look up the Blue Book® price of a car from its model number? What if we want to find out the common name of an animal based upon its scientific name? We need some way to determine an integer hash value for any object, and the Object.hashCode()
method does just that. Now you know why: this method provides a consistent way for producing integer hash values that are used by hash tables and other classes that need quick lookup.
We can therefore dispense with using a friend's birthday as a hash value. The java.lang.String
class already overrides Object.hashCode()
, so as long as our friends' names are unique we can use our friends' name as the lookup value and simply ask the strings to provide a hash code. There is a slight complication: we don't know ahead of time what a string's hash code is, or which two strings might have the same hash code. Therefore we'll have to create a separate method with logic for adding values to the linked list if values already exist for some hash code.
Hash Buckets
n | n / 3 | n % 3 |
---|---|---|
0 | 0 | 0 |
1 | 0 | 1 |
2 | 0 | 2 |
3 | 1 | 0 |
4 | 1 | 1 |
5 | 1 | 2 |
We've now succeeded in creating a generalized hash table approach. There is one underlying issue that pops up often in algorithms: we have a very fast algorithm, but we are using a lot of memory to pull this off. We have just a few friends, but because we don't know the range of hash values that will be returned, we create an array large enough to hold as many friends as there are non-negative integers! As the largest int
value in Java is 0x7FFFFFFF
, this results in an array with 2,147,483,647 elements, all but a few of them containing null
—and null
takes up just as much space as any other reference. We need to find a way to cut down on the number of array positions (or “buckets”) we are using.
Let's say that we'd rather use 8
buckets instead. We could go to each of our friends one at a time and given them a number between 0
and 7
, then start over at 0
for the next friend, and so on. It so happens that there is a Java operator that does just that: the modulus %
operator. This operator is usually described as providing the remainder of a division problem. Take a look in the figure at what the results turn out to be for n / 3
and for n % 3
.
So using n % x
will give us numbers from 0 to x - 1
. If we want to use 8
buckets, then we need merely to take our hash code % 8
to reduce the number of buckets.
Review
Summary
A hash table stores objects based upon their hash codes in “buckets”. Because multiple objects may have the same hash code, a hash table uses some hash conflict resolution such as maintaining a linked list of objects in each bucket. Because the number of hash codes may be much larger than the number of buckets, a hash table usually uses the modulus %
operator to group hash codes into even fewer buckets. Thus the process looks like this:
key
- This is the key that will look up an object in a hash table.
key.hashCode()
- Retrieve the hash code for the key.
Math.abs(key.hashCode())
- Make sure the hash code is not negative.
Math.abs(key.hashCode()) % bucketCount
- Determine which bucket in which to store the entry.
buckets[Math.abs(key.hashCode()) % bucketCount]
- Get the linked list of key-object entries one of the keys of which may match.
Gotchas
- If a hash function is not uniformly distributed, it winds up lumping all keys in the same “bucket”, making the search have linear O(n) performance, no better than storing all the items in a single linked list (which is essentially what results).
- If you use a custom type as a map key, forgetting to implement the
Object.hashCode()
orObject.equals(Object)
method of a key value object will prevent hash table lookup from working correctly. You will learn about overriding these methods in a future lesson.
In the Real World
- The decision between performance and memory usage is one that appears frequently in computer science.
- Most of the time the various approaches to collision detection are only discussed in job interviews. The rest of the time, you'll use whatever method your library hash table implementation uses.
Think About It
- Choose the correct key for hash table lookup for your objects, as hash tables can only store one object association with any particular key.
Self Evaluation
- How do hash values relate to hash table input values?
- Can a hash table have multiple entries with the same input value? What about the same attribute value?
- What if you wanted to associate more than one output value with an input value? How could you accomplish that?
Task
HashTableImpl
In your datastruct
project, create a HashTableImpl
hash table class that implements the following interface:
- Create appropriate unit tests. For example:
- Test
null
keys. - Test
null
values. - Test putting duplicate keys with distinct values.
- Test putting duplicate keys with equal values.
- Test putting different keys with equal values.
- Create a test that stores one value using a key of
Long.valueOf(0)
, and another value using a key ofLong.valueOf(-1)
. Verify you can retrieve both values independently. These two keys are interesting, because they represent different numbers yet return the same hash value. See Long hashCode returns same number for different objects in java.
- Test
Publication Name Lookup
Implement the following method in Booker
for looking up a publication by its name.
- This task assumes that you have already refactored your
Publication
hierarchy so that allPublication
s have a consolidatedgetName()
or similar method. - In addition to the normal storage you are using for your publications, create a static
HashTable
of all publications, using their names as keys.- Do not create new publication instances.
- In your static initializer block, after populating the normal list of publications, iterate over them and store them appropriately in the hash table.
- Use this hash table in your
getPublicationByName(…)
method. - Create appropriate unit tests.
- Choose your favorite publication. At the end of Booker's output, look up your chosen publication by name using your
getPublicationByName(…)
method, and print that book separately.
See Also
References
Acknowledgments
- “BLUE BOOK®” is a trademark of Kelley Blue Book Co., Inc.