Hash tables
A hash table is a data structure that maps keys to values, using a hash function to correlate key values with the index (or bucket) where the value is stored in an underlying array . A hash table is an implementation of the map abstract data type.
Operation | Average | Worst |
---|---|---|
Access | $\htmlClass{text-green}{\Theta(1)}$ | $\htmlClass{text-yellow}{\mathcal{O}(n)}$ |
Insert | $\htmlClass{text-green}{\Theta(1)}$ | $\htmlClass{text-yellow}{\mathcal{O}(n)}$ |
Delete | $\htmlClass{text-green}{\Theta(1)}$ | $\htmlClass{text-yellow}{\mathcal{O}(n)}$ |
Hash functions
A hash function maps each key to an integer index of a bucket array.
Division hashing uses the modulo function by dividing the key $k$ by the number of buckets $b$, with the remainder serving as the bucket index: $i = k \mod b$.
There are many other hash functions in use, each with unique benefits and disadvantages.
Collision handling
A hash collision occurs when the hash function maps multiple keys to the same bucket index.
In a hash table with no collisions, access, insertion, and deletion can all be performed in $\mathcal{O}(1)$ time.
The load factor of a hash table is the ratio of stored items ($n$) to array buckets ($b$), $n \over {b}$. To minimize collisions, the load factor of a hash table should never exceed $1$ (an upper bound is usually set between $0.5$ and $0.75$). When the specified upper bound is reached, a rehash is performed, with the underlying array being resized and existing values distributed to their new corresponding buckets.
A hash function should also be chosen that evenly distributes keys to buckets, avoiding clustering values. Even with a load factor far below $1$, if the hash function distributes every key to the same bucket (the worst-case scenario), the performance of every operation will $\mathcal{O}(n)$.
Chaining
In chaining, collisions are handled by appending colliding key-value pairs to a linked list stored in the bucket. To access a value by its key, the hash function resolves the correct bucket, and the list is searched for the correct key-value pair.
In this example, the hash table, which implements a map, is bootstrapped from a simpler map implementation, an association list (the LinkedMap class):
1public abstract class HashTable<K,V> implements MapADT<K,V> {
2 protected static int DEFAULT_CAPACITY = 17;
3
4 protected int size = 0;
5
6 protected int hash(Object key) {
7 // ensure a positive hash value by removing the sign bit from Java's hash,
8 // so that later application of the modulo operator results in a value > 0
9 return key.hashCode() & 0x7fffffff;
10 }
11
12 public int size() { return size; }
13
14 public boolean isEmpty() { return size == 0; }
15}
1public class ChainingHashTable<K,V> extends HashTable<K,V> {
2 private LinkedMap<K,V>[] table;
3
4 public ChainingHashTable() { this(DEFAULT_CAPACITY); }
5
6 public ChainingHashTable(int capacity) {
7 table = (LinkedMap<K, V>[]) new LinkedMap[capacity];
8 for (int i = 0; i < capacity; i++)
9 table[i] = new LinkedMap<>();
10 }
11
12 public V get(K key) {
13 int index = hash(key) % table.length;
14 return table[index].get(key);
15 }
16
17 public V set(K key, V value) {
18 int index = hash(key) % table.length;
19 V result = table[index].set(key, value);
20 if (result == null) size++;
21 if (loadFactor() > 0.75) resize(table.length * 2);
22 return result;
23 }
24
25 public V remove(K key) {
26 int index = hash(key) % table.length;
27 V result = table[index].remove(key);
28 if (result != null) size--;
29 return result;
30 }
31
32 public void forEach(BiConsumer<K,V> consumer) {
33 for (int i = 0; i < size; i++)
34 table[i].forEach(consumer);
35 }
36
37 private LinkedMap<K,V>[] createTable(int capacity) {
38 var newTable = (LinkedMap<K, V>[]) new LinkedMap[capacity];
39 for (int i = 0; i < capacity; i++)
40 newTable[i] = new LinkedMap<>();
41 return newTable;
42 }
43
44 private float loadFactor() { return (float) size / table.length; }
45
46 private void resize(int capacity) {
47 ChainingHashTable<K,V> resized = new ChainingHashTable<>(capacity);
48 forEach(resized::set);
49 table = resized.table;
50 }
51}
Open addressing
In open addressing, each key-value pair is directly stored in a bucket, but when a collision occurs, a method is used to place the pair in another bucket such that it can be found later.
Linear probing
Linear probing is an open addressing method whereby a colliding key value pair is stored by iterating forwards in the array until an unoccupied bucket is found. When searching for a key, the buckets are scanned in the same order until either the matching key or an empty bucket (key doesn’t exist) is found.
1public class LinearProbingHashTable<K,V> extends HashTable<K,V> {
2 private static class Bucket<K,V> {
3 private final K key;
4 private V value;
5
6 public Bucket(K key, V value) {
7 this.key = key;
8 this.value = value;
9 }
10 }
11
12 private Bucket<K,V>[] table;
13 private final Bucket<K,V> REMOVED = new Bucket<>(null, null);
14
15 public LinearProbingHashTable() { this(DEFAULT_CAPACITY); }
16
17 public LinearProbingHashTable(int capacity) {
18 table = (Bucket<K,V>[]) new Bucket[capacity];
19 }
20
21 public V get(K key) {
22 int start = hash(key) % table.length;
23 for (int i = 0; i < table.length; i++) {
24 int j = (start + i) % table.length;
25 var bucket = table[j];
26 if (bucket != null && bucket != REMOVED && bucket.key.equals(key)) {
27 return bucket.value;
28 }
29 }
30 return null;
31 }
32
33 public V set(K key, V value) {
34 if (loadFactor() >= 0.5) resize(table.length * 2);
35
36 int start = hash(key) % table.length;
37 int first = -1;
38 for (int i = 0; i < table.length; i++) {
39 int j = (start + i) % table.length;
40 var bucket = table[j];
41 if (bucket == null) {
42 // we can assume the key doesn't exist, and insert immediately
43 table[j] = new Bucket<>(key, value);
44 size++;
45 return null;
46 }
47 if (bucket.equals(REMOVED)) {
48 // the key may exist, but save this index for later insertion if needed
49 if (first == -1) first = j;
50 }
51 if (bucket.key.equals(key)) {
52 // the key exists, replace the value
53 var result = bucket.value;
54 bucket.value = value;
55 return result;
56 }
57 }
58 table[first] = new Bucket<>(key, value);
59 size++;
60 return null;
61 }
62
63 public V remove(K key) {
64 int start = hash(key) % table.length;
65 for (int i = 0; i < table.length; i++) {
66 int j = (start + i) % table.length;
67 var bucket = table[j];
68 if (bucket != null && bucket != REMOVED && bucket.key.equals(key)) {
69 V result = bucket.value;
70 table[j] = REMOVED;
71 size--;
72 return result;
73 }
74 }
75 return null;
76 }
77
78 public void forEach(BiConsumer<K, V> consumer) {
79 for (Bucket<K,V> bucket : table)
80 if (bucket != null && bucket != REMOVED)
81 consumer.accept(bucket.key, bucket.value);
82 }
83
84 private float loadFactor() { return (float) size / table.length; }
85
86 private void resize(int capacity) {
87 LinearProbingHashTable<K,V> resized = new LinearProbingHashTable<>(capacity);
88 forEach(resized::set);
89 table = resized.table;
90 }
91}
Wikipedia has a description of additional open addressing schemes .