Hash Indexes

Concept. A hash index maps a key to its page via a hash function, supporting O(1) equality lookups (WHERE id = 42) but no range scans (WHERE id BETWEEN 10 AND 100).

Intuition. When you query "find Mickey's user record" (WHERE user_id = 1), a hash index returns the page in one IO. But for "find users with user_id between 100 and 200," the hash gives no help, because neighbouring keys land on random pages.

Prerequisites. Index Fundamentals introduces what (key, page#) pairs are and how pointers work.


How a Hash Index Finds a Row

A hash index stores the same (value, page#) pairs any index does, but the bucket for each pair is chosen by a hash function. The caption below traces two lookups: a clean unique-key case (Alice) and a duplicate-key fan-out case (Dan).

Hash index. Three green index pages on the right hold (value, page#) entries; five green data pages on the left hold the actual rows. Purple arrows trace two lookups: Alice resolves to data page 2; Dan fans out to three data pages (0, 1, 4) because duplicate keys land on multiple pages.

When a key has duplicates (Dan in the diagram), the index entry lists every page they live on, and the database reads each one in turn. That fan-out is the only cost that scales with anything other than the table size.


Two Ways to Organize a Hash Index

Basic: one giant dictionary

index = {
  "Alice": 2,
  "Bob":   0,
  "Carol": 0,
  "Dan":   [0, 1, 4],
  # ... a billion more entries
}

page_num = index["Alice"]   # O(1)

Simple, and works fine until the dictionary stops fitting in RAM. One billion entries at 16 bytes each is 16 GB, at which point every lookup starts with a disk read.

Partitioned: split across N pages (default)

# split into N=16 smaller indexes
index_pages = [ {...}, {...}, ... ]   # 16 pages

def lookup(key):
    which = hash(key) % 16      # pick the page
    return index_pages[which][key]

Each index page is about 64 MB and fits comfortably. The extra hash step is free, and the semantics are unchanged.


What Hash Indexes Are Good At, and What They Break On

The hash function is what makes this fast, and it's also what limits it. Hashing scatters neighbouring keys to random buckets, so any query that needs keys in order gets no help at all.

Works great

WHERE user_id = 42
WHERE order_id = 'ABC-123'
WHERE session_key = 'xyz789'

Any predicate that resolves to a single key value: equality, primary-key lookup, session ID, cache key.

Useless

WHERE age > 21
ORDER BY created_at
WHERE name LIKE 'A%'
WHERE price BETWEEN 10 AND 50

Any predicate that spans a range of keys, requires sorted order, or matches a prefix. Reach for a B+Tree instead.


Algorithm

Build and lookup

Build(table, key_column):
    B = choose_bucket_count()            // often a prime
    index = array[B] of lists

    for row in table:
        key = row[key_column]
        bucket = hash(key) % B
        index[bucket].append((key, row_pointer))

    return index

Lookup(index, search_key):                // O(1) average
    bucket = hash(search_key) % B
    for (key, pointer) in index[bucket]:
        if key == search_key:
            return pointer                 // found
    return null                            // not found

Hash choice matters. A poor hash clusters keys into a few buckets and degrades lookup to O(N), so production databases use well-studied hashes like MurmurHash and xxHash.


Key Takeaways

  1. O(1) equality lookups. Three IOs regardless of table size, which is hard to beat for point queries.

  2. No help for ranges or sorts. Neighbouring keys scatter to random buckets.

  3. Partitioning keeps huge indexes practical. The semantics stay the same, and each partition fits in RAM.


Next

B+Tree Indexes → Keeps keys sorted, so the same index handles both point lookups and range scans.