Index Fundamentals

Concept. An index is an auxiliary data structure that maps a key (e.g. user_id) to the page where matching rows live, turning a full-table scan into a one-IO lookup.

Intuition. Without an index, finding Mickey's listens means scanning every row in a billion-row Listens table. With an index on user_id, the database looks up "Mickey lives on page 47," reads page 47, and returns Mickey's 3 listens. Same answer, 100,000× fewer IOs.

This page defines what every index stores. The next four pages each show a different way of organizing those (key, page#) pairs: hash indexes for equality, B+Trees for ranges, SSTables for immutable writes, and LSM trees for write-heavy workloads.

160,000

Pages in the Songs table

IOs with an index

80,000×

Speedup for equality queries

What's Inside an Index?

Two kinds of pages live on disk. They're both 64 MB blocks of bytes, but they hold very different things.

Data pages hold the rows themselves

Each row is whatever size the row is. A song with lyrics is roughly 10 KB.
At 1 M songs, that works out to 10 GB, which is about 160,000 pages.

Index pages hold only pointers

Each entry is a (search_key, page#) pair of 8 to 20 bytes.
At 1 M entries, that's 20 MB, which fits on a single page.

A pointer is the Page 47 part of an entry. It's a page number on disk (or a memory address in RAM), and it's the compact coordinate that gets you from the index straight to the data. Every index type on the rest of this module (hash, B+Tree, LSM) is a different way of organizing those same (key, page#) pairs.

Why the Index Is So Much Smaller

Same data, two footprints. The Songs table is 10 GB spread over 160,000 pages; the artist index is 20 MB on a single page, 500 times smaller.

Figure: The Songs table on the left stores whole rows. The artist index on the right stores only (key, page#) pointers. The bottom bars show their sizes at the same horizontal scale.

A data page carries 10 KB rows. An index page carries 20-byte pointers. That is the 500× reduction: the same logical information stored in a different representation.

The payoff shows up on the query WHERE artist = 'Beatles'. Without an index, the database scans all 160,000 pages of Songs. With one, it reads a single index page to learn Beatles lives on page 47, then reads page 47. That's two IOs instead of 160,000, or 80,000× fewer reads for the same answer.

When to Add an Index, and When to Skip It

An index is a tradeoff: faster reads in exchange for extra storage and slower writes. It pays off on columns the query engine actually uses, and that narrow the result set down to a few rows.

Good candidates

Primary keys. Indexed automatically in most databases.
Foreign keys. Joins walk these constantly.
High-selectivity columns with many unique values, like email or user_id.
Frequently filtered columns that WHERE clauses target.

Poor candidates

Rarely queried columns. You pay the cost without getting the benefit.
Low-selectivity columns such as a boolean, where the index points at half the table.
Frequently updated columns. Every write rewrites the index too.
Small tables (under about 1,000 rows), where a scan is already fast.

Key Takeaways

Indexes store (key, page#) pointers, not data. That's why they're 500× smaller than the table they index.
Two IOs beats 160,000. The win grows with table size, so bigger tables make indexes more valuable, not less.
Not every column deserves one. Indexes are rewarding to query and painful to write, so pick them where reads dominate writes.

Hash Indexes → Constant-time lookup for equality queries. The simplest way to organize (key, page#) pairs.