Index Fundamentals
Concept. An index is an auxiliary data structure that maps a key (e.g. user_id) to the page where matching rows live, turning a full-table scan into a one-IO lookup.
Intuition. Without an index, finding Mickey's listens means scanning every row in a billion-row Listens table. With an index on user_id, the database looks up "Mickey lives on page 47," reads page 47, and returns Mickey's 3 listens. Same answer, 100,000× fewer IOs.
This page defines what every index stores. The next four pages each show a different way of organizing those (key, page#) pairs: hash indexes for equality, B+Trees for ranges, SSTables for immutable writes, and LSM trees for write-heavy workloads.
160,000
Pages in the Songs table
2
IOs with an index
80,000×
Speedup for equality queries
What's Inside an Index?
Two kinds of pages live on disk. They're both 64 MB blocks of bytes, but they hold very different things.
Data pages hold the rows themselves
- Each row is whatever size the row is. A song with lyrics is roughly 10 KB.
- At 1 M songs, that works out to 10 GB, which is about 160,000 pages.
Index pages hold only pointers
- Each entry is a
(search_key, page#)pair of 8 to 20 bytes. - At 1 M entries, that's 20 MB, which fits on a single page.
A pointer is the Page 47 part of an entry. It's a page number on disk (or a memory address in RAM), and it's the compact coordinate that gets you from the index straight to the data. Every index type on the rest of this module (hash, B+Tree, LSM) is a different way of organizing those same (key, page#) pairs.
Why the Index Is So Much Smaller
Figure: The Songs table on the left stores whole rows. The artist index on the right stores only (key, page#) pointers. The bottom bars show their sizes at the same horizontal scale.
A data page carries 10 KB rows. An index page carries 20-byte pointers. That is the 500× reduction: the same logical information stored in a different representation.
The payoff shows up on the query WHERE artist = 'Beatles'. Without an index, the database scans all 160,000 pages of Songs. With one, it reads a single index page to learn Beatles lives on page 47, then reads page 47. That's two IOs instead of 160,000, or 80,000× fewer reads for the same answer.
When to Add an Index, and When to Skip It
An index is a tradeoff: faster reads in exchange for extra storage and slower writes. It pays off on columns the query engine actually uses, and that narrow the result set down to a few rows.
Good candidates
- Primary keys. Indexed automatically in most databases.
- Foreign keys. Joins walk these constantly.
- High-selectivity columns with many unique values, like
emailoruser_id. - Frequently filtered columns that
WHEREclauses target.
Poor candidates
- Rarely queried columns. You pay the cost without getting the benefit.
- Low-selectivity columns such as a boolean, where the index points at half the table.
- Frequently updated columns. Every write rewrites the index too.
- Small tables (under about 1,000 rows), where a scan is already fast.
Key Takeaways
-
Indexes store
(key, page#)pointers, not data. That's why they're 500× smaller than the table they index. -
Two IOs beats 160,000. The win grows with table size, so bigger tables make indexes more valuable, not less.
-
Not every column deserves one. Indexes are rewarding to query and painful to write, so pick them where reads dominate writes.
Next
Hash Indexes → Constant-time lookup for equality queries. The simplest way to organize (key, page#) pairs.