B+Tree Indexes

Concept. A B+Tree index keeps keys sorted in a balanced tree, supporting O(log N) point lookups and range scans (WHERE rating BETWEEN 4.0 AND 5.0) by walking the leaf pages in order.

Intuition. When you query "find Listens with rating between 4.0 and 5.0," a B+Tree on rating walks down the tree to the first 4.0, then reads leaf pages in order until it passes 5.0. That is one descent followed by a scan, not a tree walk per row.

Prerequisites. Index Fundamentals introduces (key, page#) pairs and pointers.

What a B+Tree Stores

A B+Tree sits on top of the table's data pages. Three layers of index nodes (root, internal, leaf) route a lookup down to the right page, and the leaves are linked left-to-right so a range scan can stream across them without going back to the root.

B+Tree on Songs.release_date_since_1970. Five green data pages on the left hold rows. The tree on the right has a green root, three green internal nodes, and six green leaves. The purple trace shows the lookup path for key 12890: root → middle internal → leaf 11234–13788 → Page 3, where row 12: As It Was (12890) lives.

Figure

B+Tree on Songs.release_date_since_1970. Index built on top of the table's data pages.

Data pages (disk): the underlying rows of the Songs table. Each page holds a handful of tuples (Page 0 has 3 songs, Page 3 has 4, etc.). The index sits on top of these.
B+Tree index on data pages (disk): a tree of pages whose leaves point into the data pages above. Three levels:
- Root: 3 separator keys + 4 child pointers. Routes the lookup down to the right subtree.
- Internal nodes: same shape as root (keys + child pointers). Route the lookup further.
- Leaf nodes: hold (key, data-page pointer) pairs, sorted, linked left-to-right for sequential range scans.
Live example: lookup release_date = 12890 ("As It Was").
- Root: 12890 sits in 5000..14999, follow middle pointer.
- Internal: 12890 sits in 11234..13788, follow that pointer.
- Leaf 11234..13788: matches entry 12890 → Pg 3.
- Read Page 3, find row 12: As It Was (12890). 4 page reads total.
Fanout in this diagram = 4 (3 keys + 4 child pointers per node, kept small for clarity). Real databases use fanout in the thousands to millions, depending on page size (64 KB to 64 MB). With fanout ≈ 1,000, a billion-key table fits in just 3 tree levels: height = ⌈log_fanout(N)⌉.

Every layer is one IO. A billion-row table with fanout around 1,000 fits in three levels (root, internal, leaf). Add one more IO to read the data page itself, and a point lookup is roughly 4 IOs regardless of table size.

Walking a Range Query

The B+Tree finds the starting leaf in O(log N) IOs, then streams across leaves until it passes the end key. The trace below walks 12000 ≤ release_date ≤ 14000.

Walking the range 12000 to 14000. The root chooses the middle subtree, the middle internal node chooses leaf 3, the query reads leaves 3 and 4 in order, then peeks at leaf 5 whose first key 14456 exceeds 14000 and the scan stops. Data pages 3 and 4 are fetched. Total 7 IOs versus 160,000 for a full scan.

Figure

Range query 12000 ≤ release_date ≤ 14000 on the same B+Tree.

B+Tree nodes (disk): solid green nodes are touched by the query, dashed-gray nodes are skipped entirely.
Data pages (disk): only the pages with matching rows are fetched (Page 3 and Page 4).
Live trace (purple arrows + labels):
- Step 1, root. 12000 sits in 5000..14999, descend into the middle internal.
- Step 2, internal. 12000 sits in 11234..13788, descend into Leaf 3.
- Step 3, leaves 3 → 4. Walk the linked-leaf chain. Leaf 3 has 12890, 13456 (both in range). Leaf 4 has 13789, 13980 (both in range).
- Stop check, peek leaf 5. First key 14456 exceeds 14000, so stop.
- Fetch. Read Page 3 and Page 4 to get the matching rows.
Cost. 5 index IOs + 2 data IOs = 7 total. A full table scan would be roughly 160,000 IOs. About 23,000× faster.

The result is mostly sequential IO once the descent reaches the starting leaf, which is what makes range scans cheap on a B+Tree.

Why B+Trees Are Shallow: Fanout

Fanout is how many children an internal node has, and it's set by how many (key, pointer) pairs fit on one page. High fanout means a shallow tree, and a shallow tree means few IOs per lookup.

Fanout = ⌊page size / (key size + pointer size)⌋
Height = ⌈log_fanout(number of keys)⌉

A 4 KB page with 4-byte keys and 4-byte pointers gives a fanout of 512. An 8 KB page gives roughly 1,024. At fanout 1,000, a billion keys need only three levels. That is why B+Tree lookups feel constant-time even as tables grow.

How Page and Key Size Drive Tree Height

One trillion keys stored four different ways. Read down the Page-size column to see the effect of moving from a 64 KB to a 64 MB page.

Page size	Key size	Fanout	Height	IOs / lookup
64 KB	8 B	4,096	4	4
64 KB	1 KB	63	7	7
64 MB	8 B	4,194,304	2	2
64 MB	1 KB	65,027	3	3

A trillion-key index with 1 KB search keys goes from 7 IOs on a 64 KB page to just 3 IOs on a 64 MB page. Same tree structure, different arithmetic. The tradeoff is that larger pages use more memory per node, but they give you a far shallower tree. Real systems pick the page size that fits their workload: PostgreSQL ships with 8 KB, MySQL InnoDB with 16 KB, and analytical engines like BigQuery and Snowflake use MB-sized columnar blocks.

What B+Trees Handle Well, and What They Don't

Handles well

WHERE user_id = 42
WHERE rating BETWEEN 4.0 AND 5.0
WHERE name LIKE 'Ali%'
ORDER BY created_at

Equality, ranges, prefix matching, and sorted scans. Anything that benefits from keys being in order.

Doesn't help

WHERE LOWER(name) = 'alice'
WHERE name LIKE '%son'
WHERE age != 30

Functions on the indexed column (the index is on name, not LOWER(name)), suffix matches, and non-equality !=. The tree can't narrow the search space, so it falls back to a scan.

Key Takeaways

Fanout is the whole story. Fanout of 1,000 means log₁,₀₀₀(N) levels, which is three for a billion rows.
One descent, then a scan. Range queries walk the linked leaves, with no per-row tree traversal.
Match the index to the query shape. B+Trees shine on =, BETWEEN, LIKE 'prefix%', and ORDER BY. They can't help when you wrap the column in a function.

SSTables → The immutable sorted file format that write-heavy stores use instead of a B+Tree on disk.