B+Tree Indexes

Concept. A B+Tree index keeps keys sorted in a balanced tree, supporting O(log N) point lookups and range scans (WHERE rating BETWEEN 4.0 AND 5.0) by walking the leaf pages in order.

Intuition. When you query "find Listens with rating between 4.0 and 5.0," a B+Tree on rating walks down the tree to the first 4.0, then reads leaf pages in order until it passes 5.0. That is one descent followed by a scan, not a tree walk per row.

Prerequisites. Index Fundamentals introduces (key, page#) pairs and pointers.


What a B+Tree Stores

A B+Tree sits on top of the table's data pages. Three layers of index nodes (root, internal, leaf) route a lookup down to the right page, and the leaves are linked left-to-right so a range scan can stream across them without going back to the root.

B+Tree on Songs.release_date_since_1970. Five green data pages on the left hold rows. The tree on the right has a green root, three green internal nodes, and six green leaves. The purple trace shows the lookup path for key 12890: root → middle internal → leaf 11234–13788 → Page 3, where row 12: As It Was (12890) lives.

Every layer is one IO. A billion-row table with fanout around 1,000 fits in three levels (root, internal, leaf). Add one more IO to read the data page itself, and a point lookup is roughly 4 IOs regardless of table size.


Walking a Range Query

The B+Tree finds the starting leaf in O(log N) IOs, then streams across leaves until it passes the end key. The trace below walks 12000 ≤ release_date ≤ 14000.

Walking the range 12000 to 14000. The root chooses the middle subtree, the middle internal node chooses leaf 3, the query reads leaves 3 and 4 in order, then peeks at leaf 5 whose first key 14456 exceeds 14000 and the scan stops. Data pages 3 and 4 are fetched. Total 7 IOs versus 160,000 for a full scan.

The result is mostly sequential IO once the descent reaches the starting leaf, which is what makes range scans cheap on a B+Tree.


Why B+Trees Are Shallow: Fanout

Fanout is how many children an internal node has, and it's set by how many (key, pointer) pairs fit on one page. High fanout means a shallow tree, and a shallow tree means few IOs per lookup.

Fanout = ⌊page size / (key size + pointer size)⌋
Height = ⌈log_fanout(number of keys)⌉

A 4 KB page with 4-byte keys and 4-byte pointers gives a fanout of 512. An 8 KB page gives roughly 1,024. At fanout 1,000, a billion keys need only three levels. That is why B+Tree lookups feel constant-time even as tables grow.

How Page and Key Size Drive Tree Height

One trillion keys stored four different ways. Read down the Page-size column to see the effect of moving from a 64 KB to a 64 MB page.

Page size Key size Fanout Height IOs / lookup
64 KB 8 B 4,096 4 4
64 KB 1 KB 63 7 7
64 MB 8 B 4,194,304 2 2
64 MB 1 KB 65,027 3 3

A trillion-key index with 1 KB search keys goes from 7 IOs on a 64 KB page to just 3 IOs on a 64 MB page. Same tree structure, different arithmetic. The tradeoff is that larger pages use more memory per node, but they give you a far shallower tree. Real systems pick the page size that fits their workload: PostgreSQL ships with 8 KB, MySQL InnoDB with 16 KB, and analytical engines like BigQuery and Snowflake use MB-sized columnar blocks.


What B+Trees Handle Well, and What They Don't

Handles well

WHERE user_id = 42
WHERE rating BETWEEN 4.0 AND 5.0
WHERE name LIKE 'Ali%'
ORDER BY created_at

Equality, ranges, prefix matching, and sorted scans. Anything that benefits from keys being in order.

Doesn't help

WHERE LOWER(name) = 'alice'
WHERE name LIKE '%son'
WHERE age != 30

Functions on the indexed column (the index is on name, not LOWER(name)), suffix matches, and non-equality !=. The tree can't narrow the search space, so it falls back to a scan.


Key Takeaways

  1. Fanout is the whole story. Fanout of 1,000 means log₁,₀₀₀(N) levels, which is three for a billion rows.

  2. One descent, then a scan. Range queries walk the linked leaves, with no per-row tree traversal.

  3. Match the index to the query shape. B+Trees shine on =, BETWEEN, LIKE 'prefix%', and ORDER BY. They can't help when you wrap the column in a function.


Next

SSTables → The immutable sorted file format that write-heavy stores use instead of a B+Tree on disk.