LSM Tree Indexes
Concept. An LSM tree converts random writes into sequential disk writes by buffering inserts in a RAM MemTable, then flushing them as immutable SSTables that are merged in the background.
Intuition. When Spotify ingests 1 million listen events per second, a B+Tree can't keep up, because every insert is a random disk write. LSM trees absorb all writes into RAM first, then dump them to disk as a single sequential file. Reads check RAM first, then the on-disk SSTables.
Prerequisites. SSTables introduces the immutable file format LSM trees are built from.
What Gets Stored
Every record is a (key, value, timestamp) triple. For a song-play counter, that looks like:
(S_102, +1, 13:42:05)
Song ID is the key, the increment is the value, and wall-clock time is the version. No row is ever updated in place. A new (S_102, +1, …) is appended for every play, and the "current state" of S_102 is whatever you get when you add all of its values together.
Writes: How Data Moves from RAM to Disk
Writes flow downward through three tiers. The caption below names each piece and walks the live S_75 example.
No write ever does a random seek. The MemTable is RAM, the flush is one sequential append, and compaction is a streaming merge. This is how an LSM tree sustains a million writes per second on commodity disks.
Reads: Merge on Read
The real state of a key isn't stored anywhere. It's computed by walking every tier that might hold the key and combining the results.
A naïve read would touch every SSTable. In practice each SSTable carries a bloom filter, so tiers that can't possibly contain the key are skipped without a disk read. Hot keys in recent SSTables answer fast, and cold keys may walk deep.
How "Merge" Is Defined
Because LSM reads compute state from fragments, the "merge" step is really an application-specific reduction. Four patterns cover most systems:
| Pattern | Merge logic | Example |
|---|---|---|
| Counter | sum(v₁, v₂, …) | Song play counts. Sum the +1 fragments. |
| Latest-wins | pick max(timestamp) | User profile updates. Newest overwrites. |
| Statistical | sum(sums), sum(counts) | Rolling averages. Keep running totals. |
| TTL / retention | filter(time > now − window) | Session logs. Drop events older than 24 hours. |
Key Takeaways
-
Sequential writes are the whole point. Buffer in RAM, flush as sorted files, never seek.
-
State is computed, not stored. The "truth" is a merge across MemTable and SSTable tiers, and the merge function is the application's choice.
-
Compaction is the tax. Cheap writes now are paid for by background merges that reclaim space and bound read cost.
-
Production systems. Bigtable, Spanner, Cassandra, DynamoDB, and RocksDB all run on LSM trees underneath.
Next
You've finished the core indexing tour. The case studies that follow apply these four structures to real systems: Spotify search, Spotify's LSM deployment, Wrapped's yearly rollup, and a query disaster from a missing index.