LSM Tree Indexes

Concept. An LSM tree converts random writes into sequential disk writes by buffering inserts in a RAM MemTable, then flushing them as immutable SSTables that are merged in the background.

Intuition. When Spotify ingests 1 million listen events per second, a B+Tree can't keep up, because every insert is a random disk write. LSM trees absorb all writes into RAM first, then dump them to disk as a single sequential file. Reads check RAM first, then the on-disk SSTables.

Prerequisites. SSTables introduces the immutable file format LSM trees are built from.


What Gets Stored

Every record is a (key, value, timestamp) triple. For a song-play counter, that looks like:

(S_102, +1, 13:42:05)

Song ID is the key, the increment is the value, and wall-clock time is the version. No row is ever updated in place. A new (S_102, +1, …) is appended for every play, and the "current state" of S_102 is whatever you get when you add all of its values together.


Writes: How Data Moves from RAM to Disk

Writes flow downward through three tiers. The caption below names each piece and walks the live S_75 example.

LSM write path at 1:44 PM. Incoming plays accumulate in the red MemTable in RAM, flush every 15 minutes into green Level-0 SSTables (1:00, 1:15, and 1:30 .db) on disk, and merge hourly into a green compacted Level-1+ SSTable. Song S_75's purple play counts are spread across all three tiers.

No write ever does a random seek. The MemTable is RAM, the flush is one sequential append, and compaction is a streaming merge. This is how an LSM tree sustains a million writes per second on commodity disks.


Reads: Merge on Read

The real state of a key isn't stored anywhere. It's computed by walking every tier that might hold the key and combining the results.

Merge-on-read. A query for total plays of S_75 gets +2 from the red MemTable, +18 from the three green Level-0 SSTables, and +5,420 from the green Level-1+ SSTable, summed in purple to 5,440 total plays at 1:44 PM.

A naïve read would touch every SSTable. In practice each SSTable carries a bloom filter, so tiers that can't possibly contain the key are skipped without a disk read. Hot keys in recent SSTables answer fast, and cold keys may walk deep.


How "Merge" Is Defined

Because LSM reads compute state from fragments, the "merge" step is really an application-specific reduction. Four patterns cover most systems:

Pattern Merge logic Example
Counter sum(v₁, v₂, …) Song play counts. Sum the +1 fragments.
Latest-wins pick max(timestamp) User profile updates. Newest overwrites.
Statistical sum(sums), sum(counts) Rolling averages. Keep running totals.
TTL / retention filter(time > now − window) Session logs. Drop events older than 24 hours.

Key Takeaways

  1. Sequential writes are the whole point. Buffer in RAM, flush as sorted files, never seek.

  2. State is computed, not stored. The "truth" is a merge across MemTable and SSTable tiers, and the merge function is the application's choice.

  3. Compaction is the tax. Cheap writes now are paid for by background merges that reclaim space and bound read cost.

  4. Production systems. Bigtable, Spanner, Cassandra, DynamoDB, and RocksDB all run on LSM trees underneath.


Next

You've finished the core indexing tour. The case studies that follow apply these four structures to real systems: Spotify search, Spotify's LSM deployment, Wrapped's yearly rollup, and a query disaster from a missing index.