LSM Tree Indexes

Concept. An LSM tree converts random writes into sequential disk writes by buffering inserts in a RAM MemTable, then flushing them as immutable SSTables that are merged in the background.

Intuition. When Spotify ingests 1 million listen events per second, a B+Tree can't keep up, because every insert is a random disk write. LSM trees absorb all writes into RAM first, then dump them to disk as a single sequential file. Reads check RAM first, then the on-disk SSTables.

Prerequisites. SSTables introduces the immutable file format LSM trees are built from.

What Gets Stored

Every record is a (key, value, timestamp) triple. For a song-play counter, that looks like:

(S_102, +1, 13:42:05)

Song ID is the key, the increment is the value, and wall-clock time is the version. No row is ever updated in place. A new (S_102, +1, …) is appended for every play, and the "current state" of S_102 is whatever you get when you add all of its values together.

Writes: How Data Moves from RAM to Disk

Writes flow downward through three tiers. The caption below names each piece and walks the live S_75 example.

LSM write path at 1:44 PM. Incoming plays accumulate in the red MemTable in RAM, flush every 15 minutes into green Level-0 SSTables (1:00, 1:15, and 1:30 .db) on disk, and merge hourly into a green compacted Level-1+ SSTable. Song S_75's purple play counts are spread across all three tiers.

Figure

LSM write path at 1:44 PM. RAM on top, disk below, time flows downward. Counter example: merge = add.

MemTable (RAM, live): incoming plays land here. Updates are fast: no disk seek. Lost on crash unless replayed from the WAL.
Level-0 SSTables (disk): every 15 minutes the MemTable is flushed as one immutable file. Recent flushes shown: 1:00 PM.db, 1:15 PM.db, 1:30 PM.db.
Level-1+ SSTable (disk, compacted): hourly compaction merges Level-0 files into one larger sorted, de-duplicated file.
Live example: total plays for S_75 (counter merge, sum across tiers).
- MemTable: +2 · 1:30 PM.db: +10 · 1:00 PM.db: +8 · compacted history: 5,420
- Total = 5,440 plays. The numbers are smeared across tiers; the read path sums them (merge-on-read).
Merge function is configurable. "Add" is right for counters. Other LSM workloads merge differently: "newest version wins" for last-write-wins keys, "set union" for tags, "max" for high-water-marks. The merge step is generic; the operator is whatever the column needs. The merge section walks through more examples.

No write ever does a random seek. The MemTable is RAM, the flush is one sequential append, and compaction is a streaming merge. This is how an LSM tree sustains a million writes per second on commodity disks.

Reads: Merge on Read

The real state of a key isn't stored anywhere. It's computed by walking every tier that might hold the key and combining the results.

Merge-on-read. A query for total plays of S_75 gets +2 from the red MemTable, +18 from the three green Level-0 SSTables, and +5,420 from the green Level-1+ SSTable, summed in purple to 5,440 total plays at 1:44 PM.

A naïve read would touch every SSTable. In practice each SSTable carries a bloom filter, so tiers that can't possibly contain the key are skipped without a disk read. Hot keys in recent SSTables answer fast, and cold keys may walk deep.

How "Merge" Is Defined

Because LSM reads compute state from fragments, the "merge" step is really an application-specific reduction. Four patterns cover most systems:

Pattern	Merge logic	Example
Counter	sum(v₁, v₂, …)	Song play counts. Sum the `+1` fragments.
Latest-wins	pick max(timestamp)	User profile updates. Newest overwrites.
Statistical	sum(sums), sum(counts)	Rolling averages. Keep running totals.
TTL / retention	filter(time > now − window)	Session logs. Drop events older than 24 hours.

Key Takeaways

Sequential writes are the whole point. Buffer in RAM, flush as sorted files, never seek.
State is computed, not stored. The "truth" is a merge across MemTable and SSTable tiers, and the merge function is the application's choice.
Compaction is the tax. Cheap writes now are paid for by background merges that reclaim space and bound read cost.
Production systems. Bigtable, Spanner, Cassandra, DynamoDB, and RocksDB all run on LSM trees underneath.