Case Study 3.3: Tracking High-Volume Activity with LSM Trees

Concept. Spotify tracks billions of "like" and "play" events using LSM-tree-backed key-value stores (Cassandra, Bigtable), which sustain millions of writes per second by appending to an in-memory MemTable and flushing sequentially.

Intuition. When 700 million users all like and play songs continuously, a relational table cannot keep up with the write rate. Spotify routes the firehose into an LSM-backed cluster. Every event is buffered in RAM, then periodically flushed as an immutable SSTable. Reads aggregate across the MemTable and the SSTables.

Case Study 3.3 Reading Time: 8 mins

The Bottleneck: Random Writes in B+Trees

In a B+Tree, each update requires a disk seek to find the right leaf node. Updating the play count for any of 100 million songs means descending the tree, loading the right page, and rewriting the value. Multiply that by millions of updates per second and the disk controller, not the CPU, becomes the limit.

LSM trees side-step this by never doing a random write. Refer back to the LSM Tree Indexes page for the machinery (MemTable, flush, compaction). The rest of this page focuses on how Spotify actually uses that machinery for three distinct workloads.

Use Case 1: Counters (Play Counts, Likes)

Every "Like" and every "Play" is appended to the MemTable as a fragment: (song_id, +1, timestamp). To get the current total, the query sums matching fragments across the MemTable and every SSTable that might hold them. The merge function is simply SUM.

This is the pattern the core LSM page uses as its running example. Nothing special about Spotify's implementation beyond scale.

Use Case 2: Latest-Wins (Last-Played Position)

When you pause a 3-hour podcast, Spotify has to save your exact timestamp. With millions of users pausing and resuming constantly, random B+Tree updates would saturate the disk. Instead, Spotify appends (user_id, podcast_id, position) to the MemTable. When you resume, it scans newest to oldest (MemTable first, then SSTables in reverse chronological order), grabs the first version it finds, and ignores every older timestamp buried in older files.

The merge function here is pick-latest. Old versions aren't wrong, just superseded, and compaction eventually cleans them up.

Use Case 3: Immutable Events (Friend Activity Feed)

Spotify's Friend Activity feed tracks what your friends are listening to right now. These events are never updated, only appended. Every play is a fast memory write to the log. The dashboard renders the feed by doing a merge-on-read across the most recent SSTable segments.

No write locks, no contention, no coordination. The same append-and-forget pattern that powers counters also powers timelines.

Use Case 4: Deletes via Tombstones

SSTables are immutable, so you can't erase data from a file that's already on disk. When you unlike a song, Spotify writes a tombstone, a special marker recording that the key was deleted. On read, if the system encounters a tombstone before the original entry, it returns "not found" even though the older Like still physically exists in a deeper SSTable. During the next compaction, the system matches the tombstone against the old log entry and permanently drops both from the merged file.

The tombstone is the LSM equivalent of DELETE. It trades instant physical deletion for a delayed cleanup during compaction.

Key Takeaways

Append, don't overwrite. Treating the database as a log (append-only) instead of a ledger (overwriting) is what keeps write performance flat regardless of data size.
The merge function is the API. SUM for counters, pick-latest for position tracking, append for feeds, and tombstones for deletes. All four run on the same underlying LSM storage.
Compaction is where garbage is collected. Old versions and tombstones stay on disk until a background merge rewrites the file without them.

Case Study 3.4: Spotify Wrapped → When the read side needs to serve 700 million personalized summaries in a single day, the answer is to pre-aggregate everything overnight.