Case Study 3.4: The Engineering of "Spotify Wrapped"
Concept. Spotify Wrapped pre-aggregates each user's full year of listening into a small per-user record overnight, so that the December 1 traffic spike reads from indexed rollups instead of scanning a year of raw plays.
Intuition. When 700 million users hit "show my Wrapped" on December 1, you cannot scan a year of 365 billion play events in real time. Instead, every night a batch job aggregates each user's listens into a single row keyed by user_id, indexed for instant lookup.
Two Phases: Batch and Online
Spotify Wrapped runs in two phases. The batch phase runs overnight and aggregates a year of play events into per-user summaries. The online phase serves those summaries to 700 million clients when the feature launches. The two phases have completely different access patterns, and they need different indexes.
Phase 1: Offline Aggregation
Before December, Spotify walks the Listens table to tally each user's top tracks. Say you logged 1,000 listens this year. How quickly can the system find and process them?
Option A: Unclustered Storage
The table is stored chronologically, as songs were played. Even with an index on user_id, the engine has to retrieve your 1,000 plays from 1,000 different pages scattered across disk. That's 1,000 random seeks, which is catastrophic at scale. Calculating Wrapped for everyone would take weeks.
Option B: Clustered Storage
Spotify reorders the trillion-row Listens table by (user_id, listen_time), creating a clustered index on those columns. Clustering means the table is physically sorted on disk in this order.
Now your 1,000 plays sit together on the same 64 MB page (or a small handful of neighbouring pages). The engine does a sequential scan: it reads your page, finds your data, and the next few records are guaranteed to also be yours. The same B+Tree index is also useful for date-specific lookups like "what did I play on July 12 at 9 PM," since the B+Tree gets you to the right spot and the clustering keeps nearby data nearby.
Phase 2: Online Delivery
Once summaries are computed, they're stored in a Wrapped table. Two different access patterns show up, and they each want a different index.
The Wrapped Card: Hash Index on user_id
When you tap your Wrapped card, the system fetches your precomputed summary. A hash index on user_id gives O(1) lookup, which is what you need when millions of users are tapping the card at the same second.
Recently Played: B+Tree on a Clustered Table
Some users drill deeper and want their full "Recently Played" history from the raw Listens table. A hash index can't handle a range scan like "show all songs I played in December." The B+Tree on the clustered Listens table can: it finds the start of December quickly, and because the data is clustered, the scroll across leaves reads contiguous pages.
The Clustering Tradeoff: One Dimension at a Time
A table can only be physically clustered by one dimension at a time, because data cannot sit on disk in two different orders simultaneously. This forces a design choice.
-
Clustering by
(user_id, listen_time)makes Wrapped fast and personal-history deep-dives fast. -
Clustering by
(artist_id, user_id)would make "Top 10 Fans of Taylor Swift" fast instead, but Wrapped becomes unclustered (and therefore slow).
At Spotify's scale, the common answer is to maintain multiple physical copies of the same table, each clustered differently to support a different feature. Storage is cheap; disk seeks at query time are not.
Key Takeaways
-
Pre-aggregate when the workload is predictable. The entire Wrapped rollout depends on doing the expensive work overnight, so the online path is a tiny indexed lookup.
-
Clustering turns random IO into sequential IO. A
user_id-clustered Listens table puts a user's 1,000 plays on one page instead of 1,000. -
Pick the index to match the access pattern. Hash indexes for point lookups, B+Trees for ranges, and cluster the table to match whichever access pattern dominates.
Next
Putting It All Together: The $550,000 Query → Four optimizations (indexes, columnar storage, compression, page size) stacked on a single query take it from $550K a year to under $400.