Spotify End-to-End: How Everything Connects

Concept. A single click on Spotify's play button triggers a coordinated cascade across multiple subsystems (auth, recommendation, search, playback). Each is backed by a different storage and indexing strategy chosen for its specific workload.

Intuition. When Mickey hits play, Postgres authenticates him, an LSM-backed events store records the click, a vector search returns recommended next songs, a B+Tree-indexed catalog fetches the audio file, and a CDN streams it. All within 2 seconds.

One Play Button, Every Algorithm

When you hit play on "Anti-Hero," that single click sets off a chain reaction through every algorithm we've covered in this course.

1 Click

User action

3 Systems

OLTP → ETL → OLAP

8 Algorithms

From this course

2 Seconds

To CEO dashboard

The Journey

Spotify end-to-end architecture: one click triggers three systems. Act 1 OLTP (LSM tree write), Act 2 ETL (Spark hash partition, BigSort, broadcast join, columnar write), Act 3 OLAP (star schema query with broadcast and columnar scan).

Figure: one click triggers three systems. Act 1 is OLTP (LSM tree write), Act 2 is ETL (Spark hash partition, BigSort, broadcast join, columnar write), and Act 3 is OLAP (star-schema query with broadcast and columnar scan).


Act 1 · The Click (OLTP)

You tap play. In ~8 ms, this happens:

INSERT INTO listens (user_id, song_id, timestamp)
VALUES (123456, 789, NOW());

The write lands in an in-memory MemTable, returns instantly, and is flushed to disk later as part of an SSTable. No random disk I/O on the hot path. That's why Spotify can sustain 100 K writes/sec per node.

→ Algorithm: LSM Tree (lsm-indexes.html)

8 ms response · 100K writes/sec · 8.6B events/day


Act 2 · The Transform (ETL)

At 2 AM, Spark kicks in. 8.6 billion events from yesterday need processing into the analytics warehouse.

# 1. Hash Partition  →  spread across 1000 nodes
listens.repartition(1000, "user_id")

# 2. BigSort         →  sort within each partition
       .sortWithinPartitions("timestamp")

# 3. Broadcast Join  →  small dim tables to every node
       .join(broadcast(songs), "song_id")

# 4. Write Columnar  →  compress 100 TB → 10 TB
       .write.parquet("s3://warehouse/")

Each step is one of the algorithms from M3:

Step Algorithm Page
Repartition by user_id Hash Partitioning hash-partitioning.html
Sort within partitions BigSort big-sort.html
Join dim tables Broadcast Join distributed-query.html
Write Parquet Columnar storage storage-layout.html

3 hours total · 1000-node Spark cluster · 100 TB processed per day


Act 3 · The Dashboard (OLAP)

CEO opens the mobile app. The "Top Artists" dashboard renders in 2 seconds:

-- "Top Artists by Country This Month"
SELECT artist, country, COUNT(*) AS plays
FROM fact_plays f
JOIN dim_artists a ON f.artist_id = a.artist_id   -- broadcast (10 MB)
WHERE f.date >= '2024-01-01'
GROUP BY artist, country
ORDER BY plays DESC
LIMIT 100;

The 10 MB dim_artists is broadcast to all 1000 nodes (no shuffle). The 1 PB fact_plays is scanned columnar. Only artist_id, country, date are read out of 20 columns.

→ Algorithms: Broadcast Join + Columnar Scan (distributed-query.html, storage-layout.html)

2-second query · 3 of 20 columns scanned · 1 PB warehouse


The Big Picture

Act System type Hot algorithm Latency budget
1 · Click OLTP LSM tree write < 10 ms
2 · Transform ETL batch Hash partition · BigSort · Broadcast join · Columnar write 3 hours
3 · Dashboard OLAP Broadcast join · Columnar scan 2 seconds

The pattern: opposite optimizations at opposite ends of the pipeline. Act 1 needs blazing writes (LSM, no random I/O). Act 3 needs blazing scans (columnar, compressed). Act 2 (ETL) does the format conversion in between. You don't pick one. You pick the right tool for each stage.


Next: Module 3 Capstone takes the same "right tool for the job" mindset and applies it to seven open-ended Spotify-style design problems.