Case Study 3.2: How does Spotify support text search?

Concept. Spotify supports text search across 100 million tracks using an inverted index, a per-word lookup table that maps each search term to the list of song_ids containing it.

Intuition. Without an index, a search for "Yesterday" would read every row in a 100-million-song Songs table. An inverted index looks up "Yesterday" once and returns a short list of matching song_ids, so the database only reads those few rows.

Case Study 3.2 Reading Time: 6 mins

The Solution: An Inverted Index

An inverted index looks like the index at the back of a textbook. Each term (word) is a key, and its value is a list of locations (track IDs) where that word appears. The figure walks the live "Crazy Heart" example.

Figure

Inverted index over the Songs table. Two structures on disk, plus a runtime intersection.

Term dictionary (disk, B+Tree): every unique word in the table as a sorted key. Looking up a word is a normal B+Tree descent.
Postings lists (disk, packed arrays): each term points at the sorted song IDs that contain that word.
Bitwise AND intersection (RAM): walk two sorted lists in lockstep, keep only IDs that appear in both. Pure in-memory work, no disk seeks.
Live example, query "Crazy Heart":
- Look up "Crazy" in the term dictionary, fetch its postings list: [12, 45, 108, 145, 201, …].
- Look up "Heart" similarly: [8, 12, 88, 108, 180, …].
- Intersect the two sorted lists in RAM: {12, 108} appear in both.
- Read the 2 data pages for songs 12 and 108. About 10 IOs total.
Why it scales. A full-table scan would read all 100 million rows. The inverted index turns text search into a B+Tree descent + a sorted-list intersection, roughly 10,000,000× fewer reads.

Why a B+Tree for the Term Dictionary

The caption above lays out the structure and the live example. One detail worth pulling out: the term dictionary is a B+Tree, not a hash index. Users often type prefixes ("Craz" before they finish typing "Crazy"), and B+Trees support both exact and prefix lookups. A hash index would scatter neighbouring terms to random buckets and could not auto-complete.

Beyond Keyword Matching

Inverted indexes get you the candidate songs fast. Real search engines then apply two more layers on top:

Query expansion. A search for "happy" also checks related terms like "joyful" and "cheerful", broadening recall without forcing the user to guess synonyms.
Ranking. Not all matches are equally relevant. Spotify scores each match on three signals: where the term appeared (title beats tag), overall popularity (a global hit beats a deep cut), and personal history (an artist you listen to often moves up).

Key Takeaways

Inverted indexes turn O(N) scans into O(log N) lookups by moving the work from scanning rows to scanning a compact term dictionary.
Intersection happens in RAM, not on disk. The expensive random IO only fires once you know which rows you actually need.
B+Trees are the term dictionary of choice because they support both exact and prefix lookups, which is what search boxes actually do.

Case Study 3.3: Tracking High-Volume Activity → When the workload flips from read-heavy search to write-heavy event ingestion, the index type has to flip with it.

Case Study 3.2: How does Spotify support text search?

The Solution: An Inverted Index

Why a B+Tree for the Term Dictionary

Beyond Keyword Matching

Key Takeaways

Next