Case Study 3.2: How does Spotify support text search?
Concept. Spotify supports text search across 100 million tracks using an inverted index, a per-word lookup table that maps each search term to the list of song_ids containing it.
Intuition. Without an index, a search for "Yesterday" would read every row in a 100-million-song Songs table. An inverted index looks up "Yesterday" once and returns a short list of matching song_ids, so the database only reads those few rows.
The Solution: An Inverted Index
An inverted index looks like the index at the back of a textbook. Each term (word) is a key, and its value is a list of locations (track IDs) where that word appears. The figure walks the live "Crazy Heart" example.
Why a B+Tree for the Term Dictionary
The caption above lays out the structure and the live example. One detail worth pulling out: the term dictionary is a B+Tree, not a hash index. Users often type prefixes ("Craz" before they finish typing "Crazy"), and B+Trees support both exact and prefix lookups. A hash index would scatter neighbouring terms to random buckets and could not auto-complete.
Beyond Keyword Matching
Inverted indexes get you the candidate songs fast. Real search engines then apply two more layers on top:
-
Query expansion. A search for "happy" also checks related terms like "joyful" and "cheerful", broadening recall without forcing the user to guess synonyms.
-
Ranking. Not all matches are equally relevant. Spotify scores each match on three signals: where the term appeared (title beats tag), overall popularity (a global hit beats a deep cut), and personal history (an artist you listen to often moves up).
Key Takeaways
-
Inverted indexes turn O(N) scans into O(log N) lookups by moving the work from scanning rows to scanning a compact term dictionary.
-
Intersection happens in RAM, not on disk. The expensive random IO only fires once you know which rows you actually need.
-
B+Trees are the term dictionary of choice because they support both exact and prefix lookups, which is what search boxes actually do.
Next
Case Study 3.3: Tracking High-Volume Activity → When the workload flips from read-heavy search to write-heavy event ingestion, the index type has to flip with it.