Case Study 3.2: How does Spotify support text search?

Concept. Spotify supports text search across 100 million tracks using an inverted index, a per-word lookup table that maps each search term to the list of song_ids containing it.

Intuition. Without an index, a search for "Yesterday" would read every row in a 100-million-song Songs table. An inverted index looks up "Yesterday" once and returns a short list of matching song_ids, so the database only reads those few rows.

Case Study 3.2 Reading Time: 6 mins

The Solution: An Inverted Index

An inverted index looks like the index at the back of a textbook. Each term (word) is a key, and its value is a list of locations (track IDs) where that word appears. The figure walks the live "Crazy Heart" example.

Inverted index. A green B+Tree term dictionary on the left stores every unique word in the Songs table; Crazy and Heart are highlighted in purple. Each highlighted term points at its green postings list on the right, a packed array of song IDs. The red Bitwise-AND-intersection box runs in RAM, walking both lists in lockstep and keeping IDs that appear in both. Result: song IDs 12 and 108. About 10 IOs total versus 100 million for a full-table scan.

Why a B+Tree for the Term Dictionary

The caption above lays out the structure and the live example. One detail worth pulling out: the term dictionary is a B+Tree, not a hash index. Users often type prefixes ("Craz" before they finish typing "Crazy"), and B+Trees support both exact and prefix lookups. A hash index would scatter neighbouring terms to random buckets and could not auto-complete.

Beyond Keyword Matching

Inverted indexes get you the candidate songs fast. Real search engines then apply two more layers on top:

  • Query expansion. A search for "happy" also checks related terms like "joyful" and "cheerful", broadening recall without forcing the user to guess synonyms.

  • Ranking. Not all matches are equally relevant. Spotify scores each match on three signals: where the term appeared (title beats tag), overall popularity (a global hit beats a deep cut), and personal history (an artist you listen to often moves up).

Key Takeaways

  1. Inverted indexes turn O(N) scans into O(log N) lookups by moving the work from scanning rows to scanning a compact term dictionary.

  2. Intersection happens in RAM, not on disk. The expensive random IO only fires once you know which rows you actually need.

  3. B+Trees are the term dictionary of choice because they support both exact and prefix lookups, which is what search boxes actually do.


Next

Case Study 3.3: Tracking High-Volume Activity → When the workload flips from read-heavy search to write-heavy event ingestion, the index type has to flip with it.