Index Problem Solving: Design Patterns

Concept. Choosing an index is really about matching the query pattern to the index type. Hash for exact match, B+Tree for ranges and sorts, LSM for write-heavy workloads, and combinations for everything in between.

Intuition. The ten scenarios below start with "what are the queries this system has to answer?" and work backward to the index. Same database, same data, different access patterns produce different answers.

When a task says "optimize queries," "scale writes," or "support range scans," the right index isn't just a technical choice, it's a business decision about what access pattern matters most. See IO Cost & Trade-offs for the structural comparison the problems below assume.


Problem 1: User Lookup Service (1 M scale)

Problem. Handle 1 M user requests per second with email or username lookups. Speed is non-negotiable.

🔍 Show Solution
  • Concept. Hash index.

  • Design. Hash index on email and username for O(1) exact-match lookups.

  • Rough performance. 1 to 2 IOs per lookup; scales horizontally.

  • Why. Logins are exact matches (email = 'user@example.com'). A hash index beats a B+Tree's 3 to 4 IOs for this pattern.


Problem 2: Song Rating Lookup (100 M scale)

Problem. Spotify needs to answer "has User X already rated Song Y?" across 100 M ratings.

🔍 Show Solution
  • Concept. Composite B+Tree vs concatenated hash.

  • Design. Option A: B+Tree on (user_id, song_id) for flexibility. Option B: Hash on concat(user_id, '_', song_id) for pure speed.

  • Rough performance. B+Tree: 3 to 4 IOs, flexible. Hash: 1 to 2 IOs for exact match only.

  • Why. If you only need the exists check, hash wins. A bloom filter can also cheaply answer "definitely not rated" before you touch the main index.


Problem 3: E-commerce Price Filtering (10 M scale)

Problem. Filter products by price ranges quickly.

🔍 Show Solution
  • Concept. B+Tree index.

  • Design. B+Tree on the price column, plus a compound index on (category, price) if category filtering is common.

  • Rough performance. 3 to 4 IOs to find the start, then a sequential scan across the range.

  • Why. Hash indexes can't handle BETWEEN or < / >. The B+Tree's sorted leaves are exactly what range queries and ORDER BY want.


Problem 4: Song Analytics Dashboard (50 M scale)

Problem. Query 50 M songs by various criteria. Sorting the data multiple ways at once isn't feasible.

🔍 Show Solution
  • Concept. Primary clustered index plus secondary indexes.

  • Design. Primary B+Tree clustered on song_id. Secondary B+Trees on release_date and on popularity_score DESC. Hash on genre.

  • Rough performance. Primary: 3 IOs. Secondary: 4 to 5 IOs. Genre lookup: 1 to 2 IOs via hash.

  • Why. The table is physically sorted by song_id. Secondary indexes point back to the primary key. You pay an extra IO hop for each secondary lookup, but that's cheaper than maintaining several physically clustered copies of the table.


Problem 5: User Activity Tracking (10 M scale)

Problem. Track 10 M user sessions with mixed query patterns.

🔍 Show Solution
  • Concept. Mix index types to match each pattern.

  • Design. Hash on session_id. Composite B+Tree on (user_id, timestamp). B+Tree on last_active. Bloom filter for existence checks.

  • Rough performance. Session lookup: 1 to 2 IOs. User history: 3 to 4 IOs plus scan. Dormant detection: range scan. Existence check: answered in memory.

  • Why. Each query pattern gets the structure that fits it best. Hash for point lookups, B+Tree for ranges and sorting, bloom filter for cheap existence.


Problem 6: Similar Image Search (1 B+ scale)

Problem. Pinterest-style "find similar images" and detect exact duplicates efficiently.

🔍 Show Solution
  • Concept. Locality-Sensitive Hashing (LSH) plus a hash index. LSH is a specialized topic beyond the core index types and is covered only lightly here.

  • Design. LSH for similarity search. Hash index on image fingerprints for duplicate detection. Two-stage query: check the hash first for exact matches, then run LSH for near-matches.

  • Rough performance. Exact duplicate: O(1). Similar images: O(k), where k is the average bucket size in LSH.

  • Why. LSH handles high-dimensional data by reducing dimensionality while preserving approximate similarity. A plain hash index covers the exact-duplicate path.


Problem 7: Instagram Feed (600 M scale)

Problem. Deliver Instagram feeds without writing a single celebrity post to 600 M timelines.

🔍 Show Solution
  • Concept. B+Tree plus hybrid fanout.

  • Design. Two tables. posts is indexed by (user_id, timestamp) B+Tree. timelines is indexed by (user_id, timestamp) for regular users. For celebrities, query the posts table on demand via JOIN. For regular users, pre-populate timelines.

  • Rough performance. Celebrity: a JOIN using B+Tree indexes. Regular user: a single B+Tree lookup.

  • Why. Fanning out one celebrity post to 600 M timelines is wasteful. Instagram JOINs celebrity posts with your follow list at read time instead.


Problem 8: Real-time Event Stream (1 M writes/sec)

Problem. Spotify logs 1 M song plays per second.

🔍 Show Solution
  • Concept. LSM tree.

  • Design. LSM with size-tiered compaction. Time-based partitioning. Secondary LSM indexes on user_id for per-user queries.

  • Rough performance. Effectively 0 IOs per write (append to RAM). Reads cost K × 2 IOs where K is the number of tiers that might hold the key.

  • Why. Writes hit memory first. A B+Tree would require 4 M IOPS on disk; LSM's batched flushes absorb the load. Recent data stays in memory for fast access.


Problem 9: IoT Time-Series Data (100 K sensors)

Problem. 100 K sensors sending readings, queried by sensor_id and time range.

🔍 Show Solution
  • Concept. LSM tree plus bloom filters.

  • Design. LSM with leveled compaction. Primary key (sensor_id, timestamp). Monthly partitions. Bloom filters per SSTable.

  • Rough performance. Effectively 0 write IOs. Efficient range scans with bloom-filter pruning on reads.

  • Why. Append-only sensor data fits LSM perfectly. Bloom filters skip SSTables that can't contain the requested sensor. Run-length encoding compresses repeated values well.


Problem 10: Driver Location Tracking (10 M scale)

Problem. Uber tracks driver locations every 5 seconds but only needs the latest location per driver.

🔍 Show Solution
  • Concept. LSM with a keep-latest merge function.

  • Design. LSM tree with a custom merge that keeps only the newest timestamp per driver_id. Writes append to memory, and compaction discards older locations.

  • Rough performance. Effectively 0 write IOs. Compaction reduces data roughly 100× by keeping only the latest location per driver.

  • Why. The LSM merge function implements latest-wins directly. A B+Tree would require a costly random update for every ping. Old locations are garbage-collected for free during compaction.


Key Takeaways

  1. Match the index to the access pattern, not the data. The same 100 M rows can warrant a hash, a B+Tree, or an LSM depending on the query.

  2. Composite indexes are one of the highest-leverage moves. A single B+Tree on (user_id, timestamp) serves personal history, most-recent lookups, and range filters all at once.

  3. Mix index types on the same table when patterns conflict. Hash for point lookups, B+Tree for ranges, bloom for existence. None of them are mutually exclusive.


Next

Indexing Quiz → Check yourself on the four index types and when to use each.