SSTables: Sorted String Tables
Concept. An SSTable is an immutable file of key-value pairs sorted by key, written once and never modified. It's the building block for write-heavy stores like LSM trees.
Intuition. When Spotify needs to record 1 million listen events per second, writing each one to a sorted on-disk B+Tree is too slow, because every write triggers a rebalance. Instead, the system buffers writes in RAM until it has a chunk, then flushes a sorted immutable file to disk. That file is an SSTable.
Anatomy of an SSTable
Each SSTable is a single file with three sections laid out in order. The caption above lists what each section does. (Recall Bloom filters from Module 2 for how they work and their false-positive behavior.)
Multiple SSTables coexist on disk. A background compaction process eventually merges older files together to throw away obsolete versions.
How Databases Use SSTables
Three operations show up in any system that uses SSTables:
-
Flush. Convert an in-memory buffer of pending writes into a sorted, immutable file on disk. One sequential write per flush.
-
Read. Consult the bloom filter, binary-search the sparse index, and read the target data block. Three steps, mostly from cache.
-
Compact. K-way merge a set of older SSTables into one larger file, keeping the newest version of each key and dropping deleted entries.
Some databases use SSTables as a standalone storage format (notably older versions of HBase and many analytics stores). Others layer them into a larger structure, which is where LSM trees come in on the next page.
Algorithm
Flush, Read, Compact
Flush(memtable):
sorted = sort(memtable) // sort by key
write(bloom_filter) // for quick "not exists"
write(sparse_index) // every Nth key → offset
write(sorted_data) // the actual key-values
write(footer) // min_key, max_key, size
Read(sstable, key):
if not bloom_filter.contains(key):
return None // quick reject
offset = binary_search(index, key)
block = read_block(offset)
return linear_search(block, key)
Compact(sstables):
merged = k_way_merge(sstables) // keep newest version per key
return write_new_sstable(merged)
Where SSTables Run in Production
The same file format powers Google's Bigtable and Spanner, Amazon DynamoDB, Apache Cassandra, and RocksDB. RocksDB is the storage engine underneath Kafka Streams, CockroachDB, TiKV, and many other systems. If you've used a modern key-value store, you've almost certainly written an SSTable.
Key Takeaways
-
Immutable plus sorted equals sequential writes. A flush is one big append, not thousands of random seeks.
-
Bloom filter plus sparse index equals fast reads. Most lookups are answered from cached index entries, and the bloom filter prevents wasted reads for missing keys.
-
Compaction pays the cost later. Writes stay cheap, and the background merge reclaims space and cleans up old versions.
Next
LSM Tree Indexes → How a cluster of SSTables behaves as a single log-structured index, and how reads and writes flow through it.