SSTables: Sorted String Tables

Concept. An SSTable is an immutable file of key-value pairs sorted by key, written once and never modified. It's the building block for write-heavy stores like LSM trees.

Intuition. When Spotify needs to record 1 million listen events per second, writing each one to a sorted on-disk B+Tree is too slow, because every write triggers a rebalance. Instead, the system buffers writes in RAM until it has a chunk, then flushes a sorted immutable file to disk. That file is an SSTable.


Anatomy of an SSTable

SSTable structure. One file with three sections (bloom filter, index block, data blocks), all in pastel green (everything lives on disk). Bottom: three SSTable files coexist on disk, with the user_100 key appearing as v1, v2, v3 across them in pastel purple. The newest wins on read.

Each SSTable is a single file with three sections laid out in order. The caption above lists what each section does. (Recall Bloom filters from Module 2 for how they work and their false-positive behavior.)

Multiple SSTables coexist on disk. A background compaction process eventually merges older files together to throw away obsolete versions.


How Databases Use SSTables

Three operations show up in any system that uses SSTables:

  • Flush. Convert an in-memory buffer of pending writes into a sorted, immutable file on disk. One sequential write per flush.

  • Read. Consult the bloom filter, binary-search the sparse index, and read the target data block. Three steps, mostly from cache.

  • Compact. K-way merge a set of older SSTables into one larger file, keeping the newest version of each key and dropping deleted entries.

Some databases use SSTables as a standalone storage format (notably older versions of HBase and many analytics stores). Others layer them into a larger structure, which is where LSM trees come in on the next page.


Algorithm

Flush, Read, Compact

Flush(memtable):
    sorted = sort(memtable)        // sort by key
    write(bloom_filter)             // for quick "not exists"
    write(sparse_index)             // every Nth key → offset
    write(sorted_data)              // the actual key-values
    write(footer)                   // min_key, max_key, size

Read(sstable, key):
    if not bloom_filter.contains(key):
        return None                 // quick reject
    offset = binary_search(index, key)
    block  = read_block(offset)
    return linear_search(block, key)

Compact(sstables):
    merged = k_way_merge(sstables)  // keep newest version per key
    return write_new_sstable(merged)

Where SSTables Run in Production

The same file format powers Google's Bigtable and Spanner, Amazon DynamoDB, Apache Cassandra, and RocksDB. RocksDB is the storage engine underneath Kafka Streams, CockroachDB, TiKV, and many other systems. If you've used a modern key-value store, you've almost certainly written an SSTable.


Key Takeaways

  1. Immutable plus sorted equals sequential writes. A flush is one big append, not thousands of random seeks.

  2. Bloom filter plus sparse index equals fast reads. Most lookups are answered from cached index entries, and the bloom filter prevents wasted reads for missing keys.

  3. Compaction pays the cost later. Writes stay cheap, and the background merge reclaims space and cleans up old versions.


Next

LSM Tree Indexes → How a cluster of SSTables behaves as a single log-structured index, and how reads and writes flow through it.