Selectivity & Statistics: How Databases Estimate Data Flow

Concept. Selectivity is the fraction of rows a predicate keeps; the database's statistics (histograms, distinct-value counts) let the query planner estimate selectivity to choose the cheapest plan.

Intuition. WHERE user_id = 1 keeps ~0.01% of Listens (3 rows out of a billion); WHERE rating > 0 keeps ~50%. The optimizer needs to know which is which: a 0.01% predicate should use the index, a 50% predicate should scan. Statistics tell it which.

Core Concepts: Selectivity & Cardinality

Selectivity Cardinality
Definition Fraction passing the filter (0.0 – 1.0) Number of rows in the output
Examples user_id = 123 → 0.0001
genre = 'Rock' → 0.30
rating IS NOT NULL → 0.95
Output = Input × Selectivity
e.g. 1M × 0.01 = 10K rows
Combining AND: S₁ × S₂
OR: S₁ + S₂ − (S₁ × S₂)
FK join: max(\|R\|, \|S\|)
M:N join: \|R\|×\|S\| / V(key)

Why it matters: bad estimates → 10–1000× slower plans. Selectivity drives algorithm choice (hash vs. nested loop), access method (index vs. scan), buffer sizing, and parallelism.


Step 1: Gathering Statistics

The pipeline: sample / scan → build structures → store & use.

Sampling Techniques

Technique How it works Pros Cons Use when
Reservoir Stream-friendly uniform sample of k rows: keep first k, then replace with probability k/i Unbiased; fixed memory Requires full scan You need an exact uniform sample
Block sampling ANALYZE TABLE SAMPLE 1000 PAGES. Samples whole pages Fast I/O; cache-friendly Clustering bias Data is well-distributed
Progressive Start with 100 samples; 10× until variance < threshold Adaptive accuracy; cheap when easy May need many passes Unknown distribution
# Reservoir sampling: the canonical streaming algorithm
for i, row in enumerate(stream):
    if i < k:
        sample.append(row)
    else:
        j = random.randint(0, i)
        if j < k:
            sample[j] = row

Step 2: Storing Statistics

After gathering, build compressed representations.

Table-level vs. column-level stats

Scope What's stored Stored where
Table row_count, page_count, avg_row_len, last_analyzed, staleness PostgreSQL: pg_class · MySQL: innodb_table_stats · BigQuery: INFORMATION_SCHEMA
Column null_fraction, distinct_count, most-common values + frequencies, histogram bounds PostgreSQL: pg_statistic

Column-level stats enable range estimates, equality selectivity, NULL handling, and join cardinality.

Key trade-off. We can't store every value, so we use compressed structures that trade space for accuracy. Histograms model distributions; sketches answer specific queries (membership, distinct count, frequency).

Histograms: Modeling Value Distribution

Type Idea Best for Example
Height-balanced Equal rows per bucket; bucket ranges vary Skewed data rating buckets [1.0–2.5] [2.5–3.5] [3.5–4.2] [4.2–5.0], each 25%
Equi-width Equal ranges per bucket; row counts vary Uniform data year buckets [1950–1975] 5%, [1975–2000] 15%, [2000–2025] 80%
Wavelet Hierarchical multi-resolution compression Complex / multi-dimensional patterns Levels of increasing granularity

Probabilistic sketches

Sketch Answers Space Error Use case
Bloom Filter "Might item x be in the set?" ~10 bits/item for 1% FP 1% false positive Skip partitions in distributed query
HyperLogLog COUNT(DISTINCT x) ~1.5 KB total ±1% Unique visitors over billions
Count-Min Sketch "How often did item x appear?" ~10 KB for 0.1% error ε with prob δ Top-K / heavy hitters
# Bloom filter (space O(m) bits)
def INSERT(x):
    for h IN hash_funcs:
        bits[h(x)] = 1

def contains(x):
    return ALL(bits[h(x)] for h IN hash_funcs)
# HyperLogLog (space ~1.5KB total)
def add(value):
    h = hash(value)
    bucket = h & (m - 1)
    zeros = leading_zeros(h)
    buckets[bucket] = max(buckets[bucket], zeros)
# Count-Min Sketch (space O(w×d))
def increment(item):
    for i, h in enumerate(hash_funcs):
        table[i][h(item)] += 1

def estimate(item):
    return min(table[i][h(item)] for i, h in enumerate(hash_funcs))

Step 3: Using Statistics for Selectivity Patterns

Selectivity Examples Recommended access
High (< 1%) Primary key lookup · unique constraint · rare value Index
Medium (1 – 10%) Common value · single-month date range · category filter Bitmap scan
Low (> 10%) IS NOT NULL · broad date range · popular category Sequential scan

Key Takeaways

The complete statistics pipeline runs in four stages:

  1. Gather. Sample or scan; balance cost vs. accuracy.

  2. Store. Build histograms and sketches; compress intelligently.

  3. Use. Estimate selectivity; choose the optimal plan.

  4. Update. Monitor staleness; re-analyze when needed.

Without statistics, even the best optimizer is guessing. With good statistics, it picks plans that are 10–1000× faster than the alternatives.


Next: IO Cost Summary puts these selectivity estimates to work, comparing two query plans by hand using the cost formulas you've already seen for sort, hash, and join.