Selectivity & Statistics: How Databases Estimate Data Flow
Concept. Selectivity is the fraction of rows a predicate keeps; the database's statistics (histograms, distinct-value counts) let the query planner estimate selectivity to choose the cheapest plan.
Intuition. WHERE user_id = 1 keeps ~0.01% of Listens (3 rows out of a billion); WHERE rating > 0 keeps ~50%. The optimizer needs to know which is which: a 0.01% predicate should use the index, a 50% predicate should scan. Statistics tell it which.
Core Concepts: Selectivity & Cardinality
| Selectivity | Cardinality | |
|---|---|---|
| Definition | Fraction passing the filter (0.0 – 1.0) | Number of rows in the output |
| Examples | user_id = 123 → 0.0001genre = 'Rock' → 0.30rating IS NOT NULL → 0.95 |
Output = Input × Selectivitye.g. 1M × 0.01 = 10K rows |
| Combining | AND: S₁ × S₂OR: S₁ + S₂ − (S₁ × S₂) |
FK join: max(\|R\|, \|S\|)M:N join: \|R\|×\|S\| / V(key) |
Why it matters: bad estimates → 10–1000× slower plans. Selectivity drives algorithm choice (hash vs. nested loop), access method (index vs. scan), buffer sizing, and parallelism.
Step 1: Gathering Statistics
The pipeline: sample / scan → build structures → store & use.
Sampling Techniques
| Technique | How it works | Pros | Cons | Use when |
|---|---|---|---|---|
| Reservoir | Stream-friendly uniform sample of k rows: keep first k, then replace with probability k/i |
Unbiased; fixed memory | Requires full scan | You need an exact uniform sample |
| Block sampling | ANALYZE TABLE SAMPLE 1000 PAGES. Samples whole pages |
Fast I/O; cache-friendly | Clustering bias | Data is well-distributed |
| Progressive | Start with 100 samples; 10× until variance < threshold | Adaptive accuracy; cheap when easy | May need many passes | Unknown distribution |
# Reservoir sampling: the canonical streaming algorithm
for i, row in enumerate(stream):
if i < k:
sample.append(row)
else:
j = random.randint(0, i)
if j < k:
sample[j] = row
Step 2: Storing Statistics
After gathering, build compressed representations.
Table-level vs. column-level stats
| Scope | What's stored | Stored where |
|---|---|---|
| Table | row_count, page_count, avg_row_len, last_analyzed, staleness |
PostgreSQL: pg_class · MySQL: innodb_table_stats · BigQuery: INFORMATION_SCHEMA |
| Column | null_fraction, distinct_count, most-common values + frequencies, histogram bounds |
PostgreSQL: pg_statistic |
Column-level stats enable range estimates, equality selectivity, NULL handling, and join cardinality.
Key trade-off. We can't store every value, so we use compressed structures that trade space for accuracy. Histograms model distributions; sketches answer specific queries (membership, distinct count, frequency).
Histograms: Modeling Value Distribution
| Type | Idea | Best for | Example |
|---|---|---|---|
| Height-balanced | Equal rows per bucket; bucket ranges vary | Skewed data | rating buckets [1.0–2.5] [2.5–3.5] [3.5–4.2] [4.2–5.0], each 25% |
| Equi-width | Equal ranges per bucket; row counts vary | Uniform data | year buckets [1950–1975] 5%, [1975–2000] 15%, [2000–2025] 80% |
| Wavelet | Hierarchical multi-resolution compression | Complex / multi-dimensional patterns | Levels of increasing granularity |
Probabilistic sketches
| Sketch | Answers | Space | Error | Use case |
|---|---|---|---|---|
| Bloom Filter | "Might item x be in the set?" |
~10 bits/item for 1% FP | 1% false positive | Skip partitions in distributed query |
| HyperLogLog | COUNT(DISTINCT x) |
~1.5 KB total | ±1% | Unique visitors over billions |
| Count-Min Sketch | "How often did item x appear?" |
~10 KB for 0.1% error | ε with prob δ | Top-K / heavy hitters |
# Bloom filter (space O(m) bits)
def INSERT(x):
for h IN hash_funcs:
bits[h(x)] = 1
def contains(x):
return ALL(bits[h(x)] for h IN hash_funcs)
# HyperLogLog (space ~1.5KB total)
def add(value):
h = hash(value)
bucket = h & (m - 1)
zeros = leading_zeros(h)
buckets[bucket] = max(buckets[bucket], zeros)
# Count-Min Sketch (space O(w×d))
def increment(item):
for i, h in enumerate(hash_funcs):
table[i][h(item)] += 1
def estimate(item):
return min(table[i][h(item)] for i, h in enumerate(hash_funcs))
Step 3: Using Statistics for Selectivity Patterns
| Selectivity | Examples | Recommended access |
|---|---|---|
| High (< 1%) | Primary key lookup · unique constraint · rare value | Index |
| Medium (1 – 10%) | Common value · single-month date range · category filter | Bitmap scan |
| Low (> 10%) | IS NOT NULL · broad date range · popular category |
Sequential scan |
Key Takeaways
The complete statistics pipeline runs in four stages:
-
Gather. Sample or scan; balance cost vs. accuracy.
-
Store. Build histograms and sketches; compress intelligently.
-
Use. Estimate selectivity; choose the optimal plan.
-
Update. Monitor staleness; re-analyze when needed.
Without statistics, even the best optimizer is guessing. With good statistics, it picks plans that are 10–1000× faster than the alternatives.
Next: IO Cost Summary puts these selectivity estimates to work, comparing two query plans by hand using the cost formulas you've already seen for sort, hash, and join.