Selectivity & Statistics: How Databases Estimate Data Flow

Concept. Selectivity is the fraction of rows a predicate keeps; the database's statistics (histograms, distinct-value counts) let the query planner estimate selectivity to choose the cheapest plan.

Intuition. WHERE user_id = 1 keeps ~0.01% of Listens (3 rows out of a billion); WHERE rating > 0 keeps ~50%. The optimizer needs to know which is which: a 0.01% predicate should use the index, a 50% predicate should scan. Statistics tell it which.

Core Concepts: Selectivity & Cardinality

	Selectivity	Cardinality
Definition	Fraction passing the filter (0.0 – 1.0)	Number of rows in the output
Examples	`user_id = 123` → 0.0001 `genre = 'Rock'` → 0.30 `rating IS NOT NULL` → 0.95	`Output = Input × Selectivity` e.g. 1M × 0.01 = 10K rows
Combining	`AND`: S₁ × S₂ `OR`: S₁ + S₂ − (S₁ × S₂)	FK join: `max(\\|R\\|, \\|S\\|)` M:N join: `\\|R\\|×\\|S\\| / V(key)`

Why it matters: bad estimates → 10–1000× slower plans. Selectivity drives algorithm choice (hash vs. nested loop), access method (index vs. scan), buffer sizing, and parallelism.

Step 1: Gathering Statistics

The pipeline: sample / scan → build structures → store & use.

Sampling Techniques

Technique	How it works	Pros	Cons	Use when
Reservoir	Stream-friendly uniform sample of `k` rows: keep first `k`, then replace with probability `k/i`	Unbiased; fixed memory	Requires full scan	You need an exact uniform sample
Block sampling	`ANALYZE TABLE SAMPLE 1000 PAGES`. Samples whole pages	Fast I/O; cache-friendly	Clustering bias	Data is well-distributed
Progressive	Start with 100 samples; 10× until variance < threshold	Adaptive accuracy; cheap when easy	May need many passes	Unknown distribution

# Reservoir sampling: the canonical streaming algorithm
for i, row in enumerate(stream):
    if i < k:
        sample.append(row)
    else:
        j = random.randint(0, i)
        if j < k:
            sample[j] = row

Step 2: Storing Statistics

After gathering, build compressed representations.

Table-level vs. column-level stats

Scope	What's stored	Stored where
Table	`row_count`, `page_count`, `avg_row_len`, `last_analyzed`, `staleness`	PostgreSQL: `pg_class` · MySQL: `innodb_table_stats` · BigQuery: `INFORMATION_SCHEMA`
Column	`null_fraction`, `distinct_count`, most-common values + frequencies, histogram bounds	PostgreSQL: `pg_statistic`

Column-level stats enable range estimates, equality selectivity, NULL handling, and join cardinality.

Key trade-off. We can't store every value, so we use compressed structures that trade space for accuracy. Histograms model distributions; sketches answer specific queries (membership, distinct count, frequency).

Histograms: Modeling Value Distribution

Type	Idea	Best for	Example
Height-balanced	Equal rows per bucket; bucket ranges vary	Skewed data	`rating` buckets `[1.0–2.5] [2.5–3.5] [3.5–4.2] [4.2–5.0]`, each 25%
Equi-width	Equal ranges per bucket; row counts vary	Uniform data	`year` buckets `[1950–1975] 5%, [1975–2000] 15%, [2000–2025] 80%`
Wavelet	Hierarchical multi-resolution compression	Complex / multi-dimensional patterns	Levels of increasing granularity

Probabilistic sketches

Sketch	Answers	Space	Error	Use case
Bloom Filter	"Might item `x` be in the set?"	~10 bits/item for 1% FP	1% false positive	Skip partitions in distributed query
HyperLogLog	`COUNT(DISTINCT x)`	~1.5 KB total	±1%	Unique visitors over billions
Count-Min Sketch	"How often did item `x` appear?"	~10 KB for 0.1% error	ε with prob δ	Top-K / heavy hitters

# Bloom filter (space O(m) bits)
def INSERT(x):
    for h IN hash_funcs:
        bits[h(x)] = 1

def contains(x):
    return ALL(bits[h(x)] for h IN hash_funcs)

# HyperLogLog (space ~1.5KB total)
def add(value):
    h = hash(value)
    bucket = h & (m - 1)
    zeros = leading_zeros(h)
    buckets[bucket] = max(buckets[bucket], zeros)

# Count-Min Sketch (space O(w×d))
def increment(item):
    for i, h in enumerate(hash_funcs):
        table[i][h(item)] += 1

def estimate(item):
    return min(table[i][h(item)] for i, h in enumerate(hash_funcs))

Step 3: Using Statistics for Selectivity Patterns

Selectivity	Examples	Recommended access
High (< 1%)	Primary key lookup · unique constraint · rare value	Index
Medium (1 – 10%)	Common value · single-month date range · category filter	Bitmap scan
Low (> 10%)	`IS NOT NULL` · broad date range · popular category	Sequential scan

Key Takeaways

The complete statistics pipeline runs in four stages:

Gather. Sample or scan; balance cost vs. accuracy.
Store. Build histograms and sketches; compress intelligently.
Use. Estimate selectivity; choose the optimal plan.
Update. Monitor staleness; re-analyze when needed.

Without statistics, even the best optimizer is guessing. With good statistics, it picks plans that are 10–1000× faster than the alternatives.

Next: IO Cost Summary puts these selectivity estimates to work, comparing two query plans by hand using the cost formulas you've already seen for sort, hash, and join.