Indexing Strategies & Hotspots

Design indexes for query performance while managing write overhead

TL;DR

Indexes speed queries but slow writes (maintenance cost). Single-column indexes for simple lookups; composite indexes for multi-column WHERE+ORDER BY; avoid indexing low-cardinality columns or frequently updated columns. Monitor for hotspots: queries accessing tiny dataset repeatedly.

Learning Objectives

Understand index structures (B-tree, hash)
Design single and composite indexes
Recognize when indexes help or hurt
Identify and resolve hotspots

Index Types

B-Tree Index

Range queries (>, <, BETWEEN)
Sorting (ORDER BY)
Prefix matching (LIKE
)
General purpose
Default in most databases

Hash Index

Exact match only (=)
Very fast equality
No range queries
No sorting benefit
In-memory stores (Redis)

Practical Example

Design Checklist

Self-Check

What's the difference between single and composite indexes?
Why does index order matter in composite indexes?
When would you use a partial vs full index?
How do you detect if an index helps or hurts?
What are hotspots and how do you identify them?

Design Review Checklist (Already included)

Understanding Hotspots

A hotspot is a small dataset accessed repeatedly and intensively, creating a bottleneck. Examples:

Most recent orders (constantly queried and updated)
Bestselling product (millions of reads/day)
Trending user profile (high concurrent access)
Global counter (thousands of increments/second)

Problems Hotspots Create

Lock Contention: Multiple queries compete for locks on same rows
Index Cache Inefficiency: Popular rows keep evicting others from cache
I/O Bottleneck: Queries queue waiting for disk access
Scalability Wall: Can't improve by adding more servers (all hit same row)

Hotspot Detection

-- PostgreSQL: Find most frequently accessed tables
SELECT schemaname, tablename, idx_scan
FROM pg_stat_user_indexes
ORDER BY idx_scan DESC
LIMIT 10;

-- MySQL: Identify slow query patterns
SELECT * FROM mysql.slow_log
WHERE query_time > 0.1
GROUP BY digest
ORDER BY count(*) DESC;

Hotspot Solutions

Caching (Recommended)
Denormalization
Sharding
Read Replicas

# Problem: Product #42 gets 100,000 queries/second
# Database can't handle it; CPU at 100%

# Solution: Cache in Redis
def get_product(product_id):
    # Check cache first
    cached = redis.get(f"product:{product_id}")
    if cached:
        return json.loads(cached)  # 1ms response

    # Cache miss: query database
    product = db.query("SELECT * FROM products WHERE id = ?", product_id)

    # Store in cache with TTL
    redis.setex(f"product:{product_id}", 300, json.dumps(product))
    # 300 second = 5 minute cache

    return product

# Result: 99% of queries hit Redis (1ms)
# Only 1% hit database (100ms)
# Effective latency: 99% * 1ms + 1% * 100ms = 1.99ms
# Database load: 1/100x

-- Problem: Computing order total requires joining Orders + OrderItems
-- On hot orders, this is slow

-- Solution 1: Store computed total in orders table
ALTER TABLE orders ADD COLUMN cached_total DECIMAL(10,2);

-- When order changes, update cached_total
UPDATE orders SET cached_total = (
    SELECT SUM(price * quantity) FROM order_items
    WHERE order_id = orders.id
) WHERE id = ?;

-- Query: No join needed
SELECT * FROM orders WHERE id = ?;  -- Much faster

-- Problem: Must keep total in sync
-- Solution: Trigger or application logic
CREATE TRIGGER update_order_total
AFTER INSERT ON order_items
FOR EACH ROW
BEGIN
    UPDATE orders SET cached_total = (
        SELECT SUM(price * quantity) FROM order_items
        WHERE order_id = NEW.order_id
    ) WHERE id = NEW.order_id;
END;

# Problem: User #1 is a celebrity with 1M followers
# All requests hit same row; database can't keep up

# Solution: Shard by user_id
def get_user_followers(user_id):
    # Shard key: user_id
    shard_id = user_id % NUM_SHARDS  # Distribute across shards

    # Query appropriate shard
    db = get_shard(shard_id)
    followers = db.query(
        "SELECT * FROM followers WHERE user_id = ?",
        user_id
    )
    return followers

# Result:
# User #1 is on shard 0 (handles 1/10 of load)
# User #2 is on shard 2 (handles 1/10 of load)
# Hot users distributed across shards

# Trade-off: Cross-shard queries are complex

# Problem: Hot product gets 50,000 reads/second
# Master can handle max 10,000/second

# Solution: Read replicas
primary_db = connect("mysql-primary")
replica_1 = connect("mysql-replica-1")
replica_2 = connect("mysql-replica-2")
replica_3 = connect("mysql-replica-3")

def get_product(product_id):
    # Round-robin across replicas
    replicas = [replica_1, replica_2, replica_3]
    db = replicas[product_id % len(replicas)]

    product = db.query(
        "SELECT * FROM products WHERE id = ?",
        product_id
    )
    return product

# Write: Still goes to primary
def update_product(product_id, data):
    primary_db.query("UPDATE products SET ... WHERE id = ?", product_id)

# Result:
# Reads distributed across 3 replicas (3x capacity)
# Each replica handles 50,000 / 3 = 16,667 reads/s
# Acceptable load per replica

Example Hotspot Case:

-- E-commerce hotspot: "Get my current cart"
-- 10,000 concurrent shoppers querying their cart
-- Each cart has 5 items; 50,000 rows accessed
-- Database sees 10,000 QPS all on same 50,000 rows

-- Without optimization: Database CPU at 100%, response times 500ms+
-- With caching: Cache hit rate 95%+, response times < 5ms

-- Implementation:
-- 1. Cache cart in Redis (TTL: 5 minutes)
-- 2. On checkout, invalidate cache
-- 3. Periodic sync: refresh cache from DB every minute

Index Maintenance and Cost

Indexes have hidden costs that accumulate:

-- Every INSERT, UPDATE, DELETE updates all indexes on the table
-- 100 insert/sec on table with 5 indexes = 500 index updates/sec

-- Index fragmentation over time
-- PostgreSQL: Indexes don't fragment; but need REINDEX occasionally
REINDEX INDEX idx_users_email;

-- MySQL: Fragmentation increases disk I/O
OPTIMIZE TABLE users;  -- Rebuilds table and indexes

-- Detecting index bloat (PostgreSQL)
SELECT schemaname, tablename, indexname, idx_blkssread, idx_blksread
FROM pg_stat_user_indexes
ORDER BY idx_blkssread DESC  -- largest indexes first
LIMIT 10;

Index Selection Trade-offs

More indexes = Faster reads, slower writes
Fewer indexes = Slower reads, faster writes

Example:
- Table: users (1M rows)
- Read load: 10,000 QPS (high; reads matter)
- Write load: 10 QPS (low; writes don't matter)
- Decision: Create comprehensive indexes; optimize for reads

Alternative:
- Table: events (100M rows, append-only)
- Read load: 100 QPS (low)
- Write load: 10,000 QPS (very high; writes matter)
- Decision: Minimal indexes; optimize for writes

Query Optimization Without Indexes

Sometimes indexes aren't the answer:

-- Problem: Query is slow even with indexes
SELECT u.id, u.name, COUNT(o.id) as order_count
FROM users u
LEFT JOIN orders o ON u.id = o.user_id
WHERE u.created_at > '2025-01-01'
GROUP BY u.id, u.name
HAVING COUNT(o.id) > 10
ORDER BY order_count DESC

-- Solution 1: Materialized View (pre-computed)
CREATE MATERIALIZED VIEW user_order_summary AS
SELECT u.id, u.name, COUNT(o.id) as order_count
FROM users u
LEFT JOIN orders o ON u.id = o.user_id
GROUP BY u.id, u.name

REFRESH MATERIALIZED VIEW user_order_summary;  -- Run nightly

-- Query becomes: SELECT * FROM user_order_summary WHERE order_count > 10

-- Solution 2: Denormalization (cache in application)
-- Compute during background job, store in separate table
CREATE TABLE user_stats (
  user_id PRIMARY KEY,
  name,
  order_count,
  last_order_date,
  total_spend
)
-- Update via trigger or batch job

-- Solution 3: Archive old data (partition)
-- Store 2024 orders in archive table; query only 2025
SELECT u.id, u.name, COUNT(o.id) as order_count
FROM users u
LEFT JOIN orders_2025 o ON u.id = o.user_id  -- only recent data
WHERE u.created_at > '2025-01-01'
GROUP BY u.id
HAVING COUNT(o.id) > 10

Hotspot Patterns Across Different Systems

Problem:
- Celebrity with 10M followers; all reading celebrity's posts
- Single row in followers table accessed 10M times/day
- Cache misses; database can't keep up

Solutions:
1. Caching (aggressive)
   - Redis cache: celebrity posts (TTL: 1 minute)
   - CDN for celebrity photos (geographic distribution)
   - 99% of requests served from cache

2. Denormalization
   - Pre-compute follower feed rankings
   - Store top 100 posts in user_feed table
   - Query: SELECT * FROM user_feed WHERE user_id = celebrity_id

3. Fanout (write-time)
   - When celebrity posts, push to all followers' feeds asynchronously
   - Trade: higher write latency, but reads extremely fast

4. Sharding (extreme)
   - Shard followers by follower_id % 100
   - Each shard has fraction of followers
   - Celebrity distributed; no single bottleneck
   - Trade: cross-shard queries expensive

E-Commerce: Bestselling Product

Problem:
- Product SKU-42 on sale; 1,000 page views/sec
- Inventory: 500 units
- 100 orders/sec (all for SKU-42)
- Single row updated constantly; lock contention

Solutions:
1. Read Replica for Reads
   - Primary: handles all writes (inventory updates)
   - Replicas (3x): handle all reads (page views)
   - Distributed read load; primary writes

2. Eventual Consistency
   - Accept stale inventory (30 seconds old)
   - Cache inventory in Redis; 30s TTL
   - Occasional oversell, but cache handles 99% of reads
   - Inventory corrected asynchronously

3. Distributed Counter
   - Instead of SELECT inventory WHERE sku = 42; UPDATE ...
   - Use atomic counter: DECR inventory:sku-42
   - Extremely fast (no lock contention)
   - Eventually consistent with database
   - Periodic sync: calculate actual inventory, correct counter

4. Async Processing
   - Frontend shows cached inventory
   - User clicks "Add to Cart" (doesn't decrement inventory yet)
   - Background job processes cart asynchronously
   - If inventory runs out, cart rejected with backorder option
   - Separates UI (fast) from inventory (slow)

Time-Series: Recent Data

Problem:
- Metrics table: 10B rows/day
- All queries ask for "last 24 hours" or "last hour"
- Database scans only 0.1% of rows (recent data)
- But table so large, sequential scan slow

Solutions:
1. Time-Bucketing (Partition by Time)
   - Create one table per day: metrics_2025_02_15
   - Queries: SELECT * FROM metrics_2025_02_15 WHERE timestamp > ...
   - Only recent tables live in hot storage
   - Old tables archived to cold storage
   - Result: scans only relevant partition

2. Index on Timestamp + Metric
   - CREATE INDEX idx_metrics_ts ON metrics(timestamp DESC)
   - Query: SELECT * FROM metrics WHERE timestamp > now() - '1h'
   - Index scan returns only recent rows
   - Fast; avoids full table scan

3. Columnar Storage
   - Time-series databases (InfluxDB, Prometheus) store by column
   - Metrics table: [time, metric_name, server_id, value]
   - Stored as columns: [all_times], [all_names], [all_ids], [all_values]
   - Compression: repeated values compress well
   - Query: fetch only needed columns; skip others
   - Result: 10x faster on relevant data

Next Steps

Analyze queries — Use EXPLAIN PLAN to verify index usage
Identify hotspots — Monitor slow queries, identify repeated patterns
Design indexes — Match WHERE and ORDER BY columns
Test before deploy — Ensure indexes improve (not hurt) performance
Monitor fragmentation — Rebuild indexes regularly
Cache hot data — Redis for frequently accessed datasets
Consider materialization — Pre-compute expensive queries
Partition strategically — Time-bucketing for time-series data

References

PostgreSQL Index Documentation ↗
MySQL Index Optimization ↗
Use The Index, Luke! ↗
PostgreSQL EXPLAIN Documentation ↗
"Designing Data-Intensive Applications" by Martin Kleppmann (index chapter)
"Learning SQL" by Alan Beaulieu (optimization chapter)

Indexing Strategies & Hotspots

TL;DR​

Learning Objectives​

Index Types​

Practical Example​

Design Checklist​

Self-Check​

Design Review Checklist (Already included)​

Understanding Hotspots​

Problems Hotspots Create​

Hotspot Detection​

Hotspot Solutions​

Index Maintenance and Cost​

Index Selection Trade-offs​

Query Optimization Without Indexes​

Hotspot Patterns Across Different Systems​

Social Media: Celebrity User​

E-Commerce: Bestselling Product​

Time-Series: Recent Data​

Next Steps​

References​