Skip to main content

Indexing & Query Optimization

TL;DR

Indexes accelerate lookups by maintaining ordered access paths; good queries leverage selective predicates and covered indexes to minimize disk IO. Measure with query execution plans, avoid wildcard prefixes and unbounded scans, and watch for write amplification and hotspotting. The key: understand your query patterns first, then add indexes strategically.

Learning Objectives

After reading this article, you will be able to:

  • Read and interpret SQL execution plans to identify index usage and bottlenecks.
  • Design composite indexes aligned with common query access patterns.
  • Detect and remediate anti-patterns (N+1 queries, wildcard LIKE, full table scans).
  • Balance index benefit (faster reads) against cost (slower writes, storage).
  • Understand and mitigate hotspot and sharding issues in distributed systems.

Motivating Scenario

Your e-commerce platform's product search takes 5 seconds for 100 queries/second. The database is at saturation: every search does a full table scan of 10M products. After analyzing query patterns, you notice most searches filter by category and price range, then sort by rating. Adding a composite index (category, price, rating) reduces latency to 50ms. Write latency increases slightly (index overhead), but reads are now 100x faster. The cost: 200MB of additional index storage and a brief lock during index creation.

Core Concepts

What is an Index?

An index is a data structure (usually a B-tree) that maintains a sorted view of table columns, allowing the database to find rows quickly without scanning all rows.

Example: A table with 10M rows takes milliseconds to find rows by ID if an index exists, but seconds with a full scan.

Types of Indexes

TypeUse CaseTrade-offs
PrimaryUnique identifier, fast lookupsEnforced uniqueness, slower inserts
B-treeRange queries, equality, sortingStandard, general-purpose
HashEquality only, very fastNo range queries, no sorting
BitmapLow-cardinality columns (gender, status)Fast for filtering, slow for updates
CoveringIncludes all columns needed for a queryLarger index, faster reads
PartialWHERE condition (e.g., is_deleted = false)Smaller index, faster inserts if filters match

Composite Index Design

The order of columns in a composite index matters:

Index: (category, price, rating)

Good query: SELECT * FROM products WHERE category='Electronics' AND price < 100 ORDER BY rating
→ Index is used: category filter, then price filter, then rating sort

Mediocre query: SELECT * FROM products WHERE price < 100 AND category='Electronics'
→ Same result, index still used (optimizer may reorder), but less efficient if price is not the first filter

Bad query: SELECT * FROM products WHERE rating > 4.5
→ Index cannot be used for rating without first filtering by category (missing leading column)

Rule: ESL

  • Equality columns first (exact matches).
  • Sort columns second (ORDER BY).
  • Late columns last (range/filter).

Cardinality and Selectivity

Cardinality is the number of distinct values in a column. Selectivity is the fraction of rows returned.

  • High cardinality (IDs, emails): Good for index leading columns; filters out many rows.
  • Low cardinality (gender, status): Worse for leading columns; filters few rows.

Example:

Column: user_id → cardinality = 10M, selectivity = 1 row / 10M (very selective)
Column: is_active → cardinality = 2, selectivity = 50% (not selective)

Composite: (user_id, is_active)
→ user_id filters to 1 row; is_active is redundant (already unique)

Composite: (is_active, user_id)
→ is_active filters to 5M rows; user_id further filters to 1 row (better left-to-right filtering)
Decision flow: add indexes where queries are selective and frequently run.

Practical Example: Index Design and Query Optimization

-- Product table with 10M rows
CREATE TABLE products (
id BIGINT PRIMARY KEY,
category_id INT NOT NULL,
price DECIMAL(10, 2) NOT NULL,
rating FLOAT,
name VARCHAR(255),
is_active BOOLEAN DEFAULT true,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

-- Common query pattern: filter by category, price range, sort by rating
-- BEFORE: No index on (category_id, price)
EXPLAIN SELECT id, name, price, rating
FROM products
WHERE category_id = 5
AND price BETWEEN 10 AND 100
ORDER BY rating DESC
LIMIT 20;
-- Result: Full table scan, 10M rows examined, ~5 seconds

-- AFTER: Composite index
CREATE INDEX idx_category_price_rating ON products(category_id, price, rating DESC);

-- EXPLAIN again:
EXPLAIN SELECT id, name, price, rating
FROM products
WHERE category_id = 5
AND price BETWEEN 10 AND 100
ORDER BY rating DESC
LIMIT 20;
-- Result: Index seek, ~1000 rows examined, ~50ms

-- Covering index: include 'name' so database doesn't need table lookup
CREATE INDEX idx_category_price_rating_covering
ON products(category_id, price, rating DESC)
INCLUDE (name, id);

-- Now the query doesn't need a table lookup at all
EXPLAIN SELECT id, name, price, rating
FROM products
WHERE category_id = 5
AND price BETWEEN 10 AND 100
ORDER BY rating DESC
LIMIT 20;
-- Result: Index-only scan, ~1000 rows examined, ~20ms

-- Anti-pattern: Wildcard at start (cannot use index efficiently)
EXPLAIN SELECT * FROM products WHERE name LIKE '%laptop%';
-- Result: Full table scan, cannot use name index

-- Better: Wildcard at end (can use index)
EXPLAIN SELECT * FROM products WHERE name LIKE 'laptop%';
-- Result: Index range scan, faster

-- Anti-pattern: N+1 queries (common in ORMs)
-- Pseudocode:
-- categories = db.query("SELECT * FROM categories")
-- for cat in categories:
-- products = db.query("SELECT * FROM products WHERE category_id = ?", cat.id) # N queries
-- Total: 1 + N queries

-- Better: Join or batch query
SELECT c.id, c.name, COUNT(p.id) as product_count
FROM categories c
LEFT JOIN products p ON c.id = p.category_id
WHERE c.is_active = true
GROUP BY c.id
LIMIT 100;
-- Result: Single query, 100ms

-- Hotspot avoidance: Don't use monotonically increasing IDs as first index column
-- BAD: Data concentrated at high end of index tree
CREATE INDEX idx_id_time ON events(id, event_time);

-- BETTER: Use hash sharding or reverse keys to distribute writes
-- Partition by (id % number_of_shards) to scatter insertions
CREATE TABLE events_shard_0 PARTITION OF events FOR VALUES IN (0);
CREATE TABLE events_shard_1 PARTITION OF events FOR VALUES IN (1);
-- Then insert events('shard_' || (id % num_shards), id, event_time, ...)

When to Use / When NOT to Use

Invest in Indexes
  1. OLTP systems with selective WHERE clauses (financial, e-commerce)
  2. Queries on columns with high cardinality (user IDs, timestamps)
  3. Queries that are slow (>100ms) and run frequently (>10qps)
  4. Sort/GROUP BY operations on large datasets
  5. Range queries (BETWEEN, <, >) on numeric/date columns
Defer Indexing
  1. OLAP/data warehouse: denormalization and materialized views often better
  2. Low-cardinality columns (gender, status) as leading index columns
  3. Rare queries: optimize when proven slow, not speculatively
  4. Tiny tables: full scans are fast enough
  5. Write-heavy workloads where index maintenance cost dominates

Patterns & Pitfalls

Always run EXPLAIN PLAN (or EXPLAIN in MySQL/PostgreSQL) before and after adding an index. Look for "Index Seek" (good) vs. "Table Scan" (bad). Measure query time with realistic data volumes.
Fetching parent rows then looping to fetch children for each parent is a classic mistake. Use JOINs, subqueries, or batch fetch instead. Example: Instead of loop{ db.getOrders(userid) }, do SELECT * FROM orders WHERE userid IN (...).
Include all columns needed for a query in the index so the database doesn't need to look up the base table. This is especially valuable for SELECT with aggregations or projections.
Unused indexes slow down writes without helping reads. Periodically audit indexes: run queries against the database metadata to find unused indexes and drop them.
Inserting rows with incrementing primary keys (id AUTO_INCREMENT) concentrates writes at the right end of the B-tree. Use hashing, random UUIDs, or range sharding to distribute inserts. Or use hash indexes for the primary key.
If 80% of queries filter on isdeleted = false, use a partial index ON (column) WHERE isdeleted = false. This keeps the index small and faster.

Design Review Checklist

  • Have you profiled queries to understand which are slow?
  • Does EXPLAIN show index usage for top slow queries?
  • Are index columns ordered by ESL (Equality, Sort, Late filter)?
  • Do you have covering indexes for frequently run queries?
  • Are partial indexes used where many rows are filtered out?
  • Have you tested index creation time and write impact?
  • Are unused indexes monitored and removed periodically?
  • Have you benchmarked N+1 queries and replaced them with JOINs?
  • Is hotspotting addressed in your sharding/partitioning strategy?
  • Are query execution plans reviewed before deploying schema changes?

Self-Check

Before adding indexes:

  1. Measure first: Does EXPLAIN show a table scan? Is the query actually slow? Don't optimize speculatively.

  2. High cardinality: Is the leading index column high-cardinality (many distinct values)? If not, the index may not help much.

  3. Write cost: How often are writes compared to reads? If writes dominate, the index overhead may not be worth it.

  4. Storage: Can you afford the index size? Each index uses disk and memory for the working set.

Next Steps

  • Audit queries: Find top slow queries from your application logs.
  • Run EXPLAIN: Understand why queries are slow.
  • Design indexes: Use ESL rule and cardinality principles.
  • Test: Create indexes in staging; measure query time and write impact.
  • Deploy: Add indexes; monitor query latency improvement.
  • Maintain: Periodically find and drop unused indexes.

References

  1. Use The Index, Luke! ↗️
  2. PostgreSQL: Using EXPLAIN ↗️
  3. MySQL: Optimization and Indexes ↗️
  4. SQL Performance: Index Design ↗️