Skip to main content

Data Replication, Sharding, and Partitioning

Scale data storage horizontally by distributing data across multiple databases and replicas.

TL;DR

Single databases don't scale indefinitely. Three techniques distribute data: replication (same data on multiple servers for availability and read scaling), sharding (split data by key across servers to distribute load), and partitioning (split data within a database for manageability). Use replication for high availability and read-heavy workloads. Use sharding when one database can't handle the write volume. Sharding introduces complexity: routing logic (how to find the right shard?), rebalancing (what happens when you add a new shard?), and cross-shard queries (expensive and hard). Most systems use all three: primary/secondary replication for availability, sharding by user ID for horizontal scaling, and logical partitions within each shard for manageability.

Learning Objectives

  • Understand replication strategies and consistency tradeoffs
  • Design sharding keys and partitioning schemes
  • Implement shard routing logic
  • Handle shard rebalancing and hotspots
  • Deal with cross-shard queries and transactions
  • Monitor and manage replicas at scale

Motivating Scenario

A service's database grows to terabytes. Queries slow down. Writes can't keep up. Vertical scaling (bigger servers) has limits. Replication helps with reads but not writes. Sharding distributes writes across multiple databases. But how do you route requests to the right shard? What if one shard gets too much traffic? How do you join data across shards?

Core Concepts

Replication Patterns

Leader-Follower: writes go to leader, replicated asynchronously to followers. Followers handle reads. Tradeoff: eventual consistency (followers lag) but high availability. Multi-Leader: writes can go to any replica, synchronized between replicas. Tradeoff: more complexity, conflict resolution.

Sharding

Partition data by a key (user ID, order ID) across multiple databases. Each database handles a subset of data. Distribute writes across shards, improving throughput. Tradeoff: queries across multiple shards are expensive, transactions spanning shards are hard.

Partitioning

Divide large tables into smaller partitions (by date range, hash range) within a single database. Improves query performance (query only relevant partitions) and manageability (drop old partitions). Not true horizontal scaling but helpful for large tables.

Hotspots and Rebalancing

Some shards get more traffic than others (hotspots). Add new shards and rebalance data: gradually move keys from old shards to new shards. Rebalancing is complex, requires routing logic updates, can impact availability.

Practical Example

# ❌ POOR - Single database bottleneck
class OrderService:
def __init__(self, db):
self.db = db

def create_order(self, user_id, items):
order = Order(user_id=user_id, items=items)
self.db.insert('orders', order) # Single database bottleneck
return order

# ✅ EXCELLENT - Sharded data with routing
class ShardRouter:
def __init__(self, num_shards):
self.num_shards = num_shards
self.shards = {i: Shard(i) for i in range(num_shards)}

def get_shard(self, key):
"""Determine which shard owns this key"""
shard_id = hash(key) % self.num_shards
return self.shards[shard_id]

def route(self, key, operation, *args):
"""Route operation to appropriate shard"""
shard = self.get_shard(key)
return getattr(shard, operation)(*args)

class Shard:
def __init__(self, shard_id):
self.shard_id = shard_id
self.leader = Database(f"leader-{shard_id}")
self.followers = [Database(f"follower-{shard_id}-{i}") for i in range(2)]

def write(self, table, data):
"""Write to leader"""
self.leader.insert(table, data)
# Asynchronously replicate to followers
self._replicate_to_followers(table, data)

def read(self, table, conditions):
"""Read from follower for load distribution"""
follower = random.choice(self.followers)
return follower.query(f"SELECT * FROM {table} WHERE {conditions}")

def _replicate_to_followers(self, table, data):
for follower in self.followers:
async_replicate(follower, table, data)

class OrderService:
def __init__(self, router):
self.router = router

def create_order(self, user_id, items):
order = Order(user_id=user_id, items=items)
# Route to shard based on user_id
self.router.route(user_id, 'write', 'orders', order)
return order

def get_user_orders(self, user_id):
# Route to shard, read from follower
return self.router.route(user_id, 'read', 'orders', f"user_id = {user_id}")

def get_all_orders(self):
# Cross-shard query: expensive and slow
all_orders = []
for shard_id in range(self.router.num_shards):
shard = self.router.shards[shard_id]
all_orders.extend(shard.read('orders', '1=1'))
return all_orders

When to Use / When Not to Use

When to Use Replication
  1. Read-heavy workloads (distribute reads across replicas)
  2. High availability requirements (survive server failures)
  3. Geographic distribution (replicas in different regions)
  4. Any production system handling important data
When to Use Sharding
  1. Write-heavy workloads exceeding single database capacity
  2. Data too large to fit on one server
  3. Geographic partitioning is natural (EU data in EU, US data in US)
  4. Systems where most queries access single shard

Patterns and Pitfalls

Design Review Checklist

  • Replication strategy (leader-follower, multi-leader) is clearly documented
  • Replica lag is measured and acceptable for your consistency requirements
  • Sharding key is chosen to distribute data and load evenly
  • Routing logic correctly maps keys to shards
  • Cross-shard queries are identified and marked as expensive operations
  • Shard rebalancing strategy exists for adding/removing shards
  • Monitoring and alerting cover replica lag, shard balance, and hot shards

Self-Check

  • What's the tradeoff between replication and consistency?
  • How do you choose a sharding key?
  • What challenges arise from cross-shard queries?
One Takeaway

Replication solves availability and read scaling. Sharding solves write scaling. Choose both based on your bottlenecks, but start simple and scale only when needed.

Next Steps

  • Implement replication with heartbeat monitoring and failover
  • Design sharding key selection for your domain
  • Build shard router with consistent hashing for rebalancing
  • Create tools to monitor replica lag and shard imbalance

References

  • Martin Kleppmann, Designing Data-Intensive Applications (O'Reilly)
  • MySQL 8.0 Reference Manual: Replication
  • Designing Sharded Database Systems