Skip to main content

Leader Election and Consensus

Coordinate across distributed nodes to prevent conflicts and ensure consistency

TL;DR

When multiple nodes could take the same action (write to database, schedule a job, update configuration), all executing independently causes duplicates and corruption. Leader election ensures only one node acts. Consensus algorithms like Raft guarantee: one leader at a time, automatic leader detection when current leader fails, and consistent state across all replicas. Essential for database replication, distributed locking, and scheduler coordination. Trade-off: requires quorum to decide (unavailable if < 50% nodes up) and adds complexity. Split-brain prevention: if the cluster partitions into two groups, only the group with quorum can make decisions.

Learning Objectives

  • Understand the split-brain problem and why quorum consensus solves it
  • Learn Raft vs. Paxos: trade-offs between understandability and provability
  • Implement basic leader election and heartbeat mechanisms
  • Configure quorum size and election timeouts appropriately
  • Recognize when leader election is necessary vs. overkill

Motivating Scenario

A distributed database has 5 replicas. Primary (leader) writes to replicas (followers). Primary fails. Two scenarios without consensus:

  1. Node A thinks it's the leader, writes changes.
  2. Node B thinks it's the leader, writes conflicting changes. Clients read from both replicas, see inconsistent data (data corruption). With Raft consensus: nodes vote on next leader. Either A or B becomes leader, but not both (quorum prevents split-brain). If network partitions (2 nodes isolated from 3), the 3-node group elects a new leader and continues. The 2-node group cannot elect a leader (no quorum) and stops writing (availability sacrifice for consistency).

Core Concepts

Leader Election and Quorum

The Problem (Split-Brain): Without coordination, multiple nodes might each think they're the leader and independently write to the system. This causes data inconsistency.

The Solution (Quorum Consensus): Majority voting ensures at most one leader at a time. In a 5-node cluster, you need 3 votes (quorum = (N/2) + 1) to become leader. If the cluster splits into groups of 3 and 2, only the 3-node group can elect a leader. The 2-node group cannot (no quorum) and stops accepting writes.

Practical Example

import time
import random
from enum import Enum
from collections import defaultdict

class NodeState(Enum):
FOLLOWER = "follower"
CANDIDATE = "candidate"
LEADER = "leader"

class RaftNode:
def __init__(self, node_id, peers):
"""
node_id: unique identifier
peers: list of other node IDs (doesn't include self)
"""
self.node_id = node_id
self.peers = peers
self.total_nodes = len(peers) + 1

self.state = NodeState.FOLLOWER
self.current_term = 0
self.voted_for = None
self.leader_id = None

self.election_timeout = random.uniform(1.5, 3.0)
self.last_heartbeat = time.time()

# Voting results for election
self.votes_received = set()

def request_vote(self, candidate_id, term):
"""Handle vote request from candidate"""
if term > self.current_term:
self.current_term = term
self.voted_for = None
self.leader_id = None

if term < self.current_term:
return False # Reject: candidate's term is old

if self.voted_for is None or self.voted_for == candidate_id:
self.voted_for = candidate_id
return True

return False # Already voted for someone else

def start_election(self):
"""Become candidate and request votes"""
self.state = NodeState.CANDIDATE
self.current_term += 1
self.voted_for = self.node_id
self.votes_received = {self.node_id} # Vote for self
self.last_heartbeat = time.time()

print(f"Node {self.node_id}: Starting election for term {self.current_term}")

# Request votes from all peers
for peer in self.peers:
if self.request_vote(self.node_id, self.current_term):
self.votes_received.add(peer)

def check_election_result(self):
"""Check if election won"""
quorum = (self.total_nodes // 2) + 1

if len(self.votes_received) >= quorum:
self.state = NodeState.LEADER
self.leader_id = self.node_id
print(f"Node {self.node_id}: Elected leader for term {self.current_term}")
return True

# If didn't win, revert to follower after timeout
if time.time() - self.last_heartbeat > self.election_timeout:
self.state = NodeState.FOLLOWER
self.election_timeout = random.uniform(1.5, 3.0)
return False

return None # Still waiting

def receive_heartbeat(self, leader_id, term):
"""Receive heartbeat (AppendEntries) from leader"""
if term > self.current_term:
self.current_term = term
self.voted_for = None

if term >= self.current_term:
self.state = NodeState.FOLLOWER
self.leader_id = leader_id
self.last_heartbeat = time.time()
self.election_timeout = random.uniform(1.5, 3.0)
return True

return False # Reject: leader's term is old

def send_heartbeat(self):
"""Leader sends heartbeat to all peers"""
if self.state == NodeState.LEADER:
for peer in self.peers:
print(f"Node {self.node_id} (leader): Heartbeat to {peer}")

def get_status(self):
return {
"node_id": self.node_id,
"state": self.state.value,
"term": self.current_term,
"leader": self.leader_id,
"voted_for": self.voted_for
}

# Example: 5-node cluster election
class RaftCluster:
def __init__(self, node_ids):
self.nodes = {}
for node_id in node_ids:
peers = [n for n in node_ids if n != node_id]
self.nodes[node_id] = RaftNode(node_id, peers)

def simulate_election(self, steps=10):
"""Simulate election process"""
for step in range(steps):
print(f"\n--- Step {step} ---")

for node_id, node in self.nodes.items():
# Check election timeout
if node.state == NodeState.FOLLOWER:
if time.time() - node.last_heartbeat > node.election_timeout:
node.start_election()

# Check if won election
if node.state == NodeState.CANDIDATE:
node.check_election_result()

# Send heartbeats if leader
if node.state == NodeState.LEADER:
node.send_heartbeat()

# Print state
for node_id, node in self.nodes.items():
print(f" {node_id}: {node.get_status()}")

time.sleep(0.5)

# Stop if leader elected
leaders = [n for n in self.nodes.values()
if n.state == NodeState.LEADER]
if len(leaders) == 1:
print(f"\nElection complete: Node {leaders[0].node_id} is leader")
break

# Run simulation
cluster = RaftCluster(['A', 'B', 'C', 'D', 'E'])
cluster.simulate_election()

When to Use vs. When NOT to Use

Use Leader Election (Raft/Consensus)
  1. Database replication (one primary, multiple replicas)
  2. Distributed configuration management
Avoid Leader Election (Raft/Consensus)
  1. Scheduler coordination (one node schedules jobs)
  2. Distributed locking (only one holder at a time)
  3. Split-brain prevention required

Patterns and Pitfalls

5-node cluster needs 3 votes. 7-node needs 4. Always ensures only one majority. If partition occurs, only majority partition can elect leader.
Leader sends heartbeats every 50ms. Election timeout 1.5-3 seconds. Follower waits 1.5-3s without heartbeat before starting election. Prevents unnecessary elections.
3-node cluster is minimum for fault tolerance (tolerates 1 failure). 2-node cluster can't reach quorum if either fails. Use odd numbers: 3, 5, 7.
If leader isolated from majority, it can't commit writes. Clients get 'not leader' errors. Other partition cannot elect new leader if it has < quorum. System unavailable until partition heals.
Leader replicates writes to followers before committing. If leader crashes, one follower has all data. New leader continues from last replicated state. Prevents data loss.
Raft/Paxos is complex to implement correctly. Use existing libraries: etcd, Consul, ZooKeeper. Don't implement from scratch unless you fully understand the algorithm.

Design Review Checklist

  • Quorum size correctly calculated: (N/2) + 1 where N = total nodes
  • Odd number of nodes (3, 5, 7) for proper majority calculation
  • Election timeout is longer than heartbeat interval (prevent thrashing)
  • Leader heartbeat interval configured (50-100ms typical)
  • Log replication ensures data durability across replicas
  • Split-brain prevention tested (partition scenarios)
  • Leader failure detected within election timeout
  • Followers reject writes (only leader accepts writes)
  • Monitoring for leader election frequency (high = instability)
  • Recovery process defined for network partitions

Self-Check

  • Why is quorum consensus required (vs. simple majority voting)?
  • What happens when the leader crashes?
  • How does split-brain occur without quorum?
  • What happens if network partitions the cluster 3-2?
  • When is a write considered committed?

Next Steps

  1. Distributed Locking: Use Raft-based systems (etcd, Consul) for locks and coordination
  2. Data Consistency: Read about eventual consistency vs. strong consistency trade-offs
  3. Observability: Monitor leader elections and term changes to detect instability

References