Leader Election and Consensus
Coordinate across distributed nodes to prevent conflicts and ensure consistency
TL;DR
When multiple nodes could take the same action (write to database, schedule a job, update configuration), all executing independently causes duplicates and corruption. Leader election ensures only one node acts. Consensus algorithms like Raft guarantee: one leader at a time, automatic leader detection when current leader fails, and consistent state across all replicas. Essential for database replication, distributed locking, and scheduler coordination. Trade-off: requires quorum to decide (unavailable if < 50% nodes up) and adds complexity. Split-brain prevention: if the cluster partitions into two groups, only the group with quorum can make decisions.
Learning Objectives
- Understand the split-brain problem and why quorum consensus solves it
- Learn Raft vs. Paxos: trade-offs between understandability and provability
- Implement basic leader election and heartbeat mechanisms
- Configure quorum size and election timeouts appropriately
- Recognize when leader election is necessary vs. overkill
Motivating Scenario
A distributed database has 5 replicas. Primary (leader) writes to replicas (followers). Primary fails. Two scenarios without consensus:
- Node A thinks it's the leader, writes changes.
- Node B thinks it's the leader, writes conflicting changes. Clients read from both replicas, see inconsistent data (data corruption). With Raft consensus: nodes vote on next leader. Either A or B becomes leader, but not both (quorum prevents split-brain). If network partitions (2 nodes isolated from 3), the 3-node group elects a new leader and continues. The 2-node group cannot elect a leader (no quorum) and stops writing (availability sacrifice for consistency).
Core Concepts
The Problem (Split-Brain): Without coordination, multiple nodes might each think they're the leader and independently write to the system. This causes data inconsistency.
The Solution (Quorum Consensus): Majority voting ensures at most one leader at a time. In a 5-node cluster, you need 3 votes (quorum = (N/2) + 1) to become leader. If the cluster splits into groups of 3 and 2, only the 3-node group can elect a leader. The 2-node group cannot (no quorum) and stops accepting writes.
Practical Example
- Python
- Go
- Node.js
import time
import random
from enum import Enum
from collections import defaultdict
class NodeState(Enum):
FOLLOWER = "follower"
CANDIDATE = "candidate"
LEADER = "leader"
class RaftNode:
def __init__(self, node_id, peers):
"""
node_id: unique identifier
peers: list of other node IDs (doesn't include self)
"""
self.node_id = node_id
self.peers = peers
self.total_nodes = len(peers) + 1
self.state = NodeState.FOLLOWER
self.current_term = 0
self.voted_for = None
self.leader_id = None
self.election_timeout = random.uniform(1.5, 3.0)
self.last_heartbeat = time.time()
# Voting results for election
self.votes_received = set()
def request_vote(self, candidate_id, term):
"""Handle vote request from candidate"""
if term > self.current_term:
self.current_term = term
self.voted_for = None
self.leader_id = None
if term < self.current_term:
return False # Reject: candidate's term is old
if self.voted_for is None or self.voted_for == candidate_id:
self.voted_for = candidate_id
return True
return False # Already voted for someone else
def start_election(self):
"""Become candidate and request votes"""
self.state = NodeState.CANDIDATE
self.current_term += 1
self.voted_for = self.node_id
self.votes_received = {self.node_id} # Vote for self
self.last_heartbeat = time.time()
print(f"Node {self.node_id}: Starting election for term {self.current_term}")
# Request votes from all peers
for peer in self.peers:
if self.request_vote(self.node_id, self.current_term):
self.votes_received.add(peer)
def check_election_result(self):
"""Check if election won"""
quorum = (self.total_nodes // 2) + 1
if len(self.votes_received) >= quorum:
self.state = NodeState.LEADER
self.leader_id = self.node_id
print(f"Node {self.node_id}: Elected leader for term {self.current_term}")
return True
# If didn't win, revert to follower after timeout
if time.time() - self.last_heartbeat > self.election_timeout:
self.state = NodeState.FOLLOWER
self.election_timeout = random.uniform(1.5, 3.0)
return False
return None # Still waiting
def receive_heartbeat(self, leader_id, term):
"""Receive heartbeat (AppendEntries) from leader"""
if term > self.current_term:
self.current_term = term
self.voted_for = None
if term >= self.current_term:
self.state = NodeState.FOLLOWER
self.leader_id = leader_id
self.last_heartbeat = time.time()
self.election_timeout = random.uniform(1.5, 3.0)
return True
return False # Reject: leader's term is old
def send_heartbeat(self):
"""Leader sends heartbeat to all peers"""
if self.state == NodeState.LEADER:
for peer in self.peers:
print(f"Node {self.node_id} (leader): Heartbeat to {peer}")
def get_status(self):
return {
"node_id": self.node_id,
"state": self.state.value,
"term": self.current_term,
"leader": self.leader_id,
"voted_for": self.voted_for
}
# Example: 5-node cluster election
class RaftCluster:
def __init__(self, node_ids):
self.nodes = {}
for node_id in node_ids:
peers = [n for n in node_ids if n != node_id]
self.nodes[node_id] = RaftNode(node_id, peers)
def simulate_election(self, steps=10):
"""Simulate election process"""
for step in range(steps):
print(f"\n--- Step {step} ---")
for node_id, node in self.nodes.items():
# Check election timeout
if node.state == NodeState.FOLLOWER:
if time.time() - node.last_heartbeat > node.election_timeout:
node.start_election()
# Check if won election
if node.state == NodeState.CANDIDATE:
node.check_election_result()
# Send heartbeats if leader
if node.state == NodeState.LEADER:
node.send_heartbeat()
# Print state
for node_id, node in self.nodes.items():
print(f" {node_id}: {node.get_status()}")
time.sleep(0.5)
# Stop if leader elected
leaders = [n for n in self.nodes.values()
if n.state == NodeState.LEADER]
if len(leaders) == 1:
print(f"\nElection complete: Node {leaders[0].node_id} is leader")
break
# Run simulation
cluster = RaftCluster(['A', 'B', 'C', 'D', 'E'])
cluster.simulate_election()
package main
import (
"fmt"
"math/rand"
"time"
)
type NodeState string
const (
Follower NodeState = "follower"
Candidate NodeState = "candidate"
Leader NodeState = "leader"
)
type RaftNode struct {
nodeID string
peers []string
totalNodes int
state NodeState
currentTerm int64
votedFor *string
leaderID *string
electionTime time.Time
heartbeatTime time.Time
votesReceived map[string]bool
}
func NewRaftNode(nodeID string, peers []string) *RaftNode {
return &RaftNode{
nodeID: nodeID,
peers: peers,
totalNodes: len(peers) + 1,
state: Follower,
currentTerm: 0,
votedFor: nil,
leaderID: nil,
electionTime: time.Now(),
heartbeatTime: time.Now(),
votesReceived: make(map[string]bool),
}
}
func (rn *RaftNode) RequestVote(candidateID string, term int64) bool {
if term > rn.currentTerm {
rn.currentTerm = term
rn.votedFor = nil
}
if term < rn.currentTerm {
return false
}
if rn.votedFor == nil || *rn.votedFor == candidateID {
rn.votedFor = &candidateID
return true
}
return false
}
func (rn *RaftNode) StartElection() {
rn.state = Candidate
rn.currentTerm++
votedFor := rn.nodeID
rn.votedFor = &votedFor
rn.votesReceived = map[string]bool{rn.nodeID: true}
rn.electionTime = time.Now()
fmt.Printf("Node %s: Starting election for term %d\n", rn.nodeID, rn.currentTerm)
for _, peer := range rn.peers {
if rn.RequestVote(rn.nodeID, rn.currentTerm) {
rn.votesReceived[peer] = true
}
}
}
func (rn *RaftNode) CheckElectionResult() bool {
quorum := (rn.totalNodes / 2) + 1
if len(rn.votesReceived) >= quorum {
rn.state = Leader
leaderID := rn.nodeID
rn.leaderID = &leaderID
fmt.Printf("Node %s: Elected leader for term %d\n", rn.nodeID, rn.currentTerm)
return true
}
return false
}
func (rn *RaftNode) GetStatus() map[string]interface{} {
var votedFor, leaderID string
if rn.votedFor != nil {
votedFor = *rn.votedFor
}
if rn.leaderID != nil {
leaderID = *rn.leaderID
}
return map[string]interface{}{
"node_id": rn.nodeID,
"state": rn.state,
"term": rn.currentTerm,
"leader": leaderID,
"voted_for": votedFor,
}
}
type RaftCluster struct {
nodes map[string]*RaftNode
}
func NewRaftCluster(nodeIDs []string) *RaftCluster {
cluster := &RaftCluster{
nodes: make(map[string]*RaftNode),
}
for _, nodeID := range nodeIDs {
var peers []string
for _, id := range nodeIDs {
if id != nodeID {
peers = append(peers, id)
}
}
cluster.nodes[nodeID] = NewRaftNode(nodeID, peers)
}
return cluster
}
func (rc *RaftCluster) PrintStates() {
for _, nodeID := range []string{"A", "B", "C", "D", "E"} {
if node, ok := rc.nodes[nodeID]; ok {
fmt.Printf(" %s: %v\n", nodeID, node.GetStatus())
}
}
}
func main() {
rand.Seed(time.Now().UnixNano())
cluster := NewRaftCluster([]string{"A", "B", "C", "D", "E"})
cluster.nodes["A"].StartElection()
time.Sleep(1 * time.Second)
cluster.nodes["A"].CheckElectionResult()
fmt.Println("\nCluster state:")
cluster.PrintStates()
}
const States = {
FOLLOWER: 'follower',
CANDIDATE: 'candidate',
LEADER: 'leader'
};
class RaftNode {
constructor(nodeId, peers) {
this.nodeId = nodeId;
this.peers = peers; // other node IDs
this.totalNodes = peers.length + 1;
this.state = States.FOLLOWER;
this.currentTerm = 0;
this.votedFor = null;
this.leaderId = null;
this.electionTimeout = Math.random() * 1500 + 1500; // 1.5-3s
this.lastHeartbeat = Date.now();
this.votesReceived = new Set();
}
requestVote(candidateId, term) {
if (term > this.currentTerm) {
this.currentTerm = term;
this.votedFor = null;
this.leaderId = null;
}
if (term < this.currentTerm) {
return false;
}
if (this.votedFor === null || this.votedFor === candidateId) {
this.votedFor = candidateId;
return true;
}
return false;
}
startElection() {
this.state = States.CANDIDATE;
this.currentTerm++;
this.votedFor = this.nodeId;
this.votesReceived = new Set([this.nodeId]);
this.lastHeartbeat = Date.now();
console.log(`Node ${this.nodeId}: Starting election for term ${this.currentTerm}`);
// Simulate voting from peers
for (const peer of this.peers) {
if (this.requestVote(this.nodeId, this.currentTerm)) {
this.votesReceived.add(peer);
}
}
}
checkElectionResult() {
const quorum = Math.floor(this.totalNodes / 2) + 1;
if (this.votesReceived.size >= quorum) {
this.state = States.LEADER;
this.leaderId = this.nodeId;
console.log(`Node ${this.nodeId}: Elected leader for term ${this.currentTerm}`);
return true;
}
return false;
}
getStatus() {
return {
node_id: this.nodeId,
state: this.state,
term: this.currentTerm,
leader: this.leaderId,
voted_for: this.votedFor,
votes: this.votesReceived.size
};
}
}
class RaftCluster {
constructor(nodeIds) {
this.nodes = new Map();
for (const nodeId of nodeIds) {
const peers = nodeIds.filter(id => id !== nodeId);
this.nodes.set(nodeId, new RaftNode(nodeId, peers));
}
}
simulateElection() {
console.log('Starting election...\n');
// Node A initiates election
const nodeA = this.nodes.get('A');
nodeA.startElection();
// Check result
setTimeout(() => {
nodeA.checkElectionResult();
console.log('\nFinal state:');
for (const [nodeId, node] of this.nodes) {
console.log(` ${nodeId}: ${JSON.stringify(node.getStatus())}`);
}
}, 100);
}
}
// Run simulation
const cluster = new RaftCluster(['A', 'B', 'C', 'D', 'E']);
cluster.simulateElection();
When to Use vs. When NOT to Use
- Database replication (one primary, multiple replicas)
- Distributed configuration management
- Scheduler coordination (one node schedules jobs)
- Distributed locking (only one holder at a time)
- Split-brain prevention required
Patterns and Pitfalls
Design Review Checklist
- Quorum size correctly calculated: (N/2) + 1 where N = total nodes
- Odd number of nodes (3, 5, 7) for proper majority calculation
- Election timeout is longer than heartbeat interval (prevent thrashing)
- Leader heartbeat interval configured (50-100ms typical)
- Log replication ensures data durability across replicas
- Split-brain prevention tested (partition scenarios)
- Leader failure detected within election timeout
- Followers reject writes (only leader accepts writes)
- Monitoring for leader election frequency (high = instability)
- Recovery process defined for network partitions
Self-Check
- Why is quorum consensus required (vs. simple majority voting)?
- What happens when the leader crashes?
- How does split-brain occur without quorum?
- What happens if network partitions the cluster 3-2?
- When is a write considered committed?
Next Steps
- Distributed Locking: Use Raft-based systems (etcd, Consul) for locks and coordination
- Data Consistency: Read about eventual consistency vs. strong consistency trade-offs
- Observability: Monitor leader elections and term changes to detect instability
References
- Ongaro, D., & Ousterhout, J. (2014). In Search of an Understandable Consensus Algorithm (Raft) ↗️. USENIX ATC.
- Burrows, M. (2006). The Chubby Lock Service for Loosely-Coupled Distributed Systems ↗️. OSDI.
- etcd Documentation. etcd: Distributed Configuration and Service Discovery ↗️