Leader Election and Consensus

Coordinate across distributed nodes to prevent conflicts and ensure consistency

TL;DR

When multiple nodes could take the same action (write to database, schedule a job, update configuration), all executing independently causes duplicates and corruption. Leader election ensures only one node acts. Consensus algorithms like Raft guarantee: one leader at a time, automatic leader detection when current leader fails, and consistent state across all replicas. Essential for database replication, distributed locking, and scheduler coordination. Trade-off: requires quorum to decide (unavailable if < 50% nodes up) and adds complexity. Split-brain prevention: if the cluster partitions into two groups, only the group with quorum can make decisions.

Learning Objectives

Understand the split-brain problem and why quorum consensus solves it
Learn Raft vs. Paxos: trade-offs between understandability and provability
Implement basic leader election and heartbeat mechanisms
Configure quorum size and election timeouts appropriately
Recognize when leader election is necessary vs. overkill

Motivating Scenario

A distributed database has 5 replicas. Primary (leader) writes to replicas (followers). Primary fails. Two scenarios without consensus:

Node A thinks it's the leader, writes changes.
Node B thinks it's the leader, writes conflicting changes. Clients read from both replicas, see inconsistent data (data corruption). With Raft consensus: nodes vote on next leader. Either A or B becomes leader, but not both (quorum prevents split-brain). If network partitions (2 nodes isolated from 3), the 3-node group elects a new leader and continues. The 2-node group cannot elect a leader (no quorum) and stops writing (availability sacrifice for consistency).

Core Concepts

Leader Election and Quorum

The Problem (Split-Brain): Without coordination, multiple nodes might each think they're the leader and independently write to the system. This causes data inconsistency.

The Solution (Quorum Consensus): Majority voting ensures at most one leader at a time. In a 5-node cluster, you need 3 votes (quorum = (N/2) + 1) to become leader. If the cluster splits into groups of 3 and 2, only the 3-node group can elect a leader. The 2-node group cannot (no quorum) and stops accepting writes.

Practical Example

Python
Go
Node.js

import time
import random
from enum import Enum
from collections import defaultdict

class NodeState(Enum):
    FOLLOWER = "follower"
    CANDIDATE = "candidate"
    LEADER = "leader"

class RaftNode:
    def __init__(self, node_id, peers):
        """
        node_id: unique identifier
        peers: list of other node IDs (doesn't include self)
        """
        self.node_id = node_id
        self.peers = peers
        self.total_nodes = len(peers) + 1

        self.state = NodeState.FOLLOWER
        self.current_term = 0
        self.voted_for = None
        self.leader_id = None

        self.election_timeout = random.uniform(1.5, 3.0)
        self.last_heartbeat = time.time()

        # Voting results for election
        self.votes_received = set()

    def request_vote(self, candidate_id, term):
        """Handle vote request from candidate"""
        if term > self.current_term:
            self.current_term = term
            self.voted_for = None
            self.leader_id = None

        if term < self.current_term:
            return False  # Reject: candidate's term is old

        if self.voted_for is None or self.voted_for == candidate_id:
            self.voted_for = candidate_id
            return True

        return False  # Already voted for someone else

    def start_election(self):
        """Become candidate and request votes"""
        self.state = NodeState.CANDIDATE
        self.current_term += 1
        self.voted_for = self.node_id
        self.votes_received = {self.node_id}  # Vote for self
        self.last_heartbeat = time.time()

        print(f"Node {self.node_id}: Starting election for term {self.current_term}")

        # Request votes from all peers
        for peer in self.peers:
            if self.request_vote(self.node_id, self.current_term):
                self.votes_received.add(peer)

    def check_election_result(self):
        """Check if election won"""
        quorum = (self.total_nodes // 2) + 1

        if len(self.votes_received) >= quorum:
            self.state = NodeState.LEADER
            self.leader_id = self.node_id
            print(f"Node {self.node_id}: Elected leader for term {self.current_term}")
            return True

        # If didn't win, revert to follower after timeout
        if time.time() - self.last_heartbeat > self.election_timeout:
            self.state = NodeState.FOLLOWER
            self.election_timeout = random.uniform(1.5, 3.0)
            return False

        return None  # Still waiting

    def receive_heartbeat(self, leader_id, term):
        """Receive heartbeat (AppendEntries) from leader"""
        if term > self.current_term:
            self.current_term = term
            self.voted_for = None

        if term >= self.current_term:
            self.state = NodeState.FOLLOWER
            self.leader_id = leader_id
            self.last_heartbeat = time.time()
            self.election_timeout = random.uniform(1.5, 3.0)
            return True

        return False  # Reject: leader's term is old

    def send_heartbeat(self):
        """Leader sends heartbeat to all peers"""
        if self.state == NodeState.LEADER:
            for peer in self.peers:
                print(f"Node {self.node_id} (leader): Heartbeat to {peer}")

    def get_status(self):
        return {
            "node_id": self.node_id,
            "state": self.state.value,
            "term": self.current_term,
            "leader": self.leader_id,
            "voted_for": self.voted_for
        }

# Example: 5-node cluster election
class RaftCluster:
    def __init__(self, node_ids):
        self.nodes = {}
        for node_id in node_ids:
            peers = [n for n in node_ids if n != node_id]
            self.nodes[node_id] = RaftNode(node_id, peers)

    def simulate_election(self, steps=10):
        """Simulate election process"""
        for step in range(steps):
            print(f"\n--- Step {step} ---")

            for node_id, node in self.nodes.items():
                # Check election timeout
                if node.state == NodeState.FOLLOWER:
                    if time.time() - node.last_heartbeat > node.election_timeout:
                        node.start_election()

                # Check if won election
                if node.state == NodeState.CANDIDATE:
                    node.check_election_result()

                # Send heartbeats if leader
                if node.state == NodeState.LEADER:
                    node.send_heartbeat()

            # Print state
            for node_id, node in self.nodes.items():
                print(f"  {node_id}: {node.get_status()}")

            time.sleep(0.5)

            # Stop if leader elected
            leaders = [n for n in self.nodes.values()
                      if n.state == NodeState.LEADER]
            if len(leaders) == 1:
                print(f"\nElection complete: Node {leaders[0].node_id} is leader")
                break

# Run simulation
cluster = RaftCluster(['A', 'B', 'C', 'D', 'E'])
cluster.simulate_election()

package main

import (
    "fmt"
    "math/rand"
    "time"
)

type NodeState string

const (
    Follower  NodeState = "follower"
    Candidate NodeState = "candidate"
    Leader    NodeState = "leader"
)

type RaftNode struct {
    nodeID        string
    peers         []string
    totalNodes    int
    state         NodeState
    currentTerm   int64
    votedFor      *string
    leaderID      *string
    electionTime  time.Time
    heartbeatTime time.Time
    votesReceived map[string]bool
}

func NewRaftNode(nodeID string, peers []string) *RaftNode {
    return &RaftNode{
        nodeID:        nodeID,
        peers:         peers,
        totalNodes:    len(peers) + 1,
        state:         Follower,
        currentTerm:   0,
        votedFor:      nil,
        leaderID:      nil,
        electionTime:  time.Now(),
        heartbeatTime: time.Now(),
        votesReceived: make(map[string]bool),
    }
}

func (rn *RaftNode) RequestVote(candidateID string, term int64) bool {
    if term > rn.currentTerm {
        rn.currentTerm = term
        rn.votedFor = nil
    }

    if term < rn.currentTerm {
        return false
    }

    if rn.votedFor == nil || *rn.votedFor == candidateID {
        rn.votedFor = &candidateID
        return true
    }

    return false
}

func (rn *RaftNode) StartElection() {
    rn.state = Candidate
    rn.currentTerm++
    votedFor := rn.nodeID
    rn.votedFor = &votedFor
    rn.votesReceived = map[string]bool{rn.nodeID: true}
    rn.electionTime = time.Now()

    fmt.Printf("Node %s: Starting election for term %d\n", rn.nodeID, rn.currentTerm)

    for _, peer := range rn.peers {
        if rn.RequestVote(rn.nodeID, rn.currentTerm) {
            rn.votesReceived[peer] = true
        }
    }
}

func (rn *RaftNode) CheckElectionResult() bool {
    quorum := (rn.totalNodes / 2) + 1

    if len(rn.votesReceived) >= quorum {
        rn.state = Leader
        leaderID := rn.nodeID
        rn.leaderID = &leaderID
        fmt.Printf("Node %s: Elected leader for term %d\n", rn.nodeID, rn.currentTerm)
        return true
    }

    return false
}

func (rn *RaftNode) GetStatus() map[string]interface{} {
    var votedFor, leaderID string
    if rn.votedFor != nil {
        votedFor = *rn.votedFor
    }
    if rn.leaderID != nil {
        leaderID = *rn.leaderID
    }

    return map[string]interface{}{
        "node_id":   rn.nodeID,
        "state":     rn.state,
        "term":      rn.currentTerm,
        "leader":    leaderID,
        "voted_for": votedFor,
    }
}

type RaftCluster struct {
    nodes map[string]*RaftNode
}

func NewRaftCluster(nodeIDs []string) *RaftCluster {
    cluster := &RaftCluster{
        nodes: make(map[string]*RaftNode),
    }

    for _, nodeID := range nodeIDs {
        var peers []string
        for _, id := range nodeIDs {
            if id != nodeID {
                peers = append(peers, id)
            }
        }
        cluster.nodes[nodeID] = NewRaftNode(nodeID, peers)
    }

    return cluster
}

func (rc *RaftCluster) PrintStates() {
    for _, nodeID := range []string{"A", "B", "C", "D", "E"} {
        if node, ok := rc.nodes[nodeID]; ok {
            fmt.Printf("  %s: %v\n", nodeID, node.GetStatus())
        }
    }
}

func main() {
    rand.Seed(time.Now().UnixNano())
    cluster := NewRaftCluster([]string{"A", "B", "C", "D", "E"})

    cluster.nodes["A"].StartElection()
    time.Sleep(1 * time.Second)
    cluster.nodes["A"].CheckElectionResult()

    fmt.Println("\nCluster state:")
    cluster.PrintStates()
}

const States = {
    FOLLOWER: 'follower',
    CANDIDATE: 'candidate',
    LEADER: 'leader'
};

class RaftNode {
    constructor(nodeId, peers) {
        this.nodeId = nodeId;
        this.peers = peers; // other node IDs
        this.totalNodes = peers.length + 1;

        this.state = States.FOLLOWER;
        this.currentTerm = 0;
        this.votedFor = null;
        this.leaderId = null;

        this.electionTimeout = Math.random() * 1500 + 1500; // 1.5-3s
        this.lastHeartbeat = Date.now();
        this.votesReceived = new Set();
    }

    requestVote(candidateId, term) {
        if (term > this.currentTerm) {
            this.currentTerm = term;
            this.votedFor = null;
            this.leaderId = null;
        }

        if (term < this.currentTerm) {
            return false;
        }

        if (this.votedFor === null || this.votedFor === candidateId) {
            this.votedFor = candidateId;
            return true;
        }

        return false;
    }

    startElection() {
        this.state = States.CANDIDATE;
        this.currentTerm++;
        this.votedFor = this.nodeId;
        this.votesReceived = new Set([this.nodeId]);
        this.lastHeartbeat = Date.now();

        console.log(`Node ${this.nodeId}: Starting election for term ${this.currentTerm}`);

        // Simulate voting from peers
        for (const peer of this.peers) {
            if (this.requestVote(this.nodeId, this.currentTerm)) {
                this.votesReceived.add(peer);
            }
        }
    }

    checkElectionResult() {
        const quorum = Math.floor(this.totalNodes / 2) + 1;

        if (this.votesReceived.size >= quorum) {
            this.state = States.LEADER;
            this.leaderId = this.nodeId;
            console.log(`Node ${this.nodeId}: Elected leader for term ${this.currentTerm}`);
            return true;
        }

        return false;
    }

    getStatus() {
        return {
            node_id: this.nodeId,
            state: this.state,
            term: this.currentTerm,
            leader: this.leaderId,
            voted_for: this.votedFor,
            votes: this.votesReceived.size
        };
    }
}

class RaftCluster {
    constructor(nodeIds) {
        this.nodes = new Map();

        for (const nodeId of nodeIds) {
            const peers = nodeIds.filter(id => id !== nodeId);
            this.nodes.set(nodeId, new RaftNode(nodeId, peers));
        }
    }

    simulateElection() {
        console.log('Starting election...\n');

        // Node A initiates election
        const nodeA = this.nodes.get('A');
        nodeA.startElection();

        // Check result
        setTimeout(() => {
            nodeA.checkElectionResult();

            console.log('\nFinal state:');
            for (const [nodeId, node] of this.nodes) {
                console.log(`  ${nodeId}: ${JSON.stringify(node.getStatus())}`);
            }
        }, 100);
    }
}

// Run simulation
const cluster = new RaftCluster(['A', 'B', 'C', 'D', 'E']);
cluster.simulateElection();

When to Use vs. When NOT to Use

Use Leader Election (Raft/Consensus)

Database replication (one primary, multiple replicas)
Distributed configuration management

Avoid Leader Election (Raft/Consensus)

Scheduler coordination (one node schedules jobs)
Distributed locking (only one holder at a time)
Split-brain prevention required

Patterns and Pitfalls

Pattern: Quorum Size = (N/2) + 1

5-node cluster needs 3 votes. 7-node needs 4. Always ensures only one majority. If partition occurs, only majority partition can elect leader.

Pattern: Heartbeat > Election Timeout

Leader sends heartbeats every 50ms. Election timeout 1.5-3 seconds. Follower waits 1.5-3s without heartbeat before starting election. Prevents unnecessary elections.

Pitfall: Too Few Nodes

3-node cluster is minimum for fault tolerance (tolerates 1 failure). 2-node cluster can't reach quorum if either fails. Use odd numbers: 3, 5, 7.

Pitfall: Network Partitions

If leader isolated from majority, it can't commit writes. Clients get 'not leader' errors. Other partition cannot elect new leader if it has < quorum. System unavailable until partition heals.

Pattern: Log Replication

Leader replicates writes to followers before committing. If leader crashes, one follower has all data. New leader continues from last replicated state. Prevents data loss.

Pitfall: Complexity Creep

Raft/Paxos is complex to implement correctly. Use existing libraries: etcd, Consul, ZooKeeper. Don't implement from scratch unless you fully understand the algorithm.

Design Review Checklist

Self-Check

Why is quorum consensus required (vs. simple majority voting)?
What happens when the leader crashes?
How does split-brain occur without quorum?
What happens if network partitions the cluster 3-2?
When is a write considered committed?

Next Steps

Distributed Locking: Use Raft-based systems (etcd, Consul) for locks and coordination
Data Consistency: Read about eventual consistency vs. strong consistency trade-offs
Observability: Monitor leader elections and term changes to detect instability

References

Ongaro, D., & Ousterhout, J. (2014). In Search of an Understandable Consensus Algorithm (Raft) ↗️. USENIX ATC.
Burrows, M. (2006). The Chubby Lock Service for Loosely-Coupled Distributed Systems ↗️. OSDI.
etcd Documentation. etcd: Distributed Configuration and Service Discovery ↗️

Leader Election and Consensus

TL;DR​

Learning Objectives​

Motivating Scenario​

Core Concepts​

Practical Example​

When to Use vs. When NOT to Use​

Patterns and Pitfalls​

Design Review Checklist​

Self-Check​

Next Steps​

References​