Skip to main content

Circuit Breaker

Prevent cascading failures by failing fast and recovering gradually

TL;DR

Circuit Breaker prevents cascading failures by detecting when a dependency is failing and stopping requests to it immediately. Three states: Closed (normal, counting failures), Open (failing, reject immediately), Half-Open (testing recovery). When failures exceed a threshold (e.g., 5 failures in 10 seconds), the circuit opens and rejects all requests without delay. After a timeout (e.g., 30 seconds), it enters half-open state and allows a test request. If the test succeeds, close the circuit; if it fails, stay open. Prevents cascades, enables faster recovery, and allows backends to recover gracefully.

Learning Objectives

  • Understand the three-state circuit breaker pattern and state transitions
  • Implement failure detection thresholds and recovery timeouts
  • Design appropriate fallback behavior for open circuits
  • Choose which services and dependencies need circuit breakers
  • Configure thresholds based on dependency reliability and criticality

Motivating Scenario

A payment service depends on a fraud-detection API. The fraud API becomes overloaded (processing backlog). Payments wait for fraud response (60-second timeout). All payment threads block. Frontend times out. Users can't complete purchases. With circuit breaker: after 5 consecutive fraud API failures (5 seconds), the circuit opens. New payment requests immediately fail (fail-fast). Service handles downtime gracefully (no fraud check = manual review later). After 30 seconds, circuit tests fraud API. If recovered, circuit closes. If not, stays open. Frontend is responsive throughout.

Core Concepts

Circuit Breaker State Transitions

Closed State: Normal operation. Requests pass through. Failures are counted. If failure count exceeds threshold, transition to Open.

Open State: Service is failing. Reject all new requests immediately with an error or fallback value. Don't wait for timeout. Allow bounded time for recovery.

Half-Open State: Service might be recovering. Allow a single test request through. If it succeeds, close the circuit and resume normal traffic. If it fails, return to Open.

Each state has configurable thresholds: failure count, window duration, timeout to half-open, and max test requests.

Practical Example

import time
import threading
from enum import Enum
from collections import deque

class CircuitState(Enum):
CLOSED = "closed"
OPEN = "open"
HALF_OPEN = "half_open"

class CircuitBreaker:
def __init__(
self,
failure_threshold=5,
success_threshold=1,
timeout=30,
window_size=10
):
"""
failure_threshold: failures to trigger open
success_threshold: successes to close from half-open
timeout: seconds before open -> half-open
window_size: seconds to count failures in
"""
self.failure_threshold = failure_threshold
self.success_threshold = success_threshold
self.timeout = timeout
self.window_size = window_size

self.state = CircuitState.CLOSED
self.failures = deque()
self.successes = 0
self.last_failure_time = None
self.opened_at = None
self.lock = threading.Lock()

def call(self, func, *args, **kwargs):
"""Execute function with circuit breaker protection"""
with self.lock:
if self.state == CircuitState.OPEN:
if time.time() - self.opened_at > self.timeout:
self.state = CircuitState.HALF_OPEN
self.successes = 0
else:
raise Exception("Circuit breaker is open")

try:
result = func(*args, **kwargs)
self._record_success()
return result
except Exception as e:
self._record_failure()
raise e

def _record_success(self):
with self.lock:
if self.state == CircuitState.HALF_OPEN:
self.successes += 1
if self.successes >= self.success_threshold:
self.state = CircuitState.CLOSED
self.failures.clear()
self.successes = 0
elif self.state == CircuitState.CLOSED:
# Clear old failures (outside window)
now = time.time()
while self.failures and now - self.failures[0] > self.window_size:
self.failures.popleft()

def _record_failure(self):
with self.lock:
now = time.time()
self.failures.append(now)
self.last_failure_time = now

# Clear old failures (outside window)
while self.failures and now - self.failures[0] > self.window_size:
self.failures.popleft()

failure_count = len(self.failures)

if self.state == CircuitState.CLOSED:
if failure_count >= self.failure_threshold:
self.state = CircuitState.OPEN
self.opened_at = now

elif self.state == CircuitState.HALF_OPEN:
# Any failure in half-open returns to open
self.state = CircuitState.OPEN
self.opened_at = now
self.successes = 0

def get_state(self):
with self.lock:
return {
"state": self.state.value,
"failures": len(self.failures),
"successes": self.successes if self.state == CircuitState.HALF_OPEN else None
}

# Example usage
breaker = CircuitBreaker(failure_threshold=3, timeout=5)

def unreliable_api(fail=False):
if fail:
raise Exception("API failed")
return "Success"

# Test with failures
for i in range(10):
try:
breaker.call(unreliable_api, fail=(i < 3))
print(f"Request {i}: OK")
except Exception as e:
print(f"Request {i}: {e}, State: {breaker.get_state()['state']}")

When to Use vs. When NOT to Use

Use Circuit Breakers
  1. External APIs with variable reliability
  2. Databases with potential overload/failover
Avoid Circuit Breakers
  1. Services with known performance issues
  2. Systems where cascading failure is costly
  3. Protecting limited resources (thread pools, connections)

Patterns and Pitfalls

When circuit opens, don't error. Return a safe default (empty list, cached value, degraded mode). Example: recommendations offline -> show popular items instead of erroring.
Timeout (likely overload) opens circuit. 4xx error (bad request) doesn't count. 5xx error counts. Distinguish between transient (retry) and persistent (circuit open) failures.
Circuit opens after 1-2 failures. Flaky network causes unnecessary open state. Threshold should be 3-5 failures in a 10-second window for most services.
Circuit stays open 5 minutes. Service recovers in 30 seconds, but still rejects traffic. Timeout should be 30-60 seconds; test recovery eagerly.
When circuit to auth service opens, use cached auth tokens (with expiration). When circuit to recommendation API opens, show previously recommended items. Graceful degradation.
Circuit breaker in multiple instances without shared state. Each instance tracks failures independently. Service might receive requests from 10 breakers, all in HALF_OPEN state. Use shared state (Redis) or eventual consistency.

Design Review Checklist

  • Circuit breaker wraps all calls to external services and unreliable dependencies
  • Failure threshold is configured based on dependency reliability (3-5 failures typical)
  • Failure window is appropriate (10-30 seconds for most systems)
  • Timeout to half-open is configured (30-60 seconds typical)
  • Success threshold for closing is > 1 (prevents oscillation)
  • Fallback behavior is defined for open state (error, cache, default value)
  • Different error types are handled differently (timeout vs. 5xx vs. 4xx)
  • Circuit state is monitored and alerted (transitions to OPEN, time in OPEN)
  • Multiple breaker instances coordinate (shared state if needed)
  • Tests verify fallback behavior when circuit is open

Self-Check

  • Can you draw the three states and transitions?
  • What's the difference between Closed and Half-Open?
  • How do you choose failure threshold and timeout values?
  • What should happen when circuit opens (fail or fallback)?
  • How do multiple service instances coordinate circuit state?

Next Steps

  1. Bulkhead Isolation: Read Bulkhead Isolation ↗️ to prevent resource exhaustion
  2. Rate Limiting: Learn Rate Limiting and Throttling ↗️ for proactive flow control
  3. Health Probes: Read Health Probes ↗️ to detect service failures early

References