Batching & Nagle-like Strategies

Reduce network and system overhead by batching operations intelligently.

TL;DR

Reduce network and system overhead by batching operations intelligently. This pattern is proven in production at scale and requires thoughtful implementation, continuous tuning, and rigorous monitoring to realize its benefits.

Learning Objectives

Understand the problem this pattern solves
Learn when and how to apply it correctly
Recognize trade-offs and failure modes
Implement monitoring to validate effectiveness
Apply the pattern in your own systems

Motivating Scenario

Your API receives 1000 write requests/second. Each request sends one record to the database. That's 1000 individual database round-trips/second, burning CPU and network. With batching, you accumulate 100 records in memory over 100ms, then send one batch of 100 to the database. Now it's 10 database round-trips/second (100x reduction). The trade-off: 100ms latency for individual writes. For analytics or logs, 100ms is acceptable and reduces cost/load dramatically. For user-facing requests, 100ms is too much; you batch only when possible (background jobs, async operations).

Core Concepts

Batching Strategies

Naive Approach: Send immediately

Request 1: Send immediately. Latency: 1ms. Overhead: 100%
Request 2: Send immediately. Latency: 1ms. Overhead: 100%
Request 3: Send immediately. Latency: 1ms. Overhead: 100%
...
Total overhead: High (3 round-trips instead of 1)

Batching: Accumulate then send

Requests 1-100: Accumulate for 100ms
At 100ms: Send batch. Latency: 100ms. Overhead: 1%

Hybrid: Send when batch reaches size OR timeout

Requests 1-50: Accumulate
At 50ms: Batch still under 100 items, wait
Request 51: Batch reaches 100 items, send immediately. Latency: 50ms for first 50, <1ms for #51

Benefits of batching:

Reduce round-trips (1000 → 10 for same throughput)
Better hardware utilization (send 100 items with ~same overhead as 1)
Reduced CPU and network load
Lower cost (fewer connections, transactions)

Trade-offs:

Increased latency (wait for batch)
Complexity (buffer management, flushing)
Tail latency issues (if batch doesn't fill, users wait)

Nagle's Algorithm (TCP Example)

TCP Nagle Algorithm delays small packets to avoid network waste:

Send immediately if packet >= MSS (maximum segment size, ~1500 bytes)
Else, buffer until previous segment acknowledged OR buffer fills

Without Nagle: 1-byte writes send 41-byte TCP header for each byte (huge overhead).

With Nagle: Buffer small writes, send together (efficient).

Can disable Nagle for low-latency apps (trading throughput for latency).

When to Use

High-frequency writes (logs, metrics, analytics)
Cost per operation high (database round-trips)
Acceptable latency increase (100ms-1s)
Bulk operations possible

When NOT to Use

User-facing requests (need low latency)
Operations must be immediate (trading, real-time)
Already batching at a lower level (SQL bulk inserts)

Batching Implementation Strategies

Time-Based Batching

import threading
import time
from typing import Callable, Any, List

class TimedBatcher:
    """Batch operations, flush after timeout or size limit."""

    def __init__(self, batch_size: int = 100, timeout_ms: int = 100):
        self.batch_size = batch_size
        self.timeout_ms = timeout_ms
        self.batch = []
        self.lock = threading.Lock()
        self.flush_callback = None
        self.timer = None

    def add(self, item: Any):
        """Add item to batch."""
        with self.lock:
            self.batch.append(item)

            # Cancel previous timer
            if self.timer:
                self.timer.cancel()

            # Reset timer
            self.timer = threading.Timer(
                self.timeout_ms / 1000,
                self.flush
            )
            self.timer.daemon = True
            self.timer.start()

            # Flush if batch full
            if len(self.batch) >= self.batch_size:
                self.flush()

    def flush(self):
        """Send batch to destination."""
        with self.lock:
            if not self.batch:
                return

            batch_to_send = self.batch[:]
            self.batch = []

            if self.timer:
                self.timer.cancel()
                self.timer = None

        # Send outside lock to avoid deadlock
        if self.flush_callback:
            self.flush_callback(batch_to_send)

# Usage: Log aggregation
def send_logs_to_server(logs: List[str]):
    """Batch logs and send."""
    print(f"Sending {len(logs)} logs to server")
    # Make HTTP request
    requests.post('https://logs.example.com/batch', json={'logs': logs})

batcher = TimedBatcher(batch_size=100, timeout_ms=500)
batcher.flush_callback = send_logs_to_server

# Application logs constantly
for i in range(1000):
    batcher.add(f"Log message {i}")
    time.sleep(0.01)

Size-Based Batching (For Fixed Overhead)

// Database batch inserts
List<Record> batch = new ArrayList<>();

for (Record record : records) {
    batch.add(record);

    // Flush when batch reaches size or end of stream
    if (batch.size() >= 1000) {
        database.insertBatch(batch);
        batch.clear();
    }
}

// Flush remaining
if (!batch.isEmpty()) {
    database.insertBatch(batch);
}

// Comparison:
// Without batching: 10,000 inserts = 10,000 database round-trips
// With batching (size=1000): 10,000 inserts = 10 database round-trips (100x reduction)
// Cost: Accumulation time (negligible for async operations)

Adaptive Batching (Size OR Time)

class AdaptiveBatcher {
    constructor(maxSize, maxDelayMs, callback) {
        this.maxSize = maxSize;
        this.maxDelayMs = maxDelayMs;
        this.callback = callback;
        this.batch = [];
        this.lastFlushTime = Date.now();
        this.timer = null;
    }

    add(item) {
        this.batch.push(item);

        const now = Date.now();
        const elapsed = now - this.lastFlushTime;

        // Flush if:
        // 1. Batch full
        if (this.batch.length >= this.maxSize) {
            this.flush();
            return;
        }

        // 2. Timeout expired
        if (elapsed >= this.maxDelayMs) {
            this.flush();
            return;
        }

        // 3. First item: start timer
        if (this.batch.length === 1) {
            this.timer = setTimeout(() => {
                if (this.batch.length > 0) {
                    this.flush();
                }
            }, this.maxDelayMs - elapsed);
        }
    }

    flush() {
        if (this.timer) {
            clearTimeout(this.timer);
            this.timer = null;
        }

        if (this.batch.length === 0) {
            return;
        }

        const toFlush = this.batch;
        this.batch = [];
        this.lastFlushTime = Date.now();

        this.callback(toFlush);
    }
}

// Usage
const batcher = new AdaptiveBatcher(
    100,      // max 100 items
    100,      // or 100ms timeout
    (items) => {
        console.log(`Flushing ${items.length} items`);
        // Send to server
    }
);

// Add items as they arrive
for (let i = 0; i < 500; i++) {
    batcher.add(`item-${i}`);
}

Practical Example

Core Patterns
Configuration Example
Monitoring

# Batching & Nagle-like Strategies Patterns and Their Use

Circuit Breaker:
  Purpose: Prevent cascading failures by stopping requests to failing service
  When_Failing: Return fast with cached or degraded response
  When_Recovering: Gradually allow requests to verify recovery
  Metrics_to_Track: Failure rate, response time, circuit trips

Timeout & Retry:
  Purpose: Handle transient failures and slow responses
  Implementation: Set timeout, wait, retry with backoff
  Max_Retries: 3-5 depending on operation cost and urgency
  Backoff: Exponential (1s, 2s, 4s) to avoid overwhelming failing service

Bulkhead:
  Purpose: Isolate resources so one overload doesn't affect others
  Implementation: Separate thread pools, connection pools, queues
  Example: Checkout path has dedicated database connections
  Benefit: One slow query doesn't affect other traffic

Graceful Degradation:
  Purpose: Maintain partial service when components fail
  Example: Show cached data when personalization service is down
  Requires: Knowledge of what's essential vs. nice-to-have
  Success: Users barely notice the degradation

Load Shedding:
  Purpose: Shed less important work during overload
  Implementation: Reject low-priority requests when queue is full
  Alternative: Increase latency for all rather than reject some
  Trade-off: Some customers don't get served vs. all customers are slow

Reliability_Configuration:
  service_timeouts:
    payment_api: 5s
    recommendation_engine: 2s
    user_auth: 1s
  
  retry_policy:
    transient_errors: [408, 429, 503, 504]
    max_retries: 3
    backoff_multiplier: 2
    initial_delay: 100ms
  
  circuit_breaker:
    failure_threshold: 50%
    window: 10 requests
    open_timeout: 30s
    
  load_shedding:
    queue_threshold: 1000
    shed_non_essential: true
    reject_priority: low

Essential Metrics:

Latency:
  - P50, P95, P99 response times
  - Alert if P99 > acceptable threshold
  
Failure Rates:
  - Error rate percentage
  - Alert if >5% errors
  
Pattern-Specific:
  - Circuit breaker trips (alert if >3 in 5min)
  - Retry count distribution
  - Load shed requests
  - Bulkhead resource utilization

Example Dashboard:
  - Real-time traffic flow with failures highlighted
  - Circuit breaker state (Open/Closed/Half-Open)
  - Retry success rates by service
  - Queue depths and shedding rates

Implementation Guide

Identify the Problem: What specific failure mode are you protecting against?
Choose the Right Pattern: Different problems need different solutions
Implement Carefully: Half-implemented patterns are worse than nothing
Configure Based on Data: Don't copy thresholds from blog posts
Monitor Relentlessly: Validate the pattern actually solves your problem
Tune Continuously: Thresholds need adjustment as load and systems change

Characteristics of Effective Implementation

✓ Clear objectives: Can state in one sentence what you're solving ✓ Proper monitoring: Can see whether pattern is working ✓ Appropriate thresholds: Based on data from your system ✓ Graceful failure mode: Unacceptable in production ✓ Well-tested: Failure scenarios explicitly tested ✓ Documented: Future maintainers understand why it exists

Pitfalls to Avoid

❌ Blindly copying patterns: Thresholds from one system don't work for another ❌ Over-retrying: Making failing service worse by hammering it ❌ Forgetting timeouts: Retries without timeouts extend the pain ❌ Silent failures: If circuit breaker opens, someone needs to know ❌ No monitoring: Deploying patterns without metrics to validate ❌ Set and forget: Patterns need tuning as load and systems change

Bulkheads: Isolate different use cases so failures don't cascade
Graceful Degradation: Degrade functionality when load is high
Health Checks: Detect failures requiring retry or circuit breaker
Observability: Metrics and logs showing whether pattern works

Checklist: Implementation Readiness

Problem clearly identified and measured
Pattern selected is appropriate for the problem
Thresholds based on actual data from your system
Failure mode is explicit and acceptable
Monitoring and alerts configured before deployment
Failure scenarios tested explicitly
Team understands the pattern and trade-offs
Documentation explains rationale and tuning

Self-Check

Can you state in one sentence why you need this pattern? If not, you might not need it.
Have you measured baseline before and after? If not, you don't know if it helps.
Did you tune thresholds for your system? Or copy them from a blog post?
Can someone on-call understand what triggers and what it does? If not, document better.

Takeaway

These patterns are powerful because proven in production. But power comes with complexity. Implement only what you need, tune based on data, and monitor relentlessly. A well-implemented pattern you understand is worth far more than several half-understood patterns copied from examples.

Next Steps

Identify the problem: What specific failure mode are you protecting against?
Gather baseline data: Measure current behavior before implementing
Implement carefully: Start simple, add complexity only if needed
Monitor and measure: Validate the pattern actually helps
Tune continuously: Adjust thresholds based on production experience

Batching & Nagle-like Strategies

TL;DR​

Learning Objectives​

Motivating Scenario​

Core Concepts​

Batching Strategies​

Nagle's Algorithm (TCP Example)​

When to Use​

When NOT to Use​

Batching Implementation Strategies​

Time-Based Batching​

Size-Based Batching (For Fixed Overhead)​

Adaptive Batching (Size OR Time)​

Practical Example​

Implementation Guide​

Characteristics of Effective Implementation​

Pitfalls to Avoid​

Related Patterns​

Checklist: Implementation Readiness​

Self-Check​

Takeaway​

Next Steps​

References​