Skip to main content

Load Testing in Pipelines

Validate capacity assumptions with load testing before production; detect regressions automatically.

TL;DR

Manual load testing is expensive and rare. Automated load tests in CI/CD run on every commit, measuring latency, throughput, and resource utilization. Establish baseline metrics on the main branch, then compare PR changes against it. If latency increases by more than 5% or error rate spikes above thresholds, fail the build. This prevents performance regressions from shipping. Load tests don't require production-scale resources—test at 10% scale if constrained. Focus on bottleneck endpoints (slowest, most CPU-intensive) and critical resources (database queries, cache operations). Load testing in staging catches problems before production deployment.

Learning Objectives

  • Design realistic load tests for CI/CD pipelines
  • Measure and establish performance baselines
  • Detect performance regressions automatically
  • Run cost-effective load tests with limited resources
  • Configure meaningful regression detection thresholds
  • Interpret load test results and act on findings

Motivating Scenario

Your product ships a new feature that handles user requests differently. The code looks efficient, passes functional tests, and deploys smoothly. Three days later, customer complaints arrive: the system feels slower. Investigation reveals a 12% latency increase on the critical user search endpoint. Rollback costs an hour of downtime and customer trust.

A load test in your CI/CD pipeline would have caught this before merge. The regression would have been visible, visible, and fixable during code review—where the fix takes minutes. This pattern repeats: developers make innocent changes, aggregate effects cause performance degradation, and the first signal is customer impact.

Core Concepts

Load Testing Pipeline: From Baseline to Regression Detection

Key Metrics in Load Testing

Latency percentiles: Measure response time at p50, p95, p99. P99 matters most—even if 99% of requests are fast, one slow response degrades user experience.

Throughput: Requests per second the system handles before degradation. Measure stable throughput (after initial warmup) and peak throughput.

Error rate: Percentage of requests that fail under load. Should remain near 0% for acceptable load levels. If error rate rises under load, you've found the breaking point.

Resource utilization: CPU, memory, database connections. Identify bottlenecks—if CPU is 95% but memory is 30%, you're CPU-bound.

Load Testing Patterns

Constant load: Fixed request rate (1000 req/s for 5 minutes). Simple, reproducible.

Ramp-up load: Gradually increase request rate. Models gradual traffic growth, reveals when degradation begins.

Spike load: Sudden jump to high request rate. Models traffic spikes, reveals spike resilience.

For CI/CD, constant load is ideal: predictable, fast, consistent results.

Practical Example

import http from 'k6/http';
import { check, group } from 'k6';
import { Rate, Trend } from 'k6/metrics';

// Custom metrics
const errorRate = new Rate('errors');
const searchLatency = new Trend('search_latency');

export const options = {
stages: [
{ duration: '30s', target: 100 }, // Ramp-up to 100 VUs
{ duration: '2m', target: 100 }, // Stay at 100 VUs
{ duration: '30s', target: 0 }, // Ramp-down
],
thresholds: {
'http_req_duration': ['p(95)<500', 'p(99)<1000'], // 95% under 500ms, 99% under 1s
'errors': ['rate<0.01'], // Error rate below 1%
},
};

export default function () {
group('Search Endpoint', () => {
const response = http.get('https://api.example.com/search?q=test');
const success = check(response, {
'status 200': (r) => r.status === 200,
'latency < 500ms': (r) => r.timings.duration < 500,
});

errorRate.add(!success);
searchLatency.add(response.timings.duration);
});

group('Details Endpoint', () => {
const response = http.get('https://api.example.com/details/123');
check(response, {
'status 200': (r) => r.status === 200,
});
errorRate.add(response.status !== 200);
});
}

When to Use / When Not to Use

Use SLO-Based Load Testing
  1. Services with strict availability/latency SLOs
  2. Critical customer-facing endpoints
  3. High-traffic scenarios where performance matters
  4. Cost-sensitive systems (efficiency = revenue)
  5. Microservices architecture with interdependencies
Use Static Threshold Testing
  1. Internal tools with flexible SLOs
  2. Batch processing with no real-time requirements
  3. One-off tests (not recurring in CI/CD)
  4. Exploratory testing without established baselines
  5. Development environment testing

Patterns and Pitfalls

Don't test every endpoint equally. Identify which endpoints are slowest, consume most resources, or receive most traffic. Load test those aggressively. A slow search endpoint matters more than slow admin endpoints. Bottleneck endpoints reveal where optimization efforts yield highest ROI.
You don't need to test at production scale every time. A 10% scale load test in a smaller staging environment runs faster, costs less, and still reveals regressions if your endpoints scale linearly. For expensive resources, 10% scale is often sufficient.
Cold starts skew results. JIT compilation, connection pooling, caching all take time. Always include a warm-up phase (2-3 minutes at lower load) before measuring. Ignore warm-up results; measure only after system reaches steady state.
A test that pushes 1000 req/s requires staging infrastructure to support 1000 req/s. Without proper sizing, tests themselves become bottlenecks. Baseline your staging infrastructure capacity first; design tests to stay within that.
Don't alert when latency hits 500ms—alert when latency increased by 10% from baseline. Absolute thresholds are context-blind; relative changes show degradation. A service might legitimately have 800ms latency; a jump to 900ms is the signal.
If load tests depend on external APIs or services, they're flaky. Isolate load tests from external systems. Mock external dependencies or test against staging instances you control. External flakiness should not block your CI/CD pipeline.

Design Review Checklist

  • Do you run load tests on every pull request?
  • Is there a baseline performance measurement for the main branch?
  • Are regression thresholds defined (e.g., 5% latency increase)?
  • Do failed load tests block merge or just warn?
  • Are you testing at least the top 3 bottleneck endpoints?
  • Is warm-up time included in test setup?
  • Are staging resources sufficient for test load?
  • Do developers understand why a test failed?
  • Are results tracked over time (trends, improvements)?
  • Is load testing documentation updated after each incident?

Self-Check

  • What are your top 3 slowest endpoints? Are they load-tested?
  • What's your regression detection threshold? Why that number?
  • How long do load tests take? Is that acceptable for every PR?
  • Can a new engineer understand a failed load test result?
  • Do you correlate load test regressions with actual production impact?

Next Steps

  1. Identify bottleneck endpoints: Profile production traffic, find slowest endpoints by p99 latency and request volume
  2. Establish baseline: Run load tests on main branch daily for 1 week, compute stable averages
  3. Set regression thresholds: Choose thresholds (e.g., 5% latency, 1% throughput, 0.1% error rate) based on SLOs
  4. Integrate into CI/CD: Add load tests to pull request checks; fail builds if regressions detected
  5. Measure effectiveness: Track how many regressions are caught pre-deployment vs. discovered in production

References

  1. Newman, S. (2015). Building Microservices. O'Reilly Media ↗️
  2. Gregg, B. (2013). Systems Performance: Enterprise and the Cloud. Prentice Hall ↗️
  3. k6 Documentation. k6 Load Testing ↗️