Skip to main content

Flaky Tests and Non-Determinism

Tests that pass sometimes, fail other times. Destroys confidence in CI/CD. Fix the root cause.

TL;DR

Flaky tests pass sometimes, fail other times. Destroys trust in CI. Root causes: timing (async not awaited), shared state (tests affect each other), randomness (uncontrolled), external systems (network timeouts). Fix: isolate data (each test owns its data), determinism (seed RNG, mock time), async handling (wait/poll not sleep). Run tests repeatedly, in different orders, in parallel. If a test fails intermittently, fix it immediately—don't ignore it. Flaky tests are worse than no tests.

Root Causes

Problem: Tests depend on specific timing. Thread sleeps, async operations, race conditions.

# BAD: Depends on timing
def test_async_operation():
result = async_operation()
time.sleep(1) # Hope operation completes in 1 second
assert result == expected
# Sometimes: operation takes 1.1 seconds, test fails

# GOOD: Wait for condition
def test_async_operation():
result = async_operation()
wait_until(lambda: result_is_ready(), timeout=5) # Poll until ready
assert result == expected

# GOOD: Use async/await
async def test_async_operation():
result = await async_operation()
assert result == expected # Automatically waits

Patterns & Pitfalls

Test fails randomly. Team ignores it ('It's just flaky, we all know'). Bugs hide in the noise.
All tests use same database. Test A creates data. Test B sees it. Order-dependent tests. Parallel run fails.
test.sleep(5) to wait for async operation. Sometimes operation takes 5.1s. Test fails.
Tests must run in specific order. Run out of order = failures. Can't parallelize.
Each test creates own data, cleans up after. Order-independent. Can parallelize.
Same seed for RNG, mocked time, mocked external calls. Same test = same result.
Tests properly await async operations. No guessing about timing.
Flaky test discovered? Stop. Fix it now. Don't ignore it. Flaky test = broken trust.

Root Cause Fixes

Isolation

Each test creates own data, no shared state:

@pytest.fixture(autouse=True)
def isolate_database():
"""Isolate each test in its own transaction."""
db.begin()
yield
db.rollback() # Undo all changes

# Or: in-memory database
@pytest.fixture
def db():
db = sqlite3.connect(":memory:")
db.execute("CREATE TABLE users (id INT, name TEXT)")
yield db
db.close() # Destroyed after test

Determinism

No randomness, controlled time:

# Seed RNG
random.seed(42)
Faker.seed(42)

# Mock time
@patch('datetime.datetime.now')
def test_with_mocked_time(mock_now):
mock_now.return_value = datetime(2025, 2, 14, 10, 0, 0)
# Test runs with fixed time

# Mock random
@patch('random.choice')
def test_with_mocked_random(mock_choice):
mock_choice.return_value = 'A'
# Test gets deterministic random behavior

Async Handling

Properly await or poll:

# Good: async/await
async def test_async():
result = await operation() # Waits automatically
assert result == expected

# Good: polling with timeout
def test_async_with_poll():
operation() # Start async
for _ in range(50): # Poll 50 times
if check_result():
assert True
return
time.sleep(0.1) # 100ms between checks
assert False, "Operation never completed"

# Good: Event/condition variable
def test_async_with_event():
event = threading.Event()
def callback():
event.set()
operation(callback)
assert event.wait(timeout=5) # Wait up to 5s

Design Review Checklist

  • Do any tests use hardcoded sleep() instead of wait/poll?
  • Are tests isolated (each has own data, no shared state)?
  • Do tests control randomness (seeded RNG)?
  • Do tests mock external systems (API, database)?
  • Are tests properly awaiting async operations?
  • Do tests mock or control time (not depend on current time)?
  • Can tests run in parallel without interference?
  • Can tests run in any order (not order-dependent)?
  • Are flaky tests fixed immediately (not ignored)?
  • Do tests run multiple times to detect flakiness?
  • Are fixtures cleanup properly (even on failure)?
  • Are environment variables controlled (not assumed)?
  • Are thread timing dependencies eliminated?
  • Are tests fast (< 1 second for unit tests)?
  • Are test failures reproducible (same failure every time)?

Detection & Fix Process

  1. Detect: Run tests 100 times. Any intermittent failures = flaky.
  2. Isolate: Run flaky test alone. Does it fail? If yes, test is broken. If no, it's order-dependent.
  3. Investigate: Review test code for: sleep(), shared state, randomness, external calls, time assumptions.
  4. Fix: Apply pattern from above (isolation, determinism, async handling).
  5. Verify: Run test 100 times in random order. All pass? Done.
#!/bin/bash
# Detect flaky tests

for i in {1..100}; do
pytest test_file.py::test_flaky --tb=short
if [ $? -ne 0 ]; then
echo "FAILED on run $i"
exit 1
fi
done
echo "Test passed 100 times - reliable"

Self-Check

  1. Have you ever seen a test pass, then fail, then pass? If yes, you have a flaky test.
  2. Do you ignore intermittent test failures? If yes, fix them. Flaky = broken.
  3. Do your tests depend on order? If yes, they're fragile.
  4. Do tests use sleep()? If yes, replace with wait/poll.
  5. Can you parallelize your tests? If no, you have shared state.

Next Steps

  1. Audit test suite — Run tests 10 times. Detect any failures.
  2. Isolate failing tests — Which tests are flaky?
  3. Fix root cause — Use patterns above (isolation, determinism, async).
  4. Test multiple ways — Random order, parallel, repeated runs.
  5. Monitor in CI — Flag flaky tests, don't merge until fixed.
  6. Document — Add comment explaining fix.

References

  1. Martin Fowler: Testing Strategies ↗️
  2. Google Python Style Guide: Testing ↗️
  3. Pytest: Fixtures ↗️
  4. Real Python: Mocking ↗️