Skip to main content

Test Data, Fixtures, and Synthetic Data

Generate and manage test data efficiently and safely.

TL;DR

Test data is expensive to create and maintain. Use fixtures (static YAML/SQL files for stable data) for simple cases. Use factories (programmatic generation) for dynamic variation and complex scenarios. Use synthetic data generators (Faker, random) to create realistic but fake data. Seed tests with deterministic data (same seed = same generated data) for reproducibility. Never use real production data in tests (GDPR violation, privacy risk). Organize test data by feature; keep fixtures DRY; version control them.

Learning Objectives

After reading this article, you will understand:

  • Trade-offs between fixtures, factories, and synthetic data
  • How to design reproducible test data
  • Data privacy and compliance in testing
  • Strategies for managing test data at scale
  • Tools for test data generation
  • How to organize and maintain test data

Motivating Scenario

Your tests depend on hand-crafted test data. Adding a new test requires manually creating complex fixtures. Tests pass locally but fail in CI because data changed. Another developer's tests interfere (shared test database). You spend 30% of time managing test data instead of writing tests.

Better approach: Use factories to generate realistic data on-demand. Each test gets fresh, isolated data. Seed the random generator for reproducibility. No manual fixture maintenance.

Core Concepts

Test Data Strategies

Trade-offs between different test data approaches
StrategySetup TimeMaintenanceFlexibilityPrivacy
FixturesLowMediumLowGood
FactoriesMediumLowHighGood
SyntheticMediumMediumHighExcellent
Prod CopyLowHighN/APoor

Practical Example

import factory
from faker import Faker
from app.models import User, Order

fake = Faker()

class UserFactory(factory.Factory):
class Meta:
model = User

id = factory.Sequence(lambda n: n)
email = factory.LazyFunction(fake.email)
name = factory.LazyFunction(fake.name)
created_at = factory.LazyFunction(fake.date_time)

class OrderFactory(factory.Factory):
class Meta:
model = Order

id = factory.Sequence(lambda n: n)
user = factory.SubFactory(UserFactory)
total = factory.Faker('pydecimal', left_digits=4, right_digits=2)
status = 'pending'

# Usage in tests
def test_order_total():
user = UserFactory(name='Alice')
order = OrderFactory(user=user, total=99.99)
assert order.user.name == 'Alice'
assert order.total == 99.99

def test_bulk_orders():
# Generate 100 orders
orders = OrderFactory.create_batch(100)
assert len(orders) == 100
assert all(o.status == 'pending' for o in orders)

When to Use / When Not to Use

Use Fixtures/Factories When:
  1. You need stable, reproducible test data
  2. Tests are isolated (each test gets fresh data)
  3. Data setup is complex (relationships, dependencies)
  4. You want tests to be maintainable and readable
  5. Performance is important (generate only what's needed)
Avoid When:
  1. You copy production data (privacy/compliance risk)
  2. Tests manually create data in each test (hard to maintain)
  3. Shared mutable test data (tests interfere)
  4. Data generation is slower than actual testing

Patterns and Pitfalls

Test Data Best Practices and Anti-Patterns

Use factories: Generate data dynamically, reduce fixture maintenance. Seed for reproducibility: Same seed = same data; deterministic tests. Isolate test data: Each test gets fresh data; no interference. Don't use production data: Fake it with synthetic data. Version control fixtures: Fixtures in git; changes tracked. Organize by feature: Test data structured logically. DRY test data: Extract common data setup. Document assumptions: If test data is special, document why.
Hardcoded data in tests: IDs, emails scattered in code; fragile. Shared mutable database: All tests hit same data; interference. Real production data in tests: GDPR violation, privacy breach. No test data organization: Finding/maintaining fixtures is hard. Brittle fixtures: Test data tightly coupled to implementation. Generate unnecessary data: Creating 1000 users when you need 5. Ignoring privacy: Sensitive data exposed in test runs.

Design Review Checklist

  • Each test gets fresh, isolated data (no shared state)
  • Test data is generated programmatically (factories/synthetic)
  • Fixtures are organized by domain/feature
  • Random data seeded for reproducibility
  • No production data used in tests
  • Sensitive fields masked or synthetic
  • Test data generation is fast (< 100ms per test)
  • Fixtures version-controlled (in git)
  • Test data setup is DRY (no duplication)
  • Data relationships tested (foreign keys, constraints)
  • Documentation explains non-obvious test data
  • Data cleanup runs after tests (no orphaned data)
  • Synthetic data realistic (not fake-looking)
  • Test data doesn't expose implementation details
  • GDPR/privacy compliance verified

Self-Check Questions

  • Q: Should I use production data in tests? A: Never. It violates privacy/GDPR. Use synthetic data instead.

  • Q: How do I make test data reproducible? A: Seed the random number generator. Same seed = same data.

  • Q: What's the difference between fixtures and factories? A: Fixtures are static files (fast, hard to maintain). Factories generate dynamically (flexible, easy to maintain).

  • Q: How do I avoid test interference from shared data? A: Use factories to generate fresh data per test. Clean up after each test.

  • Q: Should test data look realistic? A: Yes. Use Faker to generate realistic-looking fake data. Catches bugs hidden by unrealistic data.

Next Steps

  1. Audit test data — How is it currently managed?
  2. Choose strategy — Fixtures, factories, or synthetic?
  3. Implement factories — Reduce manual fixture maintenance
  4. Seed for reproducibility — Same seed = same tests
  5. Isolate tests — Fresh data per test
  6. Remove production data — Replace with synthetic
  7. Organize fixtures — By domain/feature
  8. Document assumptions — Why is test data special?

References

  1. Factory Boy (Python) ↗️
  2. Faker.js (JavaScript) ↗️
  3. Faker (Python) ↗️
  4. Factory Bot Wiki ↗️