Skip to main content

Time-Series Databases

Purpose-built for metrics, events, and temporal data with superior compression

TL;DR

Time-series databases (TSDB) like InfluxDB and Prometheus optimize for append-only data with timestamps. Superior compression (10x-100x better than general DBs), downsampling for long-term storage, and query patterns optimized for ranges. Use for metrics, logs, events, and monitoring. Trade-off: optimized for time queries only, not suited for complex relational queries.

Learning Objectives

  • Understand time-series data characteristics and access patterns
  • Design retention and downsampling policies
  • Recognize compression strategies for temporal data
  • Choose appropriate cardinality for tags/labels

Motivating Scenario

Infrastructure monitoring: 10K servers, 100 metrics each, 60-second intervals = 166K points/second. One year of data = 5.2 trillion points. RDBMS would need petabytes. TSDB with compression: terabytes. Downsampling old data: hundreds of GB.

Core Concepts

Practical Example

# prometheus.yml - Configuration
global:
scrape_interval: 15s
evaluation_interval: 15s

scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']

- job_name: 'nodejs-app'
static_configs:
- targets: ['localhost:3000']
relabel_configs:
- source_labels: [__address__]
target_label: instance

- job_name: 'kubernetes'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_label_app]
action: keep
regex: my-app
# Python client - Expose metrics
from prometheus_client import Counter, Histogram, Gauge, start_http_server
import time

# Counter: only increases
request_count = Counter(
'requests_total',
'Total HTTP requests',
['method', 'endpoint', 'status']
)

# Gauge: can go up or down
memory_usage = Gauge(
'memory_usage_bytes',
'Memory usage in bytes',
['process_name']
)

# Histogram: distribution of values
request_latency = Histogram(
'request_latency_seconds',
'Request latency',
['endpoint'],
buckets=(0.1, 0.5, 1.0, 2.0, 5.0)
)

# Start metrics server
start_http_server(8000)

# Use metrics in application
def handle_request(method, endpoint):
request_count.labels(method=method, endpoint=endpoint, status=200).inc()

with request_latency.labels(endpoint=endpoint).time():
# Your code here
time.sleep(0.5)

# Gauge example
import psutil
memory_usage.labels(process_name='api_server').set(psutil.Process().memory_info().rss)
# PromQL queries
# Current CPU usage
node_cpu_seconds_total

# CPU usage over last 5 minutes
rate(node_cpu_seconds_total[5m])

# Average request latency by endpoint
avg(rate(request_latency_seconds_sum[5m])) / avg(rate(request_latency_seconds_count[5m])) by (endpoint)

# 95th percentile latency
histogram_quantile(0.95, rate(request_latency_seconds_bucket[5m]))

# Requests per second
rate(requests_total[1m])

# Memory usage with threshold alert
node_memory_MemAvailable_bytes < 1000000000

When to Use TSDB / When Not to Use

Use TSDB When
  1. Primary access is time-range queries
  2. High write volume (>10K points/sec)
  3. Long-term retention with downsampling
  4. Metrics, logs, events, monitoring
  5. Append-only data (rarely updated)
Use RDBMS When
  1. Complex multi-table queries
  2. Data frequently updated/deleted
  3. Random access by non-time keys
  4. Transactions across entities
  5. Complex reporting and BI

Patterns and Pitfalls

Design Review Checklist

  • Tag cardinality is bounded (&lt;1M unique combinations)
  • High-cardinality fields stored as field values, not tags
  • Retention policies defined for different time ranges
  • Downsampling tasks scheduled for long-term data
  • Write batching configured (>1K points per batch)
  • Storage capacity planned for growth
  • Compression verified and monitored
  • Query patterns optimized (not full table scans)
  • Backup and recovery procedures documented
  • Alerting rules defined for anomalies

Self-Check

  • What's tag cardinality and why does it matter?
  • How would you handle 1M metrics from 100K servers?
  • What's the difference between tags and fields in InfluxDB?
  • How do you design downsampling for 5-year retention?
info

Time-series databases are purpose-built for append-only temporal data with excellent compression and time-range query performance. Use them for metrics and monitoring; they're not suitable for complex relational or random-access queries.

Advanced TSDB Patterns

Efficient Data Ingestion at Scale

# Batch writes for efficiency (10x-100x better throughput)
from influxdb_client import InfluxDBClient, Point
import time
from datetime import datetime

class MetricsBuffer:
def __init__(self, client, bucket, flush_interval=10, batch_size=1000):
self.client = client
self.bucket = bucket
self.buffer = []
self.flush_interval = flush_interval
self.batch_size = batch_size
self.last_flush = time.time()

def add_metric(self, measurement, tags, fields):
"""Add metric to buffer"""
point = (
Point(measurement)
.time(datetime.utcnow(), write_precision="ns")
)
for k, v in tags.items():
point.tag(k, v)
for k, v in fields.items():
point.field(k, v)

self.buffer.append(point)

# Auto-flush on batch size or interval
if len(self.buffer) >= self.batch_size or \
time.time() - self.last_flush > self.flush_interval:
self.flush()

def flush(self):
"""Write buffered points in batch"""
if not self.buffer:
return

write_api = self.client.write_api()
write_api.write(self.bucket, records=self.buffer)
self.buffer.clear()
self.last_flush = time.time()

# Usage
buffer = MetricsBuffer(client, bucket="metrics")
for i in range(100000):
buffer.add_metric(
"cpu_usage",
tags={"host": f"server-{i % 100}", "region": "us-west"},
fields={"value": 50 + i % 50}
)
# Auto-flushes every 10 seconds or 1000 points

Downsampling for Long-Term Storage

# Downsampling strategy to manage storage growth
retention_and_downsampling:
hot_data:
period: "30 days"
resolution: "1 minute"
storage: "SSD (fast queries)"
size: "~100 GB per metric"
use_case: "Real-time dashboards, recent alerts"

warm_data:
period: "1 year"
resolution: "1 hour"
storage: "HDD (slower queries ok)"
size: "~500 MB per metric"
use_case: "Weekly trends, anomaly detection"
downsampling_task: |
from(bucket: "metrics")
|> range(start: -30d)
|> filter(fn: (r) => r._measurement == "cpu_usage")
|> aggregateWindow(every: 1h, fn: mean, timeSrc: "_start")
|> to(bucket: "metrics_1h")

cold_data:
period: "5 years"
resolution: "1 day"
storage: "Archive (rarely accessed)"
size: "~50 MB per metric"
use_case: "Compliance, long-term trends, investigations"
downsampling_task: |
from(bucket: "metrics_1h")
|> range(start: -1y)
|> aggregateWindow(every: 1d, fn: mean)
|> to(bucket: "metrics_1d")

Multi-Tenancy and Data Isolation

# Tagging strategy for multi-tenant TSDB
class MultiTenantMetrics:
def __init__(self, client):
self.client = client

def record_metric(self, tenant_id, measurement, value, tags=None):
"""Record metric with tenant isolation via tags"""
tags = tags or {}
# Always include tenant_id to ensure isolation
tags['tenant_id'] = tenant_id

point = (
Point(measurement)
.tag("tenant_id", tenant_id)
)
for k, v in tags.items():
point.tag(k, v)
point.field("value", value)

self.client.write_api().write("metrics", records=[point])

def query_tenant_metrics(self, tenant_id, measurement, time_range):
"""Query only metrics for specific tenant"""
query_api = self.client.query_api()

flux_query = f'''
from(bucket: "metrics")
|> range(start: {time_range})
|> filter(fn: (r) => r._measurement == "{measurement}")
|> filter(fn: (r) => r.tenant_id == "{tenant_id}")
'''

return query_api.query(query=flux_query)

def enforce_quota(self, tenant_id, metrics_per_second_limit):
"""Enforce per-tenant ingestion quotas"""
current_rate = self._get_ingestion_rate(tenant_id)
if current_rate > metrics_per_second_limit:
raise QuotaExceededError(
f"Tenant {tenant_id} exceeds quota: "
f"{current_rate} > {metrics_per_second_limit}"
)

Comparison: Prometheus vs InfluxDB vs TimescaleDB

Prometheus:
- Pull-based (scrapes endpoints)
- Good for: Kubernetes, modern cloud infrastructure
- Strength: Simple deployment, excellent alerting
- Weakness: Limited retention, single-node

InfluxDB:
- Push-based (clients send metrics)
- Good for: Applications emitting custom metrics
- Strength: High write throughput, downsampling built-in
- Weakness: More operational overhead

TimescaleDB:
- PostgreSQL extension
- Good for: Already using PostgreSQL
- Strength: SQL querying, full database features
- Weakness: Less specialized than Prometheus/InfluxDB

Choice depends on:
- Data source: Pull (Prometheus) vs Push (InfluxDB, TimescaleDB)
- Scale: Small (<100K points/sec) vs Large (>1M points/sec)
- Query complexity: Simple ranges (Prometheus) vs Complex (TimescaleDB)
- Operational burden: Minimal (Prometheus) vs Moderate (InfluxDB, TimescaleDB)

Next Steps

  • Explore Performance Optimization for TSDB queries
  • Learn Materialized Views for pre-aggregated metrics
  • Study Retention and Archival strategies
  • Dive into Monitoring and Observability patterns
  • Design Multi-Tenancy for SaaS applications
  • Implement Capacity Planning for metric growth

References

  • Prometheus Official Documentation (prometheus.io)
  • InfluxDB Documentation (influxdata.com)
  • Timescale PostgreSQL Extension (timescale.com)
  • "Designing Data-Intensive Applications" by Martin Kleppmann
  • Brendan Gregg: "The Four Golden Signals of Monitoring"