Skip to main content

Log Retention and Privacy

Manage log lifecycle responsibly: comply with regulations, protect sensitive data, and optimize retention periods.

TL;DR

Logs contain sensitive data: passwords, API keys, PII, payment info. Never log secrets. Redact or hash PII before writing logs. Comply with regulations: GDPR requires log deletion on user request, which is hard with immutable storage. HIPAA, PCI-DSS impose strict retention limits. Define retention policies: how long must you keep logs? ERROR logs forever (compliance), INFO logs 90 days (debugging), DEBUG logs 7 days (development). Encrypt logs at rest. Access control: who can read logs? Audit log access. Delete or anonymize logs when retention expires. Use log sanitization libraries to prevent secrets in logs. Privacy is not optional—it's a legal and ethical requirement.

Learning Objectives

  • Identify what data should never appear in logs
  • Implement log sanitization to prevent secrets
  • Design retention policies based on regulatory requirements
  • Manage log deletion and data subject access requests (DSARs)
  • Protect logs at rest and in transit
  • Audit who accesses logs and when

Motivating Scenario

A developer commits code with passwords hardcoded in error messages. Now every production error logs the password. The logs are retained for 2 years. A breach exposes 18 months of logs, compromising customer accounts. A customer requests deletion under GDPR. The company searches for "my username" across 100TB of logs, but the immutable log archive makes deletion difficult. The incident costs $5M in breach remediation plus GDPR fines. A sanitization filter at log write-time would have prevented this.

Core Concepts

Sensitive Data Categories

Secrets: passwords, API keys, tokens, credentials. Never log intentionally.

PII: names, email, phone, addresses, IDs. Log sparingly, hash or redact.

Financial: credit cards, bank accounts, transaction amounts. Avoid logging.

Health: medical conditions, diagnoses, prescriptions. Strictly controlled.

Behavioral: user clicks, searches, browsing history. May be sensitive.

Regulatory Requirements

GDPR (EU): User can request deletion. You must delete their logs within 30 days. Retention must be minimal and justified.

HIPAA (US Health): Strict controls on health data logs. Retention limited. Access must be audited.

PCI-DSS (Payment Cards): Cannot log full card numbers. Retention typically 1 year. Encryption required.

SOC 2: Audit logs must be kept (typically 90 days+) for access auditing. Integrity must be protected.

Retention Policies

Different log categories need different retention:

  • Audit logs: 2-7 years (compliance, legal)
  • Error logs: 1-2 years (debugging, incident investigation)
  • INFO logs: 90 days (operational debugging)
  • DEBUG logs: 7 days (development)

Practical Example

# ❌ POOR - No sanitization, logs secrets
import logging

logger = logging.getLogger(__name__)

def authenticate(username, password):
user = find_user(username)
if user and user.password == password:
logger.info(f"User {username} authenticated with password {password}")
return user
logger.error(f"Auth failed: {username}, password: {password}")
return None

def call_external_api(api_key, user_id):
logger.debug(f"Calling API with key: {api_key}")
response = requests.get('https://api.example.com', headers={'X-API-Key': api_key})
logger.info(f"API response: {response.json()}")
return response

# Results: Logs contain passwords and API keys. Retention of 2 years means
# 18 months of exposed credentials if breached.

# ✅ EXCELLENT - Sanitization, encryption, retention policy
import logging
import re
import hashlib
from dataclasses import dataclass
from datetime import datetime, timedelta
from typing import Optional

class SensitiveDataFilter(logging.Filter):
"""Remove or redact sensitive data from logs."""

# Patterns for common sensitive data
PATTERNS = {
'password': r'password["\']?\s*[:=]\s*["\']?([^"\'\s,;]+)',
'api_key': r'api[_-]?key["\']?\s*[:=]\s*["\']?([a-zA-Z0-9\-_]+)',
'token': r'(token|bearer|jwt)["\']?\s*[:=]\s*["\']?([a-zA-Z0-9\-_.]+)',
'credit_card': r'\b(\d{4}[\s\-]?){3}\d{4}\b',
'ssn': r'\b\d{3}-\d{2}-\d{4}\b',
'email': r'\b[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}\b',
}

def filter(self, record):
"""Sanitize log record."""
msg = record.getMessage()

# Remove passwords
msg = re.sub(self.PATTERNS['password'], 'password=[REDACTED]', msg, flags=re.I)

# Remove API keys
msg = re.sub(self.PATTERNS['api_key'], 'api_key=[REDACTED]', msg, flags=re.I)

# Remove tokens
msg = re.sub(self.PATTERNS['token'], r'\1=[REDACTED]', msg, flags=re.I)

# Replace credit card with last 4 digits
def redact_card(match):
card = match.group(1).replace(' ', '').replace('-', '')
return f"****{card[-4:]}"

msg = re.sub(self.PATTERNS['credit_card'], redact_card, msg)

# Remove SSN
msg = re.sub(self.PATTERNS['ssn'], '[REDACTED_SSN]', msg)

# Hash emails (preserve for queries, lose identity)
def hash_email(match):
email = match.group(0).lower()
hashed = hashlib.sha256(email.encode()).hexdigest()[:16]
return f"user_{hashed}"

msg = re.sub(self.PATTERNS['email'], hash_email, msg)

record.msg = msg
record.args = ()
return True

# Setup logger with sanitization
def setup_logger(name: str):
logger = logging.getLogger(name)
logger.setLevel(logging.DEBUG)

handler = logging.StreamHandler()
handler.addFilter(SensitiveDataFilter())

formatter = logging.Formatter(
'%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)
handler.setFormatter(formatter)
logger.addHandler(handler)

return logger

logger = setup_logger('auth_service')

# Retention policy
@dataclass
class RetentionPolicy:
"""Define how long different log types are kept."""
audit_logs: timedelta = timedelta(days=365*2) # 2 years for compliance
error_logs: timedelta = timedelta(days=365) # 1 year for debugging
info_logs: timedelta = timedelta(days=90) # 90 days for ops
debug_logs: timedelta = timedelta(days=7) # 7 days for dev

def log_with_retention(logger, level, message, retention_days: int):
"""Log with retention metadata."""
log_entry = {
'timestamp': datetime.now().isoformat(),
'level': level,
'message': message,
'expires_at': (datetime.now() + timedelta(days=retention_days)).isoformat(),
'retention_days': retention_days,
}

getattr(logger, level.lower())(log_entry)

def authenticate(username: str, password: str) -> Optional[dict]:
"""Authenticate without logging secrets."""
user = find_user(username)

if user and user.password_hash == hash_password(password):
# Log with hashed username, retention 90 days
log_with_retention(logger, 'INFO', f'Authentication successful', 90)
return user

log_with_retention(logger, 'WARN', f'Authentication failed', 90)
return None

def call_external_api(api_key: str, user_id: str):
"""Call API without logging the key."""
# Log that we're calling the API, but not the actual key
log_with_retention(logger, 'DEBUG', f'Calling external API', 7)

response = requests.get('https://api.example.com',
headers={'X-API-Key': api_key})

# Log response without sensitive details
log_with_retention(logger, 'INFO', f'API call succeeded, status: {response.status_code}', 30)
return response

# DSAR (Data Subject Access Request) support
def export_user_logs(user_id: str) -> str:
"""Export all logs for a user (GDPR compliance)."""
# Collect all logs mentioning this user
logs = query_logs(f'user_id={user_id}')
return json.dumps(logs, indent=2)

def delete_user_logs(user_id: str):
"""Delete all logs for a user (GDPR compliance)."""
# This is hard with immutable log systems
# Solution 1: Query logs, mark as deleted, don't query again
# Solution 2: Use separate log index with retention labels, delete from index
# Solution 3: Encrypt logs with per-user key, delete key = practical deletion
mark_logs_for_deletion(user_id)
logger.info(f'Marked logs for user {user_id} for deletion')

Retention Strategy

By Log Type

Audit Logs (2+ years)

  • Access logs, authentication, authorization changes
  • Compliance requirement
  • Immutable storage recommended

Error Logs (1 year)

  • For incident investigation and RCA
  • Can be deleted after 1 year if no regulatory requirement

Info Logs (90 days)

  • Operational troubleshooting
  • Delete after debugging window

Debug Logs (7 days)

  • Development and active incident investigation
  • Short-lived

Deletion Strategy

Immutable logs: Mark as "deleted" (logical), exclude from queries
- Prevents re-analysis of deleted data
- Maintains audit trail that deletion occurred

Mutable logs: True delete
- Remove from storage (GDPR requirement)
- Verify deletion in backups
- Document deletion in audit log

Design Review Checklist

  • Does your sanitization filter catch passwords, API keys, tokens, credit cards, SSN, PII?
  • Are secrets never intentionally logged?
  • Is PII hashed or redacted before logging?
  • Do you have a documented retention policy?
  • Can you delete logs for a user within 30 days (GDPR)?
  • Are audit logs protected with integrity checks?
  • Are logs encrypted at rest and in transit?
  • Is log access audited (who read what when)?
  • Do retention policies match regulatory requirements?

Self-Check

  1. Review your logs from the past week. What sensitive data appears? Design filters to prevent it.

  2. Write a retention policy for a healthcare system (HIPAA), an e-commerce site (PCI-DSS), and a SaaS (GDPR).

  3. How would you handle a GDPR deletion request for a user whose logs span 500 GB across immutable storage?

One Takeaway

Sensitive data in logs is a compliance and security liability. Use automatic sanitization filters to prevent secrets, hash or redact PII, and implement retention policies based on regulations. Design for deletion: be able to fulfill GDPR requests within 30 days. Privacy is not a feature—it's foundational to responsible logging.

Next Steps

References

  1. GDPR - General Data Protection Regulation. (2018). Retrieved from https://gdpr-info.eu/
  2. HIPAA Security Rule. (2023). Retrieved from https://www.hhs.gov/hipaa/for-professionals/security/
  3. PCI DSS - Payment Card Industry Data Security Standard v4.0. (2023). Retrieved from https://www.pcisecuritystandards.org/
  4. OWASP - Logging Cheat Sheet. (2024). Retrieved from https://cheatsheetseries.owasp.org/cheatsheets/Logging_Cheat_Sheet.html