Skip to main content

Retention & Archival

Balance legal requirements, business needs, and storage costs

TL;DR

Define retention periods based on legal requirements (financial 7yr, health 6yr) and business need. Use tiered storage: hot (30d, $0.023/GB), warm (1yr, $0.013/GB), cold (archive, $0.004/GB). Lifecycle policies automatically transition older data downward (S3 Standard → Glacier → Archive after 90d). Delete data after retention expires—use WORM storage to prove it actually deleted. Archival saves 80%+ on storage costs but trades retrieval latency.

Learning Objectives

By the end of this article, you will understand:

  • Legal/regulatory retention requirements (SOX, HIPAA, GDPR)
  • Tiered storage strategy and cost-benefit analysis
  • Lifecycle policies for automatic transitions
  • Archival formats and compression
  • Deletion verification and compliance
  • Legal hold mechanisms for litigation data

Motivating Scenario

Your startup stores 50GB/month of transaction data. At $0.023/GB (hot storage), that's ~$13,800/year. But you only query recent 30 days actively—older data sits idle. Archive 90+ day data to cold storage ($0.004/GB) = $2,400/year. 82% cost reduction. But finance demands 7-year retention for tax purposes. Design a policy: hot 30d → warm 1yr → cold 6yr remaining → delete after 7yr. Automate via lifecycle rules; prove deletion for compliance.

Core Concepts

Data Lifecycle: Hot to Cold to Deleted

Retention Period Determination

Key regulations:

RegulationDomainRetention
SOX (Sarbanes-Oxley)Financial records7 years
HIPAAMedical records6 years
GDPRPersonal data (EU residents)As long as necessary (varies)
CCPACustomer data (CA residents)Customer request or 12 months
PCI-DSSPayment records1 year

Business needs may extend: customer churn analysis (2yr), fraud investigation (3yr), ML model training (all-time).

Minimum retention = MAX(legal requirement, business need)

Storage Tiers & Costs

Tiered storage trades access speed for cost:

Standard (Hot):    $0.023/GB  - Real-time queries (SLA ms)
Standard-IA (Warm): $0.013/GB - Infrequent access (SLA ~sec)
Glacier (Cold): $0.004/GB - Archival (retrieval hours)
Deep Archive: $0.0016/GB - Compliance (retrieval days)

Cost calculation: 50GB/month → 600GB/year

  • All hot: 600 × $0.023 = $13,800
  • Tiered (30d hot, rest warm): 50 + (550 × $0.013) = $57 + $7.15 = $64.15/month = $769.80/year
  • Savings: 94%

Retrieval costs also matter:

  • Glacier retrieval: $0.03/GB
  • Restoring 100GB = $3 (one-time)

Lifecycle Policies

Lifecycle policies automate transitions:

Rule 1: Hot to Warm after 30 days
Rule 2: Warm to Cold after 90 days
Rule 3: Cold to Archive after 1 year
Rule 4: Delete after 7 years

Deletion Verification

Prove data was actually deleted (not just marked):

  1. Audit trail: Log each deletion with timestamp, user, reason
  2. Witness verification: Third-party confirms deletion
  3. Certificate of destruction: Official document from storage provider
  4. Cryptographic proof: Hash deleted data before/after

Practical Example

{
"Rules": [
{
"Id": "archive-old-logs",
"Status": "Enabled",
"Filter": { "Prefix": "logs/" },
"Transitions": [
{
"Days": 30,
"StorageClass": "STANDARD_IA"
},
{
"Days": 90,
"StorageClass": "GLACIER"
},
{
"Days": 365,
"StorageClass": "DEEP_ARCHIVE"
}
],
"Expiration": {
"Days": 2555
}
}
]
}

Deploy:

aws s3api put-bucket-lifecycle-configuration \
--bucket my-data-bucket \
--lifecycle-configuration file://lifecycle.json

When to Use / When Not to Use

Keep Hot (Standard)
  1. Recent transactions (< 30 days)
  2. Active user sessions
  3. Real-time analytics data
  4. Current machine learning models
  5. High query frequency (daily+)
Archive (Cold)
  1. Historical data (> 1 year old)
  2. Backup data
  3. Compliance archives
  4. Infrequent queries (< weekly)
  5. Must retain but unlikely to access

Patterns & Pitfalls

Litigation pending? Court issues legal hold: don't delete data even after retention period. Mark data with hold status, exclude from auto-deletion, notify retention manager. When hold lifted, resume normal deletion.
Production data deleted but backup contains it. Backup retention policy often exceeds primary data retention. Coordinate: if deleting user from production, delete from backups too (or anonymize).
Archive retrieval costs add up. 100GB restore from Deep Archive = $3. Frequent retrievals (weekly) = $156/year. Consider Glacier (cheaper retrieval) for data accessed monthly.
Different data types have different requirements. Logs (30d), transactions (1yr), medical (6yr). Mistake: use single policy for all. Implement granular policies per data type/purpose.
Auditor asks: where's the proof you deleted 2023 data? Provide certificate from cloud provider, cryptographic hash, audit logs. Keep deletion records for 5+ years for verification.
Switching clouds? Archival formats incompatible (AWS Glacier → Azure Blob requires reformat). Plan: accept retrieval cost to migrate, or accept vendor lock-in, or use open format (Parquet, tar.gz).

Design Review Checklist

  • Identified all data types and their regulatory retention requirements
  • Documented business reasons for extended retention (if applicable)
  • Determined minimum retention period: MAX(legal, business need)
  • Designed tiered storage strategy: hot → warm → cold → archive → delete
  • Set lifecycle policy thresholds: when transition to each tier?
  • Planned deletion: how verify data actually deleted (not hidden)?
  • Considered legal holds: who can place/release holds?
  • Coordinated retention with backup strategy
  • Calculated cost savings: all-hot vs tiered approach
  • Tested restore from cold/archive: verify retrieval works

Self-Check

  • What's your data's regulatory retention requirement? (Hint: check SOX, HIPAA, GDPR, CCPA)
  • How much would tiering save you? (Hint: calculate all-hot vs tiered cost for your volume)
  • When should you delete data? (Hint: after retention period expires, prove deletion)
  • What's a legal hold and why important? (Hint: litigation, blocks normal deletion)
  • How verify data was actually deleted? (Hint: hash, provider certificate, audit logs)

Next Steps

  • Audit retention: discover what data you have, how long you keep it
  • Calculate costs: compare all-hot vs tiered storage for your volume
  • Design policy: set transitions (hot 30d, warm 90d, cold 1yr, delete 7yr)
  • Implement lifecycle: deploy policies on S3, GCS, or other cloud storage
  • Test deletion: verify data actually deleted, generate certificate

References