Cost Efficiency and FinOps
Optimize cloud spend without sacrificing performance or reliability.
TL;DR
FinOps is "DevOps for cloud costs." Treat infrastructure costs like code: version it, review it, optimize it. Right-size instances (don't run m5.xlarge when t3.small works). Use spot instances (70% discount, less guaranteed). Leverage reserved instances (20-40% discount for 1-3 year commitment). Delete unused resources (forgotten test databases cost real money). Auto-scale down when load decreases. Monitor cost per feature/user. Trade latency for cost: caching reduces database load; compression reduces egress. Architecture decisions made today determine tomorrow's cloud bill.
Learning Objectives
- Measure cost per service, feature, and user
- Right-size infrastructure (avoid over-provisioning)
- Use reserved instances, spot instances, and savings plans
- Optimize data transfer and storage
- Trade performance for cost (caching, compression, CDN)
- Implement cost alerts and budgets
Motivating Scenario
A startup pays $50k/month for cloud infrastructure. Investigation reveals: hundreds of unused databases, oversized instances, no caching (10x more database queries than needed), wasteful data transfers. Fixing these issues: $50k → $8k/month without sacrificing performance.
Core Concepts
Cost Drivers
- Compute: CPU hours (biggest spend for most companies)
- Storage: GB/month (database, backups, object storage)
- Network: Data transfer (especially cross-region)
- Managed services: RDS, Elasticsearch, SQS (premium over DIY)
Cost Optimization Levers
| Lever | Impact | Example |
|---|---|---|
| Right-sizing | 30-40% | Switch from m5.2xlarge to t3.medium |
| Reserved Instances | 20-40% | 1-year RI commitment |
| Spot Instances | 70% | Use for batch jobs, non-critical workloads |
| Caching | 80% | Redis reduces database load 10x |
| Compression | 50% | Gzip reduces egress bandwidth |
| Auto-scaling | 40% | Scale down at night, scale up for peak hours |
| Deletion | 100% | Delete unused databases, old snapshots |
Practical Example
# FinOps: Cost tracking and optimization
import boto3
from datetime import datetime, timedelta
class CostOptimizer:
def __init__(self):
self.ce = boto3.client('ce') # Cost Explorer
self.ec2 = boto3.client('ec2')
def get_monthly_cost(self):
"""Get current month's cloud spend"""
response = self.ce.get_cost_and_usage(
TimePeriod={
'Start': datetime.now().replace(day=1).strftime('%Y-%m-%d'),
'End': datetime.now().strftime('%Y-%m-%d'),
},
Granularity='DAILY',
Metrics=['UnblendedCost'],
GroupBy=[{'Type': 'DIMENSION', 'Key': 'SERVICE'}]
)
return response
def find_oversized_instances(self):
"""Find EC2 instances running at low CPU"""
# Get instance utilization from CloudWatch
instances = self.ec2.describe_instances()
for reservation in instances['Reservations']:
for instance in reservation['Instances']:
# Find instances that could be downsized
# ...
def estimate_cost_with_reserved_instances(self):
"""Calculate savings from RI purchases"""
response = self.ce.get_reservation_purchase_recommendation(
Service='EC2',
LookbackPeriod='THIRTY_DAYS',
TermInYears='ONE_YEAR'
)
return response
def set_cost_alerts(self, budget=10000):
"""Alert when monthly spend exceeds budget"""
# Setup CloudWatch alarm for cost
# ...
# Usage
optimizer = CostOptimizer()
print(f"Monthly cost: {optimizer.get_monthly_cost()}")
print(f"RI savings available: {optimizer.estimate_cost_with_reserved_instances()}")
Self-Check
- How do you calculate total cost of ownership (TCO)?
- What's the difference between reserved instances and savings plans?
- How do you identify over-provisioned instances?
- Why is data transfer expensive?
- How would you justify FinOps investment to executives?
Design Review Checklist
- Cost tracking enabled per service/feature/team?
- Unused resources identified and deleted?
- Instances right-sized (CloudWatch metrics used)?
- Reserved instances purchased for baseline?
- Spot instances used for non-critical workloads?
- Data transfer optimized (CDN, compression)?
- Auto-scaling configured (up and down)?
- Cost anomalies monitored and alerted?
- Monthly cost reviews scheduled?
- Chargeback model implemented?
- FinOps practices documented?
- Team trained on cost optimization?
Practical Cost Optimization Examples
Advanced Optimization Strategies
Cost-Per-Feature Analysis
Track costs not just by service, but by business feature:
class CostAnalyzer:
def cost_per_feature(self, feature_name: str, period: str = 'monthly'):
"""Calculate cost of operating a specific feature"""
# Map feature to services (Catalog feature uses RDS, ElastiCache, Lambda)
services = self.feature_to_services[feature_name]
total_cost = 0
for service in services:
service_cost = self.cost_explorer.get_cost(service, period)
# Allocate cost proportionally (by usage %)
allocated_cost = service_cost * self.allocation_factor(feature_name, service)
total_cost += allocated_cost
return total_cost
# Example output
# Feature: Product Search
# RDS (Elasticsearch): $5,000/month
# Lambda (search queries): $2,000/month
# Data transfer (results): $800/month
# Total: $7,800/month
# Cost per query: $0.0012
#
# Decision: If feature generates $1M revenue, cost is 0.78% of revenue (acceptable)
# If feature generates $50k revenue, cost is 15.6% of revenue (too high, deprecate?)
Optimization Techniques by Service
RDS (Databases)
- Downsize instance if CPU/memory < 20% utilized (m5.2xlarge → t3.medium saves 70%)
- Use read replicas for read-heavy workloads (vs scaling single instance)
- Enable automated backups with appropriate retention (old backups = wasted cost)
- Use Aurora MySQL/PostgreSQL (pay-per-second, scales automatically)
EC2 (Compute)
- Reserved instances for predictable baseline (1-year RI = 40% discount)
- Spot instances for non-critical workloads (batch jobs, testing = 70% discount)
- Scheduled scaling (spin down non-prod at night = 50% savings)
- Right-sizing: use CloudWatch metrics to size instances accurately
S3 (Storage)
- Use S3 Intelligent-Tiering (auto-moves to cheaper tiers based on access)
- Move old data to Glacier (95% cheaper than standard, 3-5 hour retrieval)
- Enable versioning only where needed
- Cleanup: delete unused buckets, old backups, obsolete data
Lambda (Serverless)
- Optimize memory (128MB → 256MB might reduce duration 10x, cost tradeoff)
- Reduce cold starts with provisioned concurrency (for critical functions)
- Use ARM-based Graviton2 (20% cheaper than x86)
- Set appropriate timeouts (don't run forever if failed)
Data Transfer
- Minimize cross-region transfers (expensive: $0.02/GB vs free within region)
- Use VPC endpoints to avoid NAT gateway costs
- Batch API requests to reduce calls
- Compress responses (gzip = 50-80% smaller)
FinOps Tools and Dashboards
class FinOpsDashboard:
def daily_report(self):
"""Generate daily spend report"""
return {
'today': self.cost_explorer.today(),
'ytd': self.cost_explorer.ytd(),
'forecast_monthly': self.cost_explorer.forecast_month(),
'budget': 50000,
'variance': self.cost_explorer.ytd() - (30 * self.avg_daily()),
'top_services': self.top_services_by_cost(n=5),
'anomalies': self.detect_anomalies(),
'opportunities': self.cost_optimization_opportunities()
}
def detect_anomalies(self):
"""Detect unusual spend patterns"""
today = self.cost_explorer.today()
baseline = self.baseline_daily_cost()
if today > baseline * 1.2: # 20% increase
return f"Spend {today:.0f} exceeds baseline {baseline:.0f} by 20%"
return None
def cost_optimization_opportunities(self):
"""Identify quick wins"""
opportunities = []
# Unused resources
for instance in self.ec2.describe_instances():
if instance.cpu_utilization < 10:
opportunities.append({
'type': 'Downsize',
'resource': instance.id,
'savings': instance.monthly_cost * 0.7
})
# Missing reserved instances
for rds in self.rds.describe_instances():
if rds.uptime > 720 * 24: # Always on
opportunities.append({
'type': 'Buy Reserved Instance',
'resource': rds.id,
'savings': rds.monthly_cost * 0.4
})
return opportunities
Cost Governance
Budgets and Alerts
AWS Budget:
Monthly Limit: $50,000
Alerts:
- At 50%: Notify team leads
- At 80%: Notify management
- At 100%: Automatic cost controls (stop new resources)
Cost Allocation Tags:
- Environment: production, staging, development
- Team: platform, product, data
- Cost Center: engineering, marketing
- Project: catalog, checkout, analytics
Chargeback Model:
# Monthly billing to teams based on usage
Product Team:
- RDS (catalog DB): $3,000 (allocated by query count)
- Lambda (product APIs): $1,500 (allocated by invocations)
- S3 (product images): $500 (allocated by storage GB)
- Total: $5,000/month
# This incentivizes teams to optimize
Advanced Cost Case Studies
Case Study 1: Database Optimization
Before (Unoptimized):
- 3x m5.2xlarge RDS instances: $3,000/month
- CPU/memory utilization: 5-10%
- Unoptimized queries: 10,000 queries/second for 100 QPS volume
Solution:
- Optimize slow queries (N+1 problems): Cut queries from 10,000 to 1,000 per second
- Downsize to 3x t3.medium: $300/month
- Add Redis cache: $200/month
- Query optimization engineering time: 40 hours
After:
- Monthly cost: $500 (vs $3,000)
- Savings: $2,500/month
- ROI: 1.2 months
Result: Faster queries, cheaper system
Case Study 2: Multi-Region to Single-Region + CDN
Before (Multi-Region Redundancy):
- Primary region (US): $20,000/month
- Backup region (EU): $20,000/month
- Data replication: $2,000/month
- Total: $42,000/month
Cost: $0.42 per user per month (100k users)
Solution:
- Single region (US): $20,000/month
- CloudFront CDN (global): $3,000/month
- Database replication removed: -$2,000/month
- Backup/disaster recovery: $1,000/month
- Total: $22,000/month
Result:
- Savings: $20,000/month (47% reduction)
- Latency: 200ms → 50ms (EU users)
- Tradeoff: Single region failure = full downtime (mitigated by backups)
Next Steps
- Get visibility — Enable AWS Cost Explorer, GCP Cost Management, Azure Cost Analyzer
- Identify quick wins — Unused resources, oversized instances, wasted data transfer (typically save 20-30%)
- Optimize baseline — Right-sizing, reserved instances, spot instances (save 40-50%)
- Optimize architecturally — Caching, compression, CDN (save 50-80% on certain services)
- Monitor continuously — Alerts for cost anomalies, monthly reviews, trend analysis
- Build culture — Developers accountable for cost, celebrate savings, educate on cost tradeoffs
- Measure ROI — Track savings from each optimization effort (40-hour project should save > $8k to be worthwhile)