Skip to main content

Cost Efficiency and FinOps

Optimize cloud spend without sacrificing performance or reliability.

TL;DR

FinOps is "DevOps for cloud costs." Treat infrastructure costs like code: version it, review it, optimize it. Right-size instances (don't run m5.xlarge when t3.small works). Use spot instances (70% discount, less guaranteed). Leverage reserved instances (20-40% discount for 1-3 year commitment). Delete unused resources (forgotten test databases cost real money). Auto-scale down when load decreases. Monitor cost per feature/user. Trade latency for cost: caching reduces database load; compression reduces egress. Architecture decisions made today determine tomorrow's cloud bill.

Learning Objectives

  • Measure cost per service, feature, and user
  • Right-size infrastructure (avoid over-provisioning)
  • Use reserved instances, spot instances, and savings plans
  • Optimize data transfer and storage
  • Trade performance for cost (caching, compression, CDN)
  • Implement cost alerts and budgets

Motivating Scenario

A startup pays $50k/month for cloud infrastructure. Investigation reveals: hundreds of unused databases, oversized instances, no caching (10x more database queries than needed), wasteful data transfers. Fixing these issues: $50k → $8k/month without sacrificing performance.

Core Concepts

Cost Drivers

  1. Compute: CPU hours (biggest spend for most companies)
  2. Storage: GB/month (database, backups, object storage)
  3. Network: Data transfer (especially cross-region)
  4. Managed services: RDS, Elasticsearch, SQS (premium over DIY)

Cost Optimization Levers

LeverImpactExample
Right-sizing30-40%Switch from m5.2xlarge to t3.medium
Reserved Instances20-40%1-year RI commitment
Spot Instances70%Use for batch jobs, non-critical workloads
Caching80%Redis reduces database load 10x
Compression50%Gzip reduces egress bandwidth
Auto-scaling40%Scale down at night, scale up for peak hours
Deletion100%Delete unused databases, old snapshots

Practical Example

# FinOps: Cost tracking and optimization
import boto3
from datetime import datetime, timedelta

class CostOptimizer:
def __init__(self):
self.ce = boto3.client('ce') # Cost Explorer
self.ec2 = boto3.client('ec2')

def get_monthly_cost(self):
"""Get current month's cloud spend"""
response = self.ce.get_cost_and_usage(
TimePeriod={
'Start': datetime.now().replace(day=1).strftime('%Y-%m-%d'),
'End': datetime.now().strftime('%Y-%m-%d'),
},
Granularity='DAILY',
Metrics=['UnblendedCost'],
GroupBy=[{'Type': 'DIMENSION', 'Key': 'SERVICE'}]
)
return response

def find_oversized_instances(self):
"""Find EC2 instances running at low CPU"""
# Get instance utilization from CloudWatch
instances = self.ec2.describe_instances()

for reservation in instances['Reservations']:
for instance in reservation['Instances']:
# Find instances that could be downsized
# ...

def estimate_cost_with_reserved_instances(self):
"""Calculate savings from RI purchases"""
response = self.ce.get_reservation_purchase_recommendation(
Service='EC2',
LookbackPeriod='THIRTY_DAYS',
TermInYears='ONE_YEAR'
)
return response

def set_cost_alerts(self, budget=10000):
"""Alert when monthly spend exceeds budget"""
# Setup CloudWatch alarm for cost
# ...

# Usage
optimizer = CostOptimizer()
print(f"Monthly cost: {optimizer.get_monthly_cost()}")
print(f"RI savings available: {optimizer.estimate_cost_with_reserved_instances()}")

Self-Check

  • How do you calculate total cost of ownership (TCO)?
  • What's the difference between reserved instances and savings plans?
  • How do you identify over-provisioned instances?
  • Why is data transfer expensive?
  • How would you justify FinOps investment to executives?

Design Review Checklist

  • Cost tracking enabled per service/feature/team?
  • Unused resources identified and deleted?
  • Instances right-sized (CloudWatch metrics used)?
  • Reserved instances purchased for baseline?
  • Spot instances used for non-critical workloads?
  • Data transfer optimized (CDN, compression)?
  • Auto-scaling configured (up and down)?
  • Cost anomalies monitored and alerted?
  • Monthly cost reviews scheduled?
  • Chargeback model implemented?
  • FinOps practices documented?
  • Team trained on cost optimization?

Practical Cost Optimization Examples

Advanced Optimization Strategies

Cost-Per-Feature Analysis

Track costs not just by service, but by business feature:

class CostAnalyzer:
def cost_per_feature(self, feature_name: str, period: str = 'monthly'):
"""Calculate cost of operating a specific feature"""
# Map feature to services (Catalog feature uses RDS, ElastiCache, Lambda)
services = self.feature_to_services[feature_name]

total_cost = 0
for service in services:
service_cost = self.cost_explorer.get_cost(service, period)
# Allocate cost proportionally (by usage %)
allocated_cost = service_cost * self.allocation_factor(feature_name, service)
total_cost += allocated_cost

return total_cost

# Example output
# Feature: Product Search
# RDS (Elasticsearch): $5,000/month
# Lambda (search queries): $2,000/month
# Data transfer (results): $800/month
# Total: $7,800/month
# Cost per query: $0.0012
#
# Decision: If feature generates $1M revenue, cost is 0.78% of revenue (acceptable)
# If feature generates $50k revenue, cost is 15.6% of revenue (too high, deprecate?)

Optimization Techniques by Service

RDS (Databases)

  • Downsize instance if CPU/memory < 20% utilized (m5.2xlarge → t3.medium saves 70%)
  • Use read replicas for read-heavy workloads (vs scaling single instance)
  • Enable automated backups with appropriate retention (old backups = wasted cost)
  • Use Aurora MySQL/PostgreSQL (pay-per-second, scales automatically)

EC2 (Compute)

  • Reserved instances for predictable baseline (1-year RI = 40% discount)
  • Spot instances for non-critical workloads (batch jobs, testing = 70% discount)
  • Scheduled scaling (spin down non-prod at night = 50% savings)
  • Right-sizing: use CloudWatch metrics to size instances accurately

S3 (Storage)

  • Use S3 Intelligent-Tiering (auto-moves to cheaper tiers based on access)
  • Move old data to Glacier (95% cheaper than standard, 3-5 hour retrieval)
  • Enable versioning only where needed
  • Cleanup: delete unused buckets, old backups, obsolete data

Lambda (Serverless)

  • Optimize memory (128MB → 256MB might reduce duration 10x, cost tradeoff)
  • Reduce cold starts with provisioned concurrency (for critical functions)
  • Use ARM-based Graviton2 (20% cheaper than x86)
  • Set appropriate timeouts (don't run forever if failed)

Data Transfer

  • Minimize cross-region transfers (expensive: $0.02/GB vs free within region)
  • Use VPC endpoints to avoid NAT gateway costs
  • Batch API requests to reduce calls
  • Compress responses (gzip = 50-80% smaller)

FinOps Tools and Dashboards

class FinOpsDashboard:
def daily_report(self):
"""Generate daily spend report"""
return {
'today': self.cost_explorer.today(),
'ytd': self.cost_explorer.ytd(),
'forecast_monthly': self.cost_explorer.forecast_month(),
'budget': 50000,
'variance': self.cost_explorer.ytd() - (30 * self.avg_daily()),
'top_services': self.top_services_by_cost(n=5),
'anomalies': self.detect_anomalies(),
'opportunities': self.cost_optimization_opportunities()
}

def detect_anomalies(self):
"""Detect unusual spend patterns"""
today = self.cost_explorer.today()
baseline = self.baseline_daily_cost()

if today > baseline * 1.2: # 20% increase
return f"Spend {today:.0f} exceeds baseline {baseline:.0f} by 20%"
return None

def cost_optimization_opportunities(self):
"""Identify quick wins"""
opportunities = []

# Unused resources
for instance in self.ec2.describe_instances():
if instance.cpu_utilization < 10:
opportunities.append({
'type': 'Downsize',
'resource': instance.id,
'savings': instance.monthly_cost * 0.7
})

# Missing reserved instances
for rds in self.rds.describe_instances():
if rds.uptime > 720 * 24: # Always on
opportunities.append({
'type': 'Buy Reserved Instance',
'resource': rds.id,
'savings': rds.monthly_cost * 0.4
})

return opportunities

Cost Governance

Budgets and Alerts

AWS Budget:
Monthly Limit: $50,000
Alerts:
- At 50%: Notify team leads
- At 80%: Notify management
- At 100%: Automatic cost controls (stop new resources)

Cost Allocation Tags:
- Environment: production, staging, development
- Team: platform, product, data
- Cost Center: engineering, marketing
- Project: catalog, checkout, analytics

Chargeback Model:
# Monthly billing to teams based on usage
Product Team:
- RDS (catalog DB): $3,000 (allocated by query count)
- Lambda (product APIs): $1,500 (allocated by invocations)
- S3 (product images): $500 (allocated by storage GB)
- Total: $5,000/month

# This incentivizes teams to optimize

Advanced Cost Case Studies

Case Study 1: Database Optimization

Before (Unoptimized):
- 3x m5.2xlarge RDS instances: $3,000/month
- CPU/memory utilization: 5-10%
- Unoptimized queries: 10,000 queries/second for 100 QPS volume

Solution:
- Optimize slow queries (N+1 problems): Cut queries from 10,000 to 1,000 per second
- Downsize to 3x t3.medium: $300/month
- Add Redis cache: $200/month
- Query optimization engineering time: 40 hours

After:
- Monthly cost: $500 (vs $3,000)
- Savings: $2,500/month
- ROI: 1.2 months

Result: Faster queries, cheaper system

Case Study 2: Multi-Region to Single-Region + CDN

Before (Multi-Region Redundancy):
- Primary region (US): $20,000/month
- Backup region (EU): $20,000/month
- Data replication: $2,000/month
- Total: $42,000/month

Cost: $0.42 per user per month (100k users)

Solution:
- Single region (US): $20,000/month
- CloudFront CDN (global): $3,000/month
- Database replication removed: -$2,000/month
- Backup/disaster recovery: $1,000/month
- Total: $22,000/month

Result:
- Savings: $20,000/month (47% reduction)
- Latency: 200ms → 50ms (EU users)
- Tradeoff: Single region failure = full downtime (mitigated by backups)

Next Steps

  1. Get visibility — Enable AWS Cost Explorer, GCP Cost Management, Azure Cost Analyzer
  2. Identify quick wins — Unused resources, oversized instances, wasted data transfer (typically save 20-30%)
  3. Optimize baseline — Right-sizing, reserved instances, spot instances (save 40-50%)
  4. Optimize architecturally — Caching, compression, CDN (save 50-80% on certain services)
  5. Monitor continuously — Alerts for cost anomalies, monthly reviews, trend analysis
  6. Build culture — Developers accountable for cost, celebrate savings, educate on cost tradeoffs
  7. Measure ROI — Track savings from each optimization effort (40-hour project should save > $8k to be worthwhile)

References