Cost Efficiency and FinOps

Optimize cloud spend without sacrificing performance or reliability.

TL;DR

FinOps is "DevOps for cloud costs." Treat infrastructure costs like code: version it, review it, optimize it. Right-size instances (don't run m5.xlarge when t3.small works). Use spot instances (70% discount, less guaranteed). Leverage reserved instances (20-40% discount for 1-3 year commitment). Delete unused resources (forgotten test databases cost real money). Auto-scale down when load decreases. Monitor cost per feature/user. Trade latency for cost: caching reduces database load; compression reduces egress. Architecture decisions made today determine tomorrow's cloud bill.

Learning Objectives

Measure cost per service, feature, and user
Right-size infrastructure (avoid over-provisioning)
Use reserved instances, spot instances, and savings plans
Optimize data transfer and storage
Trade performance for cost (caching, compression, CDN)
Implement cost alerts and budgets

Motivating Scenario

A startup pays $50k/month for cloud infrastructure. Investigation reveals: hundreds of unused databases, oversized instances, no caching (10x more database queries than needed), wasteful data transfers. Fixing these issues: $50k → $8k/month without sacrificing performance.

Core Concepts

Cost Drivers

Compute: CPU hours (biggest spend for most companies)
Storage: GB/month (database, backups, object storage)
Network: Data transfer (especially cross-region)
Managed services: RDS, Elasticsearch, SQS (premium over DIY)

Cost Optimization Levers

Lever	Impact	Example
Right-sizing	30-40%	Switch from m5.2xlarge to t3.medium
Reserved Instances	20-40%	1-year RI commitment
Spot Instances	70%	Use for batch jobs, non-critical workloads
Caching	80%	Redis reduces database load 10x
Compression	50%	Gzip reduces egress bandwidth
Auto-scaling	40%	Scale down at night, scale up for peak hours
Deletion	100%	Delete unused databases, old snapshots

Practical Example

# FinOps: Cost tracking and optimization
import boto3
from datetime import datetime, timedelta

class CostOptimizer:
    def __init__(self):
        self.ce = boto3.client('ce')  # Cost Explorer
        self.ec2 = boto3.client('ec2')

    def get_monthly_cost(self):
        """Get current month's cloud spend"""
        response = self.ce.get_cost_and_usage(
            TimePeriod={
                'Start': datetime.now().replace(day=1).strftime('%Y-%m-%d'),
                'End': datetime.now().strftime('%Y-%m-%d'),
            },
            Granularity='DAILY',
            Metrics=['UnblendedCost'],
            GroupBy=[{'Type': 'DIMENSION', 'Key': 'SERVICE'}]
        )
        return response

    def find_oversized_instances(self):
        """Find EC2 instances running at low CPU"""
        # Get instance utilization from CloudWatch
        instances = self.ec2.describe_instances()

        for reservation in instances['Reservations']:
            for instance in reservation['Instances']:
                # Find instances that could be downsized
                # ...

    def estimate_cost_with_reserved_instances(self):
        """Calculate savings from RI purchases"""
        response = self.ce.get_reservation_purchase_recommendation(
            Service='EC2',
            LookbackPeriod='THIRTY_DAYS',
            TermInYears='ONE_YEAR'
        )
        return response

    def set_cost_alerts(self, budget=10000):
        """Alert when monthly spend exceeds budget"""
        # Setup CloudWatch alarm for cost
        # ...

# Usage
optimizer = CostOptimizer()
print(f"Monthly cost: {optimizer.get_monthly_cost()}")
print(f"RI savings available: {optimizer.estimate_cost_with_reserved_instances()}")

Self-Check

How do you calculate total cost of ownership (TCO)?
What's the difference between reserved instances and savings plans?
How do you identify over-provisioned instances?
Why is data transfer expensive?
How would you justify FinOps investment to executives?

Design Review Checklist

Practical Cost Optimization Examples

Database Right-Sizing

Reserved Instances

Spot Instances

Caching Strategy

Data Transfer Optimization

Cleanup and Deletion

Advanced Optimization Strategies

Cost-Per-Feature Analysis

Track costs not just by service, but by business feature:

class CostAnalyzer:
    def cost_per_feature(self, feature_name: str, period: str = 'monthly'):
        """Calculate cost of operating a specific feature"""
        # Map feature to services (Catalog feature uses RDS, ElastiCache, Lambda)
        services = self.feature_to_services[feature_name]

        total_cost = 0
        for service in services:
            service_cost = self.cost_explorer.get_cost(service, period)
            # Allocate cost proportionally (by usage %)
            allocated_cost = service_cost * self.allocation_factor(feature_name, service)
            total_cost += allocated_cost

        return total_cost

# Example output
# Feature: Product Search
#   RDS (Elasticsearch): $5,000/month
#   Lambda (search queries): $2,000/month
#   Data transfer (results): $800/month
#   Total: $7,800/month
#   Cost per query: $0.0012
#
# Decision: If feature generates $1M revenue, cost is 0.78% of revenue (acceptable)
#           If feature generates $50k revenue, cost is 15.6% of revenue (too high, deprecate?)

Optimization Techniques by Service

RDS (Databases)

Downsize instance if CPU/memory < 20% utilized (m5.2xlarge → t3.medium saves 70%)
Use read replicas for read-heavy workloads (vs scaling single instance)
Enable automated backups with appropriate retention (old backups = wasted cost)
Use Aurora MySQL/PostgreSQL (pay-per-second, scales automatically)

EC2 (Compute)

Reserved instances for predictable baseline (1-year RI = 40% discount)
Spot instances for non-critical workloads (batch jobs, testing = 70% discount)
Scheduled scaling (spin down non-prod at night = 50% savings)
Right-sizing: use CloudWatch metrics to size instances accurately

S3 (Storage)

Use S3 Intelligent-Tiering (auto-moves to cheaper tiers based on access)
Move old data to Glacier (95% cheaper than standard, 3-5 hour retrieval)
Enable versioning only where needed
Cleanup: delete unused buckets, old backups, obsolete data

Lambda (Serverless)

Optimize memory (128MB → 256MB might reduce duration 10x, cost tradeoff)
Reduce cold starts with provisioned concurrency (for critical functions)
Use ARM-based Graviton2 (20% cheaper than x86)
Set appropriate timeouts (don't run forever if failed)

Data Transfer

Minimize cross-region transfers (expensive: $0.02/GB vs free within region)
Use VPC endpoints to avoid NAT gateway costs
Batch API requests to reduce calls
Compress responses (gzip = 50-80% smaller)

FinOps Tools and Dashboards

class FinOpsDashboard:
    def daily_report(self):
        """Generate daily spend report"""
        return {
            'today': self.cost_explorer.today(),
            'ytd': self.cost_explorer.ytd(),
            'forecast_monthly': self.cost_explorer.forecast_month(),
            'budget': 50000,
            'variance': self.cost_explorer.ytd() - (30 * self.avg_daily()),
            'top_services': self.top_services_by_cost(n=5),
            'anomalies': self.detect_anomalies(),
            'opportunities': self.cost_optimization_opportunities()
        }

    def detect_anomalies(self):
        """Detect unusual spend patterns"""
        today = self.cost_explorer.today()
        baseline = self.baseline_daily_cost()

        if today > baseline * 1.2:  # 20% increase
            return f"Spend {today:.0f} exceeds baseline {baseline:.0f} by 20%"
        return None

    def cost_optimization_opportunities(self):
        """Identify quick wins"""
        opportunities = []

        # Unused resources
        for instance in self.ec2.describe_instances():
            if instance.cpu_utilization < 10:
                opportunities.append({
                    'type': 'Downsize',
                    'resource': instance.id,
                    'savings': instance.monthly_cost * 0.7
                })

        # Missing reserved instances
        for rds in self.rds.describe_instances():
            if rds.uptime > 720 * 24:  # Always on
                opportunities.append({
                    'type': 'Buy Reserved Instance',
                    'resource': rds.id,
                    'savings': rds.monthly_cost * 0.4
                })

        return opportunities

Cost Governance

Budgets and Alerts

AWS Budget:
  Monthly Limit: $50,000
  Alerts:
    - At 50%: Notify team leads
    - At 80%: Notify management
    - At 100%: Automatic cost controls (stop new resources)

Cost Allocation Tags:
  - Environment: production, staging, development
  - Team: platform, product, data
  - Cost Center: engineering, marketing
  - Project: catalog, checkout, analytics

Chargeback Model:
  # Monthly billing to teams based on usage
  Product Team:
    - RDS (catalog DB): $3,000 (allocated by query count)
    - Lambda (product APIs): $1,500 (allocated by invocations)
    - S3 (product images): $500 (allocated by storage GB)
    - Total: $5,000/month

# This incentivizes teams to optimize

Advanced Cost Case Studies

Case Study 1: Database Optimization

Before (Unoptimized):
  - 3x m5.2xlarge RDS instances: $3,000/month
  - CPU/memory utilization: 5-10%
  - Unoptimized queries: 10,000 queries/second for 100 QPS volume

Solution:
  - Optimize slow queries (N+1 problems): Cut queries from 10,000 to 1,000 per second
  - Downsize to 3x t3.medium: $300/month
  - Add Redis cache: $200/month
  - Query optimization engineering time: 40 hours

After:
  - Monthly cost: $500 (vs $3,000)
  - Savings: $2,500/month
  - ROI: 1.2 months

Result: Faster queries, cheaper system

Case Study 2: Multi-Region to Single-Region + CDN

Before (Multi-Region Redundancy):
  - Primary region (US): $20,000/month
  - Backup region (EU): $20,000/month
  - Data replication: $2,000/month
  - Total: $42,000/month

  Cost: $0.42 per user per month (100k users)

Solution:
  - Single region (US): $20,000/month
  - CloudFront CDN (global): $3,000/month
  - Database replication removed: -$2,000/month
  - Backup/disaster recovery: $1,000/month
  - Total: $22,000/month

Result:
  - Savings: $20,000/month (47% reduction)
  - Latency: 200ms → 50ms (EU users)
  - Tradeoff: Single region failure = full downtime (mitigated by backups)

Next Steps

Get visibility — Enable AWS Cost Explorer, GCP Cost Management, Azure Cost Analyzer
Identify quick wins — Unused resources, oversized instances, wasted data transfer (typically save 20-30%)
Optimize baseline — Right-sizing, reserved instances, spot instances (save 40-50%)
Optimize architecturally — Caching, compression, CDN (save 50-80% on certain services)
Monitor continuously — Alerts for cost anomalies, monthly reviews, trend analysis
Build culture — Developers accountable for cost, celebrate savings, educate on cost tradeoffs
Measure ROI — Track savings from each optimization effort (40-hour project should save > $8k to be worthwhile)

Cost Efficiency and FinOps

TL;DR​

Learning Objectives​

Motivating Scenario​

Core Concepts​

Cost Drivers​

Cost Optimization Levers​

Practical Example​

Self-Check​

Design Review Checklist​

Practical Cost Optimization Examples​

Advanced Optimization Strategies​

Cost-Per-Feature Analysis​

Optimization Techniques by Service​

FinOps Tools and Dashboards​

Cost Governance​

Advanced Cost Case Studies​

Next Steps​

References​