Skip to main content

Virtual Machines

TL;DR

Virtual machines offer full OS-level isolation, broad software compatibility, and live migration capabilities. Use VMs for legacy applications, strict tenant isolation, mixed workloads (Windows + Linux), or when container escape risk is unacceptable. Trade longer boot times (seconds to minutes) and higher resource overhead (2-4 GB RAM per VM) for operational simplicity and strong security boundaries. VMs are not obsolete in cloud-native; they solve specific architectural problems.

Learning Objectives

  • Understand hypervisor architectures and isolation mechanisms (Type 1 vs Type 2).
  • Evaluate VM sizing, over-subscription, and noisy-neighbor risks.
  • Design VM placement and live migration strategies for availability.
  • Assess when VMs outweigh containers in security, compliance, or compatibility posture.
  • Optimize VM performance and resource utilization.

Motivating Scenario

Your organization hosts multi-tenant SaaS serving financial institutions, where strict isolation is non-negotiable (regulatory requirement). A shared container node compromise could leak tenant data. VMs—despite higher overhead—provide hardware-enforced isolation with separate kernels per tenant. With careful sizing, live migration policies, and resource monitoring, you maintain cost efficiency while meeting compliance (SOC 2, HIPAA, PCI-DSS) and security requirements.

Mental Model

VM architecture: hypervisor multiplexes physical resources; each VM has independent OS and kernel.

Core Concepts

Isolation: Each VM runs a complete OS kernel with separate memory, filesystem, and processes. Hypervisor enforces memory isolation via MMU (memory management unit) and CPU protections (privilege rings). Container escape cannot affect other VMs; they're hardware-protected.

Resource Over-Subscription: Hypervisors can over-subscribe vCPU to physical cores and RAM to physical memory (with swap). CPU contention causes latency; RAM over-subscription triggers slow swap. Typical over-subscription: 4:1 CPU (safe), 1.5:1 RAM (risky). Over-subscription ratios vary by workload CPU-bound vs I/O-bound).

Live Migration: VMs can move between hosts without downtime. Memory pages copied to destination, CPU context transferred. Enables maintenance, load rebalancing, hardware refresh without disrupting tenants. Pre-copy or post-copy strategies; down-time typically < 100ms.

Snapshot & Restore: VMs can be paused, snapshotted (memory + disk state), and resumed. Supports disaster recovery, testing workflows, and development/production parity.

Hypervisor Types:

  • Type 1 (Bare-Metal): KVM, Hyper-V, ESXi. Runs directly on hardware. Better performance, security.
  • Type 2 (Hosted): VirtualBox, VMware Fusion. Runs on top of OS. Easier to set up, slower.

Practical Examples

# Infrastructure as Code: Define VM instances programmatically

# Fetch latest Ubuntu AMI
data "aws_ami" "ubuntu" {
most_recent = true
owners = ["099720109477"] # Canonical

filter {
name = "name"
values = ["ubuntu/images/hvm-ssd/ubuntu-jammy-22.04-amd64-server-*"]
}
}

# EC2 instance (VM) with configuration
resource "aws_instance" "app_vm" {
ami = data.aws_ami.ubuntu.id
instance_type = "t3.medium" # 2 vCPU, 4 GB RAM
availability_zone = "us-east-1a"
key_name = aws_key_pair.deployer.key_name

vpc_security_group_ids = [aws_security_group.app.id]
subnet_id = aws_subnet.primary.id
iam_instance_profile = aws_iam_instance_profile.app.name

# Root volume (OS)
root_block_device {
volume_type = "gp3"
volume_size = 50 # GB
delete_on_termination = true
}

# Additional data volume
ebs_block_device {
device_name = "/dev/sdf"
volume_type = "gp3"
volume_size = 200
delete_on_termination = false
}

# User data (bootstrap script)
user_data = base64encode(file("${path.module}/cloud-init.sh"))

tags = {
Name = "app-server-01"
Environment = "production"
Tenant = "tenant-a"
CostCenter = "engineering"
}

# Enable detailed monitoring
monitoring = true
}

# Security group (firewall rules)
resource "aws_security_group" "app" {
name = "app-sg"
vpc_id = aws_vpc.main.id

# Ingress: HTTPS from load balancer
ingress {
from_port = 443
to_port = 443
protocol = "tcp"
security_groups = [aws_security_group.lb.id]
}

# Ingress: SSH from bastion only (for debugging)
ingress {
from_port = 22
to_port = 22
protocol = "tcp"
security_groups = [aws_security_group.bastion.id]
}

# Egress: All outbound (default)
egress {
from_port = 0
to_port = 0
protocol = "-1"
cidr_blocks = ["0.0.0.0/0"]
}
}

# Output the instance IP for access
output "instance_ip" {
value = aws_instance.app_vm.public_ip
}

VM vs Container vs Serverless Decision Matrix

Virtual Machines
  1. Full OS isolation (strongest security boundary)
  2. Legacy app support (Windows, custom OS)
  3. Live migration (zero-downtime moves)
  4. Noisy-neighbor risks (resource contention)
  5. Higher resource overhead (2-4 GB RAM minimum)
  6. Boot time 30s-2m (slower than containers)
  7. Use: Strict isolation, multi-OS, stateful workloads
Containers
  1. Lightweight (process-level isolation)
  2. Fast startup (100ms-1s)
  3. Resource efficient (10-100 MB per container)
  4. Kernel-escape risk (single kernel per host)
  5. Better density (100s containers per node)
  6. Cloud-native (Kubernetes, orchestration)
  7. Use: Microservices, rapid scaling, modern apps
Serverless (Functions)
  1. No infrastructure management
  2. Auto-scaling (millisecond response)
  3. Pay-per-use billing (cost efficient)
  4. Short-lived executions (stateless)
  5. Vendor lock-in (AWS Lambda, Google Cloud)
  6. Cold start latency (100ms-1s)
  7. Use: Event-driven, APIs, intermittent workloads

Real-World Production Architectures

Scenario 1: Multi-Tenant SaaS with Strict Isolation

Each tenant's workload runs in dedicated VMs, isolated at the hypervisor level. Compromised tenant VM cannot access others.

Multi-tenant architecture using VMs for tenant isolation.

Scenario 2: Mixed Workloads (Linux + Windows)

Legacy Windows application cannot run in containers. Host on dedicated Windows VMs while running modern services in Linux containers on Kubernetes.

# Architecture overview:
# - Windows VMs: Legacy .NET Framework apps (4 instances for HA)
# - Linux containers: Microservices (Kubernetes cluster)
# - Both access shared database, cache

# Load balancer routes:
# /api/legacy -> Windows VMs (Windows Load Balancer)
# /api/modern -> Kubernetes services (Ingress)

Scenario 3: Live Migration for Zero-Downtime Maintenance

Hypervisor maintenance required. Live migrate VMs to other hosts.

# Hypervisor A: 8 VMs, 64 GB used RAM
# Host maintenance window: 2 hour window

# Migrate VMs to Hypervisor B (pre-copy)
virsh migrate vm-1 qemu+ssh://hypervisor-b/system
# Down-time < 100ms, memory pages pre-copied before final switchover

# All 8 VMs moved, Hypervisor A maintenance begins
# Production workload uninterrupted

Common Patterns and Pitfalls

VM Sizing and Capacity Planning

# Example: E-commerce platform

Small VM (t3.medium - AWS):
vCPU: 2
RAM: 4 GB
Use: Stateless services, web frontends, light processing
Cost: ~$0.04/hour
Density: 8 per host (16 vCPU, 32 GB physical)

Medium VM (t3.large - AWS):
vCPU: 2
RAM: 8 GB
Use: Database, caching, moderate workloads
Cost: ~$0.08/hour
Density: 4 per host

Large VM (m5.2xlarge - AWS):
vCPU: 8
RAM: 32 GB
Use: High-traffic services, batch processing
Cost: ~$0.38/hour
Density: 2 per host (16 vCPU, 64 GB)

# Capacity planning:
# Physical Host: 16 vCPU, 256 GB RAM
# Expected workload: 64 small VMs (2 vCPU, 4 GB each)
# Over-subscription: 128:16 vCPU (8:1 - aggressive, OK for bursty)
# Over-subscription RAM: 256:256 (1:1 - no over-subscription, safe)

Decision Checklist

  • Strict isolation required (multi-tenant, compliance)?
  • Application incompatible with containers (legacy, custom OS)?
  • Need to run multiple OS types (Windows, Linux)?
  • Workload stateful or persistent (databases, caches)?
  • Hardware-enforced isolation is a compliance mandate?
  • Live migration needed for maintenance/scaling?
  • Team familiar with VM operations and lifecycle?
  • Budget accommodates higher resource overhead?
  • Workload not bursty (consistent resource usage)?
  • Storage performance not latency-critical?

Self-Check

  • Can you explain how a hypervisor enforces memory isolation between VMs?
  • When would you choose VM isolation over container security policies?
  • How do you plan VM sizing to avoid the noisy-neighbor problem?
  • What are the trade-offs between VM density and isolation strength?
  • How does live migration work, and what's the downtime?
  • Why might Windows licensing be a hidden cost in multi-tenant deployments?

One Takeaway

VMs are not obsolete in cloud-native architectures; they solve specific problems (strict isolation, legacy compatibility, mixed OS support) that containers cannot. Use them where isolation or compatibility demands warrant the resource overhead. Avoid them for services that scale frequently or are performance-sensitive.

Next Steps

  • Explore Containers for lighter-weight compute.
  • Study Orchestration to manage VMs and containers at scale.
  • Review Topologies for distributing VMs across regions.
  • Learn live migration best practices for your hypervisor (KVM, Hyper-V, ESXi).

References

Advanced Topics: VM Performance and Optimization

CPU Over-Subscription and Contention

Modern hypervisors allow oversubscribing CPU: scheduling 4 vCPU VMs on 2 physical cores. This works well for mixed I/O-bound workloads but creates latency issues when multiple VMs are CPU-hungry simultaneously.

Over-subscription ratios:

  • Conservative (1:1): No over-subscription, guaranteed performance. Expensive.
  • Balanced (2:1 to 4:1): Safe for mixed workloads. Typical cloud deployments.
  • Aggressive (8:1+): Only for bursty, predictable workloads. High latency variance.

Detecting contention:

# In VM: check CPU ready time (VMware)
esxtop → u → CPUIdle%
If ready% > 10%: contention, schedule migration

# In AWS: check "CPU Steal Time"
cat /proc/stat | grep cpu # If steal_time high, noisy neighbors

Memory Over-Subscription and Swap

Hypervisors can balloon (compress) or swap memory. When VM uses more RAM than allocated, performance drops dramatically:

Memory allocation strategies:

  • Reserve: Allocate upfront, waste memory, guarantee performance
  • Shared: Allow over-subscription, save memory, risk swap
  • Ballooning: Dynamically adjust, middle ground

Avoiding swap:

# In VM: monitor swap usage
free -h
vmstat 1 # swpd column

# If swapping: either add RAM or reduce workload

Live Migration Deep Dive

Live migration enables moving VMs between hosts without downtime. How it works:

  1. Pre-copy phase: Copy memory pages to destination while source VM still runs
  2. Stop-copy phase: Pause VM, copy remaining dirty pages, CPU state
  3. Resume phase: Resume VM on destination (downtime ~50-100ms)

Success factors:

  • Shared storage (NFS, SAN) so VM disk accessible from both hosts
  • Low network latency between hosts (< 5ms RTT)
  • Both hosts support same CPU features
  • VM downtime must be minimized (~100ms acceptable)

Failure scenarios:

  • Network partition: VM frozen in limbo (use manual intervention)
  • Storage unavailable: Migration fails, VM needs restart
  • CPU mismatch: Destination CPU missing features source had (crash)

VM Sizing in the Real World

Small e-commerce shop:

Traffic: 1000 requests/day
Solution: 1 × t3.large (2 vCPU, 8GB RAM)
Cost: ~$60/month
Utilization: ~5% (lots of idle capacity, acceptable for small)

Mid-market SaaS:

Traffic: 10M requests/day, 50 concurrent users
Solution: 4 × m5.xlarge (4 vCPU, 16GB RAM each)
Cost: ~$600/month
Utilization: ~40-60% (good balance)
Load balancer distributes traffic
Auto-scaling: +2 VMs at 80% CPU, -2 at 30%

High-traffic platform:

Traffic: 1B requests/day, 10K concurrent users
Solution: 100 × r5.4xlarge (16 vCPU, 128GB RAM)
Cost: ~$100K/month
Utilization: 70-80% (high, efficient)
Distributed across 3 regions for redundancy

Disaster Recovery: Backup and Restore

Snapshot strategy:

  • Daily snapshots: restore to any point in last 30 days
  • Incremental snapshots: only changes stored (not full copy)
  • Test restores: verify snapshots actually work
  • Encryption: snapshots encrypted, secured

RPO/RTO targets:

  • RPO (Recovery Point Objective): How much data loss acceptable? 1 hour of data?
  • RTO (Recovery Time Objective): How long to restore? 1 minute? 1 hour?
  • Higher RPO/RTO = cheaper solutions; lower = more expensive

Example: High-availability database:

  • RPO target: 5 minutes (max 5 min of lost data)
  • RTO target: 15 minutes (restored within 15 min)
  • Solution: Continuous replication to standby VM + automated failover