Cloud Service Models
| Model | You Manage | Provider Manages | Example |
|---|---|---|---|
| SaaS | Config only | Everything else | Salesforce, GitHub |
| PaaS | Apps & data | Runtime, OS, hardware | Heroku, App Engine |
| FaaS | Functions | Scaling, runtime, infra | Lambda, Cloud Functions |
| IaaS | Apps, data, runtime, OS | Virtualisation, hardware | EC2, GCE |
Guideline
Prefer higher-level services (PaaS > IaaS) when they fit your constraints. They shift operational burden to the provider. Use IaaS only when you need fine-grained control or the higher-level service doesn't meet compliance or performance requirements.
Deployment Strategies
| Strategy | Availability | Latency | Cost | Complexity |
|---|---|---|---|---|
| Single region | Medium | Low (local) | Low | Low |
| Active-Passive | High | Low (failover ~min) | Medium | Medium |
| Active-Active | Very high | Low (global) | High | High |
Infrastructure as Code (IaC)
Treat infrastructure the same way you treat application code:
- Terraform / OpenTofu — declarative, stateful, multi-cloud
- Pulumi — infrastructure in general-purpose languages (TypeScript, Python, Go)
- CloudFormation / CDK — AWS-native
- Ansible — configuration management, procedural
IaC Best Practices
| Practice | Why |
|---|---|
| Store state remotely (S3, Terraform Cloud) | Prevents state loss and enables team collaboration |
| Review IaC changes in pull requests | Catches misconfigurations before deployment |
| Use modules | Avoids duplication and enforces standards |
| Tag all resources | Enables cost tracking by team, project, environment |
Cost Governance
Cloud costs spiral without governance. Establish:
| Practice | Impact | Effort |
|---|---|---|
| Budgets and alerts | Prevents bill shock | Low |
| Resource tagging | Enables cost attribution | Low |
| Right-sizing instances | Reduces waste 20-40% | Medium |
| Auto-scaling | Matches capacity to demand | Medium |
| Reserved / savings plans | 30-60% discount vs on-demand | Low |
Designing for Resilience
┌──────────┐
│ Load │
│ Balancer │
└────┬─────┘
│
┌─────┼─────┐
│ │ │
┌────▼──┐ ┌▼───┐ ┌▼────┐
│ App │ │App │ │ App │
│ Inst.1│ │Ins.│ │Ins.3│
└────┬──┘ └────┘ └─────┘
│
┌────▼────┐
│Circuit │
│Breaker │──► Downstream Service
└─────────┘
The goal is not to prevent all failures — it’s to limit blast radius and recover automatically:
- Load balancers — distribute traffic, health checks
- Auto-scaling groups — replace failed instances
- Circuit breakers — fail fast when downstream is down
- Retries with backoff — handle transient failures
- Bulkheads — isolate failure to one component
Remember
Resilience is not just about infrastructure. An architecture where a single database failure takes down the entire system is not resilient — regardless of how many app instances you have running.