Skip to content
← Notes

SRE

SLIs, SLOs, incident response, chaos engineering

  1. 00 8-Week Roadmap: Fullstack → SRE A solid two-month plan to convert a working fullstack engineer into a junior-SRE-ready operator. Daily breakdown, real labs, and a final capstone.
  2. 01 What SRE Actually Is Class SRE implements DevOps. The error-budget contract, toil cap, and the embedded engineer model that makes Google's reliability work.
  3. 02 SLIs, SLOs & Error Budgets Pick the right SLI, set an SLO that survives lawyer review, and burn the budget the way Google's CRE team does it.
  4. 03 Golden Signals, RED, and USE The three monitoring frameworks that actually matter, when to use each, and the Prometheus + Grafana stack that exposes them all.
  5. 04 Incident Response & On-Call ICS roles, severity classification, comms cadence, and the on-call rotation that doesn't burn engineers out.
  6. 05 Blameless Postmortems The full template Google, Etsy, and Stripe use, with action-item discipline that prevents the same incident twice.
  7. 06 Capacity Planning & Load Testing Little's Law, Universal Scalability Law, headroom, and a real k6 + Locust load test you can run today.
  8. 07 Production Readiness Reviews The PRR checklist that prevents 80% of preventable launches-into-fire, with a real launch gate Terraform module.
  9. 08 Chaos Engineering Hypothesis-driven failure injection from Netflix Simian Army to Chaos Mesh, with real experiments and a safety harness.
  10. 09 Disaster Recovery & Backups RTO, RPO, multi-region failover, and the restore drill that proves your backups exist.
  11. 10 Toil & Automation Measuring toil, the 50% cap, and the automation taxonomy from one-off scripts to self-healing operators.
  12. 11 Scaling & Distributed Systems — 8-Week Companion Roadmap Zero to designing, building, and operating scalable systems. 1–2 hrs/day, 8 weeks, 8 real projects, 5 case studies, mini-YouTube capstone.
  13. 12 Linux Performance Mastery From `top` to `perf`, `bpftrace`, and flame graphs. The senior-SRE toolkit for diagnosing latency, CPU, memory, and I/O at the kernel level — without restarting anything in production.
  14. 13 Network Engineering for SREs BGP, anycast, ECMP, CDN internals, packet capture, and TCP at scale. The networking layer where 'random' production weirdness actually lives.
  15. 14 Database Internals for SREs MVCC, replication lag, hot rows, query plans, B-tree vs LSM, connection pools at scale. The DB knowledge that separates 'I run Postgres' from 'I keep Postgres up under fire.'
  16. 15 Distributed Systems Theory for SREs CAP, PACELC, FLP, Raft, Paxos, gossip, vector clocks, CRDTs, fencing tokens. The theory that explains why your distributed system breaks the way it does.
  17. 16 Kubernetes at Scale 1,000+ node clusters, multi-tenancy, RBAC, NetworkPolicy, OPA/Kyverno, GitOps, etcd tuning. The operating model when 'just run kubectl apply' is no longer a strategy.
  18. 17 Service Mesh Internals Envoy, Istio, Linkerd, sidecar vs ambient, mTLS, xDS, retries, circuit breakers, traffic shifting. What a mesh actually does and when it earns its complexity.
  19. 18 FinOps & Cost Engineering Unit economics, rightsizing, spot, savings plans, cost-aware SLOs. The senior SRE skill that turns 'the cloud bill is too high' into a tracked, owned, falling number.
  20. 19 Reliability Culture & SRE Org Design Staff+ SRE work, embedding, charters, blame-aware orgs, mentoring, sustainable on-call. The non-technical lever that makes or breaks every reliability program.
  21. 20 12-Month Mastery Roadmap — Junior SRE → Senior/Staff The year-long plan that picks up where the 8-week roadmap stops. Monthly milestones, real production projects, deep reading, and the artifacts that prove staff-level capability.