SRE
SLIs, SLOs, incident response, chaos engineering
- 00 8-Week Roadmap: Fullstack → SRE A solid two-month plan to convert a working fullstack engineer into a junior-SRE-ready operator. Daily breakdown, real labs, and a final capstone. beginner 18 min →
- 01 What SRE Actually Is Class SRE implements DevOps. The error-budget contract, toil cap, and the embedded engineer model that makes Google's reliability work. beginner 14 min →
- 02 SLIs, SLOs & Error Budgets Pick the right SLI, set an SLO that survives lawyer review, and burn the budget the way Google's CRE team does it. beginner 18 min →
- 03 Golden Signals, RED, and USE The three monitoring frameworks that actually matter, when to use each, and the Prometheus + Grafana stack that exposes them all. beginner 16 min →
- 04 Incident Response & On-Call ICS roles, severity classification, comms cadence, and the on-call rotation that doesn't burn engineers out. intermediate 17 min →
- 05 Blameless Postmortems The full template Google, Etsy, and Stripe use, with action-item discipline that prevents the same incident twice. intermediate 15 min →
- 06 Capacity Planning & Load Testing Little's Law, Universal Scalability Law, headroom, and a real k6 + Locust load test you can run today. intermediate 18 min →
- 07 Production Readiness Reviews The PRR checklist that prevents 80% of preventable launches-into-fire, with a real launch gate Terraform module. intermediate 16 min →
- 08 Chaos Engineering Hypothesis-driven failure injection from Netflix Simian Army to Chaos Mesh, with real experiments and a safety harness. advanced 16 min →
- 09 Disaster Recovery & Backups RTO, RPO, multi-region failover, and the restore drill that proves your backups exist. advanced 16 min →
- 10 Toil & Automation Measuring toil, the 50% cap, and the automation taxonomy from one-off scripts to self-healing operators. advanced 15 min →
- 11 Scaling & Distributed Systems — 8-Week Companion Roadmap Zero to designing, building, and operating scalable systems. 1–2 hrs/day, 8 weeks, 8 real projects, 5 case studies, mini-YouTube capstone. beginner 25 min →
- 12 Linux Performance Mastery From `top` to `perf`, `bpftrace`, and flame graphs. The senior-SRE toolkit for diagnosing latency, CPU, memory, and I/O at the kernel level — without restarting anything in production. mastery 32 min →
- 13 Network Engineering for SREs BGP, anycast, ECMP, CDN internals, packet capture, and TCP at scale. The networking layer where 'random' production weirdness actually lives. mastery 30 min →
- 14 Database Internals for SREs MVCC, replication lag, hot rows, query plans, B-tree vs LSM, connection pools at scale. The DB knowledge that separates 'I run Postgres' from 'I keep Postgres up under fire.' mastery 30 min →
- 15 Distributed Systems Theory for SREs CAP, PACELC, FLP, Raft, Paxos, gossip, vector clocks, CRDTs, fencing tokens. The theory that explains why your distributed system breaks the way it does. mastery 32 min →
- 16 Kubernetes at Scale 1,000+ node clusters, multi-tenancy, RBAC, NetworkPolicy, OPA/Kyverno, GitOps, etcd tuning. The operating model when 'just run kubectl apply' is no longer a strategy. mastery 30 min →
- 17 Service Mesh Internals Envoy, Istio, Linkerd, sidecar vs ambient, mTLS, xDS, retries, circuit breakers, traffic shifting. What a mesh actually does and when it earns its complexity. mastery 28 min →
- 18 FinOps & Cost Engineering Unit economics, rightsizing, spot, savings plans, cost-aware SLOs. The senior SRE skill that turns 'the cloud bill is too high' into a tracked, owned, falling number. mastery 26 min →
- 19 Reliability Culture & SRE Org Design Staff+ SRE work, embedding, charters, blame-aware orgs, mentoring, sustainable on-call. The non-technical lever that makes or breaks every reliability program. mastery 26 min →
- 20 12-Month Mastery Roadmap — Junior SRE → Senior/Staff The year-long plan that picks up where the 8-week roadmap stops. Monthly milestones, real production projects, deep reading, and the artifacts that prove staff-level capability. mastery 30 min →