SRE

SLIs, SLOs, incident response, chaos engineering

0/21 chapters · 0 XP earned

00 8-Week Roadmap: Fullstack → SRE A solid two-month plan to convert a working fullstack engineer into a junior-SRE-ready operator. Daily breakdown, real labs, and a final capstone. beginner 18 min
01 What SRE Actually Is Class SRE implements DevOps. The error-budget contract, toil cap, and the embedded engineer model that makes Google's reliability work. beginner 14 min
02 SLIs, SLOs & Error Budgets Pick the right SLI, set an SLO that survives lawyer review, and burn the budget the way Google's CRE team does it. beginner 18 min
03 Golden Signals, RED, and USE The three monitoring frameworks that actually matter, when to use each, and the Prometheus + Grafana stack that exposes them all. beginner 16 min
04 Incident Response & On-Call ICS roles, severity classification, comms cadence, and the on-call rotation that doesn't burn engineers out. intermediate 17 min
05 Blameless Postmortems The full template Google, Etsy, and Stripe use, with action-item discipline that prevents the same incident twice. intermediate 15 min
06 Capacity Planning & Load Testing Little's Law, Universal Scalability Law, headroom, and a real k6 + Locust load test you can run today. intermediate 18 min
07 Production Readiness Reviews The PRR checklist that prevents 80% of preventable launches-into-fire, with a real launch gate Terraform module. intermediate 16 min
08 Chaos Engineering Hypothesis-driven failure injection from Netflix Simian Army to Chaos Mesh, with real experiments and a safety harness. advanced 16 min
09 Disaster Recovery & Backups RTO, RPO, multi-region failover, and the restore drill that proves your backups exist. advanced 16 min
10 Toil & Automation Measuring toil, the 50% cap, and the automation taxonomy from one-off scripts to self-healing operators. advanced 15 min
11 Scaling & Distributed Systems — 8-Week Companion Roadmap Zero to designing, building, and operating scalable systems. 1–2 hrs/day, 8 weeks, 8 real projects, 5 case studies, mini-YouTube capstone. beginner 25 min
12 Linux Performance Mastery From `top` to `perf`, `bpftrace`, and flame graphs. The senior-SRE toolkit for diagnosing latency, CPU, memory, and I/O at the kernel level — without restarting anything in production. mastery 32 min
13 Network Engineering for SREs BGP, anycast, ECMP, CDN internals, packet capture, and TCP at scale. The networking layer where 'random' production weirdness actually lives. mastery 30 min
14 Database Internals for SREs MVCC, replication lag, hot rows, query plans, B-tree vs LSM, connection pools at scale. The DB knowledge that separates 'I run Postgres' from 'I keep Postgres up under fire.' mastery 30 min
15 Distributed Systems Theory for SREs CAP, PACELC, FLP, Raft, Paxos, gossip, vector clocks, CRDTs, fencing tokens. The theory that explains why your distributed system breaks the way it does. mastery 32 min
16 Kubernetes at Scale 1,000+ node clusters, multi-tenancy, RBAC, NetworkPolicy, OPA/Kyverno, GitOps, etcd tuning. The operating model when 'just run kubectl apply' is no longer a strategy. mastery 30 min
17 Service Mesh Internals Envoy, Istio, Linkerd, sidecar vs ambient, mTLS, xDS, retries, circuit breakers, traffic shifting. What a mesh actually does and when it earns its complexity. mastery 28 min
18 FinOps & Cost Engineering Unit economics, rightsizing, spot, savings plans, cost-aware SLOs. The senior SRE skill that turns 'the cloud bill is too high' into a tracked, owned, falling number. mastery 26 min
19 Reliability Culture & SRE Org Design Staff+ SRE work, embedding, charters, blame-aware orgs, mentoring, sustainable on-call. The non-technical lever that makes or breaks every reliability program. mastery 26 min
20 12-Month Mastery Roadmap — Junior SRE → Senior/Staff The year-long plan that picks up where the 8-week roadmap stops. Monthly milestones, real production projects, deep reading, and the artifacts that prove staff-level capability. mastery 30 min