Production Readiness Reviews
The PRR checklist that prevents 80% of preventable launches-into-fire, with a real launch gate Terraform module.
Real-World Analogy
A building inspection before occupancy — someone independent checks the structure is safe before people move in.
Why PRRs exist
A Production Readiness Review (PRR) is a structured checkpoint before SRE accepts the pager for a service. It exists for one reason: the cost of fixing a missing dashboard at launch is 10x lower than fixing it during a 2am incident.
PRRs are the most leveraged thing SRE does. One hour of review prevents weeks of pager pain.
The launch criteria checklist
A PRR is a yes/no checklist. Ambiguity in any item means the launch is blocked until it’s resolved.
# PRR: [Service Name] — [Owner Team] — [Launch Date]
## 1. Architecture
- [ ] Architecture diagram exists and is current
- [ ] All dependencies documented (services, databases, external APIs)
- [ ] Failure mode for each dependency documented
- [ ] Single points of failure identified and accepted (or eliminated)
## 2. Reliability
- [ ] SLI defined and measurable in production
- [ ] SLO target agreed by product + SRE, written down
- [ ] Error budget policy signed off
- [ ] Capacity plan completed (Little's Law math + headroom)
- [ ] Load test passed at expected launch traffic + 2x
## 3. Observability
- [ ] RED metrics for every endpoint
- [ ] USE metrics for every infrastructure component
- [ ] Structured logging with correlation IDs
- [ ] Distributed tracing for at least the critical path
- [ ] Dashboard exists, linked from the runbook
- [ ] Cardinality budget reviewed (no unbounded labels)
## 4. Alerting
- [ ] Symptom-based alerts on SLO burn rate
- [ ] Every alert has a linked runbook
- [ ] Alert threshold tested in staging (false-positive rate under 5%)
- [ ] Escalation path defined in PagerDuty
## 5. Deploy + Rollback
- [ ] CI runs unit + integration + smoke tests
- [ ] Deploy is automated (no manual steps)
- [ ] Canary stage with at least 5% traffic for 30 min
- [ ] Rollback is one command and tested
- [ ] Database migrations are forward + backward compatible
## 6. Security
- [ ] Secrets in vault, not in env or repo
- [ ] mTLS or equivalent for service-to-service auth
- [ ] AuthZ rules reviewed by security team
- [ ] Dependency vulnerability scan clean
- [ ] PII handling reviewed (if applicable)
## 7. Operations
- [ ] Runbook exists with at least: alert→diagnosis→mitigation
- [ ] On-call rotation populated and acknowledges service ownership
- [ ] Backup + restore tested in the last 90 days (if stateful)
- [ ] DR runbook exists (RTO + RPO documented and tested)
- [ ] Cost forecasted for steady-state and 10x scale
## 8. Launch
- [ ] Feature flag exists for emergency disable
- [ ] Status page entry created
- [ ] Customer support team trained on common questions
- [ ] Post-launch monitoring shift assigned (extra eyes for 48h) This is a comprehensive list. Tailor it — most services don’t need every item, but the team must consciously skip an item, not forget it.
A worked PRR for a real service
Let’s run a PRR for a payment-service launch. The conversation as it would actually happen:
SRE: "Walk me through the deploy."
DEV: "We git push to main, GitHub Actions builds, deploys to staging,
runs smoke tests, then rolls to production."
SRE: "What if smoke tests pass and prod is broken?"
DEV: "We... have not actually tested rollback. We'd revert the commit
and redeploy."
SRE: "How long does that take?"
DEV: "Maybe 8-10 minutes for the full pipeline."
SRE: "BLOCKED. Rollback must be one command and complete in under
2 minutes. Add `kubectl rollout undo` to your runbook,
test it Monday, then we re-review." That conversation, repeated across the checklist, is the PRR. Every block is a real thing the team would have eventually discovered in a real outage at much higher cost.
Codifying the checklist as policy
A checklist that humans run is a checklist humans skip. Bake it into deploy tooling.
# terraform/modules/service/main.tf
# A real internal module that enforces PRR baseline at infrastructure level
variable "service_name" { type = string }
variable "owner_team" { type = string }
variable "slo_target" { type = number } # e.g., 0.999
variable "runbook_url" { type = string }
# Enforce SLO is realistic
locals {
validate_slo = var.slo_target > 0.99 && var.slo_target < 0.99999 ? null : (
file("ERROR: slo_target must be between 0.99 and 0.99999")
)
}
# Mandatory: PagerDuty service must exist
resource "pagerduty_service" "this" {
name = var.service_name
escalation_policy = data.pagerduty_escalation_policy.team.id
alert_creation = "create_alerts_and_incidents"
}
# Mandatory: SLO recording rules
resource "kubernetes_manifest" "slo_rules" {
manifest = yamldecode(templatefile("${path.module}/slo-rules.yaml.tpl", {
service_name = var.service_name
slo_target = var.slo_target
}))
}
# Mandatory: burn-rate alerts wired to PagerDuty
resource "kubernetes_manifest" "burn_alerts" {
manifest = yamldecode(templatefile("${path.module}/burn-alerts.yaml.tpl", {
service_name = var.service_name
slo_target = var.slo_target
pagerduty_service = pagerduty_service.this.id
runbook_url = var.runbook_url
}))
}
# Mandatory: dashboard provisioned
resource "grafana_dashboard" "service" {
config_json = templatefile("${path.module}/dashboard.json.tpl", {
service_name = var.service_name
})
folder = data.grafana_folder.team.id
}
# Mandatory: runbook URL must respond 200
data "http" "runbook_check" {
url = var.runbook_url
lifecycle {
postcondition {
condition = self.status_code == 200
error_message = "runbook_url ${var.runbook_url} returned ${self.status_code}"
}
}
} To launch a new service, a team must invoke this module. They cannot create a service in production without an SLO, an alert, a dashboard, and a reachable runbook. The PRR has moved from “checklist” to “compile-time error.”
Make the right way the easy way. If your launch tooling enforces 80% of the PRR automatically, the human review can focus on the 20% that requires judgment (capacity, security, customer impact).
The runbook (the most-skipped item)
Runbooks are the artifact most teams produce worst. A real runbook follows a structured shape:
# Runbook: payment-service / PaymentLatencyHigh
## Alert
PaymentLatencyHigh — p99 latency > 500ms for 5 minutes.
## Severity
SEV2 if duration < 30 min. SEV1 if revenue impact > $1k/min.
## Owner
Team: payments-platform
Slack: #payments-oncall
PagerDuty: payments-primary
## What this means for users
Checkout still works but feels slow. Cart abandonment may rise.
## Diagnostic steps
1. **Check the dashboard:** [direct deep link](https://grafana.example.com/d/abc/payment-service?from=now-1h)
2. **Identify which endpoint:** Look at the "p99 by endpoint" panel.
3. **Check recent deploys:** `kubectl rollout history deployment/payment-service`
4. **Check upstream dependencies:**
- Stripe API status: https://status.stripe.com
- Database CPU: [dashboard panel](https://grafana.example.com/d/db?from=now-1h)
- Redis hit rate: [panel](https://grafana.example.com/d/redis?panel=4)
## Mitigations (try in order)
### 1. If recent deploy correlates with onset
```bash
kubectl rollout undo deployment/payment-service -n payments Wait 60s. Confirm latency drops on dashboard.
2. If database is the bottleneck (CPU > 80%)
# Failover to read replica for non-critical reads
kubectl patch configmap payment-config \
--type merge \
-p '{"data":{"DB_READ_FROM_REPLICA":"true"}}'
kubectl rollout restart deployment/payment-service 3. If Stripe is the bottleneck
Enable degraded mode (cash-on-delivery only):
kubectl patch configmap payment-config \
--type merge \
-p '{"data":{"DEGRADED_MODE":"true"}}' Escalation
- After 15 min unresolved → page payments-secondary
- After 30 min unresolved → page engineering manager
- If revenue > $5k/min impact → page VP Eng
Last reviewed
2026-04-15 by @alice
The hallmarks: deep links (not "go look at Grafana"), copy-pasteable commands, ordered mitigations from least-risky to most-risky.
## Pre-launch monitoring shift
For SEV-critical launches, schedule extra eyes for the first 48 hours. This is the single highest-leverage tradition in launch ops.
```typescript
// Launch monitoring shift schedule
const launchShift = {
service: "new-checkout-flow",
launchTime: "2026-05-15 10:00 PT",
tier1: {
duration: "T+0 to T+4h",
staff: ["author", "reviewer", "on-call SRE"],
activity: "Active monitoring; 5-min dashboard checks",
},
tier2: {
duration: "T+4h to T+24h",
staff: ["on-call SRE"],
activity: "Hourly dashboard checks; lower threshold pages",
},
tier3: {
duration: "T+24h to T+48h",
staff: ["on-call SRE"],
activity: "Normal on-call; launch tag still active for prioritization",
},
}; Annual recertification
PRRs are not one-time. Services drift. Annual recertification:
# A real CI job that checks PRR compliance
sre-prr-check --service payment-service
# Sample output:
# ✓ SLO defined and recording rules active
# ✓ Burn-rate alerts wired
# ✗ Dashboard returned 404 (was deleted in Grafana cleanup)
# ✓ Runbook URL responsive
# ✗ Last DR drill: 2024-11 (>365 days ago)
# ✗ Backup restore last tested: never
#
# Service payment-service: 4/8 PRR items compliant
# Status: NEEDS RECERTIFICATION Failed recertification means SRE escalates to the engineering manager. Not punitive — it’s the back-pressure that prevents production from rotting silently.
Stay current
- Google SRE Book — Launch Coordination Engineering — the original PRR checklist
- AWS Well-Architected Framework — pillars + review process
- Google Cloud Architecture Framework — counterpart to AWS WA
- 12-factor app — pre-PRR baseline for service hygiene
Key Takeaways
- PRRs catch the most outage causes at 10x lower cost than at incident time
- The checklist must be checked, not memorized — write it down
- Bake the baseline into Terraform — the right way becomes the only way
- Runbooks need deep links and copy-pasteable commands, not vague pointers
- Recertify annually — services drift, drift causes outages, drift is preventable