← SRE · intermediate · 17 min · 04 / 21 বাংলা

Incident Response & On-Call

ICS roles, severity classification, comms cadence, and the on-call rotation that doesn't burn engineers out.

incident responseon-callICSPagerDutywar room

Real-World Analogy

A fire drill — the procedures exist so that when the real fire happens, nobody is improvising.

The five phases of an incident

Every incident, regardless of size, moves through the same five phases. Naming them out loud during the response keeps the team coordinated.

1. DETECT      Alert fires (or human notices)
2. TRIAGE      Assess severity, assemble responders, declare incident
3. MITIGATE    Stop the bleeding (rollback, failover, kill switch)
4. RESOLVE    Fix the root cause (often hours/days after mitigation)
5. LEARN       Postmortem, action items, share learnings

The biggest mistake junior responders make is conflating Mitigate and Resolve. Mitigate first, debug later. If a deploy is on fire, roll it back, then investigate the rolled-back code at leisure. Don’t try to fix forward in the middle of an outage.

Severity classification (a real one)

Severity must be defined in writing, before the incident, and posted in the on-call runbook. Here is a battle-tested grid you can adapt:

SEV1  Customer-visible outage, data loss, security breach,
      revenue impact >$1k/min, regulatory exposure
      → Page incident commander + on-call lead + comms
      → Stand up war room within 5 minutes
      → Status page update within 15 minutes
      → Executive update every 30 minutes

SEV2  Major degradation: significant feature broken, key
      customer impacted, SLO budget burning at >10x rate
      → Page primary on-call
      → Internal Slack channel
      → Status page update if customer-visible

SEV3  Minor degradation: non-critical feature impaired,
      latency elevated but within SLO
      → Ticket-grade alert; no page
      → Investigate next business day

SEV4  Cosmetic, internal-only, or self-recovering
      → Log it; aggregate weekly

The grid prevents the most common dispute in incident response: “is this a SEV1 or a SEV2?” Decide it before the adrenaline hits.

Err toward higher severity at declaration time. It is cheap to downgrade a SEV1 to SEV2 ten minutes in. It is expensive to realize at hour 2 that you understaffed the response.

Incident Command System (ICS) roles

Borrowed from firefighting and FEMA, ICS gives every responder a clear lane.

INCIDENT COMMANDER (IC)
  Owns the response. Single decision-maker.
  Does NOT debug. Coordinates, decides, delegates.
  Can be a junior engineer — authority comes from the role.

OPERATIONS LEAD (OL)
  Drives technical mitigation. Runs the actual debugging.
  Reports findings to IC.

COMMUNICATIONS LEAD (CL)
  Owns external comms: status page, customer support, executives.
  Frees IC and OL to focus on the system.

SCRIBE
  Captures the timeline in real time. Every command run, every
  hypothesis, every decision. The scribe document becomes the
  postmortem skeleton.

SUBJECT MATTER EXPERTS (SMEs)
  Pulled in by IC as needed. Database SME, network SME, etc.
  They answer questions; they do not run the incident.

For a SEV1 you fill all four named roles. For a SEV2, IC + OL + Scribe is enough. The IC must explicitly NOT touch the keyboard — their job is to think one level above the debugging.

The on-call rotation that doesn’t break people

// Anti-patterns that destroy on-call teams:
const broken = {
	rotation: 'Same 3 people forever', // burnout in 6 months
	handoff: 'None — silent transition', // dropped context
	pageVolume: '10+ pages per shift', // sleep deprivation
	daytimeWork: 'Same as non-on-call week', // exhaustion
	comp: "None — 'it's part of the job'" // resentment
};

// What works:
const sustainable = {
	rotation: '8+ engineers, 1-week shifts',
	handoff: '30-min sync at start: open issues, recent deploys, watchlist',
	pageVolume: '<5 pages/week (else: fix the noise)',
	daytimeWork: 'On-call week is project-light; backlog/runbook focused',
	comp: 'Per-shift stipend OR comp time off after',
	followUp: 'Every page reviewed in weekly on-call retro'
};

If your team only has 4 engineers, you do not have an on-call rotation. You have a death march. Hire more, narrow the on-call scope, or use a paid follow-the-sun service.

The handoff template

# On-call handoff: [outgoing engineer] → [incoming engineer]

Date: 2026-05-03
Time: 09:00 PT

## Open incidents

- INC-2247: Checkout p99 elevated since Friday. Mitigated by autoscaler bump.
  Root cause TBD. Next step: OL to review traces.

## Recent deploys (past 48h)

- payment-service v2.14.0 — small refactor, no incidents
- checkout v3.8.1 — autoscaler config change (related to INC-2247)

## Watchlist

- DB primary CPU trending up (60% → 75% over 7 days)
- Kafka consumer lag on order-events occasionally spikes; tolerable for now

## Known noise

- "S3 5xx burst" alert fires daily at 03:15 UTC during cost-report job
  → Suppressed in PagerDuty until INFRA-882 lands

## Anything you should know

- Big marketing push tomorrow 10am PT — expect 3-5x normal traffic

Status page communication

External comms is its own discipline. The template that works for almost any incident:

[INVESTIGATING]   We are investigating reports of [symptom]. We will
                  update within 30 minutes.

[IDENTIFIED]      We have identified the cause of [symptom] as
                  [neutral description, no blame, no jargon].
                  Mitigation is underway.

[MONITORING]      A fix has been applied to [symptom]. We are
                  monitoring to confirm full recovery.

[RESOLVED]        [Symptom] has been resolved. A postmortem will be
                  published within [N] business days.

Rules:

Update on a regular cadence even if there’s nothing new. “Still investigating, next update at 14:30” is better than 90 minutes of silence.
Plain language. “Some users may be unable to check out” beats “the order pipeline experienced a partial degradation.”
No internal jargon. The reader does not know what “the canary” is.
No blame, no speculation. Especially not on third parties — you’ll be wrong half the time.

A complete incident channel template

Spin up a dedicated Slack/Teams channel for every SEV1/SEV2. Pin this template at the top:

# Incident: INC-2271 — Checkout returning 503s

**Status**: INVESTIGATING
**Severity**: SEV1
**Started**: 2026-05-03 14:22 UTC
**IC**: @alice
**OL**: @bob
**CL**: @carol
**Scribe**: @dan
**Status page**: status.example.com/incidents/abc123

## Current hypothesis

Database connection pool exhaustion in EU region.

## Mitigation in progress

1. Bumping pool size from 50 → 100 (in progress)
2. Diverting EU traffic to US (decided against — too much latency)

## Timeline

14:22 Page fired (CheckoutErrorBudgetFastBurn)
14:24 IC declared SEV1
14:28 OL identified DB connection saturation
14:31 Hypothesis posted, mitigation 1 started
14:35 Connection pool bump deployed to canary
...

Everything goes in this channel. Threads for side conversations. No DMs about the incident — the scribe needs to capture every decision.

Drills (the part everyone skips)

You cannot expect people to perform incident response correctly under pressure if they have never practiced. Run quarterly drills:

// Quarterly incident drill template

const drill = {
	scenario: 'Database primary becomes unreachable from app tier',
	injection: 'Block port 5432 with iptables on db-primary',
	region: 'staging',
	observers: ['sre-lead', 'vp-eng'],
	participants: ['full on-call rotation'],

	successCriteria: [
		'IC declared within 5 minutes of first symptom',
		'Status page updated within 15 minutes (simulated)',
		'Mitigation (failover) executed within 20 minutes',
		'Scribe captured complete timeline'
	],

	postDrill: 'Retro within 24h; action items into Jira'
};

The first drill always exposes that the runbook has a typo, the failover script needs a flag that nobody knows, and the IC role rotation is unclear. That’s the point — better to discover it in staging on a Tuesday than in production at 3am.

Stay current

PagerDuty Incident Response docs — the public playbook
Google SRE Book — Managing Incidents — IC roles defined
Grafana OnCall — open-source rotation tool, free tier real
Increment magazine — On-Call issue — how mature orgs run pagers

Key Takeaways

Mitigate first, debug later — rollback before root cause
Severity grid is written before the incident, not argued during one
ICS gives every responder a clear lane — IC decides, OL debugs, CL communicates
8+ engineers, 1-week shifts, paid on-call is the floor for sustainable rotation
Drill quarterly — the first time you exercise a runbook should not be in a real outage