Incident management: the humans are part of the system

August 18, 2024

An incident is the moment your observability tells you something is wrong and your on-call engineer has to decide what to do about it. Everything before that point — the metrics, the dashboards, the SLOs, the alert routing — is setup; everything after it is the discipline this post is about. The two halves are connected: observability without incident response produces alarm fatigue; incident response without observability produces heroic firefighting. The practice worth having is the combined one.

The SRE tradition, from Google’s SRE book onward, frames this as reducing three numbers: time-to-detect (TTD), time-to-mitigate (TTM), and time-to-recur (TTR, or sometimes “time between recurrences of the same incident”). Each is a different discipline with different levers. TTD is an observability problem. TTM is an incident-response problem — runbooks, on-call, escalation, clear roles. TTR is a postmortem problem — did you actually learn the right thing and fix the underlying cause, or just patch the proximate one?

This post is about the middle and the last of those: what incident response actually involves, how on-call works, the roles and vocabulary that let a distributed response coordinate, and why the postmortem is the part where the learning either happens or doesn’t.

What an incident is

A working definition: an incident is a customer-impacting condition — or a condition about to become customer-impacting — that needs a coordinated response to resolve. Production down. Error rate elevated. Latency spiking. Data loss in progress. A security breach underway.

Two things this definition excludes. First: a page that isn’t customer-impacting is not an incident. If the alert is from a dashboard threshold that no longer reflects anything users care about, the response is to fix the alert, not to run an incident. Second: an incident isn’t defined by the tool’s severity tag. A P3 that is about to become a P1 is an incident now; waiting for the tag to change wastes time.

The concept worth naming: severity. Most organizations use a three-or-four-level scale.

Sev1 / P0. Full outage, significant customer impact, active revenue loss, security breach, data corruption. All hands. Wake people up. Status page updates. Executive visibility.
Sev2 / P1. Partial outage, degraded service for some users, elevated error rates within an SLO burn window. On-call responds immediately. May or may not wake people up depending on time and persistence.
Sev3 / P2. Minor degradation, non-critical feature broken, issues affecting a small fraction of users. On-call during business hours; tracked but not dropping other work.
Sev4 / P3. Awareness tickets. Known issues. Planned work affecting users. Rarely a “real” incident.

The scale itself matters less than using it consistently. Teams that inflate severity (“everything is Sev1”) lose the signal. Teams that deflate (“it’s only Sev2” while revenue drips away) delay response. The severity should match the impact; the impact should drive the urgency of the response.

On-call

On-call is the discipline of having a human available to respond to incidents within some agreed time window. The shape most teams converge on: a rotating schedule, managed in PagerDuty, Opsgenie, or incident.io, with primary and secondary tiers. The primary is paged first. If the primary doesn’t acknowledge within N minutes, the page escalates to the secondary. If neither acknowledges, it escalates to a manager or a wider group.

The design decisions that matter:

Who is on-call. The team that owns the service is the right answer in principle; “you build it, you run it” is the SRE formulation. In practice, many organizations still have a central ops team or SRE group that handles some incidents. The tension is real: the owning team has the context but not always the operational experience; the central team has the experience but not always the code-level context.
Rotation shape. Weekly rotations are common and a good default. Daily rotations sound fairer but hand off context too often. Monthly rotations exhaust the person on-call. Seven days of carrying a pager is about as much as a human does well; some teams split it into a weekday primary and a weekend primary.
Coverage hours. 24/7 for services that need it; business hours plus “paged for Sev1” for services that don’t. Not every service needs a human awake at 3am; making everything 24/7 is how teams burn out.
Compensation. On-call is work. Either pay for it (a stipend, overtime, time-in-lieu), or staff it in a way that respects people’s time (rarely-paged rotations, short shifts, post-incident recovery time). Unpaid, unrespected on-call is a retention problem waiting to happen.

The page itself — what triggers an alert that wakes someone up — is an observability question. The thing worth saying here: page on symptoms that indicate user pain, not on component-level health. “Checkout API error rate is above 5% for 5 minutes” belongs in a page. “CPU on one of the checkout API nodes is at 90%” almost never does; it’s a cause, not a symptom, and paging on causes produces false positives from normal operation.

The first five minutes

The moment an incident starts, the goal is not to fix it. The goal is to coordinate so that the fix can happen efficiently. The pattern that works, across teams and tools:

Acknowledge the page. Someone is responding. The page doesn’t escalate. Other people know they’re not the primary.
Declare an incident. Tools like incident.io, FireHydrant, Rootly, PagerDuty’s incident workflows do this: creating a dedicated Slack channel, a doc for notes, a Zoom bridge if needed. The formal declaration exists so that the response has a container. Without it, the response scatters across DMs and never coheres.
Assess and classify. Is this a Sev1 or a Sev3? What is the customer impact? The classification drives how much of the organization gets pulled in. Getting this wrong, in either direction, is the biggest early-response mistake.
Assign roles. Even in a small response, name who is doing what. Usually: an Incident Commander (coordinates, doesn’t debug), a Technical Lead (leads the actual debugging), and a Communications Lead (updates the status page, stakeholders, customer support).

The role separation matters. A lead engineer who is both debugging and updating the status page and coordinating with support is doing three jobs badly. Named roles mean each person has one focus and the hand-offs are explicit.

The Incident Commander

The Incident Commander (IC) is not the senior engineer. The IC may or may not understand the system that’s on fire. The IC’s job is to run the response: who is doing what, what the current state is, what the next decision is, when the next update will happen. They coordinate; they do not debug.

Concretely, an IC:

Maintains the current hypothesis (“we think it’s the database connection pool exhaustion; $ENG is investigating; next update in 10 minutes”).
Decides when to escalate (“this is now Sev1, paging the whole team”).
Decides when to page downstream teams whose systems might be involved.
Keeps the Slack channel to a single thread of coordination, not debugging chatter.
Calls the decisions the team is avoiding (“we’re going to roll back; $ENG please start the rollback in 60 seconds”).
Decides when the incident is over.

Organizations that invest in IC training — running tabletop exercises, rotating the IC role across engineers, building a pool of ICs — respond faster and more calmly than organizations where whoever is senior becomes the IC by default. The ICIS (Incident Commander Incident Support) framework from the wildfire world is the original; the software-industry adaptations are all variations.

Mitigation before diagnosis

One principle that distinguishes good incident response from bad: mitigate before you diagnose.

A server is struggling. The instinct is to find out why. The better first move is to stop it struggling — roll back the recent deploy, fail over to the healthy region, drain the hot shard, shed load — and then investigate with the pressure off. Diagnosis with the fire still burning is slower and more error-prone than diagnosis with the fire out.

The mitigations worth having pre-baked:

Rollback. The previous release is the most common mitigation and the most important to rehearse. A rollback that takes twenty minutes during an incident means twenty minutes of pain you could have skipped.
Feature flag off. For features deployed dark and turned on progressively, the kill switch is the fastest mitigation: flip the flag, pain stops, investigation begins.
Traffic shaping. Shedding load, rerouting to a healthy region, scaling up capacity, shifting traffic away from a broken shard.
Stop the auto-actions. Disable the autoscaler, disable the deployer, freeze the state so nothing else happens while you investigate.

Not every incident can be mitigated before diagnosis. A novel data-corruption bug, a cascading dependency failure, an outage whose cause doesn’t match anything pre-baked — these require actual diagnosis to mitigate. But the default should be: try the obvious mitigations first. Most incidents are caused by a recent change; rolling back resolves most of them.

Communication

A non-trivial fraction of the work of an incident is not fixing it — it is telling people what’s going on. The status page, internal stakeholders, customer support, in some cases executive leadership. Done badly, communication either under-informs (customers are in the dark, support is guessing) or over-informs (every small diagnosis update is broadcast, producing noise).

The pattern that works:

Regular cadence. Updates at fixed intervals (every 15 or 30 minutes, depending on severity), even if the update is “still investigating, no new information.” Silence reads as chaos; scheduled updates read as control.
Impact-first. The update opens with what users are experiencing (“checkout is failing for ~30% of users”) and what the current mitigation status is. Technical details come second, if at all.
No premature “resolved.” The incident is over when the system is healthy and has been for some time, not when the immediate symptom stops. Declaring “resolved” three times in a row erodes trust.
Separate internal from external. The Slack channel where engineers are debugging is not the status page. The status page is the customer-facing story, written carefully. The communications lead bridges between the two; the debugging channel does not leak to customers.

Tools like Statuspage (Atlassian), Incident.io’s status page, and FireHydrant’s status page integrate with the incident workflow so that the update happens in one place and propagates. This is a small investment that pays off during every Sev1.

The postmortem

Once the incident is resolved, the real work starts.

A postmortem (also retrospective, learning review, incident review) is a structured analysis of what happened, why, and what to change. Done well, it is the single highest-leverage activity in reliability engineering — the thing that turns each incident into durable learning for the organization. Done badly, it is a blame session, or a checkbox, or a document nobody reads.

A useful postmortem document includes:

Summary. Two paragraphs: what the user-visible impact was, what happened, how it was resolved, the headline number (duration, users affected, revenue impact).
Timeline. Events in chronological order, with timestamps and named actors. “14:02 — alert fires on checkout latency. 14:03 — on-call acknowledges. 14:07 — IC declared. 14:12 — first hypothesis (connection pool exhaustion). 14:18 — rollback initiated…”
Root cause. The underlying reason, not the proximate one. “The deploy contained a change that raised the database connection count; the connection pool was sized for the previous count; the pool exhausted under load.”
Contributing factors. Things that made the incident worse or harder to detect. Missing alert. Slow rollback. Runbook out of date. Pager routing broken. Context missing in Slack.
What went well. Actively look for this. Teams tend to write postmortems as catalogs of failure; there is always something that worked, and naming it teaches the organization what to preserve.
Action items. Specific, assigned, with dates. Generic action items (“improve monitoring”) don’t ship; specific ones (“add alert on connection pool utilization

80%, owned by $ENG, due $DATE”) do.

The blameless discipline

The central norm of modern postmortems — originating with John Allspaw’s writing and now standard across mature engineering organizations — is blamelessness. The idea: the person who pushed the button that caused the outage is not the cause. The system that let them push the button without catching the mistake is the cause.

This is not softness. It is a functional observation. If the deploy system lets a broken change reach prod, the problem is not that an engineer wrote a broken change — engineers write broken changes constantly. The problem is that the deploy system didn’t catch it. Blaming the engineer for the broken change misses the actionable part: improve the deploy system so the next broken change is also caught.

Blamelessness has two practical consequences:

Engineers talk honestly about what happened. If mistakes are punished, people minimize them in postmortems, or skip the postmortem, or quietly stop reporting minor incidents. The organization loses visibility. Blameless retrospectives produce honest accounts, which produce better learning.
The focus shifts to systemic fixes. “$ENG pushed a broken change” is not an action item. “Deploys of changes touching the connection pool require a review from $TEAM” is. The action items on a blameless postmortem are about systems and processes, not about individuals, because those are the things that are actually changeable.

Blamelessness does not mean “no accountability.” It means accountability is assigned to the system, not to the individual, and that the questions asked are “how do we prevent this?” not “whose fault was this?”

The culture is harder to establish than the mechanics. An organization where a senior engineer makes a mistake and gets publicly chastised has learned the opposite lesson about postmortems, regardless of what the postmortem template says. The discipline requires leadership to model it — especially when the mistake is expensive.

Near-misses

The incidents worth analyzing are not just the ones that happened. The near-misses — the ones that almost happened, where some guardrail caught the problem, where the rollback was fast enough that users didn’t notice, where the on-call found the issue before the page fired — are information too.

Teams that only postmortem actual customer-impact incidents under-learn. A near-miss is a successful defense-in-depth; it tells you which layer caught the problem and which layers might have failed if that one hadn’t. Some organizations (Etsy historically, many mature SRE cultures) run postmortems on significant near-misses as well. The ROI is good — the same structured analysis, none of the customer impact, often a clearer causal story because the chain was shorter.

Incident metrics

The metrics worth tracking, carefully:

MTTD (mean time to detect). How long from incident start to the team noticing. Driven by observability quality.
MTTA (mean time to acknowledge). How long from page to on-call acknowledging. Driven by on-call health.
MTTM (mean time to mitigate). How long from detection to the user-visible impact stopping. Driven by runbook/rollback quality.
MTTR (mean time to resolve — or to recover, depending on whose definition). How long until fully resolved, including root-cause fix where relevant.
Incident frequency. How many Sev1/2s per quarter.
Recurrence rate. How often the same root cause comes back.

These are pressure-valves: too much focus on MTTR drives teams to declare “resolved” prematurely; too much focus on frequency drives teams to reclassify real incidents as tickets. The metrics are informative; they are not the goal. The goal is a system that fails less and fails better. The metrics are indicators of how that is going, not of how well the ritual is being performed.

The patterns that don’t work

The postmortem nobody reads. A document written, filed, forgotten. No action items, or action items with no owners or no dates. Fix: treat action items as real work, tracked in the same system as feature work, with completion tracked. If the action items don’t close, the postmortem was a performance.

The action item pile-up. Every incident produces five action items; none get done. The pile grows. Over time, the postmortem becomes a way of generating work-that-won’t-happen rather than work that will. Fix: cap action items per incident (usually 2-3), prioritize ruthlessly, close things that are stale, and occasionally review the pile for recurring themes that signal a bigger problem.

Root cause as one thing. “The root cause was an engineer’s typo.” Almost no production incident has a single root cause; there is a chain, and the chain is where the learning lives. The technique from Allspaw’s Learning from Incidents community: ask “how was this possible?” at each level of the chain. The engineer’s typo was possible because the review didn’t catch it; the review didn’t catch it because the change was too large to review carefully; the change was too large because the team deploys weekly; … each step surfaces something changeable.

The blame-in-blameless-clothing. The postmortem uses the word “blameless” and then spends three pages explaining why the specific engineer made the specific mistake. Fix: if you could remove every sentence about an individual and the document still made sense, you wrote a good one. If it wouldn’t, rewrite.

Incident theater. Elaborate incident-management processes for incidents that didn’t need them. Over time, the process gets avoided — people don’t “declare incidents” for medium things because the process is too heavy. Fix: make the lightweight declaration lightweight. A Slack command that creates a channel, assigns an IC, and starts a timer should take seconds. The heavier process kicks in when severity warrants.

The rule

Incident response is a practiced discipline, not a heroic one. The teams that do it well have: named roles, clear severity definitions, pre-baked mitigations, structured communication, and a blameless learning culture. The teams that do it badly have: the same senior engineer on every page, ad-hoc coordination in DMs, mitigations invented during each incident, postmortems written once and filed, action items that never close.

The practice composes with the rest. Observability gives you the signal; CI/CD gives you the rollback; IaC gives you the reproducible environments; platform engineering gives you the paved path that makes the common fixes safe. Incident management is the discipline that sits on top of those and uses them well when things go wrong.

The goal is not zero incidents. Zero incidents is usually a sign that you’re not detecting them, not that you don’t have them. The goal is incidents that are detected fast, mitigated fast, and learned from. A team that has ten incidents a quarter, each resolved in under thirty minutes with a postmortem that produces one durable improvement, is in a better shape than a team that has two incidents a quarter, each dragged on for hours because nobody was trained to run one. The number of incidents is noise; the response quality is signal.