Operations

Building an Incident Management Process from Scratch

The first time production breaks badly, your team will improvise. Someone will notice. Someone will Slack about it. People will jump on a call. Maybe a fix ships. Maybe customers find out before you tell them. Eventually it ends.

Improvisation works for a while. It stops working around the time you have three engineers, paying customers, and an SLA. After that point, an incident management process is the difference between 90-minute recoveries and 6-hour ones.

The components you actually need

A workable incident process at startup scale has five pieces. Not 50.

1. Detection. Alerts go to a single channel that a human is responsible for at any given time. On-call rotation with PagerDuty, BetterStack, or Opsgenie. The on-call engineer is the first responder for every alert, no matter the severity.

2. Declaration. When the on-call engineer determines this is a real incident (vs a flaky alert), they declare it. A simple Slack command (/incident in your incident management tool) creates a dedicated channel, pings the right people, and starts the timeline.

3. Coordination. One person is the Incident Commander. Their job is to coordinate, not to debug. They run the bridge, decide on actions, assign owners, communicate to stakeholders. Often this is the on-call engineer for small incidents. For severe incidents, it should be a more senior person.

4. Communication. Customer communication on the status page within 15 minutes of declaring an incident, even if you do not know much yet. Internal updates in the incident channel every 30 minutes during active incidents. Customer follow-up email after resolution.

5. Postmortem. Within five business days of resolution. Blameless, focused on systems and decisions, with concrete action items and owners. Reviewed in a 30-minute meeting with the team.

Severity levels that make sense

Three levels is enough at startup scale. More than that and you spend energy debating severity instead of fixing the problem.

  • P1: Customer-facing outage or data risk. Wake people up. All-hands until resolved. Customer comms required.
  • P2: Significant degradation, but customers can still mostly work. Wake people up during business hours. Coordinated response. Customer comms required.
  • P3: Minor degradation, internal issue with customer-facing impact pending. Business hours only. Single owner. Update status if customer-visible.

The roles, even on a small team

Even with only 4 engineers, separate the roles during an incident.

Incident Commander: coordinates, does not debug. The brain.

Subject Matter Expert: actually debugging the issue. The hands.

Communicator: writes the status page updates, emails customers, fields stakeholder questions. The voice.

Scribe: keeps the timeline. Notes every action, every observation, every decision. The memory.

For a small incident, one person can wear multiple hats. For a serious one, the roles split. The Commander never debugs because the cognitive load of running the bridge and writing code at the same time is what causes 90-minute incidents to stretch to 6 hours.

What the postmortem must do

Blameless postmortems are not "no one is at fault." They are "the goal is to learn, not to assign blame." Specific decisions and actions are discussed in detail. Systems failures are surfaced. Individuals are not named in the postmortem document.

Three sections that matter:

Timeline. Minute by minute, what happened. What was observed, what was done, what worked, what did not.

Contributing factors. Systems, processes, and decisions that contributed to the incident. Plural. Always plural; there is never a single cause.

Action items. Specific, owned, dated. Tracked to completion in the same place you track product work. The number of postmortems with action items that never close is the leading indicator of repeat incidents.

The cultural piece

The hardest part of incident management is not the runbook. It is building a culture where incidents are normal and learning-focused rather than shameful and hidden. Teams that punish incidents have fewer reported incidents and more catastrophic ones. Teams that treat incidents as the cost of running a real system get better at running real systems.

Building an incident process?

We help startups stand up incident management, on-call rotations, and postmortem culture. Two-week engagement, working with your existing tools.

Get started

Not ready for a call? Same.

Get the playbook, not a sales pitch

If this was useful, Jacob sends a few short, practical notes on cutting cloud spend and scaling infra the right way. No fluff, unsubscribe in one click. Just reply if you want to talk; it reaches him directly.

From Jacob Masse, founder of traztech. No spam, unsubscribe in one click.

Need help with any of this?

We help startups build secure, scalable infrastructure. Book a free strategy call and let\'s talk about your stack.

Book a free consultation