It is 2 AM. Your monitoring dashboard lights up. The application is down. Customers are tweeting about it. Your phone is ringing. What do you do?
If your answer is "figure it out in the moment," you are not alone. Most startups do not have an incident response plan until after their first major outage. But by then, the damage is done: lost revenue, angry customers, and a team that is burned out from an all-night firefight with no playbook.
Here is how to build an incident response plan in one afternoon.
Define severity levels
Not every incident is the same. A minor UI bug is not the same as a complete outage. Define three or four severity levels so your team knows how to respond to each:
- SEV 1 (Critical): Complete outage or data breach. All customers affected. Revenue impact. Requires immediate response from the entire engineering team.
- SEV 2 (Major): Significant feature degradation. Many customers affected. Requires response within 30 minutes from the on-call engineer plus one additional team member.
- SEV 3 (Minor): Partial degradation. Some customers affected, workarounds available. Requires response within 2 hours during business hours.
- SEV 4 (Low): Minor issue with minimal impact. Tracked as a bug and fixed in the normal development cycle.
Establish roles
During an incident, everyone needs to know their role. Define at least three:
Incident Commander (IC): Runs the response. Makes decisions about what to investigate, what to communicate, and when to escalate. The IC does not debug the problem. They coordinate the people who do.
Technical Lead: Leads the debugging effort. Investigates the root cause, identifies fixes, and implements them. This should be your strongest engineer for the affected system.
Communications Lead: Manages external and internal communications. Updates the status page, responds to customer inquiries, and keeps stakeholders informed. This is usually someone from customer success or the founding team.
Create a communication plan
When an incident happens, you need to communicate with three audiences:
The team: Use a dedicated Slack channel (#incident-YYYYMMDD) for real-time coordination. Keep all technical discussion in this channel. Post status updates every 15 minutes for SEV 1 and SEV 2 incidents.
Customers: Update your status page within 10 minutes of confirming an incident. Send email updates for SEV 1 incidents. Be honest about what is happening. Customers forgive outages. They do not forgive silence.
Stakeholders: Notify your CEO, board, and investors for SEV 1 incidents. They should hear it from you, not from Twitter.
Build a response checklist
When an incident is declared, the IC follows this checklist:
- Confirm the incident and assign a severity level.
- Create the incident channel in Slack.
- Page the on-call engineer and assign the Technical Lead role.
- Assign a Communications Lead.
- Post the first status page update.
- Begin investigation. Check dashboards, logs, and recent deployments.
- If a recent deployment is the cause, roll it back immediately.
- Post status updates every 15 minutes.
- When resolved, confirm resolution and update the status page.
- Schedule a post-incident review within 48 hours.
Set up on-call rotation
Someone needs to be reachable 24/7. Use PagerDuty, Opsgenie, or even a shared phone to route alerts. Rotate on-call weekly so no one person bears the burden. Pay your engineers extra for on-call shifts, either through additional compensation or time off.
The on-call engineer should have access to all production systems, a laptop, and a stable internet connection at all times during their rotation. No exceptions.
Run post-incident reviews
Every SEV 1 and SEV 2 incident gets a post-incident review (often called a postmortem, though we prefer the less dramatic term). The review covers what happened, why it happened, what the impact was, and what you are going to do to prevent it from happening again.
The most important rule of a post-incident review: no blame. The goal is to improve systems, not to punish individuals. If someone made a mistake, ask what about the system allowed that mistake to have such a big impact. Then fix the system.
Document the review and share it with the entire company. This builds a culture of transparency and continuous improvement.
An afternoon spent building this plan will save you countless hours of chaos during your next incident. And there will be a next incident. The question is whether you will be ready for it.