It is 2:47 AM. PagerDuty is screaming. The database is at 99% disk usage. Your on-call engineer, who has been with the company for three months, has never seen this before. They start Googling. They try random things. They accidentally make it worse. By the time someone with the right knowledge wakes up, an hour has passed and customers have noticed.
A runbook would have made this a 10-minute fix.
What a runbook is (and is not)
A runbook is a step-by-step procedure for handling a specific operational scenario. It is not documentation of how the system works (that is architecture docs). It is not a list of things to check (that is a monitoring dashboard). It is a specific set of actions for a specific situation, written clearly enough that someone who has never seen the problem before can follow it.
The template
Every runbook should follow this structure:
Title: [Alert name or scenario, e.g., "Database Disk Usage Above 90%"]
Severity: [Critical / Warning / Info]
Impact: [What happens if this is not resolved. e.g., "Database will become read-only when disk reaches 100%, causing all write operations to fail."]
Prerequisites: [Tools or access needed. e.g., "AWS Console access, psql client, SSH access to bastion host"]
Investigation steps:
- [Step 1 with exact command or procedure]
- [Step 2 with expected output]
- [Step 3 with decision point: if X, go to step 4a; if Y, go to step 4b]
Resolution steps:
- [Step 1 with exact command]
- [Step 2 with verification check]
- [Step 3 confirming resolution]
Escalation: [Who to contact if the runbook does not resolve the issue. Include name, phone number, and Slack handle.]
Post-resolution: [Any follow-up actions needed after the immediate fix. e.g., "Create a Jira ticket to investigate root cause of disk growth."]
The 10 runbooks every startup needs
- Application is returning 5xx errors. How to check logs, identify the failing component, and restart or roll back.
- Database is slow or unresponsive. How to check active queries, kill long-running queries, and assess connection pool status.
- Disk usage is high. How to identify what is consuming disk, clean up safely, and expand storage if needed.
- Memory usage is high. How to identify memory leaks, restart affected services, and scale if needed.
- SSL certificate is expiring. How to renew certificates and verify the renewal.
- Deployment failed. How to roll back to the previous version.
- Third-party service is down. What to do when Stripe, SendGrid, AWS, or another dependency has an outage.
- Customer reports data issue. How to safely investigate and fix data-related customer reports.
- DNS is not resolving. How to check DNS configuration and propagation.
- Security incident detected. Initial containment steps, who to notify, and how to preserve evidence.
How to build the habit
The challenge with runbooks is not writing them. It is keeping them up to date. Here are two practices that help:
Write runbooks during postmortems. After every incident, one of the action items should be "write or update the runbook for this scenario." This ensures runbooks are created from real incidents, not theoretical scenarios.
Review runbooks during on-call handoff. When the on-call rotation changes, the outgoing engineer spends 5 minutes reviewing any runbooks that were used during their shift. If a step was wrong or unclear, they update it immediately.
Need help building operational runbooks?
traztech helps startups build comprehensive runbook libraries, set up alerting, and train teams on incident response. We make sure your team can handle anything production throws at them.
Book a free strategy call