The standard way founders think about downtime is "lost revenue per hour." For a SaaS with $5M ARR, that is roughly $570 an hour. Annoying but absorbable. So why bother investing in reliability?
Because lost revenue is the smallest line on the downtime bill.
What downtime actually costs
Direct revenue loss. Real but usually small at startup scale. $500 to $5,000 per hour depending on ARR.
Customer trust. Churn after a major outage runs 2 to 5 percent above baseline for the following quarter, and is concentrated in your highest-value enterprise accounts. At $50K ACV, even a 1% churn bump is real money.
Support load. Every minute of customer-facing downtime generates 5 to 20 minutes of support work as tickets come in, you write status updates, you do follow-up communications, and you handle credit requests. A two-hour outage typically eats 20+ hours of support team time.
Engineering response. A serious incident pulls 3 to 8 engineers off their planned work for the duration of the incident plus several days of postmortem and remediation. That is a week of engineering capacity gone.
SLA credits. If you signed enterprise contracts with uptime guarantees (you almost certainly did), you owe credits proportional to time below SLA. These show up in next quarter's revenue line.
Sales pipeline damage. Prospects in active evaluation will see the status page or hear about it from your existing customers. We have seen 6-figure deals stall after a single major outage in mid-cycle.
Recruiting damage. Senior engineering candidates Google "your-company outage" before accepting an offer. A pattern of incidents is visible.
The math that actually matters
For a $10M ARR SaaS, a two-hour total outage costs roughly:
- Direct revenue: $2,300
- Support team: 20 hours × $75 = $1,500
- Engineering time: 40 hours × $150 = $6,000
- SLA credits: $5,000 to $25,000
- Elevated churn (one quarter, 1% of base): $25,000
- Pipeline impact: $50,000+ (one stalled deal)
Total realistic cost: $90,000 to $110,000 for a single 2-hour outage. The "lost revenue" line is less than 3% of that.
What is actually worth investing in
Given the math, even modest reliability investments pay back fast. The high-leverage moves:
A real status page (Statuspage, BetterStack, or Instatus). Customers who can see status during an outage generate a third as many support tickets as customers who cannot.
Proper monitoring and alerting (Datadog, Grafana, or equivalent). Mean time to detect drives mean time to resolve directly. Detecting an issue in 2 minutes vs 30 minutes is the difference between a paper cut and a 6-figure incident.
An incident runbook and on-call rotation. When the first thing your team does during an outage is figure out who to wake up, you have already lost 20 minutes. A documented rotation and runbook cuts response time dramatically.
Postmortems with actual follow-up. Most teams do postmortems. Few of them assign action items with deadlines and follow up. The incidents you do not learn from will repeat.
What is not worth investing in (yet)
Multi-region failover before you have repeatedly hit single-region issues. Five nines of availability for a product where customers are fine with three nines. A dedicated SRE team before you have 20+ engineers. These are real practices, but they are usually the wrong investment for early-stage startups. Solve the basics first.
Want a reliability audit?
We audit production infrastructure, identify the highest-leverage reliability investments, and help you implement them in priority order.
Book a review