DevOps

SaaS Uptime SLAs: How to Set Them and How to Hit Them

Your customer asks for 99.99% uptime. Your engineer says "we run on AWS, that should be fine." Six months later, the math has not worked out. You owe SLA credits. Sales is unhappy. Engineering is exhausted.

Setting an uptime SLA is a commitment, not a marketing claim. Here is how to think about what to promise and how to actually deliver it.

What the numbers actually mean

SLA tiers and their downtime budgets per month:

  • 99.0%: 7 hours 18 minutes of allowed downtime per month.
  • 99.5%: 3 hours 39 minutes.
  • 99.9% (three nines): 43 minutes 49 seconds.
  • 99.95%: 21 minutes 54 seconds.
  • 99.99% (four nines): 4 minutes 22 seconds.
  • 99.999% (five nines): 26 seconds.

Each additional nine is roughly 10x harder than the previous one. The architectural difference between "two nines" and "three nines" is significant. Between "three nines" and "four nines" is enormous. Between "four nines" and "five nines" is something only the largest infrastructure teams in the world reliably achieve.

What you can promise at your stage

Pre-product-market-fit (no enterprise customers): Do not set a public SLA. Run a status page. Aim for two nines internally.

Early enterprise (first 5-15 enterprise customers): 99.9% is achievable on managed cloud services with reasonable engineering investment. Standard tier for most B2B SaaS.

Scaling enterprise (50+ enterprise customers): 99.95% with serious investment in redundancy, deploy safety, and incident response. The jump from 99.9 to 99.95 typically requires multi-AZ deployment, blue/green deploys, and a real on-call practice.

Mission-critical enterprise (banking, healthcare, telecom): 99.99% requires multi-region active-active or active-passive architecture, automated failover, and a dedicated SRE function. This is a major investment, not a target you stumble into.

Promising more than you can deliver is worse than promising less. Customers respect honest SLAs. They penalize broken ones.

What "uptime" means in your SLA

Define this carefully in the contract. The phrase "99.9% uptime" can mean very different things.

Specifically, the SLA should define:

  • What counts as "down." Total unavailability? Degraded performance? Specific feature unavailable? Define it.
  • How measurement works. Customer-reported? Synthetic monitoring? Your monitoring? Whose tool wins in a dispute?
  • What is excluded. Scheduled maintenance (with notice), customer-caused issues, force majeure, third-party outages (AWS, Stripe, etc.).
  • Measurement window. Calendar month is standard. Trailing 30 days is sometimes used.
  • Credit structure. Typically a percentage of monthly fees credited back, with a cap (usually 25-50% of monthly fees). Credits do not extend the term.

What it takes to actually hit 99.9%

Three nines is achievable on a single cloud region with the following:

  • Multi-AZ deployment for the application tier and the database.
  • Auto-scaling configured properly so traffic spikes do not cause outages.
  • Health checks at the load balancer that pull bad instances out of rotation fast.
  • Blue/green or canary deployment so bad deploys do not affect 100% of traffic.
  • Real monitoring with alerts that fire fast (under 2 minutes for customer-facing issues).
  • On-call rotation with documented runbooks.
  • Database backups with tested restore. Failover for the primary database.
  • Status page with public history.

None of these is exotic. All of them require deliberate work. Teams that wing it typically achieve 99% to 99.5%, which sounds close to three nines but is not.

The trap that kills most SLAs

Most missed SLAs are not from one big outage. They are from the cumulative effect of small ones: a 5-minute glitch every week is 240 minutes per year, which already exceeds the four-nines budget.

Track partial degradation, not just total outages. Aggregate over your measurement window. Look at your actual numbers, not the optimistic ones in the status page summary.

What to do if you miss

If you miss SLA, communicate proactively. Issue credits without making customers ask. The credit is rarely the largest cost; the trust damage is. Customers who feel respected after an incident often stay. Customers who feel ignored leave.

Need to hit a tighter SLA?

We help SaaS teams move from "best effort" reliability to specific uptime targets through architecture review, monitoring, and incident process work.

Talk to us

Not ready for a call? Same.

Get the playbook, not a sales pitch

If this was useful, Jacob sends a few short, practical notes on cutting cloud spend and scaling infra the right way. No fluff, unsubscribe in one click. Just reply if you want to talk; it reaches him directly.

From Jacob Masse, founder of traztech. No spam, unsubscribe in one click.

Need help with any of this?

We help startups build secure, scalable infrastructure. Book a free strategy call and let\'s talk about your stack.

Book a free consultation