Your database just disappeared. Your cloud provider had a regional outage. Your engineer accidentally ran DROP TABLE in production. You need a disaster recovery plan, and "we will figure it out when it happens" is not one.
The good news: a practical DR plan for a 3-person engineering team does not need to be a 50-page document. It needs to answer five questions and include tested procedures for each.
Question 1: What are you protecting?
List every critical system and data store. For a typical SaaS startup, this is: the application database (PostgreSQL, MySQL), file storage (S3 buckets), application configuration (environment variables, secrets), source code (GitHub), and the infrastructure definition (Terraform state).
For each item, define two numbers:
- RPO (Recovery Point Objective): How much data can you afford to lose? If your RPO is 1 hour, you need backups at least every hour.
- RTO (Recovery Time Objective): How long can you be down? If your RTO is 4 hours, you need to be able to restore everything within 4 hours.
For most startups: RPO = 1 hour, RTO = 4 hours. These are reasonable targets that do not require expensive infrastructure.
Question 2: What can go wrong?
The most common disaster scenarios for startups:
- Database corruption or accidental deletion (most common, usually human error)
- Cloud provider regional outage (rare but impactful)
- Ransomware or security breach (increasingly common)
- Application bug that corrupts data (common during migrations)
- DNS or certificate issues that take down the entire application
Question 3: How are you backing up?
Set up automated backups for every critical data store:
Database: Use RDS automated backups (free, continuous) with point-in-time recovery enabled. Also set up a daily pg_dump to S3 in a DIFFERENT AWS account (cross-account backup). This protects against account compromise. Test a restore monthly.
File storage: Enable S3 versioning and cross-region replication for any bucket containing customer data. Cost is minimal (you pay for the replicated storage).
Source code: GitHub provides this inherently (every clone is a backup). But also ensure your Terraform state is backed up. Use an S3 backend with versioning for Terraform state.
Secrets: Use a secrets manager (AWS Secrets Manager, HashiCorp Vault) rather than environment variables. Secrets managers provide versioning, audit logging, and rotation capabilities.
Question 4: How do you recover?
Write step-by-step runbooks for each scenario. Keep them simple enough that any engineer on the team can execute them at 3 AM while half-asleep.
Database recovery runbook (example):
- Identify the point in time you need to recover to
- Open the RDS console, select the database, choose "Restore to point in time"
- Wait for the new instance to become available (typically 15-30 minutes)
- Verify the data by running a set of predefined queries
- Update the application configuration to point to the new database
- Restart the application
- Verify the application is functioning correctly
- Terminate the old database instance (after confirming recovery)
Total recovery time: 30-60 minutes. Well within a 4-hour RTO.
Question 5: How do you test it?
A DR plan that has never been tested is a hope, not a plan. Schedule quarterly DR tests. Pick one scenario each quarter and run through the recovery procedure. Track how long it takes and document any issues.
Start simple: restore your database from backup to a separate instance. Verify the data is correct. Time the entire process. If you can do this successfully, you have covered the most common disaster scenario.
Need help with disaster recovery planning?
traztech helps startups build practical disaster recovery plans, set up backup infrastructure, and run DR tests. We make sure you can recover when things go wrong.
Book a free strategy call