DevOps

Zero-Downtime Database Migrations: A Practical Guide

Database migrations should not require maintenance windows. They almost never do, technically. They often do operationally, because the team has not built the muscle to run them safely. Here is the pattern that works.

The core principle

Every migration needs to be backward and forward compatible during the transition. Your application must work with both the old and new schema during the rollout window. If you violate this, you will eventually take an outage.

This is the principle behind the expand/contract pattern, sometimes called the "two-phase migration" or "parallel change." Make the additive change first, migrate data, switch the application over, then remove the old structure. Each step is independently safe.

Common migrations and how to do them safely

Adding a column. Easy case. Add the column nullable or with a default that does not require rewriting existing rows. In Postgres 11+, adding a column with a constant default is metadata-only. Code that does not know about the column ignores it; code that does know about it uses it. Single step, no risk.

Adding a NOT NULL column to a large table. Do not do this in one step. The sequence: add the column as nullable, deploy code that writes to it, backfill in batches with throttling, verify completeness, then add the NOT NULL constraint as NOT VALID, then VALIDATE in a second transaction. Three deploys, zero locks.

Renaming a column. The trap that bites everyone. The sequence: add the new column, deploy code that writes to both, backfill, deploy code that reads from the new column, deploy code that stops writing to the old column, drop the old column in a separate migration weeks later. Five deploys minimum. Anything faster takes downtime.

Changing a column type. Similar to a rename. Add a new column with the new type, dual-write, backfill, switch reads, stop writing old, drop old. The temptation to ALTER COLUMN TYPE in place is real and almost always wrong on a busy table.

Adding an index on a large table. Always CREATE INDEX CONCURRENTLY (Postgres) or use online DDL (MySQL/InnoDB). Non-concurrent index creation locks writes for the duration, which can be hours on a multi-billion row table.

Dropping a column. Deploy code that does not reference the column. Wait a full deploy cycle. Then drop the column. Dropping a column the application still references will take you down immediately.

The operational practices

Migrate during low-traffic windows. Not because the migration itself requires it, but because if something does go wrong, you want fewer affected users while you investigate. 2 AM in your largest market.

Use a migration tool with transactional safety. Flyway, Liquibase, Alembic, golang-migrate. Pick one. Use it consistently. Stop running raw SQL through psql from someone's laptop.

Have a rollback plan. Most schema changes are not trivially reversible. The plan for rollback is usually "fix forward with another migration" but you need to think it through before you ship, not after.

Watch the right metrics during the migration. Replication lag, lock wait time, error rate on the application. Have these on a dashboard you actively watch. Most failed migrations show distress in metrics 30 to 90 seconds before the application starts erroring.

Backfill in batches. Never UPDATE a million rows in a single transaction. Batch in chunks of 1,000 to 10,000, sleep between batches, and watch for replication lag. A backfill that takes 4 hours and does not impact production is infinitely better than a backfill that takes 20 minutes and locks the table.

The mindset that matters

The teams that run schema changes well treat each migration as a deploy in itself, with the same care, the same review, and the same rollback planning. The teams that have outages treat schema changes as one-off SQL commands.

If your migrations feel scary, that is information. The fix is not to do them less often. It is to make each one smaller and more routine until they are boring.

Migration coming up?

We help engineering teams plan and execute risky schema changes safely, especially when the existing team has not done it before.

Get a second pair of eyes

Not ready for a call? Same.

Get the playbook, not a sales pitch

If this was useful, Jacob sends a few short, practical notes on cutting cloud spend and scaling infra the right way. No fluff, unsubscribe in one click. Just reply if you want to talk; it reaches him directly.

From Jacob Masse, founder of traztech. No spam, unsubscribe in one click.

Need help with any of this?

We help startups build secure, scalable infrastructure. Book a free strategy call and let\'s talk about your stack.

Book a free consultation