DevOps

Scaling PostgreSQL in Production: Lessons from Real Startups

PostgreSQL will take a SaaS startup further than most teams expect. Many companies run on a single primary Postgres instance well past $50M ARR. The teams that get into trouble usually do so by choosing a more complex database stack before they have to, or by waiting too long to do basic Postgres scaling work.

Here is the sequence that actually plays out, in roughly the order you will need each.

Stage 1: get the fundamentals right (year 1)

You can avoid most scaling problems by getting the basics right from the start.

Indexes on every column you query by. Sounds obvious, repeatedly is not. EXPLAIN every query that appears in your application. Indexes are cheap; sequential scans on tables with a million rows are not.

Connection pooling. PgBouncer in front of Postgres, transaction-mode pooling, sized correctly for your workload. Without it, you will exhaust max_connections at completely reasonable traffic levels.

Query monitoring. pg_stat_statements enabled from day one. You cannot fix what you cannot measure. Datadog Database Monitoring or pganalyze are worth the cost.

Reasonable defaults. shared_buffers around 25% of RAM, effective_cache_size around 75%, work_mem sized based on your query patterns. The Postgres defaults are conservative for modern hardware.

Most performance issues at year 1 are missing indexes or N+1 queries from the ORM. Fix these before doing anything more sophisticated.

Stage 2: read replicas (year 1-2)

Once read load is meaningful, add one or two read replicas. Route appropriate queries to them: analytics, reporting, search, anything that does not need to be transactionally consistent with the primary.

The traps:

  • Replication lag is real. Reads from replicas can return slightly stale data.
  • Not every query can be safely moved to a replica. Mark queries explicitly.
  • Failover handling needs to be in the application or a connection layer (pgbouncer, RDS Proxy, etc.).

Stage 3: vertical scaling (year 2)

Postgres scales surprisingly well vertically. A db.r6i.16xlarge on AWS RDS gives you 64 vCPUs and 512GB of RAM, which handles enormous workloads. The cost is real but lower than the engineering cost of horizontal scaling.

The right time to scale up vertically is when query optimization stops moving the needle and the bottleneck is genuinely CPU or memory. Do not scale up to avoid query optimization work; that just moves the problem.

Stage 4: partitioning (year 2-3)

For tables that grow large (events tables, time-series tables, audit logs), table partitioning by date is high-leverage. Queries that filter by date skip irrelevant partitions; old partitions can be dropped or archived cheaply.

Postgres native partitioning is good enough for most cases. pg_partman automates partition management. Do this before tables grow past ~100M rows; partitioning a huge existing table is painful.

Stage 5: connection-level work (year 2-3)

At higher scale, connection management becomes a bottleneck. Options:

PgBouncer transaction-mode pooling at higher pool sizes. Watch out for prepared statements compatibility.

Pg_bouncer + RDS Proxy combination. Some teams use both for failover handling plus connection limiting.

Application-level connection management. Some frameworks need help to reuse connections properly.

Stage 6: extract by domain (year 3+)

Before splitting Postgres at the row level (sharding), most teams should split by domain. Move authentication, billing, analytics, or other distinct domains to separate databases. Each is independently scalable. None requires you to solve cross-shard transactions.

This works because most "scaling problems" are actually contention problems. Splitting domains relieves the contention without requiring distributed database expertise.

Stage 7: sharding (rarely, year 4+)

Sharding is the last resort. Citus, Vitess, application-level sharding. All of these introduce significant operational complexity. Cross-shard queries are slow or impossible. Schema changes are hard. Backups and restores get complicated.

Many startups think they need sharding and actually need stage 6 (domain split). Some need stage 5 (connection management). Very few need stage 7. If you are considering sharding, get a second opinion first.

The mistakes that get teams stuck

Moving to a "more scalable" database before exhausting Postgres. DynamoDB, MongoDB, Cassandra each solve specific problems Postgres does not. They also create new problems Postgres did not have. Use them when you have a problem they specifically solve, not as preemptive scaling.

Avoiding hard-to-roll-back changes (indexes, partitioning) because they are scary. Concurrent index creation is safe. Partitioning is established. Doing them in panic at year 4 is much worse than doing them deliberately at year 2.

Treating the database as someone else's problem. Engineers should know which queries are slow, which indexes exist, and what the locking patterns look like. Databases get into trouble fastest when no one on the application team understands them.

Postgres slowing you down?

We help engineering teams diagnose Postgres performance issues and plan the right scaling investments for their stage. Most engagements deliver measurable improvement in 2 weeks.

Get a Postgres review

Not ready for a call? Same.

Get the playbook, not a sales pitch

If this was useful, Jacob sends a few short, practical notes on cutting cloud spend and scaling infra the right way. No fluff, unsubscribe in one click. Just reply if you want to talk; it reaches him directly.

From Jacob Masse, founder of traztech. No spam, unsubscribe in one click.

Need help with any of this?

We help startups build secure, scalable infrastructure. Book a free strategy call and let\'s talk about your stack.

Book a free consultation