DevOps

Kubernetes in Production: Lessons from 50 Startups

Kubernetes has become the default answer to "how should we run our containers in production?" But after deploying and managing Kubernetes clusters for over 50 startups, we have learned that K8s is not always the right answer, and even when it is, the way most teams adopt it is wrong.

Here are the lessons we have learned the hard way.

Lesson 1: You probably do not need Kubernetes yet

If you have fewer than 10 microservices and fewer than 5 engineers, Kubernetes is almost certainly overkill. The operational overhead of running a cluster, managing upgrades, debugging networking issues, and maintaining Helm charts will consume more engineering time than it saves.

For teams at this scale, use a managed container service like AWS ECS with Fargate, Google Cloud Run, or Railway. You get the benefits of containerized deployments without the Kubernetes tax. When you outgrow these services (and you will know when that happens), migrate to Kubernetes.

Lesson 2: Use a managed Kubernetes service

If you do need Kubernetes, do not run it yourself. Use EKS, GKE, or AKS. Self-managed Kubernetes (kubeadm, kops, or bare metal) is only justified if you have a dedicated platform team of 3+ engineers and a specific reason why managed services do not work for you.

The control plane is the hardest part of Kubernetes to operate. Managed services handle etcd backups, API server availability, and version upgrades. This alone saves you hundreds of hours per year and eliminates an entire category of 2 AM incidents.

Lesson 3: Start with namespaces, not clusters

Many teams create separate clusters for each environment (dev, staging, production). This triples your operational overhead and infrastructure costs. Instead, start with a single cluster and use namespaces to isolate environments. Add network policies and RBAC to enforce boundaries.

Separate clusters make sense when you need hard isolation for compliance reasons or when your production workload is large enough to justify dedicated resources. For most startups, that threshold is around $50K per month in compute spend.

Lesson 4: Invest in observability from day one

Kubernetes abstracts away the underlying infrastructure, which means debugging becomes harder. When a pod is crash-looping, you need to know why. When latency spikes, you need to trace the request through multiple services. When a node runs out of memory, you need to know which pods to evict.

Install Prometheus and Grafana (or Datadog) on day one. Set up alerts for pod restart counts, node resource utilization, and failed deployments. Configure distributed tracing with Jaeger or OpenTelemetry. These tools are not nice-to-haves. They are requirements for running Kubernetes in production.

Lesson 5: GitOps is the right deployment model

Use ArgoCD or Flux to deploy to your cluster. Store your Kubernetes manifests in a git repository. Every change to your cluster state goes through a pull request, gets reviewed, and is applied automatically. This gives you an audit trail, easy rollbacks, and a single source of truth for your infrastructure.

The alternative, running kubectl apply from a CI pipeline, works but lacks the reconciliation loop that GitOps provides. If someone manually changes a resource in the cluster, GitOps will detect and revert it. This prevents configuration drift and the "it works on staging but not production" problem.

Lesson 6: Right-size your resources

The most common mistake we see is over-provisioning. Teams set CPU requests to 1 core and memory requests to 1 GB for every pod, then wonder why they are paying for 20 nodes when they only need 5. Use tools like Goldilocks or the Kubernetes VPA to analyze actual resource usage and set appropriate requests and limits.

On the flip side, under-provisioning causes OOM kills and CPU throttling, which manifest as random latency spikes and pod restarts. Finding the right balance takes data, not guesswork.

Lesson 7: Plan for failure

Kubernetes is resilient by design, but only if you configure it correctly. Set pod disruption budgets so rolling updates do not take down all replicas of a service. Configure liveness and readiness probes so the scheduler routes traffic correctly. Use pod anti-affinity to spread replicas across nodes. Test what happens when a node goes down.

The startups that run Kubernetes successfully are the ones that assume things will break and plan accordingly. The ones that struggle are the ones that treat Kubernetes as magic that "just works."

If you are considering adopting Kubernetes or struggling with an existing deployment, let us help. We have seen every pattern and every anti-pattern, and we can save you months of trial and error.

Not ready for a call? Same.

Get the playbook, not a sales pitch

If this was useful, Jacob sends a few short, practical notes on cutting cloud spend and scaling infra the right way. No fluff, unsubscribe in one click. Just reply if you want to talk; it reaches him directly.

From Jacob Masse, founder of traztech. No spam, unsubscribe in one click.

Need help with any of this?

We help startups build secure, scalable infrastructure. Book a free strategy call and let\'s talk about your stack.

Book a free consultation