Kubernetes in Production: Lessons from 50 Startups

Kubernetes has become the default answer to "how should we run our containers in production?" But after deploying and managing Kubernetes clusters for over 50 startups, we have learned that K8s is not always the right answer, and even when it is, the way most teams adopt it is wrong.

Here are the lessons we have learned the hard way.

Lesson 1: You probably do not need Kubernetes yet

If you have fewer than 10 microservices and fewer than 5 engineers, Kubernetes is almost certainly overkill. The operational overhead of running a cluster, managing upgrades, debugging networking issues, and maintaining Helm charts will consume more engineering time than it saves.

For teams at this scale, use a managed container service like AWS ECS with Fargate, Google Cloud Run, or Railway. You get the benefits of containerized deployments without the Kubernetes tax. When you outgrow these services (and you will know when that happens), migrate to Kubernetes.

Lesson 2: Use a managed Kubernetes service

If you do need Kubernetes, do not run it yourself. Use EKS, GKE, or AKS. Self-managed Kubernetes (kubeadm, kops, or bare metal) is only justified if you have a dedicated platform team of 3+ engineers and a specific reason why managed services do not work for you.

The control plane is the hardest part of Kubernetes to operate. Managed services handle etcd backups, API server availability, and version upgrades. This alone saves you hundreds of hours per year and eliminates an entire category of 2 AM incidents.

Lesson 3: Start with namespaces, not clusters

Many teams create separate clusters for each environment (dev, staging, production). This triples your operational overhead and infrastructure costs. Instead, start with a single cluster and use namespaces to isolate environments. Add network policies and RBAC to enforce boundaries.

Separate clusters make sense when you need hard isolation for compliance reasons or when your production workload is large enough to justify dedicated resources. For most startups, that threshold is around $50K per month in compute spend.

Lesson 4: Invest in observability from day one

Kubernetes abstracts away the underlying infrastructure, which means debugging becomes harder. When a pod is crash-looping, you need to know why. When latency spikes, you need to trace the request through multiple services. When a node runs out of memory, you need to know which pods to evict.

Install Prometheus and Grafana (or Datadog) on day one. Set up alerts for pod restart counts, node resource utilization, and failed deployments. Configure distributed tracing with Jaeger or OpenTelemetry. These tools are not nice-to-haves. They are requirements for running Kubernetes in production.

Lesson 5: GitOps is the right deployment model

Use ArgoCD or Flux to deploy to your cluster. Store your Kubernetes manifests in a git repository. Every change to your cluster state goes through a pull request, gets reviewed, and is applied automatically. This gives you an audit trail, easy rollbacks, and a single source of truth for your infrastructure.

The alternative, running kubectl apply from a CI pipeline, works but lacks the reconciliation loop that GitOps provides. If someone manually changes a resource in the cluster, GitOps will detect and revert it. This prevents configuration drift and the "it works on staging but not production" problem.

Lesson 6: Right-size your resources

The most common mistake we see is over-provisioning. Teams set CPU requests to 1 core and memory requests to 1 GB for every pod, then wonder why they are paying for 20 nodes when they only need 5. Use tools like Goldilocks or the Kubernetes VPA to analyze actual resource usage and set appropriate requests and limits.

On the flip side, under-provisioning causes OOM kills and CPU throttling, which manifest as random latency spikes and pod restarts. Finding the right balance takes data, not guesswork.

Lesson 7: Plan for failure

Kubernetes is resilient by design, but only if you configure it correctly. Set pod disruption budgets so rolling updates do not take down all replicas of a service. Configure liveness and readiness probes so the scheduler routes traffic correctly. Use pod anti-affinity to spread replicas across nodes. Test what happens when a node goes down.

The startups that run Kubernetes successfully are the ones that assume things will break and plan accordingly. The ones that struggle are the ones that treat Kubernetes as magic that "just works."

If you are considering adopting Kubernetes or struggling with an existing deployment, let us help. We have seen every pattern and every anti-pattern, and we can save you months of trial and error.

Kubernetes in Production: Lessons from 50 Startups

Lesson 1: You probably do not need Kubernetes yet

Lesson 2: Use a managed Kubernetes service

Lesson 3: Start with namespaces, not clusters

Lesson 4: Invest in observability from day one

Lesson 5: GitOps is the right deployment model

Lesson 6: Right-size your resources

Lesson 7: Plan for failure

Get the playbook, not a sales pitch

Need help with any of this?

Kubernetes in Production: Lessons from 50 Startups

Lesson 1: You probably do not need Kubernetes yet

Lesson 2: Use a managed Kubernetes service

Lesson 3: Start with namespaces, not clusters

Lesson 4: Invest in observability from day one

Lesson 5: GitOps is the right deployment model

Lesson 6: Right-size your resources

Lesson 7: Plan for failure

Get the playbook, not a sales pitch

You may also like

Your AWS Bill Is Out of Control. Here's How to Cut It by 40%

When Should a Startup Hire DevOps Help?

How to Build a CI/CD Pipeline in 2025

Need help with any of this?