You do not need Datadog, a dedicated SRE team, and a six-figure observability budget to monitor your production systems. You need three things: application metrics, infrastructure metrics, and alerting that wakes someone up when something breaks. Here is how to set all three up in an afternoon.
Start with what matters
Before configuring any tool, decide what you need to monitor. For a typical SaaS application, the critical metrics are:
- Availability: Is the application responding to requests? (HTTP health check)
- Latency: How long do requests take? (p50, p95, p99 response times)
- Error rate: What percentage of requests are failing? (5xx error rate)
- Saturation: Are your resources running out? (CPU, memory, disk, database connections)
These four metrics, borrowed from Google is "Four Golden Signals" framework, cover 90% of the problems you will encounter in production. Everything else is nice to have.
Option 1: Free tier stack (cost: $0)
If budget is a constraint, you can build solid monitoring with free tiers:
Uptime monitoring: Use UptimeRobot (free for 50 monitors). Configure HTTP checks for your main endpoints. Set check intervals to 5 minutes. This gives you availability monitoring with email and Slack alerts.
Application metrics: Use CloudWatch (included with AWS) or Cloud Monitoring (included with GCP). Send custom metrics from your application for response time and error rate. Both services include basic dashboarding.
Infrastructure metrics: CloudWatch or Cloud Monitoring again. CPU, memory, disk, and network metrics are collected automatically for cloud instances. Set up alarms for CPU > 80%, memory > 85%, and disk > 90%.
Log aggregation: CloudWatch Logs or Cloud Logging. Ship your application logs to a central location. Set up metric filters to alert on error spikes.
Option 2: Grafana Cloud stack (cost: $0-$50/month)
For a better experience, use Grafana Cloud is free tier. It includes:
- Grafana for dashboards (50GB logs, 10K metrics series included free)
- Prometheus for metrics collection
- Loki for log aggregation
- Alerts with email, Slack, PagerDuty, and webhook integrations
Install the Grafana agent on your servers. It collects system metrics automatically. Add application instrumentation using Prometheus client libraries (available for every major language). Build a single dashboard with your four golden signals. Total setup time: 2-3 hours.
Option 3: Datadog (cost: $100-$500/month for a small startup)
Datadog is the gold standard but it gets expensive fast. If you can afford it, the developer experience is excellent. APM traces, log correlation, infrastructure maps, and synthetic monitoring all in one platform. Start with the Infrastructure and Log Management products. Add APM later when you need request tracing.
Setting up alerting
Monitoring without alerting is a dashboard nobody looks at. Set up alerts for:
- Critical (wake someone up): Application down for > 5 minutes. Error rate > 10% for > 5 minutes. Database connections exhausted.
- Warning (Slack notification): CPU > 80% for > 15 minutes. Memory > 85% for > 15 minutes. Disk > 85%. Response time p95 > 2 seconds.
- Info (daily digest): Error rate > 1% (baseline monitoring). Deployment completed. SSL certificate expiring in < 30 days.
Route critical alerts to PagerDuty or OpsGenie ($9-$29/user/month). Route warnings to a dedicated Slack channel. Route info alerts to email or a daily summary.
The biggest mistake in alerting is too many alerts. Alert fatigue is real. If your team gets more than 5 alerts per day, they will start ignoring all of them. Tune your thresholds until every alert represents a real problem that requires human attention.
Need help setting up monitoring?
traztech helps startups build monitoring and alerting systems that work. We set up dashboards, configure alerts, and train your team on incident response. No DevOps team required.
Book a free strategy call