"Our production went down at 3 AM. We learned about it at 9 AM from an angry customer email."
Sound familiar? I have dozens of stories like this. In 2023, I consulted for a startup with $2M ARR — they had zero alerts. Monitoring = SSH to the server and tail -f logs. When the database crashed due to disk space, they lost 6 hours of operations and $15k in revenue.
Spoiler: You can set up proper monitoring in one evening. No Kubernetes, no DataDog at $3000/month, no dedicated DevOps team.
In this article — a minimalist stack for real projects: Prometheus for metrics, Grafana for visualization, Loki for logs. Everything runs on a single server, works for years, and saves your ass at 3 AM.
Why monitoring is not a luxury
Three real cases from my practice:
Case 1: Fintech startup (2024) Client base was growing, API was getting slower. The team knew "something was wrong" but didn't know what. Set up Prometheus in one evening — turns out one endpoint was making 300+ SQL queries per HTTP request. Fixed the N+1 problem, response time dropped from 2.5s to 120ms. ROI paid off in a week.
Case 2: E-commerce on Django (2025) Production was "sluggish" every 2-3 days, no pattern. SSH + htop showed nothing. Added Grafana with memory metrics — found that a Celery worker was eating 16GB RAM and OOM-killer was murdering it. Culprit — memory leak in image processing. Fix took 2 hours, downtime stopped.
Case 3: B2B SaaS (2023)
Customers complained about "weird slowdowns". Metrics showed a latency spike every 6 hours. Loki helped find the pattern: backup script on the same server was running pg_dump without nice, CPU utilization hit 100%, app was choking. Solution — move backups to a separate machine.
90% of production problems are invisible without monitoring. You won't know about them until you get an angry email from a customer or lose money.
What is observability in simple terms
Observability — the ability to understand what's happening inside a system by looking at its external signals.
Three pillars of observability:
- Metrics — numerical indicators: CPU, memory, RPS, latency, error rate
- Logs — events: "user registered", "database error", "deployment started"
- Traces — request path through the system (not covered in this article, that's for microservices)
Metaphor: imagine a car.
- Metrics — speedometer, tachometer, engine temperature
- Logs — onboard computer with a record "engine overheated at 2:35 PM"
- Traces — dashcam showing the entire route
Without monitoring, you're driving with your eyes closed and learning about breakdowns when the car stops.
Stack choice: why Prometheus + Grafana + Loki
There are dozens of monitoring tools. I've tried Zabbix, Nagios, ELK Stack, Datadog, New Relic. For 90% of projects, the choice is obvious: Prometheus + Grafana + Loki.
Why this stack:
- ✅ Open source and free — no licenses, vendor lock-in, or $5k/month bills
- ✅ Production-grade — used by largest companies (Google, Uber, GitLab)
- ✅ Lightweight — runs on a single server with 2GB RAM
- ✅ Integrations out of the box — exporters for everything: PostgreSQL, Redis, Nginx, Node.js
- ✅ Active development — updates every month, huge community
- ✅ Simplicity — set up in an hour, doesn't require PhD in DevOps
Comparison with alternatives:
| Criterion | Prometheus Stack | ELK Stack | Datadog/New Relic | Zabbix |
|---|---|---|---|---|
| Cost | ✅ Free | ✅ Free | ❌ $100-5000/mo | ✅ Free |
| Setup simplicity | ✅ 1-2 hours | ⚠️ 4-8 hours | ✅ 30 minutes | ⚠️ 2-4 hours |
| Resources (RAM) | ✅ 1-2GB | ❌ 8-16GB | ☁️ SaaS | ⚠️ 2-4GB |
| Metrics | ✅ Excellent | ⚠️ Not focus | ✅ Excellent | ✅ Good |
| Logs | ✅ Loki | ✅ Excellent (ES) | ✅ Excellent | ⚠️ Basic |
| Alerts | ✅ Alertmanager | ⚠️ Complex | ✅ Excellent | ✅ Good |
| Dashboards | ✅ Grafana | ⚠️ Kibana | ✅ Beautiful | ⚠️ Kibana |
| Community & Ecosystem | ✅ Huge | ✅ Large | ⚠️ Vendor-lock | ⚠️ Outdated |
My choice:
- For startups and small projects (up to 10 servers) — Prometheus Stack
- For enterprise with compliance — ELK Stack (requires Elasticsearch for audit)
- For corporations with money — Datadog/New Relic (if budget exists and you're lazy)
- For legacy systems — Zabbix (if it's already there, don't touch it)
Stack architecture on a budget
Here's what we'll be running:
┌─────────────────────────────────────────────────┐
│ Your server (2-4GB RAM) │
├─────────────────────────────────────────────────┤
│ │
│ ┌─────────────────┐ ┌──────────────┐ │
│ │ Your app │ │ PostgreSQL │ │
│ │ (FastAPI/Django)│ │ / Redis │ │
│ └────┬────────────┘ └────┬─────────┘ │
│ │ │ │
│ │ metrics │ metrics │
│ │ + logs │ (exporter) │
│ ▼ ▼ │
│ ┌─────────────────────────────────┐ │
│ │ Prometheus │ │
│ │ (collects metrics every 15s) │ │
│ └───────────┬─────────────────────┘ │
│ │ │
│ │ query │
│ ▼ │
│ ┌─────────────────────────────────┐ │
│ │ Grafana │ │
│ │ (visualization + alerts) │ │
│ └───────────┬─────────────────────┘ │
│ │ query │
│ ┌───────────▼─────────────────────┐ │
│ │ Loki │ │
│ │ (stores logs) │ │
│ └─────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────┘
Components:
- Prometheus — time-series database for metrics. Polls your app and exporters via HTTP, collects metrics every 15-30 seconds.
- Grafana — web interface for dashboards and alerts. Connects to Prometheus and Loki, draws graphs.
- Loki — log aggregator. Your app sends logs, Loki indexes them, Grafana displays them.
- Exporters — applications that export metrics in Prometheus format:
node_exporter— server metrics (CPU, RAM, disk, network)postgres_exporter— PostgreSQL metricsredis_exporter— Redis metricsnginx_exporter— Nginx metrics
Server requirements:
- Minimum: 2GB RAM, 2 CPU cores, 20GB disk
- Recommended: 4GB RAM, 2 CPU cores, 50GB disk
- OS: Ubuntu 22.04/24.04, Debian 12, or any Linux with Docker
I've run this stack even on a $5/month VPS (Hetzner CX21). For small projects, it's more than enough.
Practice: setting up the stack in an hour
Let's get to work. I assume you have a Ubuntu/Debian server with Docker.
Step 1: Create docker-compose.yml
Create a directory for monitoring:
mkdir -p /opt/monitoring
cd /opt/monitoringCreate docker-compose.yml:
services:
# Prometheus — collects metrics
prometheus:
image: prom/prometheus:v3.0.0
container_name: prometheus
command:
- "--config.file=/etc/prometheus/prometheus.yml"
- "--storage.tsdb.path=/prometheus"
- "--storage.tsdb.retention.time=30d" # Store metrics for 30 days
- "--web.enable-lifecycle" # API for hot-reload config
ports:
- "9090:9090"
volumes:
- ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
- ./prometheus/alerts.yml:/etc/prometheus/alerts.yml
- prometheus_data:/prometheus
restart: unless-stopped
networks:
- monitoring
# Grafana — visualization
grafana:
image: grafana/grafana:11.4.0
container_name: grafana
ports:
- "3000:3000"
environment:
- GF_SECURITY_ADMIN_PASSWORD=your_secure_password # CHANGE THIS!
- GF_INSTALL_PLUGINS=grafana-piechart-panel
volumes:
- grafana_data:/var/lib/grafana
- ./grafana/provisioning:/etc/grafana/provisioning
restart: unless-stopped
networks:
- monitoring
# Loki — logs
loki:
image: grafana/loki:3.3.2
container_name: loki
ports:
- "3100:3100"
command: -config.file=/etc/loki/local-config.yaml
volumes:
- loki_data:/loki
restart: unless-stopped
networks:
- monitoring
# Node Exporter — server metrics
node_exporter:
image: prom/node-exporter:v1.8.2
container_name: node_exporter
command:
- "--path.rootfs=/host"
ports:
- "9100:9100"
volumes:
- /:/host:ro,rslave
restart: unless-stopped
networks:
- monitoring
volumes:
prometheus_data:
grafana_data:
loki_data:
networks:
monitoring:
driver: bridgeStep 2: Prometheus configuration
Create prometheus/prometheus.yml:
global:
scrape_interval: 15s # Collect metrics every 15 seconds
evaluation_interval: 15s # Check alert rules every 15 seconds
# Alerts (we'll create later)
rule_files:
- "/etc/prometheus/alerts.yml"
# Where to collect metrics from
scrape_configs:
# Prometheus itself
- job_name: "prometheus"
static_configs:
- targets: ["localhost:9090"]
# Server metrics (CPU, RAM, Disk)
- job_name: "node"
static_configs:
- targets: ["node_exporter:9100"]
# Your application (FastAPI, Django, etc.)
# Uncomment and specify your app address
# - job_name: 'app'
# static_configs:
# - targets: ['app:8000']
# PostgreSQL (if using postgres_exporter)
# - job_name: 'postgres'
# static_configs:
# - targets: ['postgres_exporter:9187']
# Redis (if using redis_exporter)
# - job_name: 'redis'
# static_configs:
# - targets: ['redis_exporter:9121']Create empty alerts file prometheus/alerts.yml (we'll fill it later):
groups:
- name: basic_alerts
interval: 30s
rules: []Step 3: Launch the stack
docker compose up -dCheck that everything started:
docker compose psYou should see 4 containers with status Up:
prometheusgrafanalokinode_exporter
Check availability:
- Prometheus: http://your-server-ip:9090
- Grafana: http://your-server-ip:3000 (login:
admin, password: fromdocker-compose.yml) - Loki: http://your-server-ip:3100/ready (should return
ready)
Congratulations! Basic stack is running. Now let's configure dashboards and alerts.
Step 4: Grafana setup
- Open Grafana: http://your-server-ip:3000
- Log in (admin / your password from docker-compose)
- Add Data Source:
- Connections → Add data source → Prometheus
- URL:
http://prometheus:9090 - Save & Test — should show "Data source is working"
- Repeat for Loki:
- Add data source → Loki
- URL:
http://loki:3100 - Save & Test
Step 5: Import ready-made dashboard
No need to draw a dashboard from scratch — use a ready one.
- In Grafana: Dashboards → Import
- Enter dashboard ID: 1860 (Node Exporter Full)
- Click Load
- Select Prometheus data source
- Import
Voilà! You now have a dashboard with server metrics: CPU, RAM, Disk I/O, Network.
Other useful dashboards:
- PostgreSQL: ID 9628
- Redis: ID 11835
- Nginx: ID 12708
- Docker: ID 893
Adding metrics from your application
Now the most important part — metrics from your application.
Python (FastAPI / Django)
Install the library:
pip install prometheus-clientFastAPI:
from fastapi import FastAPI
from prometheus_client import Counter, Histogram, make_asgi_app
app = FastAPI()
# Metrics
REQUEST_COUNT = Counter(
'app_requests_total',
'Total HTTP requests',
['method', 'endpoint', 'status']
)
REQUEST_DURATION = Histogram(
'app_request_duration_seconds',
'HTTP request duration',
['method', 'endpoint']
)
@app.middleware("http")
async def prometheus_middleware(request, call_next):
method = request.method
endpoint = request.url.path
with REQUEST_DURATION.labels(method, endpoint).time():
response = await call_next(request)
REQUEST_COUNT.labels(method, endpoint, response.status_code).inc()
return response
# Endpoint for Prometheus
metrics_app = make_asgi_app()
app.mount("/metrics", metrics_app)Now metrics are available at http://your-app:8000/metrics.
Django (with django-prometheus):
pip install django-prometheus# settings.py
INSTALLED_APPS = [
'django_prometheus',
# ...
]
MIDDLEWARE = [
'django_prometheus.middleware.PrometheusBeforeMiddleware',
# ... other middleware
'django_prometheus.middleware.PrometheusAfterMiddleware',
]
# urls.py
urlpatterns = [
path('', include('django_prometheus.urls')),
# ...
]Metrics at http://your-app:8000/metrics.
Node.js (Express)
npm install prom-clientconst express = require("express");
const client = require("prom-client");
const app = express();
// Create registry
const register = new client.Registry();
// Collect default metrics (CPU, memory, event loop)
client.collectDefaultMetrics({ register });
// Custom metrics
const httpRequestDuration = new client.Histogram({
name: "http_request_duration_seconds",
help: "Duration of HTTP requests in seconds",
labelNames: ["method", "route", "status_code"],
registers: [register],
});
// Middleware for metrics
app.use((req, res, next) => {
const start = Date.now();
res.on("finish", () => {
const duration = (Date.now() - start) / 1000;
httpRequestDuration
.labels(req.method, req.route?.path || req.path, res.statusCode)
.observe(duration);
});
next();
});
// Endpoint for Prometheus
app.get("/metrics", async (req, res) => {
res.set("Content-Type", register.contentType);
res.end(await register.metrics());
});
app.listen(3000);Add application to Prometheus
Edit prometheus/prometheus.yml:
scrape_configs:
# ... existing jobs
- job_name: "myapp"
static_configs:
- targets: ["host.docker.internal:8000"] # Your applicationIf the app is in Docker Compose, use service name:
- job_name: "myapp"
static_configs:
- targets: ["myapp:8000"]Restart Prometheus:
docker compose restart prometheusCheck in Prometheus UI: Status → Targets — your app should be in UP status.
Setting up logs with Loki
Loki collects logs from your application.
Option 1: Promtail (recommended)
Promtail — agent for collecting logs and sending to Loki.
Add to docker-compose.yml:
promtail:
image: grafana/promtail:3.3.2
container_name: promtail
volumes:
- /var/log:/var/log:ro # System logs
- ./promtail/config.yml:/etc/promtail/config.yml
- ./logs:/app/logs:ro # Your app logs
command: -config.file=/etc/promtail/config.yml
restart: unless-stopped
networks:
- monitoringCreate promtail/config.yml:
server:
http_listen_port: 9080
grpc_listen_port: 0
positions:
filename: /tmp/positions.yaml
clients:
- url: http://loki:3100/loki/api/v1/push
scrape_configs:
# Your application logs
- job_name: app
static_configs:
- targets:
- localhost
labels:
job: app
__path__: /app/logs/*.log
# System logs (optional)
- job_name: system
static_configs:
- targets:
- localhost
labels:
job: syslog
__path__: /var/log/syslogRestart the stack:
docker compose up -dOption 2: Logging directly from application
Python (python-logging-loki):
pip install python-logging-lokiimport logging
from logging_loki import LokiHandler
logger = logging.getLogger("my-app")
logger.setLevel(logging.INFO)
loki_handler = LokiHandler(
url="http://loki:3100/loki/api/v1/push",
tags={"application": "my-app", "environment": "production"},
version="1",
)
logger.addHandler(loki_handler)
logger.info("Application started")
logger.error("Something went wrong", extra={"user_id": 123})Viewing logs in Grafana
- Open Grafana
- Explore → select Loki data source
- Query:
{job="app"} - Click Run query
You'll see your application logs in real-time.
Useful Loki queries (LogQL):
# All app logs
{job="app"}
# Errors only
{job="app"} |= "ERROR"
# Logs for specific user
{job="app"} | json | user_id="123"
# Error rate for last 5 minutes
rate({job="app"} |= "ERROR" [5m])Alerts: so you don't sleep through production failure
Alerts are the most important part of monitoring. Let's set up alerting in 3 steps.
Step 1: Add alert rules
Edit prometheus/alerts.yml:
groups:
- name: critical_alerts
interval: 30s
rules:
# Server unreachable
- alert: InstanceDown
expr: up == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Instance {{ $labels.instance }} is down"
description: "{{ $labels.job }} has been down for more than 1 minute."
# CPU above 80%
- alert: HighCPU
expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 5m
labels:
severity: warning
annotations:
summary: "High CPU usage on {{ $labels.instance }}"
description: "CPU usage is above 80% for 5 minutes (current: {{ $value }}%)"
# RAM above 90%
- alert: HighMemory
expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 90
for: 5m
labels:
severity: critical
annotations:
summary: "High memory usage on {{ $labels.instance }}"
description: "Memory usage is above 90% (current: {{ $value }}%)"
# Disk above 85%
- alert: DiskSpaceLow
expr: (1 - (node_filesystem_avail_bytes{fstype!~"tmpfs|fuse.lxcfs"} / node_filesystem_size_bytes)) * 100 > 85
for: 10m
labels:
severity: warning
annotations:
summary: "Low disk space on {{ $labels.instance }}"
description: "Disk {{ $labels.mountpoint }} is {{ $value }}% full"
# High error rate (>5% requests with errors)
- alert: HighErrorRate
expr: rate(app_requests_total{status=~"5.."}[5m]) / rate(app_requests_total[5m]) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate in {{ $labels.job }}"
description: "Error rate is {{ $value | humanizePercentage }} for 5 minutes"
# Slow requests (p95 latency > 1s)
- alert: SlowRequests
expr: histogram_quantile(0.95, rate(app_request_duration_seconds_bucket[5m])) > 1
for: 10m
labels:
severity: warning
annotations:
summary: "Slow requests in {{ $labels.job }}"
description: "95th percentile latency is {{ $value }}s"Reload Prometheus config:
curl -X POST http://localhost:9090/-/reloadCheck alerts: http://your-server-ip:9090/alerts
Step 2: Alertmanager setup (notifications)
Alertmanager sends notifications to Telegram, Slack, email, etc.
Add to docker-compose.yml:
alertmanager:
image: prom/alertmanager:v0.27.0
container_name: alertmanager
ports:
- "9093:9093"
volumes:
- ./alertmanager/config.yml:/etc/alertmanager/config.yml
- alertmanager_data:/alertmanager
command:
- "--config.file=/etc/alertmanager/config.yml"
restart: unless-stopped
networks:
- monitoring
volumes:
# ... existing volumes
alertmanager_data:Edit prometheus/prometheus.yml:
# Add at the beginning of the file
alerting:
alertmanagers:
- static_configs:
- targets: ["alertmanager:9093"]Create alertmanager/config.yml:
global:
resolve_timeout: 5m
route:
group_by: ["alertname", "cluster"]
group_wait: 10s
group_interval: 10s
repeat_interval: 12h
receiver: "telegram"
receivers:
# Telegram (recommended)
- name: "telegram"
telegram_configs:
- bot_token: "YOUR_BOT_TOKEN" # Get from @BotFather
chat_id: YOUR_CHAT_ID # Your chat_id
parse_mode: "HTML"
message: |
<b>{{ .Status | toUpper }}</b>
{{ range .Alerts }}
<b>Alert:</b> {{ .Labels.alertname }}
<b>Severity:</b> {{ .Labels.severity }}
<b>Summary:</b> {{ .Annotations.summary }}
<b>Description:</b> {{ .Annotations.description }}
{{ end }}
# Slack (alternative)
# - name: 'slack'
# slack_configs:
# - api_url: 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL'
# channel: '#alerts'
# title: 'Alert: {{ .GroupLabels.alertname }}'
# text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
# Email (alternative)
# - name: 'email'
# email_configs:
# - to: 'your-email@example.com'
# from: 'alerts@yourapp.com'
# smarthost: 'smtp.gmail.com:587'
# auth_username: 'your-email@gmail.com'
# auth_password: 'your-app-password'Restart the stack:
docker compose up -dHow to get Telegram bot token and chat_id:
- Create bot: message @BotFather →
/newbot→ follow instructions → getbot_token - Get chat_id: message the bot
/start, then openhttps://api.telegram.org/bot<bot_token>/getUpdates→ find"chat":{"id":123456789}
Step 3: Test alerts
Let's create artificial load for testing:
# Load CPU
yes > /dev/null &
yes > /dev/null &
yes > /dev/null &
yes > /dev/null &
# After 5 minutes HighCPU alert should trigger
# Check: http://your-server-ip:9090/alerts
# Stop the load:
killall yesYou should receive a notification in Telegram within 5-6 minutes.
Alerts are working! Now you'll know about problems before customers do.
Dashboards for real life
Ready-made dashboards are good, but for production you need custom ones.
"Application Health" dashboard
Create a new dashboard in Grafana:
Panels:
-
RPS (Requests Per Second)
- Query:
rate(app_requests_total[1m]) - Visualization: Graph
- Query:
-
Error Rate (%)
- Query:
(rate(app_requests_total{status=~"5.."}[5m]) / rate(app_requests_total[5m])) * 100 - Visualization: Graph
- Threshold: warning at 1%, critical at 5%
- Query:
-
Latency (p50, p95, p99)
- Query:
histogram_quantile(0.50, rate(app_request_duration_seconds_bucket[5m])) # p50 histogram_quantile(0.95, rate(app_request_duration_seconds_bucket[5m])) # p95 histogram_quantile(0.99, rate(app_request_duration_seconds_bucket[5m])) # p99 - Visualization: Graph
- Query:
-
Active Users (if you have this metric)
- Query:
active_users_gauge - Visualization: Stat
- Query:
-
Top 5 Slowest Endpoints
- Query:
topk(5, histogram_quantile(0.95, rate(app_request_duration_seconds_bucket[5m]))) - Visualization: Table
- Query:
"Infrastructure" dashboard
-
CPU Usage
- Query:
100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
- Query:
-
Memory Usage
- Query:
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100
- Query:
-
Disk I/O
- Query:
rate(node_disk_read_bytes_total[5m])andrate(node_disk_written_bytes_total[5m])
- Query:
-
Network Traffic
- Query:
rate(node_network_receive_bytes_total[5m])andrate(node_network_transmit_bytes_total[5m])
- Query:
Troubleshooting: how to investigate issues
You got an alert at 3 AM. What to do?
Scenario 1: HighCPU alert
- Grafana → Infrastructure dashboard → look at CPU graph
- Correlate: check RPS at the same time — if there's a traffic spike, the code is to blame
- Logs in Loki:
{job="app"} | json | line_format "{{.endpoint}} {{.duration}}"— look for slow endpoints - Prometheus:
topk(5, rate(app_request_duration_seconds_sum[5m]))— top slow requests - Fix: optimize code or scale
Scenario 2: HighMemory alert
- Grafana → Infrastructure → Memory Usage
- Prometheus:
process_resident_memory_bytes— check app memory consumption - Logs:
{job="app"} |= "OutOfMemory"or|= "MemoryError" - Hypothesis: memory leak? Check the code
- Temporary fix:
docker compose restart app - Long-term fix: profile the code (memory_profiler, py-spy)
Scenario 3: HighErrorRate alert
- Grafana → Application Health → Error Rate graph
- Prometheus:
rate(app_requests_total{status=~"5.."}[5m])— which endpoints? - Loki:
{job="app"} |= "ERROR" or "Exception"— read stack traces - Root cause: database unavailable? API down? Timeout?
- Fix: depends on the cause
Scenario 4: SlowRequests alert
- Prometheus:
histogram_quantile(0.95, rate(app_request_duration_seconds_bucket[5m]))by endpoint - Loki: find specific slow requests with parameters
- Database: check
pg_stat_statements— maybe slow query? - Fix: add indexes, cache, optimize
Pro tip: Create a runbook for each alert. A document with investigation steps and typical solutions. Will save hours at 3 AM.
Best practices: what I learned over the years
1. Retention Policy
By default, Prometheus stores metrics for 15 days. That's not enough. Set 30-90 days:
command:
- "--storage.tsdb.retention.time=90d"For Loki:
# loki-config.yaml
limits_config:
retention_period: 30d2. Don't monitor everything
Metrics cost memory and disk. Monitor only what helps make decisions:
- ✅ Monitor: RPS, error rate, latency, CPU, RAM, disk
- ❌ Don't monitor: clicks on every button (that's for analytics, not observability)
3. Alert hygiene
- Group alerts: don't send 50 notifications per minute, group to 1 every 5 minutes
- Severity levels:
critical→ phone call,warning→ Telegram,info→ logs only - Mute when deploying: disable alerts during deployment, otherwise false positives
4. Backup configs
Store configs in Git:
git init
git add docker-compose.yml prometheus/ grafana/ alertmanager/
git commit -m "Initial monitoring setup"
git remote add origin git@github.com:yourname/monitoring-config.git
git push5. Security
- Don't expose ports publicly: use Nginx reverse proxy with auth
- Change Grafana default password
- Restrict Prometheus access: it can show sensitive data
Advanced features
Remote Write (long-term storage)
Prometheus stores metrics locally. For long-term storage (years) use remote write to:
- Thanos — open source, S3-backed storage
- Cortex — multi-tenant Prometheus
- Grafana Cloud — managed (free up to 10k series)
ServiceMonitor (for Kubernetes)
If you have Kubernetes, use Prometheus Operator + ServiceMonitor:
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: myapp
spec:
selector:
matchLabels:
app: myapp
endpoints:
- port: metrics
interval: 30sDistributed Tracing (Tempo)
For microservices add Grafana Tempo:
tempo:
image: grafana/tempo:latest
command: ["-config.file=/etc/tempo.yaml"]
volumes:
- ./tempo.yaml:/etc/tempo.yaml
ports:
- "3200:3200" # Tempo UI
- "4317:4317" # OTLP gRPCIntegrate with OpenTelemetry SDK in your app.
Total cost of ownership
Real numbers from my practice:
Self-hosted stack cost (per month):
- VPS 4GB RAM (Hetzner CX31): $7
- Storage 50GB (if you need more): $5
- Setup time: 4-8 hours (one-time)
- Maintenance time: 1-2 hours per month
Total: $12-15/month + 2 hours of time
SaaS alternatives cost:
- Datadog: $100-500/month (depends on volume)
- New Relic: $99-749/month
- Grafana Cloud: $0-299/month (free tier up to 10k series)
ROI: 3-month savings: $300-1500. Annual: $1200-6000.
But the main savings — lost revenue not lost due to downtime. One hour of downtime can cost $1000-10000 depending on the project.
Monitoring pays for itself after the first incident you prevented. For me it was the first week.
Conclusions
What we did:
- Set up Prometheus + Grafana + Loki in an hour
- Configured app and infrastructure metrics
- Created alerts with Telegram notifications
- Built dashboards for monitoring and troubleshooting
- Learned to investigate issues
What's next:
- Add metrics from all critical components: database, cache, queues, external APIs
- Configure alerts for your SLA: if you have 99.9% uptime, downtime > 43 minutes/month is critical
- Create a runbook for each alert: document "what to do if..."
- Train the team: everyone should be able to read metrics and logs
- Automate: add monitoring to CI/CD so metrics appear automatically
Main lesson:
Monitoring is not optional, it's a necessity. The sooner you set it up, the more nerves, money, and customers you'll save.
Don't wait for production to crash at 3 AM. Set up monitoring right now.
P.S. Have questions about setting up monitoring for your project? Write in the comments or email me — I'll help you figure it out.



