Skip to main content

Monitoring Stack 2025: Prometheus + Grafana + Loki on a Budget

Константин Потапов
18 min

How to set up production-grade monitoring in one evening without a DevOps team. Metrics, logs, alerts, and dashboards for projects on a single server or small cluster.

Monitoring Stack 2025: Prometheus + Grafana + Loki on a Budget

"Our production went down at 3 AM. We learned about it at 9 AM from an angry customer email."

Sound familiar? I have dozens of stories like this. In 2023, I consulted for a startup with $2M ARR — they had zero alerts. Monitoring = SSH to the server and tail -f logs. When the database crashed due to disk space, they lost 6 hours of operations and $15k in revenue.

Spoiler: You can set up proper monitoring in one evening. No Kubernetes, no DataDog at $3000/month, no dedicated DevOps team.

In this article — a minimalist stack for real projects: Prometheus for metrics, Grafana for visualization, Loki for logs. Everything runs on a single server, works for years, and saves your ass at 3 AM.

Why monitoring is not a luxury

Three real cases from my practice:

Case 1: Fintech startup (2024) Client base was growing, API was getting slower. The team knew "something was wrong" but didn't know what. Set up Prometheus in one evening — turns out one endpoint was making 300+ SQL queries per HTTP request. Fixed the N+1 problem, response time dropped from 2.5s to 120ms. ROI paid off in a week.

Case 2: E-commerce on Django (2025) Production was "sluggish" every 2-3 days, no pattern. SSH + htop showed nothing. Added Grafana with memory metrics — found that a Celery worker was eating 16GB RAM and OOM-killer was murdering it. Culprit — memory leak in image processing. Fix took 2 hours, downtime stopped.

Case 3: B2B SaaS (2023) Customers complained about "weird slowdowns". Metrics showed a latency spike every 6 hours. Loki helped find the pattern: backup script on the same server was running pg_dump without nice, CPU utilization hit 100%, app was choking. Solution — move backups to a separate machine.

90% of production problems are invisible without monitoring. You won't know about them until you get an angry email from a customer or lose money.

What is observability in simple terms

Observability — the ability to understand what's happening inside a system by looking at its external signals.

Three pillars of observability:

  1. Metrics — numerical indicators: CPU, memory, RPS, latency, error rate
  2. Logs — events: "user registered", "database error", "deployment started"
  3. Traces — request path through the system (not covered in this article, that's for microservices)

Metaphor: imagine a car.

  • Metrics — speedometer, tachometer, engine temperature
  • Logs — onboard computer with a record "engine overheated at 2:35 PM"
  • Traces — dashcam showing the entire route

Without monitoring, you're driving with your eyes closed and learning about breakdowns when the car stops.

Stack choice: why Prometheus + Grafana + Loki

There are dozens of monitoring tools. I've tried Zabbix, Nagios, ELK Stack, Datadog, New Relic. For 90% of projects, the choice is obvious: Prometheus + Grafana + Loki.

Why this stack:

  • Open source and free — no licenses, vendor lock-in, or $5k/month bills
  • Production-grade — used by largest companies (Google, Uber, GitLab)
  • Lightweight — runs on a single server with 2GB RAM
  • Integrations out of the box — exporters for everything: PostgreSQL, Redis, Nginx, Node.js
  • Active development — updates every month, huge community
  • Simplicity — set up in an hour, doesn't require PhD in DevOps

Comparison with alternatives:

CriterionPrometheus StackELK StackDatadog/New RelicZabbix
Cost✅ Free✅ Free❌ $100-5000/mo✅ Free
Setup simplicity✅ 1-2 hours⚠️ 4-8 hours✅ 30 minutes⚠️ 2-4 hours
Resources (RAM)✅ 1-2GB❌ 8-16GB☁️ SaaS⚠️ 2-4GB
Metrics✅ Excellent⚠️ Not focus✅ Excellent✅ Good
Logs✅ Loki✅ Excellent (ES)✅ Excellent⚠️ Basic
Alerts✅ Alertmanager⚠️ Complex✅ Excellent✅ Good
Dashboards✅ Grafana⚠️ Kibana✅ Beautiful⚠️ Kibana
Community & Ecosystem✅ Huge✅ Large⚠️ Vendor-lock⚠️ Outdated

My choice:

  • For startups and small projects (up to 10 servers) — Prometheus Stack
  • For enterprise with compliance — ELK Stack (requires Elasticsearch for audit)
  • For corporations with money — Datadog/New Relic (if budget exists and you're lazy)
  • For legacy systems — Zabbix (if it's already there, don't touch it)
PrometheusGrafanaLokiDockerNode Exporter

Stack architecture on a budget

Here's what we'll be running:

┌─────────────────────────────────────────────────┐
│  Your server (2-4GB RAM)                        │
├─────────────────────────────────────────────────┤
│                                                 │
│  ┌─────────────────┐  ┌──────────────┐          │
│  │ Your app        │  │ PostgreSQL   │          │
│  │ (FastAPI/Django)│  │ / Redis      │          │
│  └────┬────────────┘  └────┬─────────┘          │
│       │                    │                    │
│       │ metrics            │ metrics            │
│       │ + logs             │ (exporter)         │
│       ▼                    ▼                    │
│  ┌─────────────────────────────────┐            │
│  │ Prometheus                      │            │
│  │ (collects metrics every 15s)    │            │
│  └───────────┬─────────────────────┘            │
│              │                                  │
│              │ query                            │
│              ▼                                  │
│  ┌─────────────────────────────────┐            │
│  │ Grafana                         │            │
│  │ (visualization + alerts)        │            │
│  └───────────┬─────────────────────┘            │
│              │ query                            │
│  ┌───────────▼─────────────────────┐            │
│  │ Loki                            │            │
│  │ (stores logs)                   │            │
│  └─────────────────────────────────┘            │
│                                                 │
└─────────────────────────────────────────────────┘

Components:

  1. Prometheus — time-series database for metrics. Polls your app and exporters via HTTP, collects metrics every 15-30 seconds.
  2. Grafana — web interface for dashboards and alerts. Connects to Prometheus and Loki, draws graphs.
  3. Loki — log aggregator. Your app sends logs, Loki indexes them, Grafana displays them.
  4. Exporters — applications that export metrics in Prometheus format:
    • node_exporter — server metrics (CPU, RAM, disk, network)
    • postgres_exporter — PostgreSQL metrics
    • redis_exporter — Redis metrics
    • nginx_exporter — Nginx metrics

Server requirements:

  • Minimum: 2GB RAM, 2 CPU cores, 20GB disk
  • Recommended: 4GB RAM, 2 CPU cores, 50GB disk
  • OS: Ubuntu 22.04/24.04, Debian 12, or any Linux with Docker

I've run this stack even on a $5/month VPS (Hetzner CX21). For small projects, it's more than enough.

Practice: setting up the stack in an hour

Let's get to work. I assume you have a Ubuntu/Debian server with Docker.

Step 1: Create docker-compose.yml

Create a directory for monitoring:

mkdir -p /opt/monitoring
cd /opt/monitoring

Create docker-compose.yml:

services:
  # Prometheus — collects metrics
  prometheus:
    image: prom/prometheus:v3.0.0
    container_name: prometheus
    command:
      - "--config.file=/etc/prometheus/prometheus.yml"
      - "--storage.tsdb.path=/prometheus"
      - "--storage.tsdb.retention.time=30d" # Store metrics for 30 days
      - "--web.enable-lifecycle" # API for hot-reload config
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
      - ./prometheus/alerts.yml:/etc/prometheus/alerts.yml
      - prometheus_data:/prometheus
    restart: unless-stopped
    networks:
      - monitoring
 
  # Grafana — visualization
  grafana:
    image: grafana/grafana:11.4.0
    container_name: grafana
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=your_secure_password # CHANGE THIS!
      - GF_INSTALL_PLUGINS=grafana-piechart-panel
    volumes:
      - grafana_data:/var/lib/grafana
      - ./grafana/provisioning:/etc/grafana/provisioning
    restart: unless-stopped
    networks:
      - monitoring
 
  # Loki — logs
  loki:
    image: grafana/loki:3.3.2
    container_name: loki
    ports:
      - "3100:3100"
    command: -config.file=/etc/loki/local-config.yaml
    volumes:
      - loki_data:/loki
    restart: unless-stopped
    networks:
      - monitoring
 
  # Node Exporter — server metrics
  node_exporter:
    image: prom/node-exporter:v1.8.2
    container_name: node_exporter
    command:
      - "--path.rootfs=/host"
    ports:
      - "9100:9100"
    volumes:
      - /:/host:ro,rslave
    restart: unless-stopped
    networks:
      - monitoring
 
volumes:
  prometheus_data:
  grafana_data:
  loki_data:
 
networks:
  monitoring:
    driver: bridge

Step 2: Prometheus configuration

Create prometheus/prometheus.yml:

global:
  scrape_interval: 15s # Collect metrics every 15 seconds
  evaluation_interval: 15s # Check alert rules every 15 seconds
 
# Alerts (we'll create later)
rule_files:
  - "/etc/prometheus/alerts.yml"
 
# Where to collect metrics from
scrape_configs:
  # Prometheus itself
  - job_name: "prometheus"
    static_configs:
      - targets: ["localhost:9090"]
 
  # Server metrics (CPU, RAM, Disk)
  - job_name: "node"
    static_configs:
      - targets: ["node_exporter:9100"]
 
  # Your application (FastAPI, Django, etc.)
  # Uncomment and specify your app address
  # - job_name: 'app'
  #   static_configs:
  #     - targets: ['app:8000']
 
  # PostgreSQL (if using postgres_exporter)
  # - job_name: 'postgres'
  #   static_configs:
  #     - targets: ['postgres_exporter:9187']
 
  # Redis (if using redis_exporter)
  # - job_name: 'redis'
  #   static_configs:
  #     - targets: ['redis_exporter:9121']

Create empty alerts file prometheus/alerts.yml (we'll fill it later):

groups:
  - name: basic_alerts
    interval: 30s
    rules: []

Step 3: Launch the stack

docker compose up -d

Check that everything started:

docker compose ps

You should see 4 containers with status Up:

  • prometheus
  • grafana
  • loki
  • node_exporter

Check availability:

Congratulations! Basic stack is running. Now let's configure dashboards and alerts.

Step 4: Grafana setup

  1. Open Grafana: http://your-server-ip:3000
  2. Log in (admin / your password from docker-compose)
  3. Add Data Source:
    • ConnectionsAdd data sourcePrometheus
    • URL: http://prometheus:9090
    • Save & Test — should show "Data source is working"
  4. Repeat for Loki:
    • Add data sourceLoki
    • URL: http://loki:3100
    • Save & Test

Step 5: Import ready-made dashboard

No need to draw a dashboard from scratch — use a ready one.

  1. In Grafana: DashboardsImport
  2. Enter dashboard ID: 1860 (Node Exporter Full)
  3. Click Load
  4. Select Prometheus data source
  5. Import

Voilà! You now have a dashboard with server metrics: CPU, RAM, Disk I/O, Network.

Other useful dashboards:

  • PostgreSQL: ID 9628
  • Redis: ID 11835
  • Nginx: ID 12708
  • Docker: ID 893

Adding metrics from your application

Now the most important part — metrics from your application.

Python (FastAPI / Django)

Install the library:

pip install prometheus-client

FastAPI:

from fastapi import FastAPI
from prometheus_client import Counter, Histogram, make_asgi_app
 
app = FastAPI()
 
# Metrics
REQUEST_COUNT = Counter(
    'app_requests_total',
    'Total HTTP requests',
    ['method', 'endpoint', 'status']
)
 
REQUEST_DURATION = Histogram(
    'app_request_duration_seconds',
    'HTTP request duration',
    ['method', 'endpoint']
)
 
@app.middleware("http")
async def prometheus_middleware(request, call_next):
    method = request.method
    endpoint = request.url.path
 
    with REQUEST_DURATION.labels(method, endpoint).time():
        response = await call_next(request)
 
    REQUEST_COUNT.labels(method, endpoint, response.status_code).inc()
    return response
 
# Endpoint for Prometheus
metrics_app = make_asgi_app()
app.mount("/metrics", metrics_app)

Now metrics are available at http://your-app:8000/metrics.

Django (with django-prometheus):

pip install django-prometheus
# settings.py
INSTALLED_APPS = [
    'django_prometheus',
    # ...
]
 
MIDDLEWARE = [
    'django_prometheus.middleware.PrometheusBeforeMiddleware',
    # ... other middleware
    'django_prometheus.middleware.PrometheusAfterMiddleware',
]
 
# urls.py
urlpatterns = [
    path('', include('django_prometheus.urls')),
    # ...
]

Metrics at http://your-app:8000/metrics.

Node.js (Express)

npm install prom-client
const express = require("express");
const client = require("prom-client");
 
const app = express();
 
// Create registry
const register = new client.Registry();
 
// Collect default metrics (CPU, memory, event loop)
client.collectDefaultMetrics({ register });
 
// Custom metrics
const httpRequestDuration = new client.Histogram({
  name: "http_request_duration_seconds",
  help: "Duration of HTTP requests in seconds",
  labelNames: ["method", "route", "status_code"],
  registers: [register],
});
 
// Middleware for metrics
app.use((req, res, next) => {
  const start = Date.now();
  res.on("finish", () => {
    const duration = (Date.now() - start) / 1000;
    httpRequestDuration
      .labels(req.method, req.route?.path || req.path, res.statusCode)
      .observe(duration);
  });
  next();
});
 
// Endpoint for Prometheus
app.get("/metrics", async (req, res) => {
  res.set("Content-Type", register.contentType);
  res.end(await register.metrics());
});
 
app.listen(3000);

Add application to Prometheus

Edit prometheus/prometheus.yml:

scrape_configs:
  # ... existing jobs
 
  - job_name: "myapp"
    static_configs:
      - targets: ["host.docker.internal:8000"] # Your application

If the app is in Docker Compose, use service name:

- job_name: "myapp"
  static_configs:
    - targets: ["myapp:8000"]

Restart Prometheus:

docker compose restart prometheus

Check in Prometheus UI: StatusTargets — your app should be in UP status.

Setting up logs with Loki

Loki collects logs from your application.

Promtail — agent for collecting logs and sending to Loki.

Add to docker-compose.yml:

promtail:
  image: grafana/promtail:3.3.2
  container_name: promtail
  volumes:
    - /var/log:/var/log:ro # System logs
    - ./promtail/config.yml:/etc/promtail/config.yml
    - ./logs:/app/logs:ro # Your app logs
  command: -config.file=/etc/promtail/config.yml
  restart: unless-stopped
  networks:
    - monitoring

Create promtail/config.yml:

server:
  http_listen_port: 9080
  grpc_listen_port: 0
 
positions:
  filename: /tmp/positions.yaml
 
clients:
  - url: http://loki:3100/loki/api/v1/push
 
scrape_configs:
  # Your application logs
  - job_name: app
    static_configs:
      - targets:
          - localhost
        labels:
          job: app
          __path__: /app/logs/*.log
 
  # System logs (optional)
  - job_name: system
    static_configs:
      - targets:
          - localhost
        labels:
          job: syslog
          __path__: /var/log/syslog

Restart the stack:

docker compose up -d

Option 2: Logging directly from application

Python (python-logging-loki):

pip install python-logging-loki
import logging
from logging_loki import LokiHandler
 
logger = logging.getLogger("my-app")
logger.setLevel(logging.INFO)
 
loki_handler = LokiHandler(
    url="http://loki:3100/loki/api/v1/push",
    tags={"application": "my-app", "environment": "production"},
    version="1",
)
 
logger.addHandler(loki_handler)
 
logger.info("Application started")
logger.error("Something went wrong", extra={"user_id": 123})

Viewing logs in Grafana

  1. Open Grafana
  2. Explore → select Loki data source
  3. Query: {job="app"}
  4. Click Run query

You'll see your application logs in real-time.

Useful Loki queries (LogQL):

# All app logs
{job="app"}
 
# Errors only
{job="app"} |= "ERROR"
 
# Logs for specific user
{job="app"} | json | user_id="123"
 
# Error rate for last 5 minutes
rate({job="app"} |= "ERROR" [5m])

Alerts: so you don't sleep through production failure

Alerts are the most important part of monitoring. Let's set up alerting in 3 steps.

Step 1: Add alert rules

Edit prometheus/alerts.yml:

groups:
  - name: critical_alerts
    interval: 30s
    rules:
      # Server unreachable
      - alert: InstanceDown
        expr: up == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Instance {{ $labels.instance }} is down"
          description: "{{ $labels.job }} has been down for more than 1 minute."
 
      # CPU above 80%
      - alert: HighCPU
        expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High CPU usage on {{ $labels.instance }}"
          description: "CPU usage is above 80% for 5 minutes (current: {{ $value }}%)"
 
      # RAM above 90%
      - alert: HighMemory
        expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 90
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High memory usage on {{ $labels.instance }}"
          description: "Memory usage is above 90% (current: {{ $value }}%)"
 
      # Disk above 85%
      - alert: DiskSpaceLow
        expr: (1 - (node_filesystem_avail_bytes{fstype!~"tmpfs|fuse.lxcfs"} / node_filesystem_size_bytes)) * 100 > 85
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Low disk space on {{ $labels.instance }}"
          description: "Disk {{ $labels.mountpoint }} is {{ $value }}% full"
 
      # High error rate (>5% requests with errors)
      - alert: HighErrorRate
        expr: rate(app_requests_total{status=~"5.."}[5m]) / rate(app_requests_total[5m]) > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High error rate in {{ $labels.job }}"
          description: "Error rate is {{ $value | humanizePercentage }} for 5 minutes"
 
      # Slow requests (p95 latency > 1s)
      - alert: SlowRequests
        expr: histogram_quantile(0.95, rate(app_request_duration_seconds_bucket[5m])) > 1
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Slow requests in {{ $labels.job }}"
          description: "95th percentile latency is {{ $value }}s"

Reload Prometheus config:

curl -X POST http://localhost:9090/-/reload

Check alerts: http://your-server-ip:9090/alerts

Step 2: Alertmanager setup (notifications)

Alertmanager sends notifications to Telegram, Slack, email, etc.

Add to docker-compose.yml:

alertmanager:
  image: prom/alertmanager:v0.27.0
  container_name: alertmanager
  ports:
    - "9093:9093"
  volumes:
    - ./alertmanager/config.yml:/etc/alertmanager/config.yml
    - alertmanager_data:/alertmanager
  command:
    - "--config.file=/etc/alertmanager/config.yml"
  restart: unless-stopped
  networks:
    - monitoring
 
volumes:
  # ... existing volumes
  alertmanager_data:

Edit prometheus/prometheus.yml:

# Add at the beginning of the file
alerting:
  alertmanagers:
    - static_configs:
        - targets: ["alertmanager:9093"]

Create alertmanager/config.yml:

global:
  resolve_timeout: 5m
 
route:
  group_by: ["alertname", "cluster"]
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 12h
  receiver: "telegram"
 
receivers:
  # Telegram (recommended)
  - name: "telegram"
    telegram_configs:
      - bot_token: "YOUR_BOT_TOKEN" # Get from @BotFather
        chat_id: YOUR_CHAT_ID # Your chat_id
        parse_mode: "HTML"
        message: |
          <b>{{ .Status | toUpper }}</b>
          {{ range .Alerts }}
          <b>Alert:</b> {{ .Labels.alertname }}
          <b>Severity:</b> {{ .Labels.severity }}
          <b>Summary:</b> {{ .Annotations.summary }}
          <b>Description:</b> {{ .Annotations.description }}
          {{ end }}
 
  # Slack (alternative)
  # - name: 'slack'
  #   slack_configs:
  #     - api_url: 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL'
  #       channel: '#alerts'
  #       title: 'Alert: {{ .GroupLabels.alertname }}'
  #       text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
 
  # Email (alternative)
  # - name: 'email'
  #   email_configs:
  #     - to: 'your-email@example.com'
  #       from: 'alerts@yourapp.com'
  #       smarthost: 'smtp.gmail.com:587'
  #       auth_username: 'your-email@gmail.com'
  #       auth_password: 'your-app-password'

Restart the stack:

docker compose up -d

How to get Telegram bot token and chat_id:

  1. Create bot: message @BotFather → /newbot → follow instructions → get bot_token
  2. Get chat_id: message the bot /start, then open https://api.telegram.org/bot<bot_token>/getUpdates → find "chat":{"id":123456789}

Step 3: Test alerts

Let's create artificial load for testing:

# Load CPU
yes > /dev/null &
yes > /dev/null &
yes > /dev/null &
yes > /dev/null &
 
# After 5 minutes HighCPU alert should trigger
# Check: http://your-server-ip:9090/alerts
 
# Stop the load:
killall yes

You should receive a notification in Telegram within 5-6 minutes.

Alerts are working! Now you'll know about problems before customers do.

Dashboards for real life

Ready-made dashboards are good, but for production you need custom ones.

"Application Health" dashboard

Create a new dashboard in Grafana:

Panels:

  1. RPS (Requests Per Second)

    • Query: rate(app_requests_total[1m])
    • Visualization: Graph
  2. Error Rate (%)

    • Query: (rate(app_requests_total{status=~"5.."}[5m]) / rate(app_requests_total[5m])) * 100
    • Visualization: Graph
    • Threshold: warning at 1%, critical at 5%
  3. Latency (p50, p95, p99)

    • Query:
      histogram_quantile(0.50, rate(app_request_duration_seconds_bucket[5m]))  # p50
      histogram_quantile(0.95, rate(app_request_duration_seconds_bucket[5m]))  # p95
      histogram_quantile(0.99, rate(app_request_duration_seconds_bucket[5m]))  # p99
    • Visualization: Graph
  4. Active Users (if you have this metric)

    • Query: active_users_gauge
    • Visualization: Stat
  5. Top 5 Slowest Endpoints

    • Query: topk(5, histogram_quantile(0.95, rate(app_request_duration_seconds_bucket[5m])))
    • Visualization: Table

"Infrastructure" dashboard

  1. CPU Usage

    • Query: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
  2. Memory Usage

    • Query: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100
  3. Disk I/O

    • Query: rate(node_disk_read_bytes_total[5m]) and rate(node_disk_written_bytes_total[5m])
  4. Network Traffic

    • Query: rate(node_network_receive_bytes_total[5m]) and rate(node_network_transmit_bytes_total[5m])

Troubleshooting: how to investigate issues

You got an alert at 3 AM. What to do?

Scenario 1: HighCPU alert

  1. Grafana → Infrastructure dashboard → look at CPU graph
  2. Correlate: check RPS at the same time — if there's a traffic spike, the code is to blame
  3. Logs in Loki: {job="app"} | json | line_format "{{.endpoint}} {{.duration}}" — look for slow endpoints
  4. Prometheus: topk(5, rate(app_request_duration_seconds_sum[5m])) — top slow requests
  5. Fix: optimize code or scale

Scenario 2: HighMemory alert

  1. Grafana → Infrastructure → Memory Usage
  2. Prometheus: process_resident_memory_bytes — check app memory consumption
  3. Logs: {job="app"} |= "OutOfMemory" or |= "MemoryError"
  4. Hypothesis: memory leak? Check the code
  5. Temporary fix: docker compose restart app
  6. Long-term fix: profile the code (memory_profiler, py-spy)

Scenario 3: HighErrorRate alert

  1. Grafana → Application Health → Error Rate graph
  2. Prometheus: rate(app_requests_total{status=~"5.."}[5m]) — which endpoints?
  3. Loki: {job="app"} |= "ERROR" or "Exception" — read stack traces
  4. Root cause: database unavailable? API down? Timeout?
  5. Fix: depends on the cause

Scenario 4: SlowRequests alert

  1. Prometheus: histogram_quantile(0.95, rate(app_request_duration_seconds_bucket[5m])) by endpoint
  2. Loki: find specific slow requests with parameters
  3. Database: check pg_stat_statements — maybe slow query?
  4. Fix: add indexes, cache, optimize

Pro tip: Create a runbook for each alert. A document with investigation steps and typical solutions. Will save hours at 3 AM.

Best practices: what I learned over the years

1. Retention Policy

By default, Prometheus stores metrics for 15 days. That's not enough. Set 30-90 days:

command:
  - "--storage.tsdb.retention.time=90d"

For Loki:

# loki-config.yaml
limits_config:
  retention_period: 30d

2. Don't monitor everything

Metrics cost memory and disk. Monitor only what helps make decisions:

  • Monitor: RPS, error rate, latency, CPU, RAM, disk
  • Don't monitor: clicks on every button (that's for analytics, not observability)

3. Alert hygiene

  • Group alerts: don't send 50 notifications per minute, group to 1 every 5 minutes
  • Severity levels: critical → phone call, warning → Telegram, info → logs only
  • Mute when deploying: disable alerts during deployment, otherwise false positives

4. Backup configs

Store configs in Git:

git init
git add docker-compose.yml prometheus/ grafana/ alertmanager/
git commit -m "Initial monitoring setup"
git remote add origin git@github.com:yourname/monitoring-config.git
git push

5. Security

  • Don't expose ports publicly: use Nginx reverse proxy with auth
  • Change Grafana default password
  • Restrict Prometheus access: it can show sensitive data

Advanced features

Remote Write (long-term storage)

Prometheus stores metrics locally. For long-term storage (years) use remote write to:

  • Thanos — open source, S3-backed storage
  • Cortex — multi-tenant Prometheus
  • Grafana Cloud — managed (free up to 10k series)

ServiceMonitor (for Kubernetes)

If you have Kubernetes, use Prometheus Operator + ServiceMonitor:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: myapp
spec:
  selector:
    matchLabels:
      app: myapp
  endpoints:
    - port: metrics
      interval: 30s

Distributed Tracing (Tempo)

For microservices add Grafana Tempo:

tempo:
  image: grafana/tempo:latest
  command: ["-config.file=/etc/tempo.yaml"]
  volumes:
    - ./tempo.yaml:/etc/tempo.yaml
  ports:
    - "3200:3200" # Tempo UI
    - "4317:4317" # OTLP gRPC

Integrate with OpenTelemetry SDK in your app.

Total cost of ownership

Real numbers from my practice:

Self-hosted stack cost (per month):

  • VPS 4GB RAM (Hetzner CX31): $7
  • Storage 50GB (if you need more): $5
  • Setup time: 4-8 hours (one-time)
  • Maintenance time: 1-2 hours per month

Total: $12-15/month + 2 hours of time

SaaS alternatives cost:

  • Datadog: $100-500/month (depends on volume)
  • New Relic: $99-749/month
  • Grafana Cloud: $0-299/month (free tier up to 10k series)

ROI: 3-month savings: $300-1500. Annual: $1200-6000.

But the main savings — lost revenue not lost due to downtime. One hour of downtime can cost $1000-10000 depending on the project.

Monitoring pays for itself after the first incident you prevented. For me it was the first week.

Conclusions

What we did:

  • Set up Prometheus + Grafana + Loki in an hour
  • Configured app and infrastructure metrics
  • Created alerts with Telegram notifications
  • Built dashboards for monitoring and troubleshooting
  • Learned to investigate issues

What's next:

  1. Add metrics from all critical components: database, cache, queues, external APIs
  2. Configure alerts for your SLA: if you have 99.9% uptime, downtime > 43 minutes/month is critical
  3. Create a runbook for each alert: document "what to do if..."
  4. Train the team: everyone should be able to read metrics and logs
  5. Automate: add monitoring to CI/CD so metrics appear automatically

Main lesson:

Monitoring is not optional, it's a necessity. The sooner you set it up, the more nerves, money, and customers you'll save.

Don't wait for production to crash at 3 AM. Set up monitoring right now.


P.S. Have questions about setting up monitoring for your project? Write in the comments or email me — I'll help you figure it out.