#monitoring #DevOps #observability #Prometheus #Grafana

Monitoring Stack 2025: Prometheus + Grafana + Loki on a Budget

Константин Потапов

October 14, 2025

18 min

How to set up production-grade monitoring in one evening without a DevOps team. Metrics, logs, alerts, and dashboards for projects on a single server or small cluster.

Monitoring Stack 2025: Prometheus + Grafana + Loki on a Budget

"Our production went down at 3 AM. We learned about it at 9 AM from an angry customer email."

Sound familiar? I have dozens of stories like this. In 2023, I consulted for a startup with $2M ARR — they had zero alerts. Monitoring = SSH to the server and tail -f logs. When the database crashed due to disk space, they lost 6 hours of operations and $15k in revenue.

Spoiler: You can set up proper monitoring in one evening. No Kubernetes, no DataDog at $3000/month, no dedicated DevOps team.

In this article — a minimalist stack for real projects: Prometheus for metrics, Grafana for visualization, Loki for logs. Everything runs on a single server, works for years, and saves your ass at 3 AM.

Why monitoring is not a luxury

Three real cases from my practice:

Case 1: Fintech startup (2024) Client base was growing, API was getting slower. The team knew "something was wrong" but didn't know what. Set up Prometheus in one evening — turns out one endpoint was making 300+ SQL queries per HTTP request. Fixed the N+1 problem, response time dropped from 2.5s to 120ms. ROI paid off in a week.

Case 2: E-commerce on Django (2025) Production was "sluggish" every 2-3 days, no pattern. SSH + htop showed nothing. Added Grafana with memory metrics — found that a Celery worker was eating 16GB RAM and OOM-killer was murdering it. Culprit — memory leak in image processing. Fix took 2 hours, downtime stopped.

Case 3: B2B SaaS (2023) Customers complained about "weird slowdowns". Metrics showed a latency spike every 6 hours. Loki helped find the pattern: backup script on the same server was running pg_dump without nice, CPU utilization hit 100%, app was choking. Solution — move backups to a separate machine.

90% of production problems are invisible without monitoring. You won't know about them until you get an angry email from a customer or lose money.

What is observability in simple terms

Observability — the ability to understand what's happening inside a system by looking at its external signals.

Three pillars of observability:

Metrics — numerical indicators: CPU, memory, RPS, latency, error rate
Logs — events: "user registered", "database error", "deployment started"
Traces — request path through the system (not covered in this article, that's for microservices)

Metaphor: imagine a car.

Metrics — speedometer, tachometer, engine temperature
Logs — onboard computer with a record "engine overheated at 2:35 PM"
Traces — dashcam showing the entire route

Without monitoring, you're driving with your eyes closed and learning about breakdowns when the car stops.

Stack choice: why Prometheus + Grafana + Loki

There are dozens of monitoring tools. I've tried Zabbix, Nagios, ELK Stack, Datadog, New Relic. For 90% of projects, the choice is obvious: Prometheus + Grafana + Loki.

Why this stack:

✅ Open source and free — no licenses, vendor lock-in, or $5k/month bills
✅ Production-grade — used by largest companies (Google, Uber, GitLab)
✅ Lightweight — runs on a single server with 2GB RAM
✅ Integrations out of the box — exporters for everything: PostgreSQL, Redis, Nginx, Node.js
✅ Active development — updates every month, huge community
✅ Simplicity — set up in an hour, doesn't require PhD in DevOps

Comparison with alternatives:

Criterion	Prometheus Stack	ELK Stack	Datadog/New Relic	Zabbix
Cost	✅ Free	✅ Free	❌ $100-5000/mo	✅ Free
Setup simplicity	✅ 1-2 hours	⚠️ 4-8 hours	✅ 30 minutes	⚠️ 2-4 hours
Resources (RAM)	✅ 1-2GB	❌ 8-16GB	☁️ SaaS	⚠️ 2-4GB
Metrics	✅ Excellent	⚠️ Not focus	✅ Excellent	✅ Good
Logs	✅ Loki	✅ Excellent (ES)	✅ Excellent	⚠️ Basic
Alerts	✅ Alertmanager	⚠️ Complex	✅ Excellent	✅ Good
Dashboards	✅ Grafana	⚠️ Kibana	✅ Beautiful	⚠️ Kibana
Community & Ecosystem	✅ Huge	✅ Large	⚠️ Vendor-lock	⚠️ Outdated

My choice:

For startups and small projects (up to 10 servers) — Prometheus Stack
For enterprise with compliance — ELK Stack (requires Elasticsearch for audit)
For corporations with money — Datadog/New Relic (if budget exists and you're lazy)
For legacy systems — Zabbix (if it's already there, don't touch it)

PrometheusGrafanaLokiDockerNode Exporter

Stack architecture on a budget

Here's what we'll be running:

┌─────────────────────────────────────────────────┐
│  Your server (2-4GB RAM)                        │
├─────────────────────────────────────────────────┤
│                                                 │
│  ┌─────────────────┐  ┌──────────────┐          │
│  │ Your app        │  │ PostgreSQL   │          │
│  │ (FastAPI/Django)│  │ / Redis      │          │
│  └────┬────────────┘  └────┬─────────┘          │
│       │                    │                    │
│       │ metrics            │ metrics            │
│       │ + logs             │ (exporter)         │
│       ▼                    ▼                    │
│  ┌─────────────────────────────────┐            │
│  │ Prometheus                      │            │
│  │ (collects metrics every 15s)    │            │
│  └───────────┬─────────────────────┘            │
│              │                                  │
│              │ query                            │
│              ▼                                  │
│  ┌─────────────────────────────────┐            │
│  │ Grafana                         │            │
│  │ (visualization + alerts)        │            │
│  └───────────┬─────────────────────┘            │
│              │ query                            │
│  ┌───────────▼─────────────────────┐            │
│  │ Loki                            │            │
│  │ (stores logs)                   │            │
│  └─────────────────────────────────┘            │
│                                                 │
└─────────────────────────────────────────────────┘

Components:

Prometheus — time-series database for metrics. Polls your app and exporters via HTTP, collects metrics every 15-30 seconds.
Grafana — web interface for dashboards and alerts. Connects to Prometheus and Loki, draws graphs.
Loki — log aggregator. Your app sends logs, Loki indexes them, Grafana displays them.
Exporters — applications that export metrics in Prometheus format:
- node_exporter — server metrics (CPU, RAM, disk, network)
- postgres_exporter — PostgreSQL metrics
- redis_exporter — Redis metrics
- nginx_exporter — Nginx metrics

Server requirements:

Minimum: 2GB RAM, 2 CPU cores, 20GB disk
Recommended: 4GB RAM, 2 CPU cores, 50GB disk
OS: Ubuntu 22.04/24.04, Debian 12, or any Linux with Docker

I've run this stack even on a $5/month VPS (Hetzner CX21). For small projects, it's more than enough.

Practice: setting up the stack in an hour

Let's get to work. I assume you have a Ubuntu/Debian server with Docker.

Step 1: Create docker-compose.yml

Create a directory for monitoring:

mkdir -p /opt/monitoring
cd /opt/monitoring

Create docker-compose.yml:

services:
  # Prometheus — collects metrics
  prometheus:
    image: prom/prometheus:v3.0.0
    container_name: prometheus
    command:
      - "--config.file=/etc/prometheus/prometheus.yml"
      - "--storage.tsdb.path=/prometheus"
      - "--storage.tsdb.retention.time=30d" # Store metrics for 30 days
      - "--web.enable-lifecycle" # API for hot-reload config
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
      - ./prometheus/alerts.yml:/etc/prometheus/alerts.yml
      - prometheus_data:/prometheus
    restart: unless-stopped
    networks:
      - monitoring
 
  # Grafana — visualization
  grafana:
    image: grafana/grafana:11.4.0
    container_name: grafana
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=your_secure_password # CHANGE THIS!
      - GF_INSTALL_PLUGINS=grafana-piechart-panel
    volumes:
      - grafana_data:/var/lib/grafana
      - ./grafana/provisioning:/etc/grafana/provisioning
    restart: unless-stopped
    networks:
      - monitoring
 
  # Loki — logs
  loki:
    image: grafana/loki:3.3.2
    container_name: loki
    ports:
      - "3100:3100"
    command: -config.file=/etc/loki/local-config.yaml
    volumes:
      - loki_data:/loki
    restart: unless-stopped
    networks:
      - monitoring
 
  # Node Exporter — server metrics
  node_exporter:
    image: prom/node-exporter:v1.8.2
    container_name: node_exporter
    command:
      - "--path.rootfs=/host"
    ports:
      - "9100:9100"
    volumes:
      - /:/host:ro,rslave
    restart: unless-stopped
    networks:
      - monitoring
 
volumes:
  prometheus_data:
  grafana_data:
  loki_data:
 
networks:
  monitoring:
    driver: bridge

Step 2: Prometheus configuration

Create prometheus/prometheus.yml:

global:
  scrape_interval: 15s # Collect metrics every 15 seconds
  evaluation_interval: 15s # Check alert rules every 15 seconds
 
# Alerts (we'll create later)
rule_files:
  - "/etc/prometheus/alerts.yml"
 
# Where to collect metrics from
scrape_configs:
  # Prometheus itself
  - job_name: "prometheus"
    static_configs:
      - targets: ["localhost:9090"]
 
  # Server metrics (CPU, RAM, Disk)
  - job_name: "node"
    static_configs:
      - targets: ["node_exporter:9100"]
 
  # Your application (FastAPI, Django, etc.)
  # Uncomment and specify your app address
  # - job_name: 'app'
  #   static_configs:
  #     - targets: ['app:8000']
 
  # PostgreSQL (if using postgres_exporter)
  # - job_name: 'postgres'
  #   static_configs:
  #     - targets: ['postgres_exporter:9187']
 
  # Redis (if using redis_exporter)
  # - job_name: 'redis'
  #   static_configs:
  #     - targets: ['redis_exporter:9121']

Create empty alerts file prometheus/alerts.yml (we'll fill it later):

groups:
  - name: basic_alerts
    interval: 30s
    rules: []

Step 3: Launch the stack

docker compose up -d

Check that everything started:

docker compose ps

You should see 4 containers with status Up:

prometheus
grafana
loki
node_exporter

Check availability:

Prometheus: http://your-server-ip:9090
Grafana: http://your-server-ip:3000 (login: admin, password: from docker-compose.yml)
Loki: http://your-server-ip:3100/ready (should return ready)

Congratulations! Basic stack is running. Now let's configure dashboards and alerts.

Step 4: Grafana setup

Open Grafana: http://your-server-ip:3000
Log in (admin / your password from docker-compose)
Add Data Source:
- Connections → Add data source → Prometheus
- URL: http://prometheus:9090
- Save & Test — should show "Data source is working"
Repeat for Loki:
- Add data source → Loki
- URL: http://loki:3100
- Save & Test

Step 5: Import ready-made dashboard

No need to draw a dashboard from scratch — use a ready one.

In Grafana: Dashboards → Import
Enter dashboard ID: 1860 (Node Exporter Full)
Click Load
Select Prometheus data source
Import

Voilà! You now have a dashboard with server metrics: CPU, RAM, Disk I/O, Network.

Other useful dashboards:

PostgreSQL: ID 9628
Redis: ID 11835
Nginx: ID 12708
Docker: ID 893

Adding metrics from your application

Now the most important part — metrics from your application.

Python (FastAPI / Django)

Install the library:

pip install prometheus-client

FastAPI:

from fastapi import FastAPI
from prometheus_client import Counter, Histogram, make_asgi_app
 
app = FastAPI()
 
# Metrics
REQUEST_COUNT = Counter(
    'app_requests_total',
    'Total HTTP requests',
    ['method', 'endpoint', 'status']
)
 
REQUEST_DURATION = Histogram(
    'app_request_duration_seconds',
    'HTTP request duration',
    ['method', 'endpoint']
)
 
@app.middleware("http")
async def prometheus_middleware(request, call_next):
    method = request.method
    endpoint = request.url.path
 
    with REQUEST_DURATION.labels(method, endpoint).time():
        response = await call_next(request)
 
    REQUEST_COUNT.labels(method, endpoint, response.status_code).inc()
    return response
 
# Endpoint for Prometheus
metrics_app = make_asgi_app()
app.mount("/metrics", metrics_app)

Now metrics are available at http://your-app:8000/metrics.

Django (with django-prometheus):

pip install django-prometheus

# settings.py
INSTALLED_APPS = [
    'django_prometheus',
    # ...
]
 
MIDDLEWARE = [
    'django_prometheus.middleware.PrometheusBeforeMiddleware',
    # ... other middleware
    'django_prometheus.middleware.PrometheusAfterMiddleware',
]
 
# urls.py
urlpatterns = [
    path('', include('django_prometheus.urls')),
    # ...
]

Metrics at http://your-app:8000/metrics.

Node.js (Express)

npm install prom-client

const express = require("express");
const client = require("prom-client");
 
const app = express();
 
// Create registry
const register = new client.Registry();
 
// Collect default metrics (CPU, memory, event loop)
client.collectDefaultMetrics({ register });
 
// Custom metrics
const httpRequestDuration = new client.Histogram({
  name: "http_request_duration_seconds",
  help: "Duration of HTTP requests in seconds",
  labelNames: ["method", "route", "status_code"],
  registers: [register],
});
 
// Middleware for metrics
app.use((req, res, next) => {
  const start = Date.now();
  res.on("finish", () => {
    const duration = (Date.now() - start) / 1000;
    httpRequestDuration
      .labels(req.method, req.route?.path || req.path, res.statusCode)
      .observe(duration);
  });
  next();
});
 
// Endpoint for Prometheus
app.get("/metrics", async (req, res) => {
  res.set("Content-Type", register.contentType);
  res.end(await register.metrics());
});
 
app.listen(3000);

Add application to Prometheus

Edit prometheus/prometheus.yml:

scrape_configs:
  # ... existing jobs
 
  - job_name: "myapp"
    static_configs:
      - targets: ["host.docker.internal:8000"] # Your application

If the app is in Docker Compose, use service name:

- job_name: "myapp"
  static_configs:
    - targets: ["myapp:8000"]

Restart Prometheus:

docker compose restart prometheus

Check in Prometheus UI: Status → Targets — your app should be in UP status.

Setting up logs with Loki

Loki collects logs from your application.

Option 1: Promtail (recommended)

Promtail — agent for collecting logs and sending to Loki.

Add to docker-compose.yml:

promtail:
  image: grafana/promtail:3.3.2
  container_name: promtail
  volumes:
    - /var/log:/var/log:ro # System logs
    - ./promtail/config.yml:/etc/promtail/config.yml
    - ./logs:/app/logs:ro # Your app logs
  command: -config.file=/etc/promtail/config.yml
  restart: unless-stopped
  networks:
    - monitoring

Create promtail/config.yml:

server:
  http_listen_port: 9080
  grpc_listen_port: 0
 
positions:
  filename: /tmp/positions.yaml
 
clients:
  - url: http://loki:3100/loki/api/v1/push
 
scrape_configs:
  # Your application logs
  - job_name: app
    static_configs:
      - targets:
          - localhost
        labels:
          job: app
          __path__: /app/logs/*.log
 
  # System logs (optional)
  - job_name: system
    static_configs:
      - targets:
          - localhost
        labels:
          job: syslog
          __path__: /var/log/syslog

Restart the stack:

docker compose up -d

Option 2: Logging directly from application

Python (python-logging-loki):

pip install python-logging-loki

import logging
from logging_loki import LokiHandler
 
logger = logging.getLogger("my-app")
logger.setLevel(logging.INFO)
 
loki_handler = LokiHandler(
    url="http://loki:3100/loki/api/v1/push",
    tags={"application": "my-app", "environment": "production"},
    version="1",
)
 
logger.addHandler(loki_handler)
 
logger.info("Application started")
logger.error("Something went wrong", extra={"user_id": 123})

Viewing logs in Grafana

Open Grafana
Explore → select Loki data source
Query: {job="app"}
Click Run query

You'll see your application logs in real-time.

Useful Loki queries (LogQL):

# All app logs
{job="app"}
 
# Errors only
{job="app"} |= "ERROR"
 
# Logs for specific user
{job="app"} | json | user_id="123"
 
# Error rate for last 5 minutes
rate({job="app"} |= "ERROR" [5m])

Alerts: so you don't sleep through production failure

Alerts are the most important part of monitoring. Let's set up alerting in 3 steps.

Step 1: Add alert rules

Edit prometheus/alerts.yml:

groups:
  - name: critical_alerts
    interval: 30s
    rules:
      # Server unreachable
      - alert: InstanceDown
        expr: up == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Instance {{ $labels.instance }} is down"
          description: "{{ $labels.job }} has been down for more than 1 minute."
 
      # CPU above 80%
      - alert: HighCPU
        expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High CPU usage on {{ $labels.instance }}"
          description: "CPU usage is above 80% for 5 minutes (current: {{ $value }}%)"
 
      # RAM above 90%
      - alert: HighMemory
        expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 90
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High memory usage on {{ $labels.instance }}"
          description: "Memory usage is above 90% (current: {{ $value }}%)"
 
      # Disk above 85%
      - alert: DiskSpaceLow
        expr: (1 - (node_filesystem_avail_bytes{fstype!~"tmpfs|fuse.lxcfs"} / node_filesystem_size_bytes)) * 100 > 85
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Low disk space on {{ $labels.instance }}"
          description: "Disk {{ $labels.mountpoint }} is {{ $value }}% full"
 
      # High error rate (>5% requests with errors)
      - alert: HighErrorRate
        expr: rate(app_requests_total{status=~"5.."}[5m]) / rate(app_requests_total[5m]) > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High error rate in {{ $labels.job }}"
          description: "Error rate is {{ $value | humanizePercentage }} for 5 minutes"
 
      # Slow requests (p95 latency > 1s)
      - alert: SlowRequests
        expr: histogram_quantile(0.95, rate(app_request_duration_seconds_bucket[5m])) > 1
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Slow requests in {{ $labels.job }}"
          description: "95th percentile latency is {{ $value }}s"

Reload Prometheus config:

curl -X POST http://localhost:9090/-/reload

Check alerts: http://your-server-ip:9090/alerts

Step 2: Alertmanager setup (notifications)

Alertmanager sends notifications to Telegram, Slack, email, etc.

Add to docker-compose.yml:

alertmanager:
  image: prom/alertmanager:v0.27.0
  container_name: alertmanager
  ports:
    - "9093:9093"
  volumes:
    - ./alertmanager/config.yml:/etc/alertmanager/config.yml
    - alertmanager_data:/alertmanager
  command:
    - "--config.file=/etc/alertmanager/config.yml"
  restart: unless-stopped
  networks:
    - monitoring
 
volumes:
  # ... existing volumes
  alertmanager_data:

Edit prometheus/prometheus.yml:

# Add at the beginning of the file
alerting:
  alertmanagers:
    - static_configs:
        - targets: ["alertmanager:9093"]

Create alertmanager/config.yml:

global:
  resolve_timeout: 5m
 
route:
  group_by: ["alertname", "cluster"]
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 12h
  receiver: "telegram"
 
receivers:
  # Telegram (recommended)
  - name: "telegram"
    telegram_configs:
      - bot_token: "YOUR_BOT_TOKEN" # Get from @BotFather
        chat_id: YOUR_CHAT_ID # Your chat_id
        parse_mode: "HTML"
        message: |
          <b>{{ .Status | toUpper }}</b>
          {{ range .Alerts }}
          <b>Alert:</b> {{ .Labels.alertname }}
          <b>Severity:</b> {{ .Labels.severity }}
          <b>Summary:</b> {{ .Annotations.summary }}
          <b>Description:</b> {{ .Annotations.description }}
          {{ end }}
 
  # Slack (alternative)
  # - name: 'slack'
  #   slack_configs:
  #     - api_url: 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL'
  #       channel: '#alerts'
  #       title: 'Alert: {{ .GroupLabels.alertname }}'
  #       text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
 
  # Email (alternative)
  # - name: 'email'
  #   email_configs:
  #     - to: 'your-email@example.com'
  #       from: 'alerts@yourapp.com'
  #       smarthost: 'smtp.gmail.com:587'
  #       auth_username: 'your-email@gmail.com'
  #       auth_password: 'your-app-password'

Restart the stack:

docker compose up -d

How to get Telegram bot token and chat_id:

Create bot: message @BotFather → /newbot → follow instructions → get bot_token
Get chat_id: message the bot /start, then open https://api.telegram.org/bot<bot_token>/getUpdates → find "chat":{"id":123456789}

Step 3: Test alerts

Let's create artificial load for testing:

# Load CPU
yes > /dev/null &
yes > /dev/null &
yes > /dev/null &
yes > /dev/null &
 
# After 5 minutes HighCPU alert should trigger
# Check: http://your-server-ip:9090/alerts
 
# Stop the load:
killall yes

You should receive a notification in Telegram within 5-6 minutes.

Alerts are working! Now you'll know about problems before customers do.

Dashboards for real life

Ready-made dashboards are good, but for production you need custom ones.

"Application Health" dashboard

Create a new dashboard in Grafana:

Panels:

RPS (Requests Per Second)
- Query: rate(app_requests_total[1m])
- Visualization: Graph
Error Rate (%)
- Query: (rate(app_requests_total{status=~"5.."}[5m]) / rate(app_requests_total[5m])) * 100
- Visualization: Graph
- Threshold: warning at 1%, critical at 5%

Latency (p50, p95, p99)

Query:

histogram_quantile(0.50, rate(app_request_duration_seconds_bucket[5m]))  # p50
histogram_quantile(0.95, rate(app_request_duration_seconds_bucket[5m]))  # p95
histogram_quantile(0.99, rate(app_request_duration_seconds_bucket[5m]))  # p99

Visualization: Graph

Active Users (if you have this metric)
- Query: active_users_gauge
- Visualization: Stat
Top 5 Slowest Endpoints
- Query: topk(5, histogram_quantile(0.95, rate(app_request_duration_seconds_bucket[5m])))
- Visualization: Table

"Infrastructure" dashboard

CPU Usage
- Query: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
Memory Usage
- Query: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100
Disk I/O
- Query: rate(node_disk_read_bytes_total[5m]) and rate(node_disk_written_bytes_total[5m])
Network Traffic
- Query: rate(node_network_receive_bytes_total[5m]) and rate(node_network_transmit_bytes_total[5m])

Troubleshooting: how to investigate issues

You got an alert at 3 AM. What to do?

Scenario 1: HighCPU alert

Grafana → Infrastructure dashboard → look at CPU graph
Correlate: check RPS at the same time — if there's a traffic spike, the code is to blame
Logs in Loki: {job="app"} | json | line_format "{{.endpoint}} {{.duration}}" — look for slow endpoints
Prometheus: topk(5, rate(app_request_duration_seconds_sum[5m])) — top slow requests
Fix: optimize code or scale

Scenario 2: HighMemory alert

Grafana → Infrastructure → Memory Usage
Prometheus: process_resident_memory_bytes — check app memory consumption
Logs: {job="app"} |= "OutOfMemory" or |= "MemoryError"
Hypothesis: memory leak? Check the code
Temporary fix: docker compose restart app
Long-term fix: profile the code (memory_profiler, py-spy)

Scenario 3: HighErrorRate alert

Grafana → Application Health → Error Rate graph
Prometheus: rate(app_requests_total{status=~"5.."}[5m]) — which endpoints?
Loki: {job="app"} |= "ERROR" or "Exception" — read stack traces
Root cause: database unavailable? API down? Timeout?
Fix: depends on the cause

Scenario 4: SlowRequests alert

Prometheus: histogram_quantile(0.95, rate(app_request_duration_seconds_bucket[5m])) by endpoint
Loki: find specific slow requests with parameters
Database: check pg_stat_statements — maybe slow query?
Fix: add indexes, cache, optimize

Pro tip: Create a runbook for each alert. A document with investigation steps and typical solutions. Will save hours at 3 AM.

Best practices: what I learned over the years

1. Retention Policy

By default, Prometheus stores metrics for 15 days. That's not enough. Set 30-90 days:

command:
  - "--storage.tsdb.retention.time=90d"

For Loki:

# loki-config.yaml
limits_config:
  retention_period: 30d

2. Don't monitor everything

Metrics cost memory and disk. Monitor only what helps make decisions:

✅ Monitor: RPS, error rate, latency, CPU, RAM, disk
❌ Don't monitor: clicks on every button (that's for analytics, not observability)

3. Alert hygiene

Group alerts: don't send 50 notifications per minute, group to 1 every 5 minutes
Severity levels: critical → phone call, warning → Telegram, info → logs only
Mute when deploying: disable alerts during deployment, otherwise false positives

4. Backup configs

Store configs in Git:

git init
git add docker-compose.yml prometheus/ grafana/ alertmanager/
git commit -m "Initial monitoring setup"
git remote add origin git@github.com:yourname/monitoring-config.git
git push

5. Security

Don't expose ports publicly: use Nginx reverse proxy with auth
Change Grafana default password
Restrict Prometheus access: it can show sensitive data

Advanced features

Remote Write (long-term storage)

Prometheus stores metrics locally. For long-term storage (years) use remote write to:

Thanos — open source, S3-backed storage
Cortex — multi-tenant Prometheus
Grafana Cloud — managed (free up to 10k series)

ServiceMonitor (for Kubernetes)

If you have Kubernetes, use Prometheus Operator + ServiceMonitor:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: myapp
spec:
  selector:
    matchLabels:
      app: myapp
  endpoints:
    - port: metrics
      interval: 30s

Distributed Tracing (Tempo)

For microservices add Grafana Tempo:

tempo:
  image: grafana/tempo:latest
  command: ["-config.file=/etc/tempo.yaml"]
  volumes:
    - ./tempo.yaml:/etc/tempo.yaml
  ports:
    - "3200:3200" # Tempo UI
    - "4317:4317" # OTLP gRPC

Integrate with OpenTelemetry SDK in your app.

Total cost of ownership

Real numbers from my practice:

Self-hosted stack cost (per month):

VPS 4GB RAM (Hetzner CX31): $7
Storage 50GB (if you need more): $5
Setup time: 4-8 hours (one-time)
Maintenance time: 1-2 hours per month

Total: $12-15/month + 2 hours of time

SaaS alternatives cost:

Datadog: $100-500/month (depends on volume)
New Relic: $99-749/month
Grafana Cloud: $0-299/month (free tier up to 10k series)

ROI: 3-month savings: $300-1500. Annual: $1200-6000.

But the main savings — lost revenue not lost due to downtime. One hour of downtime can cost $1000-10000 depending on the project.

Monitoring pays for itself after the first incident you prevented. For me it was the first week.

Conclusions

What we did:

Set up Prometheus + Grafana + Loki in an hour
Configured app and infrastructure metrics
Created alerts with Telegram notifications
Built dashboards for monitoring and troubleshooting
Learned to investigate issues

What's next:

Add metrics from all critical components: database, cache, queues, external APIs
Configure alerts for your SLA: if you have 99.9% uptime, downtime > 43 minutes/month is critical
Create a runbook for each alert: document "what to do if..."
Train the team: everyone should be able to read metrics and logs
Automate: add monitoring to CI/CD so metrics appear automatically

Main lesson:

Monitoring is not optional, it's a necessity. The sooner you set it up, the more nerves, money, and customers you'll save.

Don't wait for production to crash at 3 AM. Set up monitoring right now.

P.S. Have questions about setting up monitoring for your project? Write in the comments or email me — I'll help you figure it out.

← All posts Discuss post

2025-11-13·30 min

Next.js Deployment with GitLab CI/CD: From Server Setup to Automation

Complete guide to setting up automated Next.js deployment to your own server via GitLab CI/CD. PM2 for zero-downtime, Nginx Proxy Manager for domain management, secrets management, and multi-environment setup.

2025-10-23·22 min

Monolith → Microservices: How Not to Kill Your Team in the Process

When it's really time to split, Strangler Fig in practice, distributed tracing from day one, and what to do with shared code. Real-world migration case with metrics and honest talk about pitfalls.

2025-10-17·22 min

Blameless Postmortems: How to Turn Incidents into Team Growth

Production crashed Friday night. Team fixed it by 3 AM. Monday meeting: CTO yells 'Who screwed up?!' Developers stay silent. Incident will repeat. I've seen this 40 times. Postmortem isn't an interrogation. It's a system that turns mistakes into growth. Breaking down blame-free culture, report structure, and what to do with results.

Monitoring Stack 2025: Prometheus + Grafana + Loki on a Budget

Table of Contents

Related posts

Next.js Deployment with GitLab CI/CD: From Server Setup to Automation

Monolith → Microservices: How Not to Kill Your Team in the Process

Blameless Postmortems: How to Turn Incidents into Team Growth