Корреляция трейсов, логов и метрик

Проблема: изолированные источники данных

Реальный инцидент:

14:32 — Prometheus alert: p95 latency /checkout endpoint > 2s
14:35 — Открываем Grafana → видим spike на графике
14:37 — Ищем trace в Jaeger → находим медленный trace (trace_id: a1b2c3d4)
14:40 — Хотим посмотреть логи → открываем Kibana/Loki
14:42 — Проблема: как найти логи ИМЕННО этого запроса среди миллионов?
14:50 — Grep по timestamp ±10 секунд → находим 10 000 строк логов
15:20 — Вручную ищем нужный лог среди тысяч строк

30+ минут потрачено на корреляцию данных.

Решение: Unified Observability — связь traces, logs, metrics через trace_id.

Alert → Click → Trace → Click → Logs конкретного запроса

3 клика, 30 секунд.

Three Pillars of Observability

1. Metrics (Prometheus, Grafana)

Что показывают: Агрегированные числа (req/sec, latency, error rate).

http_requests_total{service="api", status="200"} 150000
http_request_duration_seconds{quantile="0.95"} 0.8

Плюсы:

Низкий overhead
Долгое хранение (годы)
Быстрые запросы

Минусы:

Нет деталей отдельного запроса
"Что пошло не так?" → неясно

2. Logs (Loki, Elasticsearch, Kibana)

Что показывают: Детальные события с контекстом.

2024-01-15 14:32:15 ERROR User purchase failed: insufficient balance
  user_id=123 amount=99.99 balance=50.00

Плюсы:

Полный контекст ошибки
Stack traces, переменные
Human-readable

Минусы:

Высокий объём (ТБ/день)
Дорогое хранение
Медленный поиск без индексов

3. Traces (Jaeger, Tempo, Zipkin)

Что показывают: Путь запроса через систему.

POST /checkout [850ms]
├─ Auth Service [120ms]
├─ User Service [700ms] ← медленно!
└─ Payment Service [30ms]

Плюсы:

Визуализация bottlenecks
Связь между сервисами
Latency breakdown

Минусы:

Нет деталей внутри span (переменные, условия)
Overhead при 100% sampling

Unified Observability: связываем всё

Цель: От alert в Prometheus → через trace в Jaeger → к логам в Loki за 3 клика.

Ключ: trace_id и span_id во всех трёх системах.

┌─────────────────────────────────────────────────┐
│ Prometheus Alert: p95 latency > 2s             │
│ Click "View Traces" →                          │
├─────────────────────────────────────────────────┤
│ Jaeger: Trace a1b2c3d4 [3.5s]                  │
│   ├─ User Service span: slow DB query          │
│   Click "View Logs" →                          │
├─────────────────────────────────────────────────┤
│ Loki: Logs filtered by trace_id=a1b2c3d4       │
│   ERROR: Database timeout on query SELECT...   │
│   Stack trace: ...                             │
└─────────────────────────────────────────────────┘

Добавление trace_id в логи

Structured Logging с trace_id

До (плохо):

console.log("User purchase failed, amount: 99.99");

Проблема: Невозможно найти лог конкретного запроса.

После (хорошо):

const { trace } = require("@opentelemetry/api");
 
const span = trace.getActiveSpan();
const spanContext = span.spanContext();
 
logger.info("User purchase failed", {
  trace_id: spanContext.traceId,
  span_id: spanContext.spanId,
  user_id: 123,
  amount: 99.99,
});

Результат в Loki:

{
  "timestamp": "2024-01-15T14:32:15Z",
  "level": "info",
  "message": "User purchase failed",
  "trace_id": "a1b2c3d4e5f6789012345678abcdef12",
  "span_id": "1234567890abcdef",
  "user_id": 123,
  "amount": 99.99
}

Реализация: Node.js + Winston + OpenTelemetry

Шаг 1: Установка зависимостей

npm install winston @opentelemetry/api

Шаг 2: Настройка Winston logger

const winston = require("winston");
const { trace } = require("@opentelemetry/api");
 
// Custom format: добавляем trace_id и span_id
const traceFormat = winston.format((info) => {
  const span = trace.getActiveSpan();
  if (span) {
    const spanContext = span.spanContext();
    info.trace_id = spanContext.traceId;
    info.span_id = spanContext.spanId;
    info.trace_flags = spanContext.traceFlags;
  }
  return info;
});
 
const logger = winston.createLogger({
  level: "info",
  format: winston.format.combine(
    winston.format.timestamp(),
    traceFormat(), // ← добавляем trace context
    winston.format.json()
  ),
  transports: [
    new winston.transports.Console(),
    new winston.transports.File({ filename: "app.log" }),
  ],
});
 
module.exports = logger;

Шаг 3: Использование в коде

const logger = require("./logger");
const { trace } = require("@opentelemetry/api");
 
app.post("/users/:id/purchase", async (req, res) => {
  return tracer.startActiveSpan("processPurchase", async (span) => {
    try {
      const userId = parseInt(req.params.id);
      const { amount } = req.body;
 
      logger.info("Processing purchase", { user_id: userId, amount });
 
      // Business logic
      const user = await getUserById(userId);
 
      if (user.balance < amount) {
        logger.warn("Insufficient balance", {
          user_id: userId,
          balance: user.balance,
          required: amount,
        });
 
        span.setStatus({ code: SpanStatusCode.ERROR });
        return res.status(400).json({ error: "Insufficient balance" });
      }
 
      await createPurchase(userId, amount);
      logger.info("Purchase completed", { user_id: userId, amount });
 
      res.json({ success: true });
    } catch (error) {
      logger.error("Purchase failed", {
        error: error.message,
        stack: error.stack,
      });
 
      span.recordException(error);
      res.status(500).json({ error: "Internal error" });
    } finally {
      span.end();
    }
  });
});

Вывод логов:

{
  "timestamp": "2024-01-15T14:32:15.123Z",
  "level": "info",
  "message": "Processing purchase",
  "trace_id": "a1b2c3d4e5f6789012345678abcdef12",
  "span_id": "1234567890abcdef",
  "user_id": 123,
  "amount": 99.99
}
{
  "timestamp": "2024-01-15T14:32:15.456Z",
  "level": "warn",
  "message": "Insufficient balance",
  "trace_id": "a1b2c3d4e5f6789012345678abcdef12",
  "span_id": "1234567890abcdef",
  "user_id": 123,
  "balance": 50.0,
  "required": 99.99
}

Python: Structlog + OpenTelemetry

import structlog
from opentelemetry import trace
 
# Configure structlog с trace_id processor
def add_trace_context(logger, method_name, event_dict):
    span = trace.get_current_span()
    if span:
        ctx = span.get_span_context()
        event_dict["trace_id"] = format(ctx.trace_id, '032x')
        event_dict["span_id"] = format(ctx.span_id, '016x')
    return event_dict
 
structlog.configure(
    processors=[
        add_trace_context,  # ← добавляем trace context
        structlog.processors.TimeStamper(fmt="iso"),
        structlog.processors.JSONRenderer()
    ]
)
 
logger = structlog.get_logger()
 
# Использование
@app.post("/users/{user_id}/purchase")
async def purchase(user_id: int, amount: float):
    with tracer.start_as_current_span("processPurchase") as span:
        logger.info("processing_purchase", user_id=user_id, amount=amount)
 
        user = await get_user(user_id)
 
        if user.balance < amount:
            logger.warning("insufficient_balance",
                          user_id=user_id,
                          balance=user.balance,
                          required=amount)
            raise HTTPException(status_code=400, detail="Insufficient balance")
 
        await create_purchase(user_id, amount)
        logger.info("purchase_completed", user_id=user_id, amount=amount)

Go: Zap + OpenTelemetry

import (
    "go.uber.org/zap"
    "go.opentelemetry.io/otel/trace"
)
 
// Helper: добавляет trace context в logger
func withTraceContext(ctx context.Context, logger *zap.Logger) *zap.Logger {
    span := trace.SpanFromContext(ctx)
    if span.SpanContext().IsValid() {
        sc := span.SpanContext()
        return logger.With(
            zap.String("trace_id", sc.TraceID().String()),
            zap.String("span_id", sc.SpanID().String()),
        )
    }
    return logger
}
 
// Использование
func purchaseHandler(w http.ResponseWriter, r *http.Request) {
    ctx, span := tracer.Start(r.Context(), "processPurchase")
    defer span.End()
 
    logger := withTraceContext(ctx, zap.L())
 
    userID := getUserID(r)
    amount := getAmount(r)
 
    logger.Info("processing purchase",
        zap.Int("user_id", userID),
        zap.Float64("amount", amount),
    )
 
    // Business logic
    user, err := getUser(ctx, userID)
    if err != nil {
        logger.Error("failed to get user", zap.Error(err))
        return
    }
 
    if user.Balance < amount {
        logger.Warn("insufficient balance",
            zap.Int("user_id", userID),
            zap.Float64("balance", user.Balance),
            zap.Float64("required", amount),
        )
        http.Error(w, "Insufficient balance", 400)
        return
    }
 
    logger.Info("purchase completed",
        zap.Int("user_id", userID),
        zap.Float64("amount", amount),
    )
}

Grafana: от Trace к Logs за один клик

Архитектура

┌──────────────┐     ┌──────────────┐     ┌──────────────┐
│  Prometheus  │     │    Tempo     │     │     Loki     │
│  (Metrics)   │     │   (Traces)   │     │    (Logs)    │
└──────┬───────┘     └──────┬───────┘     └──────┬───────┘
       │                    │                    │
       └────────────────────┴────────────────────┘
                            │
                     ┌──────▼───────┐
                     │   Grafana    │
                     │  (Dashboard) │
                     └──────────────┘

Шаг 1: Docker Compose с Grafana Stack

version: "3.8"
 
services:
  # Grafana Tempo (Traces)
  tempo:
    image: grafana/tempo:latest
    command: ["-config.file=/etc/tempo.yaml"]
    volumes:
      - ./tempo.yaml:/etc/tempo.yaml
    ports:
      - "3200:3200" # Tempo UI
      - "4317:4317" # OTLP gRPC
 
  # Grafana Loki (Logs)
  loki:
    image: grafana/loki:latest
    command: -config.file=/etc/loki/local-config.yaml
    ports:
      - "3100:3100"
 
  # Promtail (Log shipper)
  promtail:
    image: grafana/promtail:latest
    volumes:
      - /var/log:/var/log
      - ./promtail-config.yaml:/etc/promtail/config.yaml
    command: -config.file=/etc/promtail/config.yaml
 
  # Prometheus (Metrics)
  prometheus:
    image: prom/prometheus:latest
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
    ports:
      - "9090:9090"
 
  # Grafana (UI)
  grafana:
    image: grafana/grafana:latest
    ports:
      - "3000:3000"
    environment:
      - GF_AUTH_ANONYMOUS_ENABLED=true
      - GF_AUTH_ANONYMOUS_ORG_ROLE=Admin
    volumes:
      - ./grafana-datasources.yaml:/etc/grafana/provisioning/datasources/datasources.yaml

Шаг 2: Настройка Grafana Data Sources

Файл grafana-datasources.yaml:

apiVersion: 1
 
datasources:
  - name: Tempo
    type: tempo
    access: proxy
    url: http://tempo:3200
    jsonData:
      tracesToLogs:
        datasourceUid: loki # ← связь Tempo → Loki
        tags: ["trace_id"] # ← поле для фильтрации
        mappedTags:
          - key: trace_id
            value: trace_id
      tracesToMetrics:
        datasourceUid: prometheus # ← связь Tempo → Prometheus
      serviceMap:
        datasourceUid: prometheus
 
  - name: Loki
    type: loki
    access: proxy
    url: http://loki:3100
    jsonData:
      derivedFields:
        - datasourceUid: tempo # ← связь Loki → Tempo
          matcherRegex: '"trace_id":"(\w+)"'
          name: TraceID
          url: "$${__value.raw}"
 
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://prometheus:9090
    jsonData:
      exemplarTraceIdDestinations:
        - datasourceUid: tempo # ← связь Prometheus → Tempo
          name: trace_id

Шаг 3: Использование в Grafana

1. Открываем Trace в Tempo:

Explore → Tempo
Search by trace_id: a1b2c3d4e5f6789012345678abcdef12
Видим waterfall с spans

2. Клик "Logs for this span":

Автоматически открывается Loki
Фильтр: {trace_id="a1b2c3d4e5f6789012345678abcdef12"}
Видим все логи этого запроса

3. От логов обратно к Trace:

В логах видим trace_id
Клик на trace_id → открывается Tempo с полным trace

RED Metrics из Traces

RED: Rate, Errors, Duration — ключевые метрики для каждого сервиса.

Проблема: Дублирование — отдельно инструментируем traces и metrics.

Решение: Генерируем метрики из traces автоматически.

Tempo + Prometheus Integration

tempo.yaml:

metrics_generator:
  registry:
    external_labels:
      source: tempo
  storage:
    path: /tmp/tempo/generator/wal
    remote_write:
      - url: http://prometheus:9090/api/v1/write # ← экспорт в Prometheus
 
  traces_storage:
    path: /tmp/tempo/generator/traces
 
  # Генерируем RED метрики
  processor:
    span_metrics:
      dimensions:
        - service.name
        - http.method
        - http.status_code

Результат — автоматические метрики:

# Rate: requests per second
traces_spanmetrics_calls_total{service="api-gateway", http_method="POST", http_status_code="200"} 150
 
# Errors: error count
traces_spanmetrics_calls_total{service="api-gateway", http_status_code="500"} 5
 
# Duration: latency histogram
traces_spanmetrics_duration_seconds_bucket{service="api-gateway", le="0.1"} 100
traces_spanmetrics_duration_seconds_bucket{service="api-gateway", le="0.5"} 140
traces_spanmetrics_duration_seconds_bucket{service="api-gateway", le="1.0"} 150

Используем в Prometheus queries:

# Request rate
rate(traces_spanmetrics_calls_total[5m])
 
# Error rate
rate(traces_spanmetrics_calls_total{http_status_code=~"5.."}[5m])
/ rate(traces_spanmetrics_calls_total[5m])
 
# P95 latency
histogram_quantile(0.95, traces_spanmetrics_duration_seconds_bucket)

Exemplars: от метрики к trace

Exemplar — это пример конкретного trace, который попал в метрику.

Пример:

# Метрика: P95 latency = 2.5s
histogram_quantile(0.95, http_request_duration_seconds_bucket)
 
# Exemplar: trace_id конкретного медленного запроса
http_request_duration_seconds_bucket{le="3.0"} 145 # exemplar: trace_id=a1b2c3d4

В Grafana:

Смотрим график latency
Видим spike на 14:32
Клик на точку → видим exemplar trace_id
Клик на trace_id → открывается Tempo с деталями

Настройка Exemplars в Prometheus

Требуется Prometheus 2.26+

Включение exemplars в metrics:

const { MeterProvider } = require("@opentelemetry/sdk-metrics");
const { PrometheusExporter } = require("@opentelemetry/exporter-prometheus");
 
const exporter = new PrometheusExporter({
  exemplarsEnabled: true, // ← включаем exemplars
});
 
const meterProvider = new MeterProvider({
  readers: [exporter],
});

Практическое задание

Задача: Настройте полную интеграцию Traces → Logs → Metrics.

Требования:

Добавьте trace_id в логи (Winston/Structlog/Zap)
Настройте Docker Compose с Tempo, Loki, Prometheus, Grafana
Создайте запрос с ошибкой
В Grafana: найдите trace → перейдите к логам → найдите stack trace

Ожидаемый flow:

1. Tempo: Search trace by user_id
2. Найден trace с ошибкой (status=ERROR)
3. Click "Logs for this span"
4. Loki: автоматически открылись логи с trace_id
5. Видим ERROR log с полным stack trace

Best Practices

1. Всегда используйте structured logging (JSON)

// ❌ Плохо
console.log("User 123 purchase failed");
 
// ✅ Хорошо
logger.error("purchase_failed", {
  user_id: 123,
  trace_id: "a1b2c3d4",
  error: "insufficient_balance",
});

2. Добавляйте trace_id автоматически (middleware/processor)

3. Используйте consistent field names

trace_id (не traceId, не TraceId)
span_id (не spanId)
user_id (не userId, не user.id)

4. Логируйте критичные события на каждом span

span.addEvent("Database query started");
logger.debug("db_query_start", { query, trace_id });
 
// ... query execution
 
span.addEvent("Database query completed");
logger.info("db_query_complete", { duration_ms, rows, trace_id });

5. Храните логи и traces с одинаковым retention

Logs: 7 days
Traces: 7 days
 
(Не будет ситуации, когда trace есть, а логов уже нет)

Следующий урок

В следующем уроке мы погрузимся в OpenTelemetry Collector — центральный компонент для обработки, фильтрации и маршрутизации телеметрии в production.

Теперь вы можете за 3 клика перейти от alert к root cause! Unified Observability в действии.