OpenTelemetry Collector: архитектура и настройка

Проблема: прямая отправка traces из приложения

До Collector:

┌─────────────┐
│ Application │ ──OTLP──> Jaeger (port 14268)
│             │ ──Jaeger──> Zipkin (port 9411)
│             │ ──Zipkin──> Tempo (port 9411)
└─────────────┘

Проблемы:

Tight coupling: Приложение знает о Jaeger/Zipkin/Tempo
Нет централизованной обработки: батчинг, фильтрация в каждом приложении
Сложность миграции: Jaeger → Tempo требует изменения кода
Нет tail-based sampling: решение только на стороне клиента
Overhead на application: сетевые вызовы блокируют threads

Решение: OpenTelemetry Collector

┌─────────────┐                ┌──────────────────┐
│ Application │ ──OTLP──>      │ OTel Collector   │
│             │                │                  │
│  (любой     │                │ • Батчинг        │ ──> Jaeger
│   язык)     │                │ • Фильтрация     │ ──> Tempo
│             │                │ • Tail-sampling  │ ──> Zipkin
└─────────────┘                │ • Enrichment     │ ──> Datadog
                               └──────────────────┘

Преимущества:

Приложение отправляет один формат (OTLP)
Централизованная обработка
Легкая миграция между backends
Tail-based sampling (урок 09)
Offloading работы с application

Архитектура OpenTelemetry Collector

┌─────────────────────────────────────────────────────────┐
│               OpenTelemetry Collector                   │
│                                                         │
│  ┌──────────┐   ┌────────────┐   ┌──────────┐         │
│  │Receivers │──>│ Processors │──>│Exporters │         │
│  └──────────┘   └────────────┘   └──────────┘         │
│       │              │                  │              │
│   OTLP, Jaeger   Batch, Filter    Jaeger, Tempo       │
│   Zipkin, etc    Attributes, etc   Zipkin, etc        │
└─────────────────────────────────────────────────────────┘

1. Receivers (входящие данные)

Задача: Принимают telemetry от приложений.

Популярные receivers:

otlp — OpenTelemetry Protocol (gRPC/HTTP)
jaeger — Jaeger Thrift format
zipkin — Zipkin JSON/Protobuf
prometheus — Prometheus metrics

Пример конфигурации:

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318
 
  jaeger:
    protocols:
      grpc:
        endpoint: 0.0.0.0:14250
      thrift_http:
        endpoint: 0.0.0.0:14268
 
  zipkin:
    endpoint: 0.0.0.0:9411

2. Processors (обработка)

Задача: Модифицируют, фильтруют, обогащают данные.

batch — батчинг spans

Зачем: Уменьшить количество HTTP-вызовов (1000 spans → 10 батчей).

processors:
  batch:
    timeout: 10s # Отправить батч каждые 10s
    send_batch_size: 1000 # Или когда накопится 1000 spans

Результат:

До:  1000 HTTP requests → Jaeger
После: 10 HTTP requests (по 100 spans) → Jaeger

filter — фильтрация spans

Зачем: Не отправлять health checks, internal requests.

processors:
  filter:
    traces:
      span:
        - attributes["http.route"] == "/health"
        - attributes["http.route"] == "/metrics"
        - attributes["http.route"] == "/readiness"

Результат: Health checks не попадают в Jaeger → экономия storage.

attributes — добавление/изменение атрибутов

Зачем: Enrichment (добавить region, environment, version).

processors:
  attributes:
    actions:
      - key: deployment.environment
        value: production
        action: insert
      - key: cloud.region
        value: us-east-1
        action: upsert
      - key: sensitive_data # Удаление PII
        action: delete

Результат:

// До
{
  "trace_id": "a1b2c3d4",
  "attributes": {
    "http.method": "POST"
  }
}
 
// После
{
  "trace_id": "a1b2c3d4",
  "attributes": {
    "http.method": "POST",
    "deployment.environment": "production",
    "cloud.region": "us-east-1"
  }
}

resource — добавление resource attributes

Зачем: Добавить metadata о сервисе (Kubernetes pod, host).

processors:
  resource:
    attributes:
      - key: service.namespace
        value: ecommerce
        action: upsert
      - key: k8s.cluster.name
        value: prod-cluster-1
        action: insert

tail_sampling — продвинутое сэмплирование

Зачем: Сэмплировать на основе полного trace (ошибки, latency).

processors:
  tail_sampling:
    decision_wait: 10s # Ждём завершения trace
    policies:
      - name: errors
        type: status_code
        status_code: { status_codes: [ERROR] }
      - name: slow
        type: latency
        latency: { threshold_ms: 1000 }
      - name: sample-1-percent
        type: probabilistic
        probabilistic: { sampling_percentage: 1 }

Детали в уроке 09.

3. Exporters (отправка данных)

Задача: Отправляют данные в backends.

Популярные exporters:

exporters:
  # Jaeger
  jaeger:
    endpoint: jaeger:14250
    tls:
      insecure: true
 
  # Tempo
  otlp:
    endpoint: tempo:4317
    tls:
      insecure: true
 
  # Zipkin
  zipkin:
    endpoint: http://zipkin:9411/api/v2/spans
 
  # Logging (для debugging)
  logging:
    loglevel: debug

Полная конфигурация: практический пример

Сценарий: Production setup с фильтрацией, батчингом, enrichment.

Файл otel-collector-config.yaml:

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318
 
processors:
  # 1. Батчинг для производительности
  batch:
    timeout: 10s
    send_batch_size: 1000
 
  # 2. Фильтрация health checks
  filter:
    traces:
      span:
        - attributes["http.route"] == "/health"
        - attributes["http.route"] == "/metrics"
 
  # 3. Добавление environment metadata
  resource:
    attributes:
      - key: deployment.environment
        value: production
        action: insert
      - key: k8s.cluster.name
        from_attribute: k8s.cluster.name
        action: upsert
 
  # 4. Удаление чувствительных данных
  attributes:
    actions:
      - key: http.request.header.authorization
        action: delete
      - key: user.email
        action: delete
 
exporters:
  # Primary: Tempo
  otlp/tempo:
    endpoint: tempo:4317
    tls:
      insecure: true
 
  # Secondary: Jaeger (для backward compatibility)
  jaeger:
    endpoint: jaeger:14250
    tls:
      insecure: true
 
  # Debugging
  logging:
    loglevel: info
 
# Pipeline: соединяем всё
service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch, filter, resource, attributes]
      exporters: [otlp/tempo, jaeger, logging]
 
  # Мониторинг самого Collector
  telemetry:
    logs:
      level: info
    metrics:
      address: 0.0.0.0:8888

Docker Compose с Collector

version: "3.8"
 
services:
  # OpenTelemetry Collector
  otel-collector:
    image: otel/opentelemetry-collector-contrib:latest
    command: ["--config=/etc/otel-collector-config.yaml"]
    volumes:
      - ./otel-collector-config.yaml:/etc/otel-collector-config.yaml
    ports:
      - "4317:4317" # OTLP gRPC
      - "4318:4318" # OTLP HTTP
      - "8888:8888" # Prometheus metrics (Collector monitoring)
      - "13133:13133" # Health check
 
  # Tempo (backend)
  tempo:
    image: grafana/tempo:latest
    command: ["-config.file=/etc/tempo.yaml"]
    volumes:
      - ./tempo.yaml:/etc/tempo.yaml
    ports:
      - "3200:3200"
 
  # Jaeger (legacy)
  jaeger:
    image: jaegertracing/all-in-one:1.53
    ports:
      - "16686:16686"
      - "14250:14250"

Запуск:

docker-compose up -d

Проверка health:

curl http://localhost:13133
# Response: {"status":"Server available"}

Настройка приложения: отправка в Collector

До (прямая отправка в Jaeger):

const exporter = new OTLPTraceExporter({
  url: "http://jaeger:14268/api/traces", // Напрямую в Jaeger
});

После (отправка в Collector):

const exporter = new OTLPTraceExporter({
  url: "http://otel-collector:4318/v1/traces", // Через Collector
});

Преимущества:

Можно менять backend без изменения приложения
Централизованная обработка (батчинг, фильтрация)
Offloading работы с application

Deployment Patterns

Pattern 1: Agent (Sidecar)

Архитектура:

┌─────────────────────────────────────┐
│           Kubernetes Pod            │
│                                     │
│  ┌────────────┐  ┌───────────────┐ │
│  │Application │─>│ OTel Collector│ │
│  │ Container  │  │   (sidecar)   │ │
│  └────────────┘  └────────┬──────┘ │
└──────────────────────────│──────────┘
                           │
                           ▼
                   ┌───────────────┐
                   │ Jaeger/Tempo  │
                   └───────────────┘

Плюсы:

Низкая latency (localhost)
Изоляция ошибок (падение одного Collector не влияет на другие)
Автоматическое масштабирование (scale с приложением)

Минусы:

Высокий overhead (Collector на каждом pod)
Дублирование ресурсов
Сложность централизованной конфигурации

Когда использовать: Low-latency requirements, strict isolation.

Pattern 2: Gateway (Centralized)

Архитектура:

┌──────────┐  ┌──────────┐  ┌──────────┐
│  App 1   │  │  App 2   │  │  App 3   │
└────┬─────┘  └────┬─────┘  └────┬─────┘
     │             │             │
     └─────────────┴─────────────┘
                   │
                   ▼
         ┌───────────────────┐
         │  OTel Collector   │
         │    (Gateway)      │
         │  3 replicas       │
         └─────────┬─────────┘
                   │
                   ▼
           ┌───────────────┐
           │ Jaeger/Tempo  │
           └───────────────┘

Плюсы:

Централизованная конфигурация
Меньше ресурсов (один Collector для всех)
Easier monitoring и debugging

Минусы:

Single point of failure (нужна HA)
Network hop (дополнительная latency)
Bottleneck при высокой нагрузке

Когда использовать: Cost optimization, centralized control.

Pattern 3: Hybrid (Agent + Gateway)

Архитектура:

┌──────────────────────────────────────────────┐
│              Kubernetes Pods                 │
│                                              │
│  ┌─────┐   ┌─────────────┐                  │
│  │ App │──>│ OTel Agent  │                  │
│  └─────┘   │  (sidecar)  │                  │
│            │  • Батчинг  │                  │
│            └──────┬──────┘                  │
└───────────────────│──────────────────────────┘
                    │
                    ▼
         ┌──────────────────────┐
         │   OTel Gateway       │
         │  • Tail-sampling     │
         │  • Enrichment        │
         │  • Routing           │
         └──────────┬───────────┘
                    │
         ┌──────────┴──────────┐
         ▼                     ▼
    ┌────────┐          ┌──────────┐
    │ Tempo  │          │  Jaeger  │
    └────────┘          └──────────┘

Плюсы:

Best of both worlds
Agent: батчинг, фильтрация (легкая обработка)
Gateway: tail-sampling, routing (тяжелая обработка)

Минусы:

Сложность архитектуры
Больше компонентов для monitoring

Когда использовать: Large-scale production (tail-sampling, multi-backend).

Kubernetes Deployment

Agent (DaemonSet):

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: otel-agent
spec:
  selector:
    matchLabels:
      app: otel-agent
  template:
    metadata:
      labels:
        app: otel-agent
    spec:
      containers:
        - name: otel-collector
          image: otel/opentelemetry-collector-contrib:latest
          args: ["--config=/conf/otel-agent-config.yaml"]
          volumeMounts:
            - name: config
              mountPath: /conf
          ports:
            - containerPort: 4317 # OTLP gRPC
            - containerPort: 4318 # OTLP HTTP
          resources:
            limits:
              memory: 512Mi
              cpu: 500m
            requests:
              memory: 256Mi
              cpu: 200m
      volumes:
        - name: config
          configMap:
            name: otel-agent-config

Gateway (Deployment):

apiVersion: apps/v1
kind: Deployment
metadata:
  name: otel-gateway
spec:
  replicas: 3 # HA setup
  selector:
    matchLabels:
      app: otel-gateway
  template:
    metadata:
      labels:
        app: otel-gateway
    spec:
      containers:
        - name: otel-collector
          image: otel/opentelemetry-collector-contrib:latest
          args: ["--config=/conf/otel-gateway-config.yaml"]
          volumeMounts:
            - name: config
              mountPath: /conf
          ports:
            - containerPort: 4317
            - containerPort: 8888 # Metrics
          resources:
            limits:
              memory: 2Gi
              cpu: 1000m
---
apiVersion: v1
kind: Service
metadata:
  name: otel-gateway
spec:
  selector:
    app: otel-gateway
  ports:
    - name: otlp-grpc
      port: 4317
      targetPort: 4317
    - name: metrics
      port: 8888
      targetPort: 8888
  type: ClusterIP

Мониторинг OpenTelemetry Collector

Проблема: Collector — critical component. Нужен мониторинг.

Решение: Collector экспортирует метрики в Prometheus.

Метрики Collector

# otel-collector-config.yaml
service:
  telemetry:
    metrics:
      address: 0.0.0.0:8888 # Prometheus endpoint

Ключевые метрики:

# Принятые spans
otelcol_receiver_accepted_spans{receiver="otlp"} 150000
 
# Отправленные spans
otelcol_exporter_sent_spans{exporter="jaeger"} 148000
 
# Dropped spans (проблема!)
otelcol_processor_dropped_spans{processor="batch"} 2000
 
# Latency обработки
otelcol_processor_batch_batch_send_size_bucket

Prometheus scrape config:

scrape_configs:
  - job_name: "otel-collector"
    static_configs:
      - targets: ["otel-collector:8888"]

Grafana Dashboard:

Import ID: 15983 (OpenTelemetry Collector Dashboard)

Troubleshooting Collector

Проблема 1: Spans не доходят до backend

Debugging:

Проверьте health check:

curl http://otel-collector:13133

Включите logging exporter:

exporters:
  logging:
    loglevel: debug
 
service:
  pipelines:
    traces:
      exporters: [logging, jaeger]

Проверьте метрики:

# Spans dropped?
rate(otelcol_processor_dropped_spans[5m])
 
# Exporter errors?
rate(otelcol_exporter_send_failed_spans[5m])

Проблема 2: High memory usage

Причина: Слишком большой batch size или tail_sampling buffer.

Решение:

processors:
  batch:
    send_batch_size: 500 # Уменьшить (было 5000)
    timeout: 5s # Чаще отправлять
 
  tail_sampling:
    decision_wait: 5s # Уменьшить (было 30s)

Проблема 3: High latency

Причина: Синхронная отправка spans блокирует pipeline.

Решение: Увеличить batch timeout:

processors:
  batch:
    timeout: 1s # Отправлять чаще (было 10s)

Best Practices

1. Всегда используйте batch processor

processors:
  batch:
    timeout: 10s
    send_batch_size: 1000

Экономия: 10-100x меньше HTTP requests.

2. Фильтруйте неважные spans на Collector

processors:
  filter:
    traces:
      span:
        - attributes["http.route"] == "/health"

Экономия storage: 20-30% traces.

3. Используйте resource processor для metadata

processors:
  resource:
    attributes:
      - key: deployment.environment
        value: ${ENVIRONMENT} # Из env var
        action: insert

4. Мониторьте Collector метрики

Alert rules:

# Too many dropped spans
- alert: CollectorDroppingSpans
  expr: rate(otelcol_processor_dropped_spans[5m]) > 100
  annotations:
    summary: Collector dropping spans
 
# Exporter failing
- alert: CollectorExporterFailing
  expr: rate(otelcol_exporter_send_failed_spans[5m]) > 10

5. HA setup для Gateway

replicas: 3 # Минимум 3 для HA

Load balancing: Kubernetes Service автоматически балансирует.

Практическое задание

Задача: Настройте OTel Collector с фильтрацией и enrichment.

Требования:

Принимайте OTLP от приложения
Фильтруйте /health и /metrics endpoints
Добавьте deployment.environment=production
Отправляйте в Tempo и Jaeger одновременно
Включите logging exporter для debugging

Ожидаемый результат:

Application → Collector (filter, enrich) → Tempo + Jaeger

Следующий урок

В следующем уроке мы изучим Tail-based Sampling — продвинутое сэмплирование на основе полного trace (ошибки, latency, custom rules).

Теперь вы можете централизованно управлять трассировкой в production с OpenTelemetry Collector!