Production Debugging: от алерта до root cause за 5 минут

Цель урока

Научиться отлаживать реальные production проблемы используя distributed tracing. В этом уроке - 6 сценариев из реального опыта с пошаговым анализом.

Вы научитесь:

Распознавать patterns проблем в Jaeger UI
Находить root cause за минуты вместо часов
Использовать trace spans для debugging
Коррелировать traces с logs и metrics

Scenario 1: Retry Storm Detection

🔥 Алерт

[CRITICAL] P99 latency /api/checkout: 100ms → 5000ms (+4900%)
[WARNING] Error rate: 0.1% → 15% (+14900%)
[WARNING] Request rate: 1000 req/s → 5000 req/s (+400%)
 
Duration: Last 5 minutes

📊 Что видим в Monitoring

Request rate spike - первый признак retry storm!

🔍 Trace Analysis в Jaeger

Открываем Jaeger:

Service: api-gateway
Operation: POST /api/checkout
Lookback: Last 15 minutes
Min Duration: 4000ms

Что видим:

Pattern: ОДИН span → МНОГО одинаковых child spans!

🎯 Root Cause

Click на один из error spans

Проверяем Tags:

http.status_code: 503
error: true
error.type: ServiceUnavailableError
error.message: "Payment service is down"

Проверяем Logs/Events в span:

event: retry.attempt
retry.count: 5
retry.delay: 0ms  ← ⚠️ NO BACKOFF!

Проблема: Payment service временно недоступен → API Gateway retry БЕЗ backoff → amplification (1 request = 5 retries!)

✅ Fix

// ❌ BEFORE
const response = await fetch(paymentUrl, {
  retry: {
    count: 5,
    delay: 0, // ← Проблема!
  },
});
 
// ✅ AFTER
const response = await fetch(paymentUrl, {
  retry: {
    count: 3, // Меньше retries
    delay: 1000, // 1s initial delay
    backoff: "exponential", // 1s, 2s, 4s
    maxDelay: 10000, // Max 10s
  },
});

✅ Verification в Jaeger

После fix:

Результат:

✅ Max 3 retries вместо 5
✅ Exponential backoff снижает load на payment service
✅ Request rate вернулся к норме: 1000 req/s

Scenario 2: N+1 Query Problem

🔥 Алерт

[WARNING] P99 latency /api/users/profile: 50ms → 800ms
[INFO] Database CPU: 30% → 85%

🔍 Trace Analysis

Jaeger search:

Operation: GET /api/users/profile
Tags: user.id=42

Trace visualization:

Pattern: МНОГО одинаковых DB queries подряд!

🎯 Root Cause

Click на spans: видим 50 SELECT queries для likes вместо одного!

-- ❌ N+1 queries (50 queries total)
SELECT * FROM posts WHERE user_id = 42;  -- Returns 50 posts
 
-- Then for EACH post:
SELECT * FROM likes WHERE post_id = 1;   -- 5ms
SELECT * FROM likes WHERE post_id = 2;   -- 5ms
...
SELECT * FROM likes WHERE post_id = 50;  -- 5ms
 
-- Total: 1 + 50 = 51 queries! ❌

✅ Fix

// ❌ BEFORE - N+1 queries
const posts = await db.query("SELECT * FROM posts WHERE user_id = ?", [userId]);
 
for (const post of posts) {
  post.likes = await db.query("SELECT * FROM likes WHERE post_id = ?", [
    post.id,
  ]); // ← N queries!
}
 
// ✅ AFTER - Single query with JOIN
const posts = await db.query(
  `SELECT posts.*,
          COUNT(likes.id) as likes_count
   FROM posts
   LEFT JOIN likes ON likes.post_id = posts.id
   WHERE posts.user_id = ?
   GROUP BY posts.id`,
  [userId]
);

✅ Verification

Trace после fix:

Результат: 850ms → 55ms (~94% improvement!) 🎉

Scenario 3: Slow External API

🔥 Алерт

[WARNING] P99 latency /api/send-email: 200ms → 8000ms
[INFO] Timeout errors increasing

🔍 Trace Analysis

🎯 Root Cause

Span tags показывают:

span: POST /v3/mail/send
http.url: https://api.sendgrid.com/v3/mail/send
http.status_code: 200
duration: 7850ms  ← ⚠️ Slow external API!

Проблема: SendGrid API медленный (их проблема, не наша!), но мы ждем синхронно.

✅ Fix: Async Processing

// ❌ BEFORE - Synchronous
app.post("/api/send-email", async (req, res) => {
  await sendGridClient.send(email); // Ждем 8s!
  res.json({ success: true });
});
 
// ✅ AFTER - Async with Queue
app.post("/api/send-email", async (req, res) => {
  // Публикуем в очередь
  await queue.publish("email.send", email); // ~5ms
 
  res.json({ success: true, queued: true });
  // User получает ответ сразу!
});
 
// Background worker обрабатывает очередь
worker.process("email.send", async (job) => {
  await sendGridClient.send(job.data); // Может быть медленным, но не блокирует API
});

✅ Verification

Trace после fix:

Результат: API latency 8000ms → 15ms! 🚀

Scenario 4: Circuit Breaker Failure

🔥 Алерт

[CRITICAL] Error rate /api/orders: 0% → 98%
[CRITICAL] Downstream service "inventory-service" unavailable

🔍 Trace Pattern

Pattern: Каждый request ждет 30s timeout!

🎯 Root Cause

Span tags:
  error.type: ETIMEDOUT
  http.status_code: null
  retry.count: 0
  circuit_breaker.enabled: false  ← ⚠️ Проблема!

Проблема: Нет circuit breaker → каждый request ждет full timeout (30s)!

✅ Fix: Add Circuit Breaker

const CircuitBreaker = require("opossum");
 
// ❌ BEFORE
async function getInventory(productId) {
  return await fetch(`http://inventory-service/products/${productId}`);
  // Если сервис down → ждем 30s timeout каждый раз
}
 
// ✅ AFTER
const inventoryBreaker = new CircuitBreaker(
  async (productId) => {
    return await fetch(`http://inventory-service/products/${productId}`, {
      timeout: 2000, // 2s timeout
    });
  },
  {
    timeout: 2000,
    errorThresholdPercentage: 50, // Open при 50% errors
    resetTimeout: 30000, // Retry после 30s
  }
);
 
async function getInventory(productId) {
  try {
    return await inventoryBreaker.fire(productId);
  } catch (err) {
    if (err.code === "EOPENBREAKER") {
      // Circuit open - fail fast!
      throw new Error("Inventory service unavailable");
    }
    throw err;
  }
}

✅ Verification

Trace с circuit breaker:

Span tags:

circuit_breaker.state: OPEN
circuit_breaker.failures: 15
error.message: "Inventory service unavailable (circuit open)"

Результат:

✅ Latency: 30s → 50ms (fail fast!)
✅ Сервис остается responsive
✅ Auto-recovery после 30s (circuit half-open)

Scenario 5: Cascading Failure

🔥 Алерт

[CRITICAL] Multiple services failing simultaneously:
  - auth-service: 95% error rate
  - user-service: 90% error rate
  - order-service: 85% error rate
  - payment-service: 50% error rate

🔍 Service Dependency Graph

Pattern: Множество сервисов падают → ищем ОБЩУЮ dependency!

🎯 Root Cause via Traces

Открываем traces для КАЖДОГО failing service:

Auth Service trace:

span: getCachedUser
error.type: ECONNREFUSED
error.message: "Connection refused: redis:6379"

User Service trace:

span: cacheUserProfile
error.type: ECONNREFUSED
error.message: "Connection refused: redis:6379"

Order Service trace:

span: getCachedCart
error.type: ECONNREFUSED
error.message: "Connection refused: redis:6379"

Common pattern: ALL failing на Redis!

✅ Fix

# Проверка Redis
kubectl get pods | grep redis
# redis-xxx  0/1  CrashLoopBackOff
 
# Check logs
kubectl logs redis-xxx
# Error: MISCONF Redis is configured to save RDB snapshots,
# but it is currently not able to persist on disk
 
# Причина: Disk full!
df -h
# /dev/sda1  100%  ← Problem!
 
# Fix: Очистить место
rm /var/log/old-logs/*
kubectl delete pod redis-xxx  # Restart

✅ Verification

После fix Redis - все сервисы восстановились:

Lesson: Traces показали ОБЩУЮ точку отказа! ✨

Scenario 6: Memory Leak Detection

🔥 Алерт

[WARNING] Latency increasing linearly over time:
  - Hour 1: P99 = 100ms
  - Hour 2: P99 = 200ms
  - Hour 3: P99 = 400ms
  - Hour 4: P99 = 800ms (current)

🔍 Correlation: Traces + Metrics

Pattern: Latency растет ≈ Memory растет → Memory leak!

🎯 Root Cause

Profiling + Traces:

// Span attributes показывают leak source:
span.setAttribute("cache.size", cacheMap.size);
 
// В Jaeger:
Hour 1: cache.size = 1000
Hour 2: cache.size = 5000
Hour 3: cache.size = 15000
Hour 4: cache.size = 50000  ← ⚠️ Растет!

Код:

// ❌ BEFORE - Memory leak
const cache = new Map(); // Unbounded cache!
 
app.get("/users/:id", async (req, res) => {
  let user = cache.get(req.params.id);
 
  if (!user) {
    user = await db.getUser(req.params.id);
    cache.set(req.params.id, user); // ← Никогда не удаляется!
  }
 
  res.json(user);
});

✅ Fix: LRU Cache with TTL

const LRU = require("lru-cache");
 
// ✅ AFTER - Bounded cache
const cache = new LRU({
  max: 1000, // Max 1000 entries
  ttl: 1000 * 60 * 5, // 5 min TTL
});
 
app.get("/users/:id", async (req, res) => {
  let user = cache.get(req.params.id);
 
  if (!user) {
    user = await db.getUser(req.params.id);
    cache.set(req.params.id, user);
  }
 
  // Track cache metrics in spans
  const span = trace.getActiveSpan();
  span.setAttribute("cache.size", cache.size);
  span.setAttribute("cache.max", cache.max);
 
  res.json(user);
});

✅ Verification

Metrics после fix:

Hour 5: process_resident_memory_bytes = 250MB ✅
Hour 6: process_resident_memory_bytes = 260MB ✅
Hour 7: process_resident_memory_bytes = 255MB ✅
 
cache.size = 1000 (stable) ✅

Best Practices для Production Debugging

1. Добавляйте Context в Spans

span.setAttribute("user.id", userId);
span.setAttribute("tenant.id", tenantId);
span.setAttribute("cart.items_count", cart.items.length);
span.setAttribute("cache.hit", cacheHit);
span.setAttribute("retry.attempt", retryCount);

Почему: Легче фильтровать и находить проблемные traces!

2. Используйте Span Events для Milestones

span.addEvent("cache.miss");
span.addEvent("retry.started", { attempt: 2 });
span.addEvent("circuit_breaker.opened");

3. Коррелируйте Traces с Logs

const traceId = span.spanContext().traceId;
 
logger.error("Payment failed", {
  traceId, // ⭐ Критично для корреляции!
  userId,
  error: err.message,
});

В Grafana Loki:

{service="payment-service"} |= "traceId=abc123"

→ Клик на trace ID → Opens Jaeger!

4. Мониторьте Key Metrics в Spans

span.setAttribute("db.query_time_ms", queryTime);
span.setAttribute("cache.hit_rate", hitRate);
span.setAttribute("queue.lag_seconds", queueLag);

5. Используйте Jaeger Advanced Queries

# Найти все slow traces
duration>1s
 
# Найти traces с errors для specific user
error=true user.id=42
 
# Найти traces с high retry count
retry.count>3
 
# Комбинации
duration>500ms error=true http.status_code=503

Практическое задание

Вам дан production trace с проблемой. Найдите root cause!

Scenario: API endpoint /api/dashboard медленный (P99 = 3s)

Trace structure:

GET /api/dashboard [3.2s]
├─ getUser [150ms]
│  └─ SELECT users [15ms]
├─ getUserPosts [2.8s]
│  ├─ SELECT posts [20ms]
│  └─ enrichPosts [2.7s]
│     ├─ getComments (post_id=1) [50ms]
│     ├─ getComments (post_id=2) [50ms]
│     ├─ ... [50ms × 48 more]
└─ getUserStats [200ms]

Вопросы:

Какая проблема? (N+1 query / Retry storm / Slow API / Memory leak?)
Где bottleneck?
Как бы вы исправили?

Ответ: N+1 query problem! 50 getComments calls вместо одного batch query. Fix: JOIN или IN query для всех post_ids сразу.

Что дальше

Поздравляем! Вы прошли весь курс по distributed tracing - от основ до production debugging! 🎉

Теперь вы можете:

✅ Инструментировать приложения (HTTP, gRPC, Kafka, RabbitMQ)
✅ Настраивать infrastructure (Jaeger, OTel Collector, Istio)
✅ Отлаживать production проблемы через traces
✅ Использовать best practices для observability

Вы теперь эксперт по distributed tracing! Применяйте эти знания для улучшения reliability и observability ваших систем.

🎉 Congratulations on completing the course!