#architecture #event-driven #Kafka #RabbitMQ #microservices #async #patterns

Event-Driven Architecture: From Theory to Production

Константин Потапов

December 26, 2025

25 min

When synchronous calls strangle your system, how to implement events painlessly, Kafka vs RabbitMQ in practice, and why eventual consistency is a conscious choice. Typical production cases with metrics and a breakdown of all EDA pitfalls.

Event-Driven Architecture: From Theory to Production

"We added real-time notifications to the system. Now checkout makes 7 synchronous HTTP calls, and if even one service slows down — the entire checkout fails. Users complain, CEO demands heads."

A typical situation in growing startups (similar stories regularly appear in blogs by Stripe, Airbnb, and fintech companies). Classic synchronous architecture with a beautiful facade turns into a house of cards: one slow service brings down the entire system.

Result after implementing event-driven approach: P99 checkout latency drops from 2.3 seconds to 180ms. Availability rises from 99.5% to 99.95%. Team stops getting alerts at 3 AM (based on migration experiences at Netflix, Uber, and other companies).

In this article — a practical guide to Event-Driven Architecture without evangelism and magical thinking. Only typical production cases, Python code, metrics, and honest pitfalls that regularly occur in the industry.

The Truth About Event-Driven Architecture: It's Not About Kafka

Let's start with an uncomfortable truth: Event-Driven Architecture is not a technology. It's an approach to system design.

Imagine two communication models:

Request-Response (synchronous model) — like a phone call:

You: "Hello, what's the weather?"
Friend: "Wait... (searching for info) ...+15°C"
You: (waiting all this time with phone to your ear)

Event-Driven (asynchronous model) — like a news feed:

You: (publish event "Want to know weather")
Weather service: (sees event and publishes "+15°C" to feed)
You: (will see the answer when you're ready)

The key difference: in the first case you wait, in the second — continue living.

Typical Scenario: How Synchronous Calls Kill Checkout

An e-commerce startup decides to "improve UX" and adds new features during order creation:

What should happen at checkout:

Check user in Auth Service (50ms)
Validate products in Catalog Service (80ms)
Check promo code in Discount Service (120ms)
Reserve inventory in Inventory Service (200ms)
Create order in Order Service (100ms)
Charge payment in Payment Service (300ms)
Send email via Notification Service (150ms)

Synchronous implementation:

# ❌ Potential trap: dependency chain
from fastapi import FastAPI, HTTPException
import httpx
 
app = FastAPI()
 
@app.post("/checkout")
async def checkout(order: CheckoutRequest):
    async with httpx.AsyncClient(timeout=5.0) as client:
        # 1. Check user
        user = await client.get(f"http://auth-service/users/{order.user_id}")
        if user.status_code != 200:
            raise HTTPException(401, "User not found")
 
        # 2. Validate products
        catalog = await client.post(
            "http://catalog-service/validate",
            json={"items": order.items}
        )
        if catalog.status_code != 200:
            raise HTTPException(400, "Invalid items")
 
        # 3. Check promo code
        discount = await client.post(
            "http://discount-service/validate",
            json={"code": order.promo_code}
        )
 
        # 4. Reserve inventory
        inventory = await client.post(
            "http://inventory-service/reserve",
            json={"items": order.items}
        )
        if inventory.status_code != 200:
            raise HTTPException(400, "Out of stock")
 
        # 5. Create order
        order_response = await client.post(
            "http://order-service/orders",
            json=order.dict()
        )
 
        # 6. Charge payment
        payment = await client.post(
            "http://payment-service/charge",
            json={"amount": order.total, "user_id": order.user_id}
        )
        if payment.status_code != 200:
            # Oh no! Order created but payment failed!
            # Need rollback... but how?
            raise HTTPException(500, "Payment failed")
 
        # 7. Send email
        await client.post(
            "http://notification-service/send",
            json={"email": user.json()["email"], "order_id": order_response.json()["id"]}
        )
 
        return {"order_id": order_response.json()["id"]}

Common problems with this approach:

Problem #1: Latency accumulates

Best case: 50 + 80 + 120 + 200 + 100 + 300 + 150 = 1000ms
Worst case (P99): 200 + 300 + 500 + 800 + 400 + 1200 + 600 = 4000ms (4 seconds!)

Problem #2: Cascading failures

Notification Service down (DNS timeout 30s)
→ Checkout waits 30 seconds
→ Users see spinning loader
→ Users hit F5
→ Duplicate orders created
→ Chaos

Problem #3: Partial rollback impossible

Scenario:
1. Order created ✅
2. Inventory reserved ✅
3. Payment Service failed ❌

Result: Product reserved but not paid.
Solution: ??? (manual cleanup in admin panel)

Typical metrics of such a system under load:

P99 latency checkout: 3.8+ seconds
Success rate: 92% (8% fail due to timeouts)
Duplicate orders: ~5% (users hit F5 due to slow loading)
Unhappy customers: growing exponentially

Synchronous HTTP calls between microservices are like a chain of weak links. One link breaks → entire chain fails. Latency accumulates, failures cascade, rollback becomes a nightmare.

Event-Driven Approach: Architectural Solution

Key idea: Split checkout into two phases:

Phase 1 (synchronous): Critical business logic

Data validation
Inventory reservation (must succeed now)
Create order in PENDING status

Phase 2 (asynchronous): Everything else via events

Payment processing
Email notifications
Analytics
Recommendation updates

New architecture:

# ✅ Event-driven approach (using Kafka as broker)
from fastapi import FastAPI
from kafka import KafkaProducer
import json
 
app = FastAPI()
 
# Kafka producer (could be Redis Streams, RabbitMQ, etc.)
producer = KafkaProducer(
    bootstrap_servers='kafka:9092',
    value_serializer=lambda v: json.dumps(v).encode('utf-8')
)
 
@app.post("/checkout")
async def checkout(order: CheckoutRequest):
    # SYNCHRONOUS (critical for UX)
    # 1. Quick validation
    validate_order_data(order)  # 5ms, in-memory
 
    # 2. Inventory reservation (only external call!)
    async with httpx.AsyncClient(timeout=2.0) as client:
        inventory = await client.post(
            "http://inventory-service/reserve",
            json={"items": order.items}
        )
        if inventory.status_code != 200:
            raise HTTPException(400, "Out of stock")
 
    # 3. Create order in PENDING status
    order_id = await db.create_order(order, status="PENDING")
 
    # ASYNCHRONOUS (via events)
    # Publish event — rest is not our problem
    producer.send('orders.created', {
        'order_id': order_id,
        'user_id': order.user_id,
        'items': order.items,
        'total': order.total,
        'timestamp': datetime.utcnow().isoformat()
    })
 
    # Return result instantly!
    return {
        "order_id": order_id,
        "status": "pending",
        "message": "Order created, payment processing"
    }

Consumers (separate services listen to events):

# Payment Service (listens to "orders.created" events)
from kafka import KafkaConsumer
import json
 
consumer = KafkaConsumer(
    'orders.created',
    bootstrap_servers='kafka:9092',
    value_deserializer=lambda m: json.loads(m.decode('utf-8')),
    group_id='payment-service'
)
 
for message in consumer:
    order = message.value
 
    try:
        # Charge payment
        payment = await charge_payment(order['user_id'], order['total'])
 
        # Publish success event
        producer.send('payments.completed', {
            'order_id': order['order_id'],
            'payment_id': payment.id,
            'status': 'success'
        })
 
    except PaymentError as e:
        # Publish failure event
        producer.send('payments.failed', {
            'order_id': order['order_id'],
            'reason': str(e)
        })

# Notification Service (listens to "payments.completed" events)
consumer = KafkaConsumer(
    'payments.completed',
    bootstrap_servers='kafka:9092',
    group_id='notification-service'
)
 
for message in consumer:
    payment = message.value
 
    # Send email (even if it fails — won't affect checkout)
    await send_email(
        user_id=payment['user_id'],
        template='order_confirmation',
        data={'order_id': payment['order_id']}
    )

Typical results after implementation (based on industry migration experiences):

Метрика

Request-Response

Event-Driven

P99 latency checkout

3.8s

180ms

95%

Success rate

92%

99.9%

Duplicate orders

0.01%

100%

Depends on 7 services

Yes

—

Depends on 1 service

—

(Inventory)

Infrastructure savings: In typical cases, instance count decreases by 20-40% (due to lower load and better scaling). This is ~$500-1000/month for medium-sized projects.

Main point: UX improves (fast checkout), incident count decreases, team gets fewer night alerts.

When to Use Event-Driven Architecture

Event-Driven is not a silver bullet. There are cases when synchronous approach is simpler and better.

Choose EDA if you answered "YES" to 3+ questions:

1. Asynchronous operations present?

Examples:
- Sending email/SMS
- Report generation
- Image/video processing
- ML inference
- Analytics

2. Need fault tolerance?

Scenario: Email service is down
❌ Synchronous: Checkout fails for everyone
✅ Event-driven: Checkout works, emails sent later

3. Microservice architecture?

5+ services → high risk of cascading failures
Event-driven reduces coupling between services

4. High load?

> 1000 req/s → synchronous calls slow down
Event bus acts as buffer, smooths load spikes

5. Need audit trail / Event Sourcing?

Requirement: restore system state at any point in time
Event log = complete history of all changes

6. Different processing speeds?

Fast: API endpoint responds in 50ms
Slow: Email generation takes 2 seconds

Event-driven allows not waiting for slow operations

DON'T use EDA if:

❌ Simple CRUD system

Todo-app, blog, simple admin panel
→ Synchronous REST API is simpler and clearer

❌ Need immediate consistency

Bank transaction: money must be debited NOW
→ Eventual consistency doesn't fit

❌ Team not ready

Debugging event-driven systems is harder
Need:
- Distributed tracing
- Event monitoring
- Understanding of eventual consistency

❌ One monolith, 3 developers

Event bus overhead not justified
→ Start with modular monolith

Industry best practice: If operation must complete BEFORE user gets response — use synchronous call. Everything else — candidate for events. This approach is used in Netflix, Uber, Spotify and other high-scale systems.

Core Event-Driven System Patterns

Event-Driven is not just "throw event into Kafka". There are several patterns with different trade-offs.

Pattern #1: Event Notification (simple notifications)

Idea: Event contains minimal data, only ID and event type.

# Publisher
producer.send('user.registered', {
    'event_id': 'evt_123',
    'user_id': 'usr_456',
    'timestamp': '2025-12-26T10:00:00Z'
})
 
# Consumer (must fetch data itself)
@app.on_event('user.registered')
async def on_user_registered(event):
    # Make HTTP request for data
    user = await user_service.get_user(event['user_id'])
    await send_welcome_email(user.email)

Pros:

✅ Small message size
✅ No data duplication

Cons:

❌ Consumer depends on source service (coupling)
❌ Additional HTTP request (latency)

When to use: Data changes frequently and freshness is critical.

Pattern #2: Event-Carried State Transfer (events with data)

Idea: Event contains ALL necessary data for processing.

# Publisher
producer.send('user.registered', {
    'event_id': 'evt_123',
    'user_id': 'usr_456',
    'email': 'user@example.com',
    'name': 'John Doe',
    'created_at': '2025-12-26T10:00:00Z',
    'subscription_plan': 'premium'
})
 
# Consumer (no HTTP requests!)
@app.on_event('user.registered')
async def on_user_registered(event):
    # Everything already in event
    await send_welcome_email(
        email=event['email'],
        name=event['name']
    )

Pros:

✅ Consumer independent from source service
✅ No additional HTTP requests
✅ Can process event even if source service is down

Cons:

❌ Larger message size (more traffic)
❌ Data duplication (risk of inconsistency)

When to use: Consumers must work independently, data changes rarely.

Recommended approach for most cases: Event-Carried State Transfer in 80% of scenarios. Reduces coupling, simplifies debugging (all data visible in event). Used in Amazon, Zalando, Shopify architectures.

Pattern #3: Event Sourcing (events as source of truth)

Idea: System state = sum of all events. Don't store current state in DB, only store events.

# Traditional approach (CRUD)
class Order:
    id: int
    status: str  # "pending" | "paid" | "shipped" | "completed"
    total: float
 
# When changed, overwrite row in DB
await db.execute("UPDATE orders SET status = 'paid' WHERE id = 123")

# Event Sourcing approach
# Events (append-only log)
events = [
    {'type': 'OrderCreated', 'order_id': 123, 'total': 1500, 'timestamp': '2025-12-26T10:00:00Z'},
    {'type': 'PaymentReceived', 'order_id': 123, 'amount': 1500, 'timestamp': '2025-12-26T10:05:00Z'},
    {'type': 'OrderShipped', 'order_id': 123, 'tracking': 'TR123', 'timestamp': '2025-12-26T12:00:00Z'},
]
 
# State = replay all events
def get_order_state(order_id):
    events = event_store.get_events(order_id)
    state = {}
    for event in events:
        if event['type'] == 'OrderCreated':
            state['status'] = 'pending'
            state['total'] = event['total']
        elif event['type'] == 'PaymentReceived':
            state['status'] = 'paid'
        elif event['type'] == 'OrderShipped':
            state['status'] = 'shipped'
    return state

Pros:

✅ Complete audit trail (can restore state at any point in time)
✅ Debugging easier (all changes visible)
✅ Time travel (rollback to past state)

Cons:

❌ Complexity (need event store, snapshots, projections)
❌ Performance (replaying events can be slow)
❌ Can't delete data (GDPR problem)

When to use: Financial systems, audit critical, complex business logic.

Pattern #4: CQRS (Command Query Responsibility Segregation)

Idea: Separate write (Command) and read (Query) into different models.

┌─────────────┐
│   Client    │
└──────┬──────┘
       │
       ├─ Command (write) ─────────► Command Handler ─► Event Store
       │                                                       │
       └─ Query (read) ────────────► Read Model ◄─────────────┘
                                     (materialized view)

Example:

# Command side (write)
class CreateOrderCommand:
    user_id: int
    items: List[Item]
 
@app.post("/orders")
async def create_order(cmd: CreateOrderCommand):
    # Validation
    # Event creation
    events = [
        OrderCreatedEvent(user_id=cmd.user_id, items=cmd.items),
        InventoryReservedEvent(items=cmd.items)
    ]
    # Save to event store
    await event_store.append(events)
    return {"order_id": events[0].order_id}
 
# Query side (read)
@app.get("/orders/{order_id}")
async def get_order(order_id: int):
    # Read from materialized view (fast!)
    return await read_db.get_order(order_id)
 
# Projector (updates read model from events)
@consumer.on('OrderCreatedEvent')
async def project_order(event):
    await read_db.insert_order({
        'id': event.order_id,
        'user_id': event.user_id,
        'status': 'pending',
        'created_at': event.timestamp
    })

Pros:

✅ Read and write optimized independently
✅ Complex queries don't affect write performance
✅ Can scale read and write separately

Cons:

❌ Complexity (two data models)
❌ Eventual consistency (read model may lag)

When to use: Read-heavy systems, complex analytical queries.

Apache KafkaRabbitMQRedis StreamsNATSAWS EventBridge

Technologies: Kafka vs RabbitMQ vs Redis Streams

Choosing event bus is a critical decision. Let's compare top-3.

Apache Kafka: Tank for high loads

When to use:

Throughput > 100k messages/sec
Need persistence (event replay)
Event sourcing / stream processing
Large team (someone to maintain)

Pros:

✅ High throughput (millions msg/sec)
✅ Persistence (events stored days/months)
✅ Replay (can re-read old events)
✅ Partitioning (horizontal scaling)

Cons:

❌ Complex setup (ZooKeeper, brokers, replication)
❌ Overkill for small systems
❌ Higher latency than RabbitMQ (optimized for throughput)

What is ZooKeeper? It's a distributed coordination system that was used in Kafka for storing metadata (which brokers are alive, which topics exist, who's partition leader). Think of ZooKeeper as Kafka cluster's "address book" — it answers questions like "where is partition X leader?" or "which consumers are in group Y?".

Good news: Starting from Kafka 3.3+ (2022), KRaft mode (Kafka Raft) appeared, which doesn't require ZooKeeper. Kafka manages its own metadata. This simplifies deployment and reduces moving parts. In production 2025, KRaft is recommended over ZooKeeper.

Example (Python + aiokafka):

from aiokafka import AIOKafkaProducer, AIOKafkaConsumer
import json
 
# Producer
producer = AIOKafkaProducer(
    bootstrap_servers='localhost:9092',
    value_serializer=lambda v: json.dumps(v).encode()
)
await producer.start()
await producer.send('orders.created', {'order_id': 123})
 
# Consumer
consumer = AIOKafkaConsumer(
    'orders.created',
    bootstrap_servers='localhost:9092',
    group_id='payment-service',
    value_deserializer=lambda m: json.loads(m.decode())
)
await consumer.start()
async for msg in consumer:
    print(f"Order: {msg.value}")

Docker Compose (KRaft mode, recommended in 2025):

services:
  kafka:
    image: confluentinc/cp-kafka:7.6.0
    ports:
      - "9092:9092"
    environment:
      # KRaft mode (without ZooKeeper)
      KAFKA_NODE_ID: 1
      KAFKA_PROCESS_ROLES: "broker,controller"
      KAFKA_CONTROLLER_QUORUM_VOTERS: "1@kafka:9093"
      KAFKA_LISTENERS: "PLAINTEXT://kafka:9092,CONTROLLER://kafka:9093"
      KAFKA_ADVERTISED_LISTENERS: "PLAINTEXT://localhost:9092"
      KAFKA_CONTROLLER_LISTENER_NAMES: "CONTROLLER"
      KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: "CONTROLLER:PLAINTEXT,PLAINTEXT:PLAINTEXT"
      KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 1
      # Storage formatting (mandatory for KRaft)
      CLUSTER_ID: "MkU3OEVBNTcwNTJENDM2Qk"

Legacy variant with ZooKeeper (if compatibility needed)

services:
  kafka:
    image: confluentinc/cp-kafka:7.6.0
    ports:
      - "9092:9092"
    environment:
      KAFKA_BROKER_ID: 1
      KAFKA_ZOOKEEPER_CONNECT: zookeeper:2181
      KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://kafka:9092
      KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 1
 
  zookeeper:
    image: confluentinc/cp-zookeeper:7.6.0
    environment:
      ZOOKEEPER_CLIENT_PORT: 2181

RabbitMQ: Flexible and simple

When to use:

Need flexible routing (topic exchange, fanout, headers)
Priority queues
Small/medium loads (< 50k msg/sec)
Quick start (easier to setup than Kafka)

Pros:

✅ Setup simplicity
✅ Flexible routing (exchange types)
✅ Monitoring UI (RabbitMQ Management)
✅ Low latency

Cons:

❌ Lower throughput than Kafka
❌ No persistence by default (can enable, but slower)
❌ Replay harder

Example (Python + aio-pika):

import aio_pika
import json
 
# Publisher
connection = await aio_pika.connect_robust("amqp://localhost/")
channel = await connection.channel()
 
await channel.default_exchange.publish(
    aio_pika.Message(
        body=json.dumps({'order_id': 123}).encode(),
        delivery_mode=aio_pika.DeliveryMode.PERSISTENT
    ),
    routing_key='orders.created'
)
 
# Consumer
queue = await channel.declare_queue('payment-service-queue', durable=True)
await queue.bind(exchange='events', routing_key='orders.created')
 
async with queue.iterator() as queue_iter:
    async for message in queue_iter:
        async with message.process():
            data = json.loads(message.body.decode())
            print(f"Order: {data}")

Redis Streams: Lightweight for simple cases

When to use:

Already using Redis
Need low latency
Simple use cases (pub/sub, task queue)
Small/medium loads

Pros:

✅ Minimal overhead (Redis already there)
✅ Very low latency
✅ Simplicity (familiar Redis API)
✅ Consumer groups (like in Kafka)

Cons:

❌ In-memory (limited by RAM)
❌ No persistence guarantees
❌ Not for high loads

Example (Python + redis-py):

import redis.asyncio as redis
import json
 
r = await redis.from_url("redis://localhost")
 
# Producer
await r.xadd(
    'orders:created',
    {'data': json.dumps({'order_id': 123})}
)
 
# Consumer
consumer_group = 'payment-service'
await r.xgroup_create('orders:created', consumer_group, mkstream=True)
 
while True:
    messages = await r.xreadgroup(
        consumer_group,
        'worker-1',
        {'orders:created': '>'},
        count=10
    )
    for stream, msgs in messages:
        for msg_id, data in msgs:
            order = json.loads(data[b'data'])
            print(f"Order: {order}")
            await r.xack('orders:created', consumer_group, msg_id)

Comparison Table

Feature	Kafka	RabbitMQ	Redis Streams
Throughput	1M+ msg/s	50k msg/s	100k msg/s
Latency (P99)	10-50ms	5-10ms	1-5ms
Persistence	Yes (days/months)	Optional	In-memory (loss risk)
Replay	Yes	Difficult	Yes (limited)
Complexity	High	Medium	Low
Scaling	Horizontal	Vertical/Clustering	Vertical
Use case	Event sourcing, high-load	Task queues, routing	Simple pub/sub, caching

Recommendations based on typical use-cases:

Kafka: E-commerce, fintech, event sourcing (Netflix, LinkedIn, Uber)
RabbitMQ: Task queues, background jobs, webhooks (GitHub, Mozilla, Reddit)
Redis Streams: Real-time notifications, simple pub/sub (Stack Overflow, Slack)

Event-Driven System Pitfalls

Event-Driven solves some problems and creates others. Here are pitfalls that regularly occur during implementation (based on public post-mortems and technical blogs).

Pitfall #1: Eventual Consistency (and how to live with it)

Problem: Time passes between event publication and processing.

# User registers
await db.create_user(user_id=123, email='user@example.com')
producer.send('user.registered', {'user_id': 123})
 
# ...10ms later...
 
# Another service tries to get user
user = await user_service.get_user(123)  # May return None!

Solution #1: Read Your Own Writes

# After creation return data immediately
@app.post("/users")
async def create_user(user: UserCreate):
    user_entity = await db.create_user(user)
    producer.send('user.registered', user_entity.dict())
 
    # Return created user
    return user_entity  # Don't wait for event processing!

Solution #2: Polling / WebSocket for UI

# Frontend polling
POST /users → { "user_id": 123, "status": "pending" }
 
# Frontend polls
GET /users/123 → { "user_id": 123, "status": "pending" }
...
GET /users/123 → { "user_id": 123, "status": "active" }  # Ready!

Solution #3: Optimistic UI

// Frontend assumes success
onSubmit() {
  // Show immediately (optimistically)
  users.add({ id: 123, email: 'user@example.com', status: 'pending' })
 
  // Send to backend
  await api.createUser(...)
 
  // WebSocket/SSE updates status when processed
}

Eventual consistency is a conscious architectural choice. Users are accustomed to asynchronicity (email doesn't arrive instantly, bank transfers process within a day). The key is showing correct UI (pending states, progress indicators). This approach is used in all major platforms: Amazon (orders), Twitter (likes), Instagram (notifications).

Pitfall #2: Event Duplication (Idempotency mandatory)

Problem: Event can be processed twice (network retry, rebalancing).

# ❌ NOT idempotent
@consumer.on('payment.received')
async def on_payment(event):
    # If event processed 2 times → credit bonus twice!
    await db.execute(
        "UPDATE users SET bonus_balance = bonus_balance + 100 WHERE id = ?",
        event['user_id']
    )

Solution: Idempotency key

# ✅ Idempotent
@consumer.on('payment.received')
async def on_payment(event):
    # Check if already processed this event
    processed = await db.fetchval(
        "SELECT 1 FROM processed_events WHERE event_id = ?",
        event['event_id']
    )
    if processed:
        return  # Skip duplicate
 
    # Process in transaction
    async with db.transaction():
        await db.execute(
            "UPDATE users SET bonus_balance = bonus_balance + 100 WHERE id = ?",
            event['user_id']
        )
        await db.execute(
            "INSERT INTO processed_events (event_id, processed_at) VALUES (?, ?)",
            event['event_id'], datetime.utcnow()
        )

Alternative: Upsert operations

# PostgreSQL UPSERT (idempotent by design)
await db.execute("""
    INSERT INTO user_bonuses (user_id, payment_id, amount)
    VALUES (?, ?, 100)
    ON CONFLICT (payment_id) DO NOTHING
""", event['user_id'], event['payment_id'])

Pitfall #3: Event Ordering (Order not guaranteed)

Problem: Events can arrive in wrong order.

# Events published
producer.send('order.created', {'order_id': 123, 'status': 'pending'})
producer.send('order.paid', {'order_id': 123, 'status': 'paid'})
 
# Consumer may receive in reverse order!
1. order.paid (status = paid)
2. order.created (status = pending)
 
# Result: Order in wrong state

Solution #1: Partition key (Kafka)

# Kafka guarantees order within partition
# Send all events for one order to same partition
await producer.send(
    'orders',
    value={'event': 'order.created', 'order_id': 123},
    key=str(123).encode()  # Partition key = order_id
)

Solution #2: Sequence number / version

# Add sequence number to event
events = [
    {'order_id': 123, 'event': 'created', 'seq': 1},
    {'order_id': 123, 'event': 'paid', 'seq': 2},
    {'order_id': 123, 'event': 'shipped', 'seq': 3},
]
 
# Consumer checks sequence
@consumer.on('order.*')
async def on_order_event(event):
    current_seq = await db.fetchval("SELECT seq FROM orders WHERE id = ?", event['order_id'])
 
    if event['seq'] <= current_seq:
        return  # Old event, ignore
 
    # Apply event
    await apply_event(event)

Pitfall #4: Dead Letter Queue (when event can't be processed)

Problem: Consumer fails when processing event.

@consumer.on('payment.received')
async def on_payment(event):
    # External API down
    await stripe.refund(event['payment_id'])  # Timeout!

Without DLQ: Event will retry infinitely, blocking queue.

Solution: Dead Letter Queue + retry with backoff

from tenacity import retry, stop_after_attempt, wait_exponential
 
@consumer.on('payment.received')
@retry(
    stop=stop_after_attempt(5),
    wait=wait_exponential(multiplier=1, min=1, max=60)
)
async def on_payment(event):
    try:
        await stripe.refund(event['payment_id'])
    except Exception as e:
        # After 5 attempts → send to DLQ
        if should_send_to_dlq(e):
            await producer.send('dlq.payments', event)
            raise

Kafka DLQ config:

from aiokafka import AIOKafkaConsumer
 
consumer = AIOKafkaConsumer(
    'payments.received',
    enable_auto_commit=False,  # Manual commit
    max_poll_records=10
)
 
async for msg in consumer:
    try:
        await process_payment(msg.value)
        await consumer.commit()  # Success
    except Exception as e:
        # Send to DLQ
        await producer.send('dlq.payments', msg.value)
        await consumer.commit()  # Commit to not retry
        logger.error(f"Failed to process {msg.value}: {e}")

Monitoring Event-Driven Systems

In synchronous systems debugging is simple: look at stack trace. In event-driven — need special tools.

Metrics for event bus

Mandatory metrics:

from prometheus_client import Counter, Histogram, Gauge
 
# Event count
events_published = Counter(
    'events_published_total',
    'Total events published',
    ['topic', 'event_type']
)
 
# Processing latency
event_processing_duration = Histogram(
    'event_processing_duration_seconds',
    'Event processing duration',
    ['consumer', 'topic']
)
 
# Consumer lag (Kafka specific)
consumer_lag = Gauge(
    'kafka_consumer_lag',
    'Consumer lag in messages',
    ['group', 'topic', 'partition']
)
 
# Dead letter queue size
dlq_size = Gauge(
    'dlq_messages_total',
    'Messages in DLQ',
    ['topic']
)

Usage:

@app.post("/orders")
async def create_order(order: OrderCreate):
    start_time = time.time()
 
    # Create order
    order_id = await db.create_order(order)
 
    # Publish event
    producer.send('orders.created', {'order_id': order_id})
    events_published.labels(topic='orders.created', event_type='OrderCreated').inc()
 
    # Latency
    event_processing_duration.labels(
        consumer='order-service',
        topic='orders.created'
    ).observe(time.time() - start_time)
 
    return {"order_id": order_id}

Grafana dashboard for EDA

# Throughput (events/sec)
rate(events_published_total[1m])
 
# Consumer lag (warning if > 1000)
kafka_consumer_lag > 1000
 
# P99 latency
histogram_quantile(0.99, rate(event_processing_duration_seconds_bucket[5m]))
 
# DLQ growth rate (alert if growing)
delta(dlq_messages_total[5m]) > 10

Checklist: Are You Ready for Event-Driven?

Infrastructure:

Event bus chosen and configured (Kafka/RabbitMQ/Redis)
Monitoring and alerts on consumer lag
Dead Letter Queue configured
Distributed tracing (OpenTelemetry) implemented

Architecture:

Bounded contexts defined (which events to publish)
Idempotency strategy designed
Eventual consistency acceptable for business
Error handling strategy defined (retry, DLQ)

Team:

Developers understand eventual consistency
Experience debugging distributed systems
Ready for monitoring complexity

Conclusions

Event-Driven Architecture is a powerful tool, but not a panacea.

When to use:

Microservice architecture with 3+ services
Asynchronous operations (email, reports, ML)
Need fault tolerance
High load (buffering via event bus)

When NOT to use:

Simple CRUD monolith
Team not ready for eventual consistency
Need immediate consistency (finance)

Key patterns:

Event Notification: minimal data in event
Event-Carried State Transfer: all data in event (recommended for most cases)
Event Sourcing: events = source of truth
CQRS: read/write model separation

Technology choice:

Kafka: high loads, event sourcing, persistence
RabbitMQ: flexible routing, task queues, medium loads
Redis Streams: simplicity, low latency, small loads

Must-have:

Idempotency (protection from duplicates)
DLQ (error handling)
Lag and latency monitoring
Distributed tracing

Main lesson: Start simple (Event Notification + RabbitMQ). Increase complexity as you grow. Don't implement Kafka and Event Sourcing "just in case".

Useful materials for further study:

Company blogs with EDA:

Netflix Tech Blog — streaming platform architecture
Uber Engineering — event-driven systems in real-time
Stripe Blog — idempotency and reliability
AWS Architecture Blog — event system patterns

Books:

"Building Event-Driven Microservices" by Adam Bellemare (O'Reilly, 2020)
"Designing Data-Intensive Applications" by Martin Kleppmann

Related articles on site:

Monolith → Microservices — how to split systems
Kafka + FastAPI — practical Kafka guide
Monitoring Stack — observability for distributed systems

Need help with architecture? I conduct architecture reviews and consultations on implementing event-driven approach. Write to email — let's discuss your case.

Subscribe to updates in Telegram — writing about architecture, Python and modern development practices. No fluff, only applicable knowledge.

← All posts Discuss post

2025-12-12·35 min

Event-Driven Architecture: From Theory to Production

Table of Contents

The Truth About Event-Driven Architecture: It's Not About Kafka

Typical Scenario: How Synchronous Calls Kill Checkout

Event-Driven Approach: Architectural Solution

When to Use Event-Driven Architecture

Choose EDA if you answered "YES" to 3+ questions:

DON'T use EDA if:

Core Event-Driven System Patterns

Pattern #1: Event Notification (simple notifications)

Pattern #2: Event-Carried State Transfer (events with data)

Pattern #3: Event Sourcing (events as source of truth)

Pattern #4: CQRS (Command Query Responsibility Segregation)

Technologies: Kafka vs RabbitMQ vs Redis Streams

Apache Kafka: Tank for high loads

RabbitMQ: Flexible and simple

Redis Streams: Lightweight for simple cases

Comparison Table

Event-Driven System Pitfalls

Pitfall #1: Eventual Consistency (and how to live with it)

Pitfall #2: Event Duplication (Idempotency mandatory)

Pitfall #3: Event Ordering (Order not guaranteed)

Pitfall #4: Dead Letter Queue (when event can't be processed)

Monitoring Event-Driven Systems

Metrics for event bus

Grafana dashboard for EDA

Checklist: Are You Ready for Event-Driven?

Conclusions

Related posts

C4 Model: Practical Software Architecture Design Principles

Vector and Graph Databases: Complete Guide to RAG for Modern AI Applications

Apache Cassandra: A System Architect's First Look