"We added real-time notifications to the system. Now checkout makes 7 synchronous HTTP calls, and if even one service slows down — the entire checkout fails. Users complain, CEO demands heads."
A typical situation in growing startups (similar stories regularly appear in blogs by Stripe, Airbnb, and fintech companies). Classic synchronous architecture with a beautiful facade turns into a house of cards: one slow service brings down the entire system.
Result after implementing event-driven approach: P99 checkout latency drops from 2.3 seconds to 180ms. Availability rises from 99.5% to 99.95%. Team stops getting alerts at 3 AM (based on migration experiences at Netflix, Uber, and other companies).
In this article — a practical guide to Event-Driven Architecture without evangelism and magical thinking. Only typical production cases, Python code, metrics, and honest pitfalls that regularly occur in the industry.
The Truth About Event-Driven Architecture: It's Not About Kafka
Let's start with an uncomfortable truth: Event-Driven Architecture is not a technology. It's an approach to system design.
Imagine two communication models:
Request-Response (synchronous model) — like a phone call:
You: "Hello, what's the weather?"
Friend: "Wait... (searching for info) ...+15°C"
You: (waiting all this time with phone to your ear)
Event-Driven (asynchronous model) — like a news feed:
You: (publish event "Want to know weather")
Weather service: (sees event and publishes "+15°C" to feed)
You: (will see the answer when you're ready)
The key difference: in the first case you wait, in the second — continue living.
Typical Scenario: How Synchronous Calls Kill Checkout
An e-commerce startup decides to "improve UX" and adds new features during order creation:
What should happen at checkout:
- Check user in Auth Service (50ms)
- Validate products in Catalog Service (80ms)
- Check promo code in Discount Service (120ms)
- Reserve inventory in Inventory Service (200ms)
- Create order in Order Service (100ms)
- Charge payment in Payment Service (300ms)
- Send email via Notification Service (150ms)
Synchronous implementation:
# ❌ Potential trap: dependency chain
from fastapi import FastAPI, HTTPException
import httpx
app = FastAPI()
@app.post("/checkout")
async def checkout(order: CheckoutRequest):
async with httpx.AsyncClient(timeout=5.0) as client:
# 1. Check user
user = await client.get(f"http://auth-service/users/{order.user_id}")
if user.status_code != 200:
raise HTTPException(401, "User not found")
# 2. Validate products
catalog = await client.post(
"http://catalog-service/validate",
json={"items": order.items}
)
if catalog.status_code != 200:
raise HTTPException(400, "Invalid items")
# 3. Check promo code
discount = await client.post(
"http://discount-service/validate",
json={"code": order.promo_code}
)
# 4. Reserve inventory
inventory = await client.post(
"http://inventory-service/reserve",
json={"items": order.items}
)
if inventory.status_code != 200:
raise HTTPException(400, "Out of stock")
# 5. Create order
order_response = await client.post(
"http://order-service/orders",
json=order.dict()
)
# 6. Charge payment
payment = await client.post(
"http://payment-service/charge",
json={"amount": order.total, "user_id": order.user_id}
)
if payment.status_code != 200:
# Oh no! Order created but payment failed!
# Need rollback... but how?
raise HTTPException(500, "Payment failed")
# 7. Send email
await client.post(
"http://notification-service/send",
json={"email": user.json()["email"], "order_id": order_response.json()["id"]}
)
return {"order_id": order_response.json()["id"]}Common problems with this approach:
Problem #1: Latency accumulates
Best case: 50 + 80 + 120 + 200 + 100 + 300 + 150 = 1000ms
Worst case (P99): 200 + 300 + 500 + 800 + 400 + 1200 + 600 = 4000ms (4 seconds!)
Problem #2: Cascading failures
Notification Service down (DNS timeout 30s)
→ Checkout waits 30 seconds
→ Users see spinning loader
→ Users hit F5
→ Duplicate orders created
→ Chaos
Problem #3: Partial rollback impossible
Scenario:
1. Order created ✅
2. Inventory reserved ✅
3. Payment Service failed ❌
Result: Product reserved but not paid.
Solution: ??? (manual cleanup in admin panel)
Typical metrics of such a system under load:
- P99 latency checkout: 3.8+ seconds
- Success rate: 92% (8% fail due to timeouts)
- Duplicate orders: ~5% (users hit F5 due to slow loading)
- Unhappy customers: growing exponentially
Synchronous HTTP calls between microservices are like a chain of weak links. One link breaks → entire chain fails. Latency accumulates, failures cascade, rollback becomes a nightmare.
Event-Driven Approach: Architectural Solution
Key idea: Split checkout into two phases:
Phase 1 (synchronous): Critical business logic
- Data validation
- Inventory reservation (must succeed now)
- Create order in
PENDINGstatus
Phase 2 (asynchronous): Everything else via events
- Payment processing
- Email notifications
- Analytics
- Recommendation updates
New architecture:
# ✅ Event-driven approach (using Kafka as broker)
from fastapi import FastAPI
from kafka import KafkaProducer
import json
app = FastAPI()
# Kafka producer (could be Redis Streams, RabbitMQ, etc.)
producer = KafkaProducer(
bootstrap_servers='kafka:9092',
value_serializer=lambda v: json.dumps(v).encode('utf-8')
)
@app.post("/checkout")
async def checkout(order: CheckoutRequest):
# SYNCHRONOUS (critical for UX)
# 1. Quick validation
validate_order_data(order) # 5ms, in-memory
# 2. Inventory reservation (only external call!)
async with httpx.AsyncClient(timeout=2.0) as client:
inventory = await client.post(
"http://inventory-service/reserve",
json={"items": order.items}
)
if inventory.status_code != 200:
raise HTTPException(400, "Out of stock")
# 3. Create order in PENDING status
order_id = await db.create_order(order, status="PENDING")
# ASYNCHRONOUS (via events)
# Publish event — rest is not our problem
producer.send('orders.created', {
'order_id': order_id,
'user_id': order.user_id,
'items': order.items,
'total': order.total,
'timestamp': datetime.utcnow().isoformat()
})
# Return result instantly!
return {
"order_id": order_id,
"status": "pending",
"message": "Order created, payment processing"
}Consumers (separate services listen to events):
# Payment Service (listens to "orders.created" events)
from kafka import KafkaConsumer
import json
consumer = KafkaConsumer(
'orders.created',
bootstrap_servers='kafka:9092',
value_deserializer=lambda m: json.loads(m.decode('utf-8')),
group_id='payment-service'
)
for message in consumer:
order = message.value
try:
# Charge payment
payment = await charge_payment(order['user_id'], order['total'])
# Publish success event
producer.send('payments.completed', {
'order_id': order['order_id'],
'payment_id': payment.id,
'status': 'success'
})
except PaymentError as e:
# Publish failure event
producer.send('payments.failed', {
'order_id': order['order_id'],
'reason': str(e)
})# Notification Service (listens to "payments.completed" events)
consumer = KafkaConsumer(
'payments.completed',
bootstrap_servers='kafka:9092',
group_id='notification-service'
)
for message in consumer:
payment = message.value
# Send email (even if it fails — won't affect checkout)
await send_email(
user_id=payment['user_id'],
template='order_confirmation',
data={'order_id': payment['order_id']}
)Typical results after implementation (based on industry migration experiences):
Infrastructure savings: In typical cases, instance count decreases by 20-40% (due to lower load and better scaling). This is ~$500-1000/month for medium-sized projects.
Main point: UX improves (fast checkout), incident count decreases, team gets fewer night alerts.
When to Use Event-Driven Architecture
Event-Driven is not a silver bullet. There are cases when synchronous approach is simpler and better.
Choose EDA if you answered "YES" to 3+ questions:
1. Asynchronous operations present?
Examples:
- Sending email/SMS
- Report generation
- Image/video processing
- ML inference
- Analytics
2. Need fault tolerance?
Scenario: Email service is down
❌ Synchronous: Checkout fails for everyone
✅ Event-driven: Checkout works, emails sent later
3. Microservice architecture?
5+ services → high risk of cascading failures
Event-driven reduces coupling between services
4. High load?
> 1000 req/s → synchronous calls slow down
Event bus acts as buffer, smooths load spikes
5. Need audit trail / Event Sourcing?
Requirement: restore system state at any point in time
Event log = complete history of all changes
6. Different processing speeds?
Fast: API endpoint responds in 50ms
Slow: Email generation takes 2 seconds
Event-driven allows not waiting for slow operations
DON'T use EDA if:
❌ Simple CRUD system
Todo-app, blog, simple admin panel
→ Synchronous REST API is simpler and clearer
❌ Need immediate consistency
Bank transaction: money must be debited NOW
→ Eventual consistency doesn't fit
❌ Team not ready
Debugging event-driven systems is harder
Need:
- Distributed tracing
- Event monitoring
- Understanding of eventual consistency
❌ One monolith, 3 developers
Event bus overhead not justified
→ Start with modular monolith
Industry best practice: If operation must complete BEFORE user gets response — use synchronous call. Everything else — candidate for events. This approach is used in Netflix, Uber, Spotify and other high-scale systems.
Core Event-Driven System Patterns
Event-Driven is not just "throw event into Kafka". There are several patterns with different trade-offs.
Pattern #1: Event Notification (simple notifications)
Idea: Event contains minimal data, only ID and event type.
# Publisher
producer.send('user.registered', {
'event_id': 'evt_123',
'user_id': 'usr_456',
'timestamp': '2025-12-26T10:00:00Z'
})
# Consumer (must fetch data itself)
@app.on_event('user.registered')
async def on_user_registered(event):
# Make HTTP request for data
user = await user_service.get_user(event['user_id'])
await send_welcome_email(user.email)Pros:
- ✅ Small message size
- ✅ No data duplication
Cons:
- ❌ Consumer depends on source service (coupling)
- ❌ Additional HTTP request (latency)
When to use: Data changes frequently and freshness is critical.
Pattern #2: Event-Carried State Transfer (events with data)
Idea: Event contains ALL necessary data for processing.
# Publisher
producer.send('user.registered', {
'event_id': 'evt_123',
'user_id': 'usr_456',
'email': 'user@example.com',
'name': 'John Doe',
'created_at': '2025-12-26T10:00:00Z',
'subscription_plan': 'premium'
})
# Consumer (no HTTP requests!)
@app.on_event('user.registered')
async def on_user_registered(event):
# Everything already in event
await send_welcome_email(
email=event['email'],
name=event['name']
)Pros:
- ✅ Consumer independent from source service
- ✅ No additional HTTP requests
- ✅ Can process event even if source service is down
Cons:
- ❌ Larger message size (more traffic)
- ❌ Data duplication (risk of inconsistency)
When to use: Consumers must work independently, data changes rarely.
Recommended approach for most cases: Event-Carried State Transfer in 80% of scenarios. Reduces coupling, simplifies debugging (all data visible in event). Used in Amazon, Zalando, Shopify architectures.
Pattern #3: Event Sourcing (events as source of truth)
Idea: System state = sum of all events. Don't store current state in DB, only store events.
# Traditional approach (CRUD)
class Order:
id: int
status: str # "pending" | "paid" | "shipped" | "completed"
total: float
# When changed, overwrite row in DB
await db.execute("UPDATE orders SET status = 'paid' WHERE id = 123")# Event Sourcing approach
# Events (append-only log)
events = [
{'type': 'OrderCreated', 'order_id': 123, 'total': 1500, 'timestamp': '2025-12-26T10:00:00Z'},
{'type': 'PaymentReceived', 'order_id': 123, 'amount': 1500, 'timestamp': '2025-12-26T10:05:00Z'},
{'type': 'OrderShipped', 'order_id': 123, 'tracking': 'TR123', 'timestamp': '2025-12-26T12:00:00Z'},
]
# State = replay all events
def get_order_state(order_id):
events = event_store.get_events(order_id)
state = {}
for event in events:
if event['type'] == 'OrderCreated':
state['status'] = 'pending'
state['total'] = event['total']
elif event['type'] == 'PaymentReceived':
state['status'] = 'paid'
elif event['type'] == 'OrderShipped':
state['status'] = 'shipped'
return statePros:
- ✅ Complete audit trail (can restore state at any point in time)
- ✅ Debugging easier (all changes visible)
- ✅ Time travel (rollback to past state)
Cons:
- ❌ Complexity (need event store, snapshots, projections)
- ❌ Performance (replaying events can be slow)
- ❌ Can't delete data (GDPR problem)
When to use: Financial systems, audit critical, complex business logic.
Pattern #4: CQRS (Command Query Responsibility Segregation)
Idea: Separate write (Command) and read (Query) into different models.
┌─────────────┐
│ Client │
└──────┬──────┘
│
├─ Command (write) ─────────► Command Handler ─► Event Store
│ │
└─ Query (read) ────────────► Read Model ◄─────────────┘
(materialized view)
Example:
# Command side (write)
class CreateOrderCommand:
user_id: int
items: List[Item]
@app.post("/orders")
async def create_order(cmd: CreateOrderCommand):
# Validation
# Event creation
events = [
OrderCreatedEvent(user_id=cmd.user_id, items=cmd.items),
InventoryReservedEvent(items=cmd.items)
]
# Save to event store
await event_store.append(events)
return {"order_id": events[0].order_id}
# Query side (read)
@app.get("/orders/{order_id}")
async def get_order(order_id: int):
# Read from materialized view (fast!)
return await read_db.get_order(order_id)
# Projector (updates read model from events)
@consumer.on('OrderCreatedEvent')
async def project_order(event):
await read_db.insert_order({
'id': event.order_id,
'user_id': event.user_id,
'status': 'pending',
'created_at': event.timestamp
})Pros:
- ✅ Read and write optimized independently
- ✅ Complex queries don't affect write performance
- ✅ Can scale read and write separately
Cons:
- ❌ Complexity (two data models)
- ❌ Eventual consistency (read model may lag)
When to use: Read-heavy systems, complex analytical queries.
Technologies: Kafka vs RabbitMQ vs Redis Streams
Choosing event bus is a critical decision. Let's compare top-3.
Apache Kafka: Tank for high loads
When to use:
- Throughput > 100k messages/sec
- Need persistence (event replay)
- Event sourcing / stream processing
- Large team (someone to maintain)
Pros:
- ✅ High throughput (millions msg/sec)
- ✅ Persistence (events stored days/months)
- ✅ Replay (can re-read old events)
- ✅ Partitioning (horizontal scaling)
Cons:
- ❌ Complex setup (ZooKeeper, brokers, replication)
- ❌ Overkill for small systems
- ❌ Higher latency than RabbitMQ (optimized for throughput)
What is ZooKeeper? It's a distributed coordination system that was used in Kafka for storing metadata (which brokers are alive, which topics exist, who's partition leader). Think of ZooKeeper as Kafka cluster's "address book" — it answers questions like "where is partition X leader?" or "which consumers are in group Y?".
Good news: Starting from Kafka 3.3+ (2022), KRaft mode (Kafka Raft) appeared, which doesn't require ZooKeeper. Kafka manages its own metadata. This simplifies deployment and reduces moving parts. In production 2025, KRaft is recommended over ZooKeeper.
Example (Python + aiokafka):
from aiokafka import AIOKafkaProducer, AIOKafkaConsumer
import json
# Producer
producer = AIOKafkaProducer(
bootstrap_servers='localhost:9092',
value_serializer=lambda v: json.dumps(v).encode()
)
await producer.start()
await producer.send('orders.created', {'order_id': 123})
# Consumer
consumer = AIOKafkaConsumer(
'orders.created',
bootstrap_servers='localhost:9092',
group_id='payment-service',
value_deserializer=lambda m: json.loads(m.decode())
)
await consumer.start()
async for msg in consumer:
print(f"Order: {msg.value}")Docker Compose (KRaft mode, recommended in 2025):
services:
kafka:
image: confluentinc/cp-kafka:7.6.0
ports:
- "9092:9092"
environment:
# KRaft mode (without ZooKeeper)
KAFKA_NODE_ID: 1
KAFKA_PROCESS_ROLES: "broker,controller"
KAFKA_CONTROLLER_QUORUM_VOTERS: "1@kafka:9093"
KAFKA_LISTENERS: "PLAINTEXT://kafka:9092,CONTROLLER://kafka:9093"
KAFKA_ADVERTISED_LISTENERS: "PLAINTEXT://localhost:9092"
KAFKA_CONTROLLER_LISTENER_NAMES: "CONTROLLER"
KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: "CONTROLLER:PLAINTEXT,PLAINTEXT:PLAINTEXT"
KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 1
# Storage formatting (mandatory for KRaft)
CLUSTER_ID: "MkU3OEVBNTcwNTJENDM2Qk"Legacy variant with ZooKeeper (if compatibility needed)
services:
kafka:
image: confluentinc/cp-kafka:7.6.0
ports:
- "9092:9092"
environment:
KAFKA_BROKER_ID: 1
KAFKA_ZOOKEEPER_CONNECT: zookeeper:2181
KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://kafka:9092
KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 1
zookeeper:
image: confluentinc/cp-zookeeper:7.6.0
environment:
ZOOKEEPER_CLIENT_PORT: 2181RabbitMQ: Flexible and simple
When to use:
- Need flexible routing (topic exchange, fanout, headers)
- Priority queues
- Small/medium loads (< 50k msg/sec)
- Quick start (easier to setup than Kafka)
Pros:
- ✅ Setup simplicity
- ✅ Flexible routing (exchange types)
- ✅ Monitoring UI (RabbitMQ Management)
- ✅ Low latency
Cons:
- ❌ Lower throughput than Kafka
- ❌ No persistence by default (can enable, but slower)
- ❌ Replay harder
Example (Python + aio-pika):
import aio_pika
import json
# Publisher
connection = await aio_pika.connect_robust("amqp://localhost/")
channel = await connection.channel()
await channel.default_exchange.publish(
aio_pika.Message(
body=json.dumps({'order_id': 123}).encode(),
delivery_mode=aio_pika.DeliveryMode.PERSISTENT
),
routing_key='orders.created'
)
# Consumer
queue = await channel.declare_queue('payment-service-queue', durable=True)
await queue.bind(exchange='events', routing_key='orders.created')
async with queue.iterator() as queue_iter:
async for message in queue_iter:
async with message.process():
data = json.loads(message.body.decode())
print(f"Order: {data}")Redis Streams: Lightweight for simple cases
When to use:
- Already using Redis
- Need low latency
- Simple use cases (pub/sub, task queue)
- Small/medium loads
Pros:
- ✅ Minimal overhead (Redis already there)
- ✅ Very low latency
- ✅ Simplicity (familiar Redis API)
- ✅ Consumer groups (like in Kafka)
Cons:
- ❌ In-memory (limited by RAM)
- ❌ No persistence guarantees
- ❌ Not for high loads
Example (Python + redis-py):
import redis.asyncio as redis
import json
r = await redis.from_url("redis://localhost")
# Producer
await r.xadd(
'orders:created',
{'data': json.dumps({'order_id': 123})}
)
# Consumer
consumer_group = 'payment-service'
await r.xgroup_create('orders:created', consumer_group, mkstream=True)
while True:
messages = await r.xreadgroup(
consumer_group,
'worker-1',
{'orders:created': '>'},
count=10
)
for stream, msgs in messages:
for msg_id, data in msgs:
order = json.loads(data[b'data'])
print(f"Order: {order}")
await r.xack('orders:created', consumer_group, msg_id)Comparison Table
| Feature | Kafka | RabbitMQ | Redis Streams |
|---|---|---|---|
| Throughput | 1M+ msg/s | 50k msg/s | 100k msg/s |
| Latency (P99) | 10-50ms | 5-10ms | 1-5ms |
| Persistence | Yes (days/months) | Optional | In-memory (loss risk) |
| Replay | Yes | Difficult | Yes (limited) |
| Complexity | High | Medium | Low |
| Scaling | Horizontal | Vertical/Clustering | Vertical |
| Use case | Event sourcing, high-load | Task queues, routing | Simple pub/sub, caching |
Recommendations based on typical use-cases:
- Kafka: E-commerce, fintech, event sourcing (Netflix, LinkedIn, Uber)
- RabbitMQ: Task queues, background jobs, webhooks (GitHub, Mozilla, Reddit)
- Redis Streams: Real-time notifications, simple pub/sub (Stack Overflow, Slack)
Event-Driven System Pitfalls
Event-Driven solves some problems and creates others. Here are pitfalls that regularly occur during implementation (based on public post-mortems and technical blogs).
Pitfall #1: Eventual Consistency (and how to live with it)
Problem: Time passes between event publication and processing.
# User registers
await db.create_user(user_id=123, email='user@example.com')
producer.send('user.registered', {'user_id': 123})
# ...10ms later...
# Another service tries to get user
user = await user_service.get_user(123) # May return None!Solution #1: Read Your Own Writes
# After creation return data immediately
@app.post("/users")
async def create_user(user: UserCreate):
user_entity = await db.create_user(user)
producer.send('user.registered', user_entity.dict())
# Return created user
return user_entity # Don't wait for event processing!Solution #2: Polling / WebSocket for UI
# Frontend polling
POST /users → { "user_id": 123, "status": "pending" }
# Frontend polls
GET /users/123 → { "user_id": 123, "status": "pending" }
...
GET /users/123 → { "user_id": 123, "status": "active" } # Ready!Solution #3: Optimistic UI
// Frontend assumes success
onSubmit() {
// Show immediately (optimistically)
users.add({ id: 123, email: 'user@example.com', status: 'pending' })
// Send to backend
await api.createUser(...)
// WebSocket/SSE updates status when processed
}Eventual consistency is a conscious architectural choice. Users are accustomed to asynchronicity (email doesn't arrive instantly, bank transfers process within a day). The key is showing correct UI (pending states, progress indicators). This approach is used in all major platforms: Amazon (orders), Twitter (likes), Instagram (notifications).
Pitfall #2: Event Duplication (Idempotency mandatory)
Problem: Event can be processed twice (network retry, rebalancing).
# ❌ NOT idempotent
@consumer.on('payment.received')
async def on_payment(event):
# If event processed 2 times → credit bonus twice!
await db.execute(
"UPDATE users SET bonus_balance = bonus_balance + 100 WHERE id = ?",
event['user_id']
)Solution: Idempotency key
# ✅ Idempotent
@consumer.on('payment.received')
async def on_payment(event):
# Check if already processed this event
processed = await db.fetchval(
"SELECT 1 FROM processed_events WHERE event_id = ?",
event['event_id']
)
if processed:
return # Skip duplicate
# Process in transaction
async with db.transaction():
await db.execute(
"UPDATE users SET bonus_balance = bonus_balance + 100 WHERE id = ?",
event['user_id']
)
await db.execute(
"INSERT INTO processed_events (event_id, processed_at) VALUES (?, ?)",
event['event_id'], datetime.utcnow()
)Alternative: Upsert operations
# PostgreSQL UPSERT (idempotent by design)
await db.execute("""
INSERT INTO user_bonuses (user_id, payment_id, amount)
VALUES (?, ?, 100)
ON CONFLICT (payment_id) DO NOTHING
""", event['user_id'], event['payment_id'])Pitfall #3: Event Ordering (Order not guaranteed)
Problem: Events can arrive in wrong order.
# Events published
producer.send('order.created', {'order_id': 123, 'status': 'pending'})
producer.send('order.paid', {'order_id': 123, 'status': 'paid'})
# Consumer may receive in reverse order!
1. order.paid (status = paid)
2. order.created (status = pending)
# Result: Order in wrong stateSolution #1: Partition key (Kafka)
# Kafka guarantees order within partition
# Send all events for one order to same partition
await producer.send(
'orders',
value={'event': 'order.created', 'order_id': 123},
key=str(123).encode() # Partition key = order_id
)Solution #2: Sequence number / version
# Add sequence number to event
events = [
{'order_id': 123, 'event': 'created', 'seq': 1},
{'order_id': 123, 'event': 'paid', 'seq': 2},
{'order_id': 123, 'event': 'shipped', 'seq': 3},
]
# Consumer checks sequence
@consumer.on('order.*')
async def on_order_event(event):
current_seq = await db.fetchval("SELECT seq FROM orders WHERE id = ?", event['order_id'])
if event['seq'] <= current_seq:
return # Old event, ignore
# Apply event
await apply_event(event)Pitfall #4: Dead Letter Queue (when event can't be processed)
Problem: Consumer fails when processing event.
@consumer.on('payment.received')
async def on_payment(event):
# External API down
await stripe.refund(event['payment_id']) # Timeout!Without DLQ: Event will retry infinitely, blocking queue.
Solution: Dead Letter Queue + retry with backoff
from tenacity import retry, stop_after_attempt, wait_exponential
@consumer.on('payment.received')
@retry(
stop=stop_after_attempt(5),
wait=wait_exponential(multiplier=1, min=1, max=60)
)
async def on_payment(event):
try:
await stripe.refund(event['payment_id'])
except Exception as e:
# After 5 attempts → send to DLQ
if should_send_to_dlq(e):
await producer.send('dlq.payments', event)
raiseKafka DLQ config:
from aiokafka import AIOKafkaConsumer
consumer = AIOKafkaConsumer(
'payments.received',
enable_auto_commit=False, # Manual commit
max_poll_records=10
)
async for msg in consumer:
try:
await process_payment(msg.value)
await consumer.commit() # Success
except Exception as e:
# Send to DLQ
await producer.send('dlq.payments', msg.value)
await consumer.commit() # Commit to not retry
logger.error(f"Failed to process {msg.value}: {e}")Monitoring Event-Driven Systems
In synchronous systems debugging is simple: look at stack trace. In event-driven — need special tools.
Metrics for event bus
Mandatory metrics:
from prometheus_client import Counter, Histogram, Gauge
# Event count
events_published = Counter(
'events_published_total',
'Total events published',
['topic', 'event_type']
)
# Processing latency
event_processing_duration = Histogram(
'event_processing_duration_seconds',
'Event processing duration',
['consumer', 'topic']
)
# Consumer lag (Kafka specific)
consumer_lag = Gauge(
'kafka_consumer_lag',
'Consumer lag in messages',
['group', 'topic', 'partition']
)
# Dead letter queue size
dlq_size = Gauge(
'dlq_messages_total',
'Messages in DLQ',
['topic']
)Usage:
@app.post("/orders")
async def create_order(order: OrderCreate):
start_time = time.time()
# Create order
order_id = await db.create_order(order)
# Publish event
producer.send('orders.created', {'order_id': order_id})
events_published.labels(topic='orders.created', event_type='OrderCreated').inc()
# Latency
event_processing_duration.labels(
consumer='order-service',
topic='orders.created'
).observe(time.time() - start_time)
return {"order_id": order_id}Grafana dashboard for EDA
# Throughput (events/sec)
rate(events_published_total[1m])
# Consumer lag (warning if > 1000)
kafka_consumer_lag > 1000
# P99 latency
histogram_quantile(0.99, rate(event_processing_duration_seconds_bucket[5m]))
# DLQ growth rate (alert if growing)
delta(dlq_messages_total[5m]) > 10Checklist: Are You Ready for Event-Driven?
Infrastructure:
- Event bus chosen and configured (Kafka/RabbitMQ/Redis)
- Monitoring and alerts on consumer lag
- Dead Letter Queue configured
- Distributed tracing (OpenTelemetry) implemented
Architecture:
- Bounded contexts defined (which events to publish)
- Idempotency strategy designed
- Eventual consistency acceptable for business
- Error handling strategy defined (retry, DLQ)
Team:
- Developers understand eventual consistency
- Experience debugging distributed systems
- Ready for monitoring complexity
Conclusions
Event-Driven Architecture is a powerful tool, but not a panacea.
When to use:
- Microservice architecture with 3+ services
- Asynchronous operations (email, reports, ML)
- Need fault tolerance
- High load (buffering via event bus)
When NOT to use:
- Simple CRUD monolith
- Team not ready for eventual consistency
- Need immediate consistency (finance)
Key patterns:
- Event Notification: minimal data in event
- Event-Carried State Transfer: all data in event (recommended for most cases)
- Event Sourcing: events = source of truth
- CQRS: read/write model separation
Technology choice:
- Kafka: high loads, event sourcing, persistence
- RabbitMQ: flexible routing, task queues, medium loads
- Redis Streams: simplicity, low latency, small loads
Must-have:
- Idempotency (protection from duplicates)
- DLQ (error handling)
- Lag and latency monitoring
- Distributed tracing
Main lesson: Start simple (Event Notification + RabbitMQ). Increase complexity as you grow. Don't implement Kafka and Event Sourcing "just in case".
Useful materials for further study:
Company blogs with EDA:
- Netflix Tech Blog — streaming platform architecture
- Uber Engineering — event-driven systems in real-time
- Stripe Blog — idempotency and reliability
- AWS Architecture Blog — event system patterns
Books:
- "Building Event-Driven Microservices" by Adam Bellemare (O'Reilly, 2020)
- "Designing Data-Intensive Applications" by Martin Kleppmann
Related articles on site:
- Monolith → Microservices — how to split systems
- Kafka + FastAPI — practical Kafka guide
- Monitoring Stack — observability for distributed systems
Need help with architecture? I conduct architecture reviews and consultations on implementing event-driven approach. Write to email — let's discuss your case.
Subscribe to updates in Telegram — writing about architecture, Python and modern development practices. No fluff, only applicable knowledge.

