"We split the monolith into 47 microservices. Now deployment takes 4 hours instead of 20 minutes, the team hates me, and the CEO asks when 'this experiment' will end."
Real quote from a CTO who reached out to me in early 2024. They spent 8 months on migration, $200k budget, and now wanted to go back to the monolith.
Spoiler: We didn't go back. We fixed the architecture in 6 weeks, killed 32 out of 47 services, and got a system that works better than the monolith. But the cost of mistakes was high.
In this article — honest talk about migrating from monolith to microservices. No evangelism, no blind faith in "Netflix does it this way, so should we." Just practice, pitfalls, and real numbers.
The Truth About Microservices They Don't Tell at Conferences
Let's start with an uncomfortable truth: microservices aren't an evolution of monoliths. They're a different class of problems.
A monolith is like living in a studio apartment. Cramped, everything at hand, cleaning takes an hour.
Microservices are like managing an apartment building. Each apartment is independent, but now you have problems with heating, electricity, plumbing, and tenants complaining about each other.
Real Story: "We Became Like Google!"
- Django startup, 12-person team, 50k DAU, stable monolith. At a conference, the CTO heard a talk about microservices. Came back inspired.
The Plan:
- Split monolith into 15 microservices
- Implement Kubernetes
- Service mesh (Istio)
- Event-driven architecture (Kafka)
- "Become like Netflix"
Reality After 6 Months:
- Deployment grew from 15 minutes to 2 hours
- New features take 3 weeks instead of 1 (need to coordinate 5 teams)
- Debugging is a nightmare (request goes through 7 services, unclear where it fails)
- Infrastructure costs tripled
- 3 senior developers quit
Outcome: After 9 months they returned to a modular monolith. Lost $400k and their best people.
Microservices aren't a silver bullet. They're trading one set of problems (monolith complexity) for another (distributed system complexity). Make sure you're ready to pay this price.
When It's Really Time to Split: No-BS Checklist
After 15 migrations (in both directions), I came up with a simple rule: you need microservices when the pain of the monolith costs more than the pain of microservices.
Signs That the Monolith Is Choking You
1. Deploy Bottleneck
Symptom: Deployment takes > 30 minutes and blocks the entire team
Example: 5 teams queue for deployment on Friday evening
Cost: Developer time + risk of conflicts + slow time-to-market
Real Case: SaaS platform, 80 developers. Monolith deployment — 45 minutes. Each team could release once a week. Merge conflicts — weekly.
After splitting into 8 services: Each team deploys independently, 10-15 times a day. Time-to-market dropped from a week to a day.
2. Scaling Hell
Symptom: One endpoint eats 90% of resources, but you have to scale the entire monolith
Example: Mobile API generates PDF reports (CPU-intensive)
Other 50 endpoints are simple, but scale together with PDF
Cost: $5k/month on servers instead of $1k
Math:
Monolith: 10 instances × 8GB RAM × $100/month = $1000
(8 instances needed only for PDF generator)
Microservices: API (2 instances × 2GB × $50) + PDF Service (8 × 8GB × $100) = $900
Savings: $100/month (or $1200/year)
3. Team Collision
Symptom: 3+ teams work in one repository and interfere with each other
Example: Team A changes auth, Team B breaks orders, Team C debugs all Friday
Cost: Merge conflicts + slow PR reviews + stress
Sign: If you do daily sync meetings between teams to "not step on toes" — you need boundaries.
4. Technology Lock-in
Symptom: Want to try a new language/framework, but it requires rewriting the entire monolith
Example: Monolith on Django 2.2, want async FastAPI for WebSocket
Migrating entire monolith = 6 months
Cost: Missed opportunities + technical debt
Microservices Readiness Checklist
Answer honestly:
- Do you have 3+ teams working in the monolith?
- Does deployment take > 30 minutes?
- Do different parts of the system have different scaling requirements?
- Do you have a dedicated DevOps/Platform team?
- Are you ready to implement distributed tracing and centralized logging?
- Do you have no hiring problems (microservices require senior developers)?
- Does the business understand that migration will take 6-12 months?
If you answered "YES" to 5+ questions — microservices make sense.
If less than 5 — try a modular monolith first.
Modular Monolith as an Intermediate Step
What it is: A monolith divided into independent modules with clear boundaries.
monolith/
├── modules/
│ ├── auth/ # Separate module
│ │ ├── api/
│ │ ├── models/
│ │ └── services/
│ ├── orders/ # Separate module
│ │ ├── api/
│ │ ├── models/
│ │ └── services/
│ └── payments/ # Separate module
│ └── ...
└── shared/ # Shared code
Rules:
- Modules communicate only through public APIs (no direct imports)
- Each module can be extracted into a service in a week
- Shared code is minimal and stable
Advantages:
- ✅ Deployment still simple (one service)
- ✅ Debugging simple (one process)
- ✅ Teams work independently (different modules)
- ✅ Ready for migration (boundaries already exist)
80% of companies I worked with solved their problems with a modular monolith. Only 20% needed microservices. Don't overestimate the complexity of your problems.
Strangler Fig Pattern: How to Migrate Without Downtime
Strangler Fig — a pattern of gradual migration. The new system grows around the old one, gradually replacing its parts. Like a fig tree wraps around an old tree and eventually replaces it.
Why Not Big Bang Rewrite
Big Bang Rewrite — when you stop development and rewrite everything from scratch.
Problems:
- 📉 6-12 months without new features → business loses money
- 🐛 You'll forget edge cases from the old system → bugs in production
- 😰 Team burns out → resignations
- 💸 Risk of project failure → millions wasted
Known Failures:
- Netscape (1998) — rewrote browser from scratch, lost the market
- Knight Capital (2012) — new system launched with a bug, lost $440M in 45 minutes
Strangler Fig in Practice
Idea: The new system sits alongside the old one. Gradually switch traffic from monolith to microservices. When the monolith is empty — turn it off.
Migration Architecture
┌─────────────────────────┐
│ API Gateway / Proxy │
│ (Nginx/Envoy) │
└───────────┬─────────────┘
│
┌───────────────┴───────────────┐
│ │
▼ ▼
┌────────────────────┐ ┌──────────────────┐
│ Monolith (Django)│ │ Microservices │
│ │ │ │
│ /api/orders ✅ │ │ Auth Service ✅ │
│ /api/users ✅ │ │ Orders Service │
│ /api/auth ❌ │◄─────────│ (in progress) │
│ │ reads │ │
└────────┬───────────┘ data └─────────┬────────┘
│ │
└────────────┬────────────────────┘
▼
┌───────────────┐
│ PostgreSQL │
│ (shared DB) │
└───────────────┘
Stages:
- Proxy in front — all traffic goes through API Gateway
- First service — extract the simplest/most isolated module
- Switch traffic — change routing in proxy (without code changes)
- Monitor — watch metrics, latency, errors
- Roll back if issues — switch back in 5 seconds
- Repeat for next module
Example: Extracting Auth Service
Step 1: Duplicate Functionality
# New Auth Service (FastAPI)
from fastapi import FastAPI, Depends, HTTPException
from sqlalchemy import select
from database import async_session
app = FastAPI()
@app.post("/api/auth/login")
async def login(credentials: LoginRequest):
async with async_session() as db:
# Read from SHARED DB (same as monolith)
result = await db.execute(
select(User).where(User.email == credentials.email)
)
user = result.scalar_one_or_none()
if not user or not verify_password(credentials.password, user.password):
raise HTTPException(401, "Invalid credentials")
token = create_jwt_token(user.id)
return {"access_token": token}Step 2: Configure Routing (Nginx)
upstream auth_service {
server auth-service:8000;
}
upstream monolith {
server django-app:8000;
}
server {
listen 80;
# 10% traffic to new service (canary deployment)
location /api/auth/ {
if ($request_id ~* "[0-9]$") { # 10% requests (ending in 0-9)
proxy_pass http://auth_service;
}
proxy_pass http://monolith;
}
# Everything else to monolith
location / {
proxy_pass http://monolith;
}
}Step 3: Gradual Traffic Increase
Day 1-3: 10% traffic → Auth Service
Day 4-7: 50% traffic → Auth Service
Day 8-10: 100% traffic → Auth Service
Step 4: Monitor Metrics
# Add metrics to Auth Service
from prometheus_client import Counter, Histogram
auth_requests = Counter('auth_requests_total', 'Total auth requests', ['status'])
auth_latency = Histogram('auth_request_duration_seconds', 'Auth latency')
@app.post("/api/auth/login")
@auth_latency.time()
async def login(credentials: LoginRequest):
try:
# ... logic
auth_requests.labels(status='success').inc()
return response
except Exception as e:
auth_requests.labels(status='error').inc()
raiseStep 5: Compare with Monolith
# Grafana Dashboard
# Monolith vs microservice latency
histogram_quantile(0.95,
rate(django_http_request_duration_seconds_bucket{endpoint="/api/auth/login"}[5m])
)
vs
histogram_quantile(0.95,
rate(auth_request_duration_seconds_bucket[5m])
)
# Error rate
rate(django_http_errors_total{endpoint="/api/auth/login"}[5m])
vs
rate(auth_requests_total{status="error"}[5m])Step 6: Remove Code from Monolith
After 2 weeks of stable Auth Service at 100% traffic:
# Django monolith - delete auth views
# git rm apps/auth/views.py
# git rm apps/auth/serializers.py
# git commit -m "Remove auth - migrated to auth-service"Strangler Fig Pitfalls
Pitfall #1: Shared Database
During migration, you have a shared DB. This creates coupling.
# ❌ Bad: Auth Service changes schema
ALTER TABLE users ADD COLUMN last_login_ip VARCHAR(15);
# Monolith crashes: Unknown column 'last_login_ip'Solution: Database View Pattern
-- Auth Service works through VIEW
CREATE VIEW auth_users AS
SELECT id, email, password_hash, created_at
FROM users;
GRANT SELECT ON auth_users TO auth_service;
-- Monolith continues working with table
-- Auth Service works with view
-- Schema migration doesn't break monolithPitfall #2: Transactions Between Services
# ❌ This WON'T work
def create_order(user_id, items):
with transaction.atomic():
# Call Auth Service
user = auth_service.get_user(user_id) # HTTP request
# Create order in monolith
order = Order.objects.create(user=user)
# If error here → rollback won't affect Auth Service!Solution: Saga Pattern or eventual consistency
# ✅ Event-driven approach
def create_order(user_id, items):
# 1. Create order in PENDING status
order = Order.objects.create(user_id=user_id, status='PENDING')
# 2. Publish event to queue
event_bus.publish('order.created', {
'order_id': order.id,
'user_id': user_id
})
# 3. Auth Service listens to events and updates stats
# 4. If something goes wrong → compensating transactionDistributed transactions don't work in microservices. Accept eventual consistency as reality. Use Saga, event sourcing, or live with data duplication.
Distributed Tracing from Day One
The main pain of microservices: "Where did the request fail if it went through 7 services?"
In monolith: look at stack trace, see the entire call chain.
In microservices: look at logs of 7 services, try to find request by timestamp. Good luck.
OpenTelemetry + Jaeger: Must-Have from First Service
OpenTelemetry — standard for distributed tracing. Jaeger — UI for viewing traces.
Tracing Architecture
Request ID: 7f3a9c12-4e8d-4f2a-a1b3-8d7e9f2c1a4b
API Gateway (span: 250ms)
↓
Auth Service (span: 45ms)
├─ DB query (span: 12ms)
└─ Redis cache (span: 3ms)
↓
Order Service (span: 180ms)
├─ DB query (span: 50ms)
├─ HTTP → Payment Service (span: 120ms)
│ ├─ DB query (span: 15ms)
│ └─ HTTP → Stripe API (span: 95ms) ← Culprit!
└─ Kafka publish (span: 8ms)
One look at Jaeger UI — and you see Stripe is slowing down.
OpenTelemetry Setup in Python
Installation:
pip install opentelemetry-api opentelemetry-sdk \
opentelemetry-instrumentation-fastapi \
opentelemetry-exporter-jaegerCode (FastAPI):
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.jaeger.thrift import JaegerExporter
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
from opentelemetry.instrumentation.sqlalchemy import SQLAlchemyInstrumentor
from opentelemetry.instrumentation.redis import RedisInstrumentor
# Setup tracer
trace.set_tracer_provider(TracerProvider())
tracer = trace.get_tracer(__name__)
# Export to Jaeger
jaeger_exporter = JaegerExporter(
agent_host_name="jaeger",
agent_port=6831,
)
trace.get_tracer_provider().add_span_processor(
BatchSpanProcessor(jaeger_exporter)
)
app = FastAPI()
# Automatic instrumentation
FastAPIInstrumentor.instrument_app(app)
SQLAlchemyInstrumentor().instrument(engine=engine)
RedisInstrumentor().instrument()
# Custom spans
@app.post("/orders")
async def create_order(order: OrderCreate):
with tracer.start_as_current_span("create_order"):
# Span automatically includes:
# - request_id
# - http.method, http.url, http.status_code
# - duration
with tracer.start_as_current_span("validate_user"):
user = await auth_service.get_user(order.user_id)
with tracer.start_as_current_span("process_payment"):
payment = await payment_service.charge(order.total)
with tracer.start_as_current_span("save_to_db"):
result = await db.execute(insert(Order).values(**order.dict()))
return {"order_id": result.inserted_primary_key[0]}Cross-service trace ID propagation:
import httpx
from opentelemetry.propagate import inject
async def call_payment_service(amount: float):
headers = {}
# Inject trace context into HTTP headers
inject(headers)
async with httpx.AsyncClient() as client:
response = await client.post(
"http://payment-service/charge",
json={"amount": amount},
headers=headers # trace_id is passed forward
)
return response.json()Receiving side (Payment Service):
from opentelemetry.propagate import extract
@app.post("/charge")
async def charge(request: Request, data: ChargeRequest):
# Extract trace context from headers
context = extract(request.headers)
# Span automatically links to parent
with tracer.start_as_current_span("charge_payment", context=context):
# ... payment logic
passDocker Compose for Jaeger
services:
jaeger:
image: jaegertracing/all-in-one:1.52
environment:
- COLLECTOR_OTLP_ENABLED=true
ports:
- "16686:16686" # Jaeger UI
- "6831:6831/udp" # Receive traces
- "4317:4317" # OTLP gRPC
networks:
- monitoring
auth-service:
build: ./auth-service
environment:
- JAEGER_AGENT_HOST=jaeger
- JAEGER_AGENT_PORT=6831
networks:
- monitoring
order-service:
build: ./order-service
environment:
- JAEGER_AGENT_HOST=jaeger
- JAEGER_AGENT_PORT=6831
networks:
- monitoring
networks:
monitoring:
driver: bridgeWhat Distributed Tracing Shows
Scenario 1: Slow Request
Request to /api/orders took 2.5 seconds. Why?
Jaeger shows:
┌─ API Gateway: 2500ms
│ ├─ Auth Service: 50ms ✅
│ └─ Order Service: 2400ms ⚠️
│ ├─ DB query: 2300ms ❌ ← Problem here!
│ └─ Kafka publish: 80ms ✅
Solution: Add index on orders.user_id
Scenario 2: Cascading Failures
Order Service returns 500. What happened?
Jaeger shows:
┌─ Order Service: 500 Internal Server Error
│ └─ Payment Service: timeout after 30s ❌
│ └─ Stripe API: no response ❌
Solution: Stripe is down. Add circuit breaker.
Scenario 3: N+1 Problem in Distributed System
Request to /api/orders?user_id=123 is slow
Jaeger shows:
┌─ Order Service: 3200ms
│ ├─ DB query (orders): 50ms ✅
│ ├─ HTTP → Product Service: 150ms (×20 times!) ❌
│ │ └─ DB query: 5ms ×20
│
Solution: Batch requests to Product Service
Distributed tracing saves hours of debugging. Implement it BEFORE you launch the second microservice. Retrospective implementation is painful.
What to Do with Shared Code: 4 Strategies
The most painful question of microservices: "We have shared code (models, utilities, validation). What to do with it?"
Strategy 1: Shared Library
Idea: Common code in a separate package. Each service includes it as a dependency.
shared-lib/
├── models/
│ ├── user.py
│ └── order.py
├── utils/
│ ├── validators.py
│ └── formatters.py
└── setup.py
# Publish to private PyPI or Artifactory
Usage:
# requirements.txt for each service
company-shared-lib==1.2.3
# In code
from company_shared.models import User
from company_shared.utils import validate_emailPros:
- ✅ DRY (Don't Repeat Yourself)
- ✅ Versioning (can rollback)
- ✅ Single place for changes
Cons:
- ❌ Coupling between services (all depend on one lib)
- ❌ Updating lib = deploying all services
- ❌ Breaking changes = nightmare
Pitfall:
# Someone updated shared-lib from 1.2.3 to 2.0.0
# Breaking change: User.full_name → User.get_full_name()
# Auth Service updated → works ✅
# Order Service didn't update → crashes ❌Solution: Semantic versioning + deprecation warnings.
# shared-lib 1.3.0 (transitional release)
class User:
def full_name(self):
warnings.warn("Use get_full_name() instead", DeprecationWarning)
return self.get_full_name()
def get_full_name(self):
return f"{self.first_name} {self.last_name}"
# shared-lib 2.0.0 (breaking release)
# Remove full_name(), keep only get_full_name()When to use:
- Stable code (changes rarely)
- Utilities, constants, formatters
- Pydantic/Protobuf schemas for API contracts
Strategy 2: Code Generation
Idea: Store schemas in one place (OpenAPI, Protobuf), generate clients for each language.
api-schemas/
├── openapi/
│ ├── auth.yaml
│ └── orders.yaml
└── generate.sh
# generate.sh
openapi-generator generate \
-i openapi/auth.yaml \
-g python \
-o clients/python/auth
openapi-generator generate \
-i openapi/auth.yaml \
-g typescript-fetch \
-o clients/typescript/auth
Result:
# Auth Service (FastAPI)
# API automatically generates OpenAPI schema
# Order Service uses generated client
from auth_client import AuthApi, Configuration
config = Configuration(host="http://auth-service")
api = AuthApi(config)
user = api.get_user(user_id=123)
print(user.email) # Type-safe!Pros:
- ✅ Type safety (IDE suggests methods)
- ✅ Automatic validation
- ✅ Support for different languages (Python, Go, TypeScript)
- ✅ API schema = source of truth
Cons:
- ❌ Additional step in CI/CD
- ❌ Generated code sometimes awkward
- ❌ Need to synchronize schemas
When to use:
- Polyglot microservices (Python + Go + Node.js)
- Strict API contracts
- External API for partners
Strategy 3: Duplication
Idea: Copy code to each service. Yes, seriously.
auth-service/
└── utils/
└── validators.py # Copy
order-service/
└── utils/
└── validators.py # Same copy
payment-service/
└── utils/
└── validators.py # Same copy
Pros:
- ✅ Complete service independence
- ✅ No coupling
- ✅ Can change without risk of breaking other services
Cons:
- ❌ Violates DRY
- ❌ Bug needs fixing in 10 places
- ❌ Version divergence
When to use:
- Simple code (10-50 lines of utilities)
- Code that rarely changes
- When coupling is more expensive than duplication
"Duplication is far cheaper than the wrong abstraction" — Sandi Metz. Sometimes copying 20 lines of code to 5 services is easier than maintaining a shared library.
Strategy 4: Service as Source of Truth
Idea: No shared code. Services communicate only through APIs.
# ❌ Bad: Order Service imports User from shared-lib
from shared.models import User
def create_order(user_id):
user = User.objects.get(id=user_id) # Direct DB access!
order = Order.create(user=user)
# ✅ Good: Order Service calls Auth Service API
async def create_order(user_id):
user = await auth_service_client.get_user(user_id) # HTTP call
order = await Order.create(user_id=user_id, user_email=user.email)Pros:
- ✅ No coupling
- ✅ Each service owns its data
- ✅ Easy to change implementation inside service
Cons:
- ❌ Network latency
- ❌ Need fallback if service unavailable
- ❌ Eventual consistency
When to use:
- High isolation is critical
- Services in different languages
- Different teams own services
What to Choose: Decision Tree
Shared code:
├─ Stable, changes rarely?
│ ├─ Yes → Shared Library
│ └─ No → Duplication or Service API
├─ Need type safety?
│ └─ Yes → Code Generation
├─ Polyglot microservices?
│ └─ Yes → Code Generation or Service API
└─ Want independence at any cost?
└─ Yes → Service API
My Choice in 2025:
- Pydantic models for API contracts → Shared Library (versioning)
- Utilities (formatters, validators) → Duplication
- Business logic → Service API (each service owns its domain logic)
Real-World Case: E-commerce Migration with Metrics
Company: E-commerce platform (B2C)
Before migration: Django monolith, 120k lines of code, 35 developers
Problem: 45-minute deployment, 5 teams interfering with each other, expensive scaling
Initial State
Architecture:
Django Monolith (10 instances × 8GB RAM)
├── Auth (5% CPU)
├── Catalog (15% CPU)
├── Orders (20% CPU)
├── Payments (10% CPU)
├── Recommendations (40% CPU) ← ML model, CPU-intensive
└── Admin Panel (10% CPU)
PostgreSQL (master + 2 read replicas)
Redis (cache + sessions)
Celery (background tasks)
Metrics (before):
| Metric | Value |
|---|---|
| Deployment | 45 minutes |
| Deploys/week | 2-3 times |
| Instances | 10 × c5.2xlarge ($340/month) |
| Infrastructure | $3400/month |
| P95 latency | 350ms |
| Uptime | 99.5% (3.6 hours downtime/month) |
| Time to market | 2-3 weeks |
Pain:
- Recommendations (ML) required 40% CPU, but had to scale entire monolith
- 5 teams working in one repository → merge conflicts
- Deployment blocked everyone → queue on Friday evening
- Changes in Auth broke Orders (coupling)
Migration Plan (6 Months)
Stage 1: Preparation (month 1)
- ✅ Set up OpenTelemetry + Jaeger
- ✅ Implemented feature flags (LaunchDarkly)
- ✅ Set up API Gateway (Kong)
- ✅ Created shared library for models
- ✅ Split teams by domains
Stage 2: First Service - Recommendations (month 2)
Why first: Isolated, CPU-intensive, not critical for business.
# Recommendations Service (FastAPI + ML model)
from fastapi import FastAPI
from ml_model import RecommendationModel
app = FastAPI()
model = RecommendationModel.load()
@app.get("/recommendations/{user_id}")
async def get_recommendations(user_id: int, limit: int = 10):
# ML inference
products = await model.predict(user_id, limit=limit)
return {"products": products}Result:
Savings: $600/month (GPU instances cheaper for ML than inflating monolith)
Stage 3: Auth Service (month 3)
Why second: Critical but simple. Clear boundaries.
# Auth Service (FastAPI + JWT)
from fastapi import FastAPI, Depends, HTTPException
from fastapi_jwt_auth import AuthJWT
app = FastAPI()
@app.post("/auth/login")
async def login(credentials: LoginRequest):
user = await authenticate(credentials)
access_token = create_access_token(user.id)
return {"access_token": access_token}
@app.get("/auth/me")
async def get_current_user(Authorize: AuthJWT = Depends()):
Authorize.jwt_required()
user_id = Authorize.get_jwt_subject()
user = await get_user_by_id(user_id)
return userResult:
- Auth deployment: 8 minutes (was 45)
- Auth team can release 5-10 times a day
- Monolith became lighter (removed 15k lines)
Stage 4: Orders Service (month 4)
Challenges: Transactions with Payments, events for other services.
# Orders Service (FastAPI + Event-driven)
from fastapi import FastAPI
import aio_pika # RabbitMQ
app = FastAPI()
@app.post("/orders")
async def create_order(order: OrderCreate):
# 1. Create order
order_entity = await db.create_order(order)
# 2. Publish event
connection = await aio_pika.connect_robust("amqp://rabbitmq/")
channel = await connection.channel()
await channel.default_exchange.publish(
aio_pika.Message(
body=json.dumps({
"order_id": order_entity.id,
"user_id": order.user_id,
"total": order.total
}).encode()
),
routing_key="orders.created"
)
return order_entityStage 5: Payments Service (month 5)
Integration: Stripe, PayPal, internal wallet.
# Payments Service
@app.post("/payments/charge")
async def charge(payment: PaymentRequest):
# Listen to "orders.created" events
# Charge money
# Publish "payments.completed"
passStage 6: Catalog Service (month 6)
Feature: Read-heavy (80% GET requests).
# Catalog Service with aggressive caching
from fastapi_cache import FastAPICache
from fastapi_cache.backends.redis import RedisBackend
@app.on_event("startup")
async def startup():
redis = aioredis.from_url("redis://localhost")
FastAPICache.init(RedisBackend(redis), prefix="catalog:")
@app.get("/products/{product_id}")
@cache(expire=3600) # 1 hour
async def get_product(product_id: int):
return await db.get_product(product_id)Final State
Architecture (after):
API Gateway (Kong)
├── Auth Service (2 instances)
├── Catalog Service (3 instances + Redis)
├── Orders Service (4 instances)
├── Payments Service (2 instances)
├── Recommendations Service (4 GPU instances)
└── Django Monolith (Admin Panel only, 2 instances)
Event Bus (RabbitMQ)
Distributed Tracing (Jaeger)
Centralized Logging (Loki)
Metrics (after):
| Metric | Before | After | Change |
|---|---|---|---|
| Deployment | 45 min | 5-12 min | -73% |
| Deploys/week | 2-3 | 40-50 | +1500% |
| Infrastructure | $3400/month | $2100/month | -38% |
| P95 latency | 350ms | 180ms | -49% |
| Uptime | 99.5% | 99.9% | +0.4% |
| Time to market | 2-3 weeks | 3-5 days | -70% |
Savings: $1300/month × 12 = $15,600/year on infrastructure alone.
Productivity gain: Teams release 20x more frequently → features to market 5x faster → ROI priceless.
Pitfalls We Hit
Pitfall #1: Forgot About N+1 in Microservices
# ❌ Bad: N+1 HTTP requests
async def get_order_details(order_id):
order = await db.get_order(order_id)
# Make HTTP request for each item!
for item in order.items:
product = await catalog_service.get_product(item.product_id)
item.product_name = product.name
# 10 items = 10 HTTP requests × 50ms = 500ms latency!Solution: Batch API
# ✅ Good: one request for all products
async def get_order_details(order_id):
order = await db.get_order(order_id)
product_ids = [item.product_id for item in order.items]
products = await catalog_service.get_products_batch(product_ids) # One request!
products_map = {p.id: p for p in products}
for item in order.items:
item.product_name = products_map[item.product_id].namePitfall #2: Distributed Monolith
After 3 months we had 15 services, but they ALL called each other synchronously. This is a distributed monolith, not microservices.
Solution: Event-driven architecture. Services communicate through events, not HTTP.
Pitfall #3: No Circuit Breaker
Payments Service crashed → Orders Service waited 30s timeout on each request → entire site went down.
Solution: Circuit Breaker Pattern.
from circuitbreaker import circuit
@circuit(failure_threshold=5, recovery_timeout=60)
async def call_payment_service(amount):
async with httpx.AsyncClient() as client:
response = await client.post("http://payment-service/charge", ...)
return response.json()
# After 5 failures → circuit open → fast fail (don't wait 30s)Anti-patterns and How to Avoid Them
Anti-pattern #1: Microservice Per Table
Bad:
User Service (users table)
Order Service (orders table)
Product Service (products table)
Cart Service (cart_items table)
...
Why bad: Creating an order = 5 HTTP requests between services. Transactions impossible.
Right: Services by business domains.
Auth & Users Service (all auth logic)
Catalog Service (products, categories, search)
Order Management Service (orders, cart, checkout)
Anti-pattern #2: Shared Database
Bad: All services write to one PostgreSQL.
Why bad:
- Coupling at DB schema level
- One service changes schema → another crashes
- Scaling problematic
Right: Database per service (or at least schema per service).
-- Auth Service
CREATE SCHEMA auth;
CREATE TABLE auth.users (...);
-- Order Service
CREATE SCHEMA orders;
CREATE TABLE orders.orders (...);Anti-pattern #3: No API Gateway
Bad: Frontend calls 10 microservices directly.
Why bad:
- CORS on each service
- Auth on each service
- Frontend knows internal topology
- Can't change routing without changing frontend
Right: API Gateway (Kong, Nginx, Envoy, AWS API Gateway).
# Kong routing
/api/auth/* → Auth Service
/api/products/* → Catalog Service
/api/orders/* → Orders ServiceAnti-pattern #4: Distributed Monolith
Signs:
- All services synchronously call each other
- Can't deploy one service without others
- Changing API of one service → changes in all others
Right: Loose coupling through events.
Checklist Before Starting Migration
Infrastructure:
- API Gateway configured
- Service mesh (optional but desirable)
- Distributed tracing (OpenTelemetry + Jaeger)
- Centralized logging (Loki/ELK)
- Metrics & Monitoring (Prometheus + Grafana)
- CI/CD for each service
- Container registry (Docker Hub, ECR, GCR)
Architecture:
- Defined service boundaries (DDD, bounded contexts)
- Planned strategy for shared code
- Chose pattern for eventual consistency (Saga, events)
- Designed API contracts (OpenAPI, gRPC)
- Planned data migration strategy
Team:
- Everyone understands why we're migrating
- Have dedicated DevOps/Platform team
- Developers understand distributed systems
- Have owners for each service
Business:
- Business understands migration will take 6-12 months
- Have budget for additional infrastructure
- Ready for temporary slowdown in features
Conclusions
Microservices aren't about technology. They're about people, processes, and business goals.
When NOT to use microservices:
- Team < 10 people
- Startup searching for product-market fit
- No problems with deploy frequency
- Monolith handles the load
When to use:
- 3+ teams working in monolith
- Deploy bottleneck (> 30 minutes)
- Different parts of system require different scaling
- Have DevOps/Platform team
Strangler Fig Pattern:
- Migration without downtime
- Gradual traffic switching
- Can rollback in seconds
- Minimize risks
Distributed Tracing:
- Must-have from first microservice
- Saves hours of debugging
- OpenTelemetry + Jaeger — standard
Shared Code:
- Stable code → Shared Library
- API contracts → Code Generation
- Utilities → Duplication (sometimes it's OK)
- Business logic → Service API
Main lesson: Don't do microservices because it's trendy. Do it because the monolith became your pain. And do it gradually, measuring every step.
Need help with architecture? I conduct architecture reviews and help teams make the right decisions. Write to email — let's discuss your case.
Useful materials:
- Monitoring Stack 2025 — how to monitor microservices
- Load Balancers — load distribution
- Technical Debt Metrics — how to measure architectural debt
