#architecture #microservices #monolith #DevOps #production

Monolith → Microservices: How Not to Kill Your Team in the Process

Константин Потапов

October 23, 2025

22 min

When it's really time to split, Strangler Fig in practice, distributed tracing from day one, and what to do with shared code. Real-world migration case with metrics and honest talk about pitfalls.

Monolith → Microservices: How Not to Kill Your Team in the Process

"We split the monolith into 47 microservices. Now deployment takes 4 hours instead of 20 minutes, the team hates me, and the CEO asks when 'this experiment' will end."

Real quote from a CTO who reached out to me in early 2024. They spent 8 months on migration, $200k budget, and now wanted to go back to the monolith.

Spoiler: We didn't go back. We fixed the architecture in 6 weeks, killed 32 out of 47 services, and got a system that works better than the monolith. But the cost of mistakes was high.

In this article — honest talk about migrating from monolith to microservices. No evangelism, no blind faith in "Netflix does it this way, so should we." Just practice, pitfalls, and real numbers.

The Truth About Microservices They Don't Tell at Conferences

Let's start with an uncomfortable truth: microservices aren't an evolution of monoliths. They're a different class of problems.

A monolith is like living in a studio apartment. Cramped, everything at hand, cleaning takes an hour.

Microservices are like managing an apartment building. Each apartment is independent, but now you have problems with heating, electricity, plumbing, and tenants complaining about each other.

Real Story: "We Became Like Google!"

Django startup, 12-person team, 50k DAU, stable monolith. At a conference, the CTO heard a talk about microservices. Came back inspired.

The Plan:

Split monolith into 15 microservices
Implement Kubernetes
Service mesh (Istio)
Event-driven architecture (Kafka)
"Become like Netflix"

Reality After 6 Months:

Deployment grew from 15 minutes to 2 hours
New features take 3 weeks instead of 1 (need to coordinate 5 teams)
Debugging is a nightmare (request goes through 7 services, unclear where it fails)
Infrastructure costs tripled
3 senior developers quit

Outcome: After 9 months they returned to a modular monolith. Lost $400k and their best people.

Microservices aren't a silver bullet. They're trading one set of problems (monolith complexity) for another (distributed system complexity). Make sure you're ready to pay this price.

When It's Really Time to Split: No-BS Checklist

After 15 migrations (in both directions), I came up with a simple rule: you need microservices when the pain of the monolith costs more than the pain of microservices.

Signs That the Monolith Is Choking You

1. Deploy Bottleneck

Symptom: Deployment takes > 30 minutes and blocks the entire team
Example: 5 teams queue for deployment on Friday evening
Cost: Developer time + risk of conflicts + slow time-to-market

Real Case: SaaS platform, 80 developers. Monolith deployment — 45 minutes. Each team could release once a week. Merge conflicts — weekly.

After splitting into 8 services: Each team deploys independently, 10-15 times a day. Time-to-market dropped from a week to a day.

2. Scaling Hell

Symptom: One endpoint eats 90% of resources, but you have to scale the entire monolith
Example: Mobile API generates PDF reports (CPU-intensive)
         Other 50 endpoints are simple, but scale together with PDF
Cost: $5k/month on servers instead of $1k

Math:

Monolith: 10 instances × 8GB RAM × $100/month = $1000
         (8 instances needed only for PDF generator)

Microservices: API (2 instances × 2GB × $50) + PDF Service (8 × 8GB × $100) = $900
              Savings: $100/month (or $1200/year)

3. Team Collision

Symptom: 3+ teams work in one repository and interfere with each other
Example: Team A changes auth, Team B breaks orders, Team C debugs all Friday
Cost: Merge conflicts + slow PR reviews + stress

Sign: If you do daily sync meetings between teams to "not step on toes" — you need boundaries.

4. Technology Lock-in

Symptom: Want to try a new language/framework, but it requires rewriting the entire monolith
Example: Monolith on Django 2.2, want async FastAPI for WebSocket
         Migrating entire monolith = 6 months
Cost: Missed opportunities + technical debt

Microservices Readiness Checklist

Answer honestly:

Do you have 3+ teams working in the monolith?
Does deployment take > 30 minutes?
Do different parts of the system have different scaling requirements?
Do you have a dedicated DevOps/Platform team?
Are you ready to implement distributed tracing and centralized logging?
Do you have no hiring problems (microservices require senior developers)?
Does the business understand that migration will take 6-12 months?

If you answered "YES" to 5+ questions — microservices make sense.

If less than 5 — try a modular monolith first.

Modular Monolith as an Intermediate Step

What it is: A monolith divided into independent modules with clear boundaries.

monolith/
  ├── modules/
  │   ├── auth/          # Separate module
  │   │   ├── api/
  │   │   ├── models/
  │   │   └── services/
  │   ├── orders/        # Separate module
  │   │   ├── api/
  │   │   ├── models/
  │   │   └── services/
  │   └── payments/      # Separate module
  │       └── ...
  └── shared/            # Shared code

Rules:

Modules communicate only through public APIs (no direct imports)
Each module can be extracted into a service in a week
Shared code is minimal and stable

Advantages:

✅ Deployment still simple (one service)
✅ Debugging simple (one process)
✅ Teams work independently (different modules)
✅ Ready for migration (boundaries already exist)

80% of companies I worked with solved their problems with a modular monolith. Only 20% needed microservices. Don't overestimate the complexity of your problems.

Strangler Fig Pattern: How to Migrate Without Downtime

Strangler Fig — a pattern of gradual migration. The new system grows around the old one, gradually replacing its parts. Like a fig tree wraps around an old tree and eventually replaces it.

Why Not Big Bang Rewrite

Big Bang Rewrite — when you stop development and rewrite everything from scratch.

Problems:

📉 6-12 months without new features → business loses money
🐛 You'll forget edge cases from the old system → bugs in production
😰 Team burns out → resignations
💸 Risk of project failure → millions wasted

Known Failures:

Netscape (1998) — rewrote browser from scratch, lost the market
Knight Capital (2012) — new system launched with a bug, lost $440M in 45 minutes

Strangler Fig in Practice

Idea: The new system sits alongside the old one. Gradually switch traffic from monolith to microservices. When the monolith is empty — turn it off.

Migration Architecture

                    ┌─────────────────────────┐
                    │   API Gateway / Proxy   │
                    │    (Nginx/Envoy)        │
                    └───────────┬─────────────┘
                                │
                ┌───────────────┴───────────────┐
                │                               │
                ▼                               ▼
    ┌────────────────────┐          ┌──────────────────┐
    │   Monolith (Django)│          │  Microservices   │
    │                    │          │                  │
    │  /api/orders  ✅   │          │  Auth Service ✅ │
    │  /api/users   ✅   │          │  Orders Service  │
    │  /api/auth    ❌   │◄─────────│  (in progress)   │
    │                    │   reads  │                  │
    └────────┬───────────┘   data   └─────────┬────────┘
             │                                 │
             └────────────┬────────────────────┘
                          ▼
                  ┌───────────────┐
                  │   PostgreSQL  │
                  │ (shared DB)   │
                  └───────────────┘

Stages:

Proxy in front — all traffic goes through API Gateway
First service — extract the simplest/most isolated module
Switch traffic — change routing in proxy (without code changes)
Monitor — watch metrics, latency, errors
Roll back if issues — switch back in 5 seconds
Repeat for next module

Example: Extracting Auth Service

Step 1: Duplicate Functionality

# New Auth Service (FastAPI)
from fastapi import FastAPI, Depends, HTTPException
from sqlalchemy import select
from database import async_session
 
app = FastAPI()
 
@app.post("/api/auth/login")
async def login(credentials: LoginRequest):
    async with async_session() as db:
        # Read from SHARED DB (same as monolith)
        result = await db.execute(
            select(User).where(User.email == credentials.email)
        )
        user = result.scalar_one_or_none()
 
        if not user or not verify_password(credentials.password, user.password):
            raise HTTPException(401, "Invalid credentials")
 
        token = create_jwt_token(user.id)
        return {"access_token": token}

Step 2: Configure Routing (Nginx)

upstream auth_service {
    server auth-service:8000;
}
 
upstream monolith {
    server django-app:8000;
}
 
server {
    listen 80;
 
    # 10% traffic to new service (canary deployment)
    location /api/auth/ {
        if ($request_id ~* "[0-9]$") {  # 10% requests (ending in 0-9)
            proxy_pass http://auth_service;
        }
        proxy_pass http://monolith;
    }
 
    # Everything else to monolith
    location / {
        proxy_pass http://monolith;
    }
}

Step 3: Gradual Traffic Increase

Day 1-3:   10% traffic → Auth Service
Day 4-7:   50% traffic → Auth Service
Day 8-10: 100% traffic → Auth Service

Step 4: Monitor Metrics

# Add metrics to Auth Service
from prometheus_client import Counter, Histogram
 
auth_requests = Counter('auth_requests_total', 'Total auth requests', ['status'])
auth_latency = Histogram('auth_request_duration_seconds', 'Auth latency')
 
@app.post("/api/auth/login")
@auth_latency.time()
async def login(credentials: LoginRequest):
    try:
        # ... logic
        auth_requests.labels(status='success').inc()
        return response
    except Exception as e:
        auth_requests.labels(status='error').inc()
        raise

Step 5: Compare with Monolith

# Grafana Dashboard
# Monolith vs microservice latency
histogram_quantile(0.95,
  rate(django_http_request_duration_seconds_bucket{endpoint="/api/auth/login"}[5m])
)
vs
histogram_quantile(0.95,
  rate(auth_request_duration_seconds_bucket[5m])
)
 
# Error rate
rate(django_http_errors_total{endpoint="/api/auth/login"}[5m])
vs
rate(auth_requests_total{status="error"}[5m])

Метрика

Monolith

Auth Service

P95 latency

120ms

45ms

62%

Error rate

0.02%

0.01%

50%

Throughput

500 req/s

1200 req/s

140%

Step 6: Remove Code from Monolith

After 2 weeks of stable Auth Service at 100% traffic:

# Django monolith - delete auth views
# git rm apps/auth/views.py
# git rm apps/auth/serializers.py
# git commit -m "Remove auth - migrated to auth-service"

Strangler Fig Pitfalls

Pitfall #1: Shared Database

During migration, you have a shared DB. This creates coupling.

# ❌ Bad: Auth Service changes schema
ALTER TABLE users ADD COLUMN last_login_ip VARCHAR(15);
 
# Monolith crashes: Unknown column 'last_login_ip'

Solution: Database View Pattern

-- Auth Service works through VIEW
CREATE VIEW auth_users AS
SELECT id, email, password_hash, created_at
FROM users;
 
GRANT SELECT ON auth_users TO auth_service;
 
-- Monolith continues working with table
-- Auth Service works with view
-- Schema migration doesn't break monolith

Pitfall #2: Transactions Between Services

# ❌ This WON'T work
def create_order(user_id, items):
    with transaction.atomic():
        # Call Auth Service
        user = auth_service.get_user(user_id)  # HTTP request
 
        # Create order in monolith
        order = Order.objects.create(user=user)
 
        # If error here → rollback won't affect Auth Service!

Solution: Saga Pattern or eventual consistency

# ✅ Event-driven approach
def create_order(user_id, items):
    # 1. Create order in PENDING status
    order = Order.objects.create(user_id=user_id, status='PENDING')
 
    # 2. Publish event to queue
    event_bus.publish('order.created', {
        'order_id': order.id,
        'user_id': user_id
    })
 
    # 3. Auth Service listens to events and updates stats
    # 4. If something goes wrong → compensating transaction

Distributed transactions don't work in microservices. Accept eventual consistency as reality. Use Saga, event sourcing, or live with data duplication.

Distributed Tracing from Day One

The main pain of microservices: "Where did the request fail if it went through 7 services?"

In monolith: look at stack trace, see the entire call chain.

In microservices: look at logs of 7 services, try to find request by timestamp. Good luck.

OpenTelemetry + Jaeger: Must-Have from First Service

OpenTelemetry — standard for distributed tracing. Jaeger — UI for viewing traces.

Tracing Architecture

Request ID: 7f3a9c12-4e8d-4f2a-a1b3-8d7e9f2c1a4b

API Gateway (span: 250ms)
  ↓
Auth Service (span: 45ms)
  ├─ DB query (span: 12ms)
  └─ Redis cache (span: 3ms)
  ↓
Order Service (span: 180ms)
  ├─ DB query (span: 50ms)
  ├─ HTTP → Payment Service (span: 120ms)
  │   ├─ DB query (span: 15ms)
  │   └─ HTTP → Stripe API (span: 95ms)  ← Culprit!
  └─ Kafka publish (span: 8ms)

One look at Jaeger UI — and you see Stripe is slowing down.

OpenTelemetry Setup in Python

Installation:

pip install opentelemetry-api opentelemetry-sdk \
            opentelemetry-instrumentation-fastapi \
            opentelemetry-exporter-jaeger

Code (FastAPI):

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.jaeger.thrift import JaegerExporter
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
from opentelemetry.instrumentation.sqlalchemy import SQLAlchemyInstrumentor
from opentelemetry.instrumentation.redis import RedisInstrumentor
 
# Setup tracer
trace.set_tracer_provider(TracerProvider())
tracer = trace.get_tracer(__name__)
 
# Export to Jaeger
jaeger_exporter = JaegerExporter(
    agent_host_name="jaeger",
    agent_port=6831,
)
trace.get_tracer_provider().add_span_processor(
    BatchSpanProcessor(jaeger_exporter)
)
 
app = FastAPI()
 
# Automatic instrumentation
FastAPIInstrumentor.instrument_app(app)
SQLAlchemyInstrumentor().instrument(engine=engine)
RedisInstrumentor().instrument()
 
# Custom spans
@app.post("/orders")
async def create_order(order: OrderCreate):
    with tracer.start_as_current_span("create_order"):
        # Span automatically includes:
        # - request_id
        # - http.method, http.url, http.status_code
        # - duration
 
        with tracer.start_as_current_span("validate_user"):
            user = await auth_service.get_user(order.user_id)
 
        with tracer.start_as_current_span("process_payment"):
            payment = await payment_service.charge(order.total)
 
        with tracer.start_as_current_span("save_to_db"):
            result = await db.execute(insert(Order).values(**order.dict()))
 
        return {"order_id": result.inserted_primary_key[0]}

Cross-service trace ID propagation:

import httpx
from opentelemetry.propagate import inject
 
async def call_payment_service(amount: float):
    headers = {}
    # Inject trace context into HTTP headers
    inject(headers)
 
    async with httpx.AsyncClient() as client:
        response = await client.post(
            "http://payment-service/charge",
            json={"amount": amount},
            headers=headers  # trace_id is passed forward
        )
    return response.json()

Receiving side (Payment Service):

from opentelemetry.propagate import extract
 
@app.post("/charge")
async def charge(request: Request, data: ChargeRequest):
    # Extract trace context from headers
    context = extract(request.headers)
 
    # Span automatically links to parent
    with tracer.start_as_current_span("charge_payment", context=context):
        # ... payment logic
        pass

Docker Compose for Jaeger

services:
  jaeger:
    image: jaegertracing/all-in-one:1.52
    environment:
      - COLLECTOR_OTLP_ENABLED=true
    ports:
      - "16686:16686" # Jaeger UI
      - "6831:6831/udp" # Receive traces
      - "4317:4317" # OTLP gRPC
    networks:
      - monitoring
 
  auth-service:
    build: ./auth-service
    environment:
      - JAEGER_AGENT_HOST=jaeger
      - JAEGER_AGENT_PORT=6831
    networks:
      - monitoring
 
  order-service:
    build: ./order-service
    environment:
      - JAEGER_AGENT_HOST=jaeger
      - JAEGER_AGENT_PORT=6831
    networks:
      - monitoring
 
networks:
  monitoring:
    driver: bridge

What Distributed Tracing Shows

Scenario 1: Slow Request

Request to /api/orders took 2.5 seconds. Why?

Jaeger shows:
┌─ API Gateway: 2500ms
│  ├─ Auth Service: 50ms ✅
│  └─ Order Service: 2400ms ⚠️
│     ├─ DB query: 2300ms ❌ ← Problem here!
│     └─ Kafka publish: 80ms ✅

Solution: Add index on orders.user_id

Scenario 2: Cascading Failures

Order Service returns 500. What happened?

Jaeger shows:
┌─ Order Service: 500 Internal Server Error
│  └─ Payment Service: timeout after 30s ❌
│     └─ Stripe API: no response ❌

Solution: Stripe is down. Add circuit breaker.

Scenario 3: N+1 Problem in Distributed System

Request to /api/orders?user_id=123 is slow

Jaeger shows:
┌─ Order Service: 3200ms
│  ├─ DB query (orders): 50ms ✅
│  ├─ HTTP → Product Service: 150ms (×20 times!) ❌
│  │  └─ DB query: 5ms ×20
│
Solution: Batch requests to Product Service

Distributed tracing saves hours of debugging. Implement it BEFORE you launch the second microservice. Retrospective implementation is painful.

OpenTelemetryJaegerGrafana TempoZipkin

What to Do with Shared Code: 4 Strategies

The most painful question of microservices: "We have shared code (models, utilities, validation). What to do with it?"

Strategy 1: Shared Library

Idea: Common code in a separate package. Each service includes it as a dependency.

shared-lib/
  ├── models/
  │   ├── user.py
  │   └── order.py
  ├── utils/
  │   ├── validators.py
  │   └── formatters.py
  └── setup.py

# Publish to private PyPI or Artifactory

Usage:

# requirements.txt for each service
company-shared-lib==1.2.3
 
# In code
from company_shared.models import User
from company_shared.utils import validate_email

Pros:

✅ DRY (Don't Repeat Yourself)
✅ Versioning (can rollback)
✅ Single place for changes

Cons:

❌ Coupling between services (all depend on one lib)
❌ Updating lib = deploying all services
❌ Breaking changes = nightmare

Pitfall:

# Someone updated shared-lib from 1.2.3 to 2.0.0
# Breaking change: User.full_name → User.get_full_name()
 
# Auth Service updated → works ✅
# Order Service didn't update → crashes ❌

Solution: Semantic versioning + deprecation warnings.

# shared-lib 1.3.0 (transitional release)
class User:
    def full_name(self):
        warnings.warn("Use get_full_name() instead", DeprecationWarning)
        return self.get_full_name()
 
    def get_full_name(self):
        return f"{self.first_name} {self.last_name}"
 
# shared-lib 2.0.0 (breaking release)
# Remove full_name(), keep only get_full_name()

When to use:

Stable code (changes rarely)
Utilities, constants, formatters
Pydantic/Protobuf schemas for API contracts

Strategy 2: Code Generation

Idea: Store schemas in one place (OpenAPI, Protobuf), generate clients for each language.

api-schemas/
  ├── openapi/
  │   ├── auth.yaml
  │   └── orders.yaml
  └── generate.sh

# generate.sh
openapi-generator generate \
  -i openapi/auth.yaml \
  -g python \
  -o clients/python/auth

openapi-generator generate \
  -i openapi/auth.yaml \
  -g typescript-fetch \
  -o clients/typescript/auth

Result:

# Auth Service (FastAPI)
# API automatically generates OpenAPI schema
 
# Order Service uses generated client
from auth_client import AuthApi, Configuration
 
config = Configuration(host="http://auth-service")
api = AuthApi(config)
 
user = api.get_user(user_id=123)
print(user.email)  # Type-safe!

Pros:

✅ Type safety (IDE suggests methods)
✅ Automatic validation
✅ Support for different languages (Python, Go, TypeScript)
✅ API schema = source of truth

Cons:

❌ Additional step in CI/CD
❌ Generated code sometimes awkward
❌ Need to synchronize schemas

When to use:

Polyglot microservices (Python + Go + Node.js)
Strict API contracts
External API for partners

Strategy 3: Duplication

Idea: Copy code to each service. Yes, seriously.

auth-service/
  └── utils/
      └── validators.py  # Copy

order-service/
  └── utils/
      └── validators.py  # Same copy

payment-service/
  └── utils/
      └── validators.py  # Same copy

Pros:

✅ Complete service independence
✅ No coupling
✅ Can change without risk of breaking other services

Cons:

❌ Violates DRY
❌ Bug needs fixing in 10 places
❌ Version divergence

When to use:

Simple code (10-50 lines of utilities)
Code that rarely changes
When coupling is more expensive than duplication

"Duplication is far cheaper than the wrong abstraction" — Sandi Metz. Sometimes copying 20 lines of code to 5 services is easier than maintaining a shared library.

Strategy 4: Service as Source of Truth

Idea: No shared code. Services communicate only through APIs.

# ❌ Bad: Order Service imports User from shared-lib
from shared.models import User
 
def create_order(user_id):
    user = User.objects.get(id=user_id)  # Direct DB access!
    order = Order.create(user=user)
 
# ✅ Good: Order Service calls Auth Service API
async def create_order(user_id):
    user = await auth_service_client.get_user(user_id)  # HTTP call
    order = await Order.create(user_id=user_id, user_email=user.email)

Pros:

✅ No coupling
✅ Each service owns its data
✅ Easy to change implementation inside service

Cons:

❌ Network latency
❌ Need fallback if service unavailable
❌ Eventual consistency

When to use:

High isolation is critical
Services in different languages
Different teams own services

What to Choose: Decision Tree

Shared code:
├─ Stable, changes rarely?
│  ├─ Yes → Shared Library
│  └─ No → Duplication or Service API
├─ Need type safety?
│  └─ Yes → Code Generation
├─ Polyglot microservices?
│  └─ Yes → Code Generation or Service API
└─ Want independence at any cost?
   └─ Yes → Service API

My Choice in 2025:

Pydantic models for API contracts → Shared Library (versioning)
Utilities (formatters, validators) → Duplication
Business logic → Service API (each service owns its domain logic)

Real-World Case: E-commerce Migration with Metrics

Company: E-commerce platform (B2C)

Before migration: Django monolith, 120k lines of code, 35 developers

Problem: 45-minute deployment, 5 teams interfering with each other, expensive scaling

Initial State

Architecture:

Django Monolith (10 instances × 8GB RAM)
├── Auth (5% CPU)
├── Catalog (15% CPU)
├── Orders (20% CPU)
├── Payments (10% CPU)
├── Recommendations (40% CPU) ← ML model, CPU-intensive
└── Admin Panel (10% CPU)

PostgreSQL (master + 2 read replicas)
Redis (cache + sessions)
Celery (background tasks)

Metrics (before):

Metric	Value
Deployment	45 minutes
Deploys/week	2-3 times
Instances	10 × c5.2xlarge ($340/month)
Infrastructure	$3400/month
P95 latency	350ms
Uptime	99.5% (3.6 hours downtime/month)
Time to market	2-3 weeks

Pain:

Recommendations (ML) required 40% CPU, but had to scale entire monolith
5 teams working in one repository → merge conflicts
Deployment blocked everyone → queue on Friday evening
Changes in Auth broke Orders (coupling)

Migration Plan (6 Months)

Stage 1: Preparation (month 1)

✅ Set up OpenTelemetry + Jaeger
✅ Implemented feature flags (LaunchDarkly)
✅ Set up API Gateway (Kong)
✅ Created shared library for models
✅ Split teams by domains

Stage 2: First Service - Recommendations (month 2)

Why first: Isolated, CPU-intensive, not critical for business.

# Recommendations Service (FastAPI + ML model)
from fastapi import FastAPI
from ml_model import RecommendationModel
 
app = FastAPI()
model = RecommendationModel.load()
 
@app.get("/recommendations/{user_id}")
async def get_recommendations(user_id: int, limit: int = 10):
    # ML inference
    products = await model.predict(user_id, limit=limit)
    return {"products": products}

Result:

Метрика

Before extraction

After extraction

Monolith instances

10 × c5.2xlarge

6 × c5.2xlarge

40%

P95 latency

350ms

280ms (-20%)

20%

Inference time

180ms

95ms (-47%)

47%

Recommendations

—

4 × c5.xlarge (GPU)

Savings: $600/month (GPU instances cheaper for ML than inflating monolith)

Stage 3: Auth Service (month 3)

Why second: Critical but simple. Clear boundaries.

# Auth Service (FastAPI + JWT)
from fastapi import FastAPI, Depends, HTTPException
from fastapi_jwt_auth import AuthJWT
 
app = FastAPI()
 
@app.post("/auth/login")
async def login(credentials: LoginRequest):
    user = await authenticate(credentials)
    access_token = create_access_token(user.id)
    return {"access_token": access_token}
 
@app.get("/auth/me")
async def get_current_user(Authorize: AuthJWT = Depends()):
    Authorize.jwt_required()
    user_id = Authorize.get_jwt_subject()
    user = await get_user_by_id(user_id)
    return user

Result:

Auth deployment: 8 minutes (was 45)
Auth team can release 5-10 times a day
Monolith became lighter (removed 15k lines)

Stage 4: Orders Service (month 4)

Challenges: Transactions with Payments, events for other services.

# Orders Service (FastAPI + Event-driven)
from fastapi import FastAPI
import aio_pika  # RabbitMQ
 
app = FastAPI()
 
@app.post("/orders")
async def create_order(order: OrderCreate):
    # 1. Create order
    order_entity = await db.create_order(order)
 
    # 2. Publish event
    connection = await aio_pika.connect_robust("amqp://rabbitmq/")
    channel = await connection.channel()
    await channel.default_exchange.publish(
        aio_pika.Message(
            body=json.dumps({
                "order_id": order_entity.id,
                "user_id": order.user_id,
                "total": order.total
            }).encode()
        ),
        routing_key="orders.created"
    )
 
    return order_entity

Stage 5: Payments Service (month 5)

Integration: Stripe, PayPal, internal wallet.

# Payments Service
@app.post("/payments/charge")
async def charge(payment: PaymentRequest):
    # Listen to "orders.created" events
    # Charge money
    # Publish "payments.completed"
    pass

Stage 6: Catalog Service (month 6)

Feature: Read-heavy (80% GET requests).

# Catalog Service with aggressive caching
from fastapi_cache import FastAPICache
from fastapi_cache.backends.redis import RedisBackend
 
@app.on_event("startup")
async def startup():
    redis = aioredis.from_url("redis://localhost")
    FastAPICache.init(RedisBackend(redis), prefix="catalog:")
 
@app.get("/products/{product_id}")
@cache(expire=3600)  # 1 hour
async def get_product(product_id: int):
    return await db.get_product(product_id)

Final State

Architecture (after):

API Gateway (Kong)
├── Auth Service (2 instances)
├── Catalog Service (3 instances + Redis)
├── Orders Service (4 instances)
├── Payments Service (2 instances)
├── Recommendations Service (4 GPU instances)
└── Django Monolith (Admin Panel only, 2 instances)

Event Bus (RabbitMQ)
Distributed Tracing (Jaeger)
Centralized Logging (Loki)

Metrics (after):

Metric	Before	After	Change
Deployment	45 min	5-12 min	-73%
Deploys/week	2-3	40-50	+1500%
Infrastructure	$3400/month	$2100/month	-38%
P95 latency	350ms	180ms	-49%
Uptime	99.5%	99.9%	+0.4%
Time to market	2-3 weeks	3-5 days	-70%

Savings: $1300/month × 12 = $15,600/year on infrastructure alone.

Productivity gain: Teams release 20x more frequently → features to market 5x faster → ROI priceless.

Pitfalls We Hit

Pitfall #1: Forgot About N+1 in Microservices

# ❌ Bad: N+1 HTTP requests
async def get_order_details(order_id):
    order = await db.get_order(order_id)
 
    # Make HTTP request for each item!
    for item in order.items:
        product = await catalog_service.get_product(item.product_id)
        item.product_name = product.name
 
# 10 items = 10 HTTP requests × 50ms = 500ms latency!

Solution: Batch API

# ✅ Good: one request for all products
async def get_order_details(order_id):
    order = await db.get_order(order_id)
 
    product_ids = [item.product_id for item in order.items]
    products = await catalog_service.get_products_batch(product_ids)  # One request!
 
    products_map = {p.id: p for p in products}
    for item in order.items:
        item.product_name = products_map[item.product_id].name

Pitfall #2: Distributed Monolith

After 3 months we had 15 services, but they ALL called each other synchronously. This is a distributed monolith, not microservices.

Solution: Event-driven architecture. Services communicate through events, not HTTP.

Pitfall #3: No Circuit Breaker

Payments Service crashed → Orders Service waited 30s timeout on each request → entire site went down.

Solution: Circuit Breaker Pattern.

from circuitbreaker import circuit
 
@circuit(failure_threshold=5, recovery_timeout=60)
async def call_payment_service(amount):
    async with httpx.AsyncClient() as client:
        response = await client.post("http://payment-service/charge", ...)
        return response.json()
 
# After 5 failures → circuit open → fast fail (don't wait 30s)

Anti-patterns and How to Avoid Them

Anti-pattern #1: Microservice Per Table

Bad:

User Service (users table)
Order Service (orders table)
Product Service (products table)
Cart Service (cart_items table)
...

Why bad: Creating an order = 5 HTTP requests between services. Transactions impossible.

Right: Services by business domains.

Auth & Users Service (all auth logic)
Catalog Service (products, categories, search)
Order Management Service (orders, cart, checkout)

Anti-pattern #2: Shared Database

Bad: All services write to one PostgreSQL.

Why bad:

Coupling at DB schema level
One service changes schema → another crashes
Scaling problematic

Right: Database per service (or at least schema per service).

-- Auth Service
CREATE SCHEMA auth;
CREATE TABLE auth.users (...);
 
-- Order Service
CREATE SCHEMA orders;
CREATE TABLE orders.orders (...);

Anti-pattern #3: No API Gateway

Bad: Frontend calls 10 microservices directly.

Why bad:

CORS on each service
Auth on each service
Frontend knows internal topology
Can't change routing without changing frontend

Right: API Gateway (Kong, Nginx, Envoy, AWS API Gateway).

# Kong routing
/api/auth/*      → Auth Service
/api/products/*  → Catalog Service
/api/orders/*    → Orders Service

Anti-pattern #4: Distributed Monolith

Signs:

All services synchronously call each other
Can't deploy one service without others
Changing API of one service → changes in all others

Right: Loose coupling through events.

Checklist Before Starting Migration

Infrastructure:

API Gateway configured
Service mesh (optional but desirable)
Distributed tracing (OpenTelemetry + Jaeger)
Centralized logging (Loki/ELK)
Metrics & Monitoring (Prometheus + Grafana)
CI/CD for each service
Container registry (Docker Hub, ECR, GCR)

Architecture:

Defined service boundaries (DDD, bounded contexts)
Planned strategy for shared code
Chose pattern for eventual consistency (Saga, events)
Designed API contracts (OpenAPI, gRPC)
Planned data migration strategy

Team:

Everyone understands why we're migrating
Have dedicated DevOps/Platform team
Developers understand distributed systems
Have owners for each service

Business:

Business understands migration will take 6-12 months
Have budget for additional infrastructure
Ready for temporary slowdown in features

Conclusions

Microservices aren't about technology. They're about people, processes, and business goals.

When NOT to use microservices:

Team < 10 people
Startup searching for product-market fit
No problems with deploy frequency
Monolith handles the load

When to use:

3+ teams working in monolith
Deploy bottleneck (> 30 minutes)
Different parts of system require different scaling
Have DevOps/Platform team

Strangler Fig Pattern:

Migration without downtime
Gradual traffic switching
Can rollback in seconds
Minimize risks

Distributed Tracing:

Must-have from first microservice
Saves hours of debugging
OpenTelemetry + Jaeger — standard

Shared Code:

Stable code → Shared Library
API contracts → Code Generation
Utilities → Duplication (sometimes it's OK)
Business logic → Service API

Main lesson: Don't do microservices because it's trendy. Do it because the monolith became your pain. And do it gradually, measuring every step.

Need help with architecture? I conduct architecture reviews and help teams make the right decisions. Write to email — let's discuss your case.

Useful materials:

Monitoring Stack 2025 — how to monitor microservices
Load Balancers — load distribution
Technical Debt Metrics — how to measure architectural debt

← All posts Discuss post

2025-12-15·14 min

Monolith → Microservices: How Not to Kill Your Team in the Process

Table of Contents

Related posts

Load Balancers: The Next Step After Mastering Docker

C4 Model: Practical Software Architecture Design Principles

Vector and Graph Databases: Complete Guide to RAG for Modern AI Applications