Load Testing: Financial Audit of Your Architecture

When Problems Come With Traffic

While you're reading this sentence, your code might be losing money — not because of bugs, but because of unanticipated load. A colleague had a case: $15k/day in ads, production down for half a day — minus ~$7.5k for nothing. Since then, load testing is a financial audit of architecture: we find the point where the system starts eating revenue and calculate how much it costs.

Remember how marketplaces and retail went down on Black Friday or Singles' Day? Or how government portals couldn't handle the rush during COVID QR code and vaccination registration launches? Their mistake wasn't that they didn't test — it was that they tested the wrong things the wrong way.

A 1-second delay at 1000 RPS equals 1000 seconds of cumulative user waiting time every second of real time. That's direct abandoned carts and negative reviews. Load testing is how you know in advance at what RPS and because of which component you start losing money and exactly how much.

90% of performance issues are only visible under load. A local curl gives you 50ms, but at 1000 RPS you get seconds of waiting and 500 errors.

What Is Load Testing in Simple Terms

I keep two questions in mind when running load tests:

Performance testing: How fast is the system at target load? This is about p95, latency, throughput.
Resilience testing: How and when does the system break, how does it degrade and recover, and what does it cost? This is about limits and failure points in dollars.

Load testing is always about resilience and money. We find where the system cracks and calculate what it will cost the business.

Test types (all translated to money language):

Load testing — "How many users can we serve before we start losing money?"
Stress testing — "Where exactly and at what load does our bottom burn out, and what does it cost?"
Spike testing — "If we get a traffic surge from HackerNews/top blogger, at what RPS do we fall and how much do we lose?"
Soak testing — "What's the cost of degradation or memory leaks after an hour of peak traffic?"

Tools: What I Use and Why

Tool choice is secondary. The main thing is to answer business questions, not draw pretty graphs. We use k6 because its scripts are code in the repository that lives in CI/CD and is cheap to maintain. You can take Gatling or JMeter — what matters more is that test results answer the question "how much money do we lose with the current architecture?"

Tool selection principles:

Test as code: lives next to product code and gets reviewed.
CI/CD: one step to run, no manual clicks.
Business metrics: simple thresholds, custom metrics, and money linkage.
Transparency: reports are understood by developers, product managers, and management.

Principle / Tool	k6	Apache JMeter	Gatling
Test as code	✅ JS/TS	❌ XML/GUI	✅ Scala DSL
CI/CD integration	✅ Out of the box	⚠️ Cumbersome	✅ Excellent
Business metrics	✅ Thresholds, custom metrics	⚠️ Available but harder	✅ Rich reports
Best for	DevOps, startups	QA, legacy projects	JVM teams, high loads

We chose k6 because it fully covers these principles. Any other tool should meet the same criteria.

k6 — My Main Stack

I use k6 as default: quick start, transparent metrics, easy to automate.

Why k6:

Scripts in JavaScript/TypeScript — familiar syntax, fast development
Built-in metrics: response time, RPS, error rate
Integration with Grafana Cloud or local stack
CLI-first: easy in CI/CD
Open source and free

Why k6 Is Not Just a Tool, But a Paradigm Shift

JMeter and its successors were born for separate QA teams and GUI configs. k6 is a tool of the DevOps era, where developer and engineer are the same person: JavaScript instead of XML, code instead of clicks, CI/CD instead of manual runs. Choosing k6 is a decision in favor of making load tests part of the codebase, not artifacts on the side.

k6 forces you to think of tests as code: code can be reviewed, tested, and maintained. This changes team culture.

k6GrafanaInfluxDBDocker

I keep technical examples and scripts in the repository: it's code that we review and run in CI. The methodology matters here, not syntax.

Alternatives (When and Why)

If k6 doesn't work, I choose a tool for the team and context:

Apache JMeter:

GUI for visual configuration (if team isn't JS-savvy)
Mature plugin ecosystem
Cons: heavier, XML configs, harder in CI

Gatling:

Scala DSL (for JVM teams)
Beautiful reports out of the box
Good for complex scenarios with state

Locust:

Python scripts
Distributed load
Good if stack is already Python

Artillery:

YAML configs (simpler for simple cases)
Built-in WebSocket, Socket.io support
Cons: less flexible for complex scenarios

k6 covers 90% of cases: familiar JS, convenient CLI, Docker and CI runs, free metrics export to Grafana Cloud.

The Steel Algorithm (From Scratch)

Business analysis: Calculate the cost of 1% errors and 1 second latency for key scenarios.
Tool: Choose stack (k6/Gatling/JMeter) based on "test as code", CI/CD, metrics, transparency principles.
Scenarios: Describe 2-3 money-critical flows, document "where it will break" hypotheses.
Run: Load profile with gradual ramp-up, SLA thresholds, CI/CD execution.
Investigation: Correlate test and infrastructure metrics, find root cause.
Report: Hypothesis → data → action → verification, linked to money.

Methodology: Step 0 and 5 Steps to Meaningful Results

Running load for pretty graphs is a waste of time. Each step has a business question that we answer with numbers and actions.

Example thread: Taxi service, scenario "search → order → payment". We know from analytics that when p95 > 3s on car search, conversion drops by 15%. At peak load of 1000 RPS, that's 150 lost requests per second. With average order value $20 and 10% conversion to order, that's ~$300 in lost revenue per minute. Load testing showed the geocoder cracks at 800 RPS — meaning potential loss of ~$2400 per peak hour until we fix the geocoder.

0. Formulate Hypotheses

Before running the test, the team answers in writing: where will the system break first when load increases by N times? Examples: "DB will hit IOPS at 300 RPS", "cache will run out at 500 VU", "external payment gateway will start timing out". We then verify these hypotheses with the test.

1. Define Scenarios

Choose 2-3 critical user flows, ignore the rest in the first run. Business question: which 3 scenarios generate 80% of revenue or create peak load?

Auth + view dashboard
Search product + add to cart + checkout
Create entry + list entries

Example scenario:

import http from "k6/http";
import { group, check } from "k6";
 
export default function () {
  group("User Journey: Login → Dashboard → Logout", () => {
    // 1. Login
    const loginRes = http.post("https://api.example.com/auth/login", {
      email: "test@example.com",
      password: "password123",
    });
 
    check(loginRes, { "login success": (r) => r.status === 200 });
    const token = loginRes.json("token");
 
    // 2. Get dashboard
    const headers = { Authorization: `Bearer ${token}` };
    const dashRes = http.get("https://api.example.com/dashboard", { headers });
 
    check(dashRes, { "dashboard loaded": (r) => r.status === 200 });
 
    // 3. Logout
    http.post("https://api.example.com/auth/logout", null, { headers });
  });
}

2. Configure Load Profile

Ramp up load in stages: turning on 10000 VU at once gives false failures. Business question: at what traffic growth do metrics start costing us money?

export const options = {
  stages: [
    { duration: "1m", target: 50 }, // Warmup
    { duration: "3m", target: 50 }, // Baseline
    { duration: "2m", target: 150 }, // Growth (peak hours)
    { duration: "5m", target: 150 }, // Peak load
    { duration: "2m", target: 300 }, // Stress test
    { duration: "3m", target: 300 }, // Hold stress
    { duration: "2m", target: 0 }, // Gradual ramp-down
  ],
};

3. Set Thresholds

Thresholds turn a run into an SLA check, not just numbers for numbers' sake. Business question: what p95 latency is the profitability threshold for the key scenario?

export const options = {
  thresholds: {
    // Latency
    http_req_duration: [
      "p(50)<200", // Median < 200ms
      "p(95)<500", // 95th percentile < 500ms
      "p(99)<1000", // 99th percentile < 1s
    ],
 
    // Availability
    http_req_failed: ["rate<0.01"], // < 1% errors
 
    // Throughput
    http_reqs: ["rate>100"], // Minimum 100 RPS
 
    // Custom checks
    "checks{type:auth}": ["rate>0.99"], // 99% successful logins
  },
};

4. Run Test

Run in CI/CD so the same scenario runs under identical conditions. Locally only for debugging.

5. Analyze Results

First, look at run metrics. Business question: how much money do we lose at current metrics and where exactly?

Response time percentiles (p50, p95, p99)
Error rate (by codes: 4xx vs 5xx)
Throughput (RPS)
Virtual Users (how many sustained)

In parallel, check infrastructure:

CPU/Memory on servers (via Prometheus/Grafana)
Database queries per second, slow queries
Cache hit rate
Network I/O

If p95 < 500ms but p99 = 5s — you have long tails. Look for slow DB queries or external API timeouts.

Key Metrics for Decisions

Mandatory visualization: latency vs throughput graph (flat line ideally, spike in reality) and the "latency — error rate — throughput" triangle, plus resource saturation graphs (CPU/IO/network) on one Grafana screen.

p95

Response Time

< 1%

Error Rate

RPS

Throughput

Apdex

User Satisfaction

Latency (p95/p99):

p95 — your SLA for the main user mass.
p99 — your conscience: if p95 = 200ms but p99 = 2000ms, you have "core and tail". The tail is a systemic issue: slow DB queries, locks, GC.

Error Rate:

1% errors at 1000 RPS — that's 10 failed requests per second.
Over 10 minutes of testing, that's 6000 errors — that's an incident, not statistical noise.

Throughput (RPS):

Useless by itself; value is in correlation with latency.
Look at "Latency vs Throughput" graph: line should be gentle until the limit.
If latency spikes as RPS grows — you found the throughput ceiling of the bottleneck.

Resource Saturation:

CPU, disk, network, DB connections. RPS at 90-100% CPU or IOPS signals infrastructure will crack, not the application.
Correlate latency spikes with resource usage peaks.

Apdex (Application Performance Index):

Compressed satisfaction metric for business.
Configure S/T/F for your product, otherwise the number is meaningless.
Formula: (Satisfied + Tolerating/2) / Total. For example, Satisfied < 200ms, Tolerating up to 800ms, Frustrated — anything slower. For a trading platform, Satisfied is closer to 50ms, for CMS — to 500ms.

Infrastructure (Correlation)

CPU Usage — > 80% sustained → bottleneck
Memory — grows linearly → memory leak
Database connections — pool exhausted → tune
Cache hit rate — low → poor caching strategy

How I Find Root Cause (Brief)

High p99 + CPU/IO spikes on DB → check slow query log, EXPLAIN ANALYZE, pt-query-digest.
5xx errors without CPU growth → check DB connection limits, app pools, external API timeouts.
Throughput stuck, low CPU → look for global locks, network issues, or rate limiting.
RPS grows, latency jumps, cache hit drops → catch cache misses and reconsider TTL/strategy.

Report: How to Turn Numbers Into Actions

This is the heart of the approach: the report immediately turns into a backlog of actions and hypotheses.

Report Structure (My Template)

I use this template so the report immediately becomes an action backlog:

# Load Test: API v2.0
 
## Goal
 
Verify API readiness for 3x load increase (from 500 to 1500 RPS).
 
## Scenario
 
- User Journey: Login → Get Dashboard → Logout
- Load profile: 50 → 150 → 300 VU
- Duration: 18 minutes
 
## Results
 
### ✅ Passed Thresholds
 
- p95 latency: 420ms (threshold: < 500ms)
- Error rate: 0.3% (threshold: < 1%)
- Throughput: 1200 RPS (expected: 1000 RPS)
 
### ❌ Issues
 
- p99 latency: 3.2s (threshold: < 1s)
- CPU on DB: 92% at peak
- Slow queries: `/users` endpoint → N+1 queries
 
## Bottlenecks (by priority)
 
1. **Database N+1 queries** — `/users` makes 50+ DB queries
   - **Action:** add eager loading
   - **ETA:** 2 days
   - **Effect:** p99 latency → < 800ms
 
2. **CPU on DB** — reaches 92% at 300 VU
   - **Action:** upgrade instance (4 → 8 vCPU)
   - **ETA:** 1 day
   - **Effect:** headroom to 500 VU
 
3. **Cache hit rate** — only 65% for `/dashboard`
   - **Action:** increase cache TTL from 5min to 15min
   - **ETA:** 1 day
   - **Effect:** reduce DB load by 20%
 
## Recommendations
 
- Optimize N+1 queries **critical before release**
- Upgrade DB instance **recommended**
- Tune cache **can be postponed**
 
## Hypothesis and Verification (mandatory for each issue)
 
- **Problem:** p99 latency: 3.2s on `/users`
- **Hypothesis:** DB logs show N+1 queries on this endpoint
- **Action:** add eager loading
- **Verification:** rerun same test and confirm p99 < 800ms
 
## Graphs
 
[Attach Grafana dashboard screenshot]
 
## Next Steps
 
- [ ] Fix N+1 queries
- [ ] Rerun test after optimization
- [ ] Stress test to find limit (500+ VU)

The report should answer "What to do?", not "What happened?". Numbers without actions are useless.

Common Pitfalls

0. Testing the Wrong System

Problem: Running tests in staging that's weaker than production, with synthetic data and mocked external APIs. Results are pure noise and false security.

Solution: Test environment must be hardware and software identical to production. Data must be representative. All external dependencies should either be deployed in test environment or use prod versions (with owner consent).

Anti-Pattern Report

Typical "report" that answers no business questions:

Conducted load testing
Reached 5000 RPS
p95 = 2.3s
Recommendation: "optimize DB"

Why useless:

Unclear if we can operate at 5000 RPS (at p95 = 2.3s — already no).
Unclear what exactly to optimize in DB.
Unclear if optimizations are worth the cost.

Need answer: at what RPS do we stay within p95 SLA, how much money do we lose above this threshold, and which component breaks first.

1. Testing on Different Hardware

Problem: Staging is weaker than production, numbers mean nothing.

Solution: Run prod during off-peak or match staging resources to prod.

2. Rate Limiter Chokes, Not Load

Problem: Hitting limits (Cloudflare, Nginx), not code.

Solution: Disable or raise limits for runner IP.

3. No Warmup

Problem: First requests are slow due to cold cache and JIT.

Solution: Add warm-up stage before main test:

export const options = {
  stages: [
    { duration: "1m", target: 10 }, // Warm-up
    { duration: "5m", target: 100 }, // Main test
  ],
};

4. No Infrastructure Metrics

Problem: k6 is green but server still crashes.

Solution: Watch CPU, memory, disk, network alongside k6 metrics.

5. One Huge Run

Problem: 2-hour test, unclear when it broke.

Solution: Split into micro-tests:

Baseline: 50 VU, 5 min
Peak load: 150 VU, 10 min
Stress: 300 VU, 5 min

CI/CD Integration

GitLab CI Example

load-test:
  stage: test
  image: grafana/k6:latest
  script:
    - k6 run --out json=results.json tests/load/api.js
  artifacts:
    reports:
      junit: results.json
  only:
    - main
  when: manual # Run manually before release

GitHub Actions Example

name: Load Test
 
on:
  workflow_dispatch: # Manual trigger
 
jobs:
  load-test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
 
      - name: Run k6 test
        uses: grafana/k6-action@v0.3.1
        with:
          filename: tests/load/api.js
          cloud: true
          token: ${{ secrets.K6_CLOUD_TOKEN }}

Don't run load tests on every commit — it's expensive and slow. Run before releases or on schedule (nightly).

Business Insights

Load testing is about money and reputation, not just graphs.

ROI of Load Testing

Метрика

Without tests

With load tests

Test time

0 hours (didn't test)

8 hours (setup + run)

Production downtime

4 hours during peak

0 hours

100%

Revenue loss

$50k (checkout outage)

100%

Reputation

Negative reviews

Stable service

Simple economics:

Test cost: 1-2 engineer-days (~$500-1000)
Production failure cost: hours of downtime × revenue loss (~$10k-100k+)
ROI: 10x-100x

Incident cost formula:

(Average order × Conversion × Affected traffic share × Outage duration) + (Support cost × Ticket count) + (Reputation cost × Loss coefficient)

Load testing lets you plug in real numbers: "at 300 RPS and 10% checkout errors, we lose X dollars per minute".

When Load Testing Is Critical

E-commerce — sales, Black Friday
Streaming — new content launches
Fintech — payouts, mass transactions
SaaS — viral growth, press mentions

Questions the CEO Asks

What error percentage are we willing to tolerate and how much is that in absolute daily requests?
At what p95 does conversion drop and how much money is that per hour?
Is it cheaper to scale architecture or budget 0.5% losses in the business plan?

Conclusion

Load testing isn't a pre-release checklist item, but a resilience testing practice that directly impacts money. It's a way to turn an unpredictable incident into a planned optimization task. In each run, you answer the question: how much money do we lose with the current architecture and what needs to be done to stop losing it?

P.S. Your first serious load test will almost certainly reveal 2-3 bottlenecks you didn't know about. Better you find out than your users.

Starting Monday, take one step:

Choose one business-critical scenario;
Calculate how much money you lose at 1% errors or p95 > 1s;
Run a test that shows at what load you hit these thresholds.

First results in 2 days, understanding of architectural decision costs in 2 weeks.

See also:

Proxmox: Home Data Center — where to run tests
Self-host Supabase — how I tested DB migration under load

Load Testing: Financial Audit of Your Architecture

Table of Contents

Related posts

How to Implement Husky in a Production Project: From Installation to Production-Ready Hooks

Python 3.14: Three Breakthroughs That Will Change Your Code

Launched a Free Pytest Course for Juniors — and AI Said It's the Best in the World