When Problems Come With Traffic
While you're reading this sentence, your code might be losing money — not because of bugs, but because of unanticipated load. A colleague had a case: $15k/day in ads, production down for half a day — minus ~$7.5k for nothing. Since then, load testing is a financial audit of architecture: we find the point where the system starts eating revenue and calculate how much it costs.
Remember how marketplaces and retail went down on Black Friday or Singles' Day? Or how government portals couldn't handle the rush during COVID QR code and vaccination registration launches? Their mistake wasn't that they didn't test — it was that they tested the wrong things the wrong way.
A 1-second delay at 1000 RPS equals 1000 seconds of cumulative user waiting time every second of real time. That's direct abandoned carts and negative reviews. Load testing is how you know in advance at what RPS and because of which component you start losing money and exactly how much.
90% of performance issues are only visible under load. A local curl gives
you 50ms, but at 1000 RPS you get seconds of waiting and 500 errors.
What Is Load Testing in Simple Terms
I keep two questions in mind when running load tests:
- Performance testing: How fast is the system at target load? This is about p95, latency, throughput.
- Resilience testing: How and when does the system break, how does it degrade and recover, and what does it cost? This is about limits and failure points in dollars.
Load testing is always about resilience and money. We find where the system cracks and calculate what it will cost the business.
Test types (all translated to money language):
- Load testing — "How many users can we serve before we start losing money?"
- Stress testing — "Where exactly and at what load does our bottom burn out, and what does it cost?"
- Spike testing — "If we get a traffic surge from HackerNews/top blogger, at what RPS do we fall and how much do we lose?"
- Soak testing — "What's the cost of degradation or memory leaks after an hour of peak traffic?"
Tools: What I Use and Why
Tool choice is secondary. The main thing is to answer business questions, not draw pretty graphs. We use k6 because its scripts are code in the repository that lives in CI/CD and is cheap to maintain. You can take Gatling or JMeter — what matters more is that test results answer the question "how much money do we lose with the current architecture?"
Tool selection principles:
- Test as code: lives next to product code and gets reviewed.
- CI/CD: one step to run, no manual clicks.
- Business metrics: simple thresholds, custom metrics, and money linkage.
- Transparency: reports are understood by developers, product managers, and management.
| Principle / Tool | k6 | Apache JMeter | Gatling |
|---|---|---|---|
| Test as code | ✅ JS/TS | ❌ XML/GUI | ✅ Scala DSL |
| CI/CD integration | ✅ Out of the box | ⚠️ Cumbersome | ✅ Excellent |
| Business metrics | ✅ Thresholds, custom metrics | ⚠️ Available but harder | ✅ Rich reports |
| Best for | DevOps, startups | QA, legacy projects | JVM teams, high loads |
We chose k6 because it fully covers these principles. Any other tool should meet the same criteria.
k6 — My Main Stack
I use k6 as default: quick start, transparent metrics, easy to automate.
Why k6:
- Scripts in JavaScript/TypeScript — familiar syntax, fast development
- Built-in metrics: response time, RPS, error rate
- Integration with Grafana Cloud or local stack
- CLI-first: easy in CI/CD
- Open source and free
Why k6 Is Not Just a Tool, But a Paradigm Shift
JMeter and its successors were born for separate QA teams and GUI configs. k6 is a tool of the DevOps era, where developer and engineer are the same person: JavaScript instead of XML, code instead of clicks, CI/CD instead of manual runs. Choosing k6 is a decision in favor of making load tests part of the codebase, not artifacts on the side.
k6 forces you to think of tests as code: code can be reviewed, tested, and maintained. This changes team culture.
I keep technical examples and scripts in the repository: it's code that we review and run in CI. The methodology matters here, not syntax.
Alternatives (When and Why)
If k6 doesn't work, I choose a tool for the team and context:
Apache JMeter:
- GUI for visual configuration (if team isn't JS-savvy)
- Mature plugin ecosystem
- Cons: heavier, XML configs, harder in CI
Gatling:
- Scala DSL (for JVM teams)
- Beautiful reports out of the box
- Good for complex scenarios with state
Locust:
- Python scripts
- Distributed load
- Good if stack is already Python
Artillery:
- YAML configs (simpler for simple cases)
- Built-in WebSocket, Socket.io support
- Cons: less flexible for complex scenarios
k6 covers 90% of cases: familiar JS, convenient CLI, Docker and CI runs, free metrics export to Grafana Cloud.
The Steel Algorithm (From Scratch)
- Business analysis: Calculate the cost of 1% errors and 1 second latency for key scenarios.
- Tool: Choose stack (k6/Gatling/JMeter) based on "test as code", CI/CD, metrics, transparency principles.
- Scenarios: Describe 2-3 money-critical flows, document "where it will break" hypotheses.
- Run: Load profile with gradual ramp-up, SLA thresholds, CI/CD execution.
- Investigation: Correlate test and infrastructure metrics, find root cause.
- Report: Hypothesis → data → action → verification, linked to money.
Methodology: Step 0 and 5 Steps to Meaningful Results
Running load for pretty graphs is a waste of time. Each step has a business question that we answer with numbers and actions.
Example thread: Taxi service, scenario "search → order → payment". We know from analytics that when p95 > 3s on car search, conversion drops by 15%. At peak load of 1000 RPS, that's 150 lost requests per second. With average order value $20 and 10% conversion to order, that's ~$300 in lost revenue per minute. Load testing showed the geocoder cracks at 800 RPS — meaning potential loss of ~$2400 per peak hour until we fix the geocoder.
0. Formulate Hypotheses
Before running the test, the team answers in writing: where will the system break first when load increases by N times? Examples: "DB will hit IOPS at 300 RPS", "cache will run out at 500 VU", "external payment gateway will start timing out". We then verify these hypotheses with the test.
1. Define Scenarios
Choose 2-3 critical user flows, ignore the rest in the first run. Business question: which 3 scenarios generate 80% of revenue or create peak load?
- Auth + view dashboard
- Search product + add to cart + checkout
- Create entry + list entries
Example scenario:
import http from "k6/http";
import { group, check } from "k6";
export default function () {
group("User Journey: Login → Dashboard → Logout", () => {
// 1. Login
const loginRes = http.post("https://api.example.com/auth/login", {
email: "test@example.com",
password: "password123",
});
check(loginRes, { "login success": (r) => r.status === 200 });
const token = loginRes.json("token");
// 2. Get dashboard
const headers = { Authorization: `Bearer ${token}` };
const dashRes = http.get("https://api.example.com/dashboard", { headers });
check(dashRes, { "dashboard loaded": (r) => r.status === 200 });
// 3. Logout
http.post("https://api.example.com/auth/logout", null, { headers });
});
}2. Configure Load Profile
Ramp up load in stages: turning on 10000 VU at once gives false failures. Business question: at what traffic growth do metrics start costing us money?
export const options = {
stages: [
{ duration: "1m", target: 50 }, // Warmup
{ duration: "3m", target: 50 }, // Baseline
{ duration: "2m", target: 150 }, // Growth (peak hours)
{ duration: "5m", target: 150 }, // Peak load
{ duration: "2m", target: 300 }, // Stress test
{ duration: "3m", target: 300 }, // Hold stress
{ duration: "2m", target: 0 }, // Gradual ramp-down
],
};3. Set Thresholds
Thresholds turn a run into an SLA check, not just numbers for numbers' sake. Business question: what p95 latency is the profitability threshold for the key scenario?
export const options = {
thresholds: {
// Latency
http_req_duration: [
"p(50)<200", // Median < 200ms
"p(95)<500", // 95th percentile < 500ms
"p(99)<1000", // 99th percentile < 1s
],
// Availability
http_req_failed: ["rate<0.01"], // < 1% errors
// Throughput
http_reqs: ["rate>100"], // Minimum 100 RPS
// Custom checks
"checks{type:auth}": ["rate>0.99"], // 99% successful logins
},
};4. Run Test
Run in CI/CD so the same scenario runs under identical conditions. Locally only for debugging.
5. Analyze Results
First, look at run metrics. Business question: how much money do we lose at current metrics and where exactly?
- Response time percentiles (p50, p95, p99)
- Error rate (by codes: 4xx vs 5xx)
- Throughput (RPS)
- Virtual Users (how many sustained)
In parallel, check infrastructure:
- CPU/Memory on servers (via Prometheus/Grafana)
- Database queries per second, slow queries
- Cache hit rate
- Network I/O
If p95 < 500ms but p99 = 5s — you have long tails. Look for slow DB queries or external API timeouts.
Key Metrics for Decisions
Mandatory visualization: latency vs throughput graph (flat line ideally, spike in reality) and the "latency — error rate — throughput" triangle, plus resource saturation graphs (CPU/IO/network) on one Grafana screen.
Latency (p95/p99):
- p95 — your SLA for the main user mass.
- p99 — your conscience: if p95 = 200ms but p99 = 2000ms, you have "core and tail". The tail is a systemic issue: slow DB queries, locks, GC.
Error Rate:
- 1% errors at 1000 RPS — that's 10 failed requests per second.
- Over 10 minutes of testing, that's 6000 errors — that's an incident, not statistical noise.
Throughput (RPS):
- Useless by itself; value is in correlation with latency.
- Look at "Latency vs Throughput" graph: line should be gentle until the limit.
- If latency spikes as RPS grows — you found the throughput ceiling of the bottleneck.
Resource Saturation:
- CPU, disk, network, DB connections. RPS at 90-100% CPU or IOPS signals infrastructure will crack, not the application.
- Correlate latency spikes with resource usage peaks.
Apdex (Application Performance Index):
- Compressed satisfaction metric for business.
- Configure S/T/F for your product, otherwise the number is meaningless.
- Formula:
(Satisfied + Tolerating/2) / Total. For example, Satisfied < 200ms, Tolerating up to 800ms, Frustrated — anything slower. For a trading platform, Satisfied is closer to 50ms, for CMS — to 500ms.
Infrastructure (Correlation)
- CPU Usage — > 80% sustained → bottleneck
- Memory — grows linearly → memory leak
- Database connections — pool exhausted → tune
- Cache hit rate — low → poor caching strategy
How I Find Root Cause (Brief)
- High p99 + CPU/IO spikes on DB → check slow query log,
EXPLAIN ANALYZE, pt-query-digest. - 5xx errors without CPU growth → check DB connection limits, app pools, external API timeouts.
- Throughput stuck, low CPU → look for global locks, network issues, or rate limiting.
- RPS grows, latency jumps, cache hit drops → catch cache misses and reconsider TTL/strategy.
Report: How to Turn Numbers Into Actions
This is the heart of the approach: the report immediately turns into a backlog of actions and hypotheses.
Report Structure (My Template)
I use this template so the report immediately becomes an action backlog:
# Load Test: API v2.0
## Goal
Verify API readiness for 3x load increase (from 500 to 1500 RPS).
## Scenario
- User Journey: Login → Get Dashboard → Logout
- Load profile: 50 → 150 → 300 VU
- Duration: 18 minutes
## Results
### ✅ Passed Thresholds
- p95 latency: 420ms (threshold: < 500ms)
- Error rate: 0.3% (threshold: < 1%)
- Throughput: 1200 RPS (expected: 1000 RPS)
### ❌ Issues
- p99 latency: 3.2s (threshold: < 1s)
- CPU on DB: 92% at peak
- Slow queries: `/users` endpoint → N+1 queries
## Bottlenecks (by priority)
1. **Database N+1 queries** — `/users` makes 50+ DB queries
- **Action:** add eager loading
- **ETA:** 2 days
- **Effect:** p99 latency → < 800ms
2. **CPU on DB** — reaches 92% at 300 VU
- **Action:** upgrade instance (4 → 8 vCPU)
- **ETA:** 1 day
- **Effect:** headroom to 500 VU
3. **Cache hit rate** — only 65% for `/dashboard`
- **Action:** increase cache TTL from 5min to 15min
- **ETA:** 1 day
- **Effect:** reduce DB load by 20%
## Recommendations
- Optimize N+1 queries **critical before release**
- Upgrade DB instance **recommended**
- Tune cache **can be postponed**
## Hypothesis and Verification (mandatory for each issue)
- **Problem:** p99 latency: 3.2s on `/users`
- **Hypothesis:** DB logs show N+1 queries on this endpoint
- **Action:** add eager loading
- **Verification:** rerun same test and confirm p99 < 800ms
## Graphs
[Attach Grafana dashboard screenshot]
## Next Steps
- [ ] Fix N+1 queries
- [ ] Rerun test after optimization
- [ ] Stress test to find limit (500+ VU)The report should answer "What to do?", not "What happened?". Numbers without actions are useless.
Common Pitfalls
0. Testing the Wrong System
Problem: Running tests in staging that's weaker than production, with synthetic data and mocked external APIs. Results are pure noise and false security.
Solution: Test environment must be hardware and software identical to production. Data must be representative. All external dependencies should either be deployed in test environment or use prod versions (with owner consent).
Anti-Pattern Report
Typical "report" that answers no business questions:
- Conducted load testing
- Reached 5000 RPS
- p95 = 2.3s
- Recommendation: "optimize DB"
Why useless:
- Unclear if we can operate at 5000 RPS (at p95 = 2.3s — already no).
- Unclear what exactly to optimize in DB.
- Unclear if optimizations are worth the cost.
Need answer: at what RPS do we stay within p95 SLA, how much money do we lose above this threshold, and which component breaks first.
1. Testing on Different Hardware
Problem: Staging is weaker than production, numbers mean nothing.
Solution: Run prod during off-peak or match staging resources to prod.
2. Rate Limiter Chokes, Not Load
Problem: Hitting limits (Cloudflare, Nginx), not code.
Solution: Disable or raise limits for runner IP.
3. No Warmup
Problem: First requests are slow due to cold cache and JIT.
Solution: Add warm-up stage before main test:
export const options = {
stages: [
{ duration: "1m", target: 10 }, // Warm-up
{ duration: "5m", target: 100 }, // Main test
],
};4. No Infrastructure Metrics
Problem: k6 is green but server still crashes.
Solution: Watch CPU, memory, disk, network alongside k6 metrics.
5. One Huge Run
Problem: 2-hour test, unclear when it broke.
Solution: Split into micro-tests:
- Baseline: 50 VU, 5 min
- Peak load: 150 VU, 10 min
- Stress: 300 VU, 5 min
CI/CD Integration
GitLab CI Example
load-test:
stage: test
image: grafana/k6:latest
script:
- k6 run --out json=results.json tests/load/api.js
artifacts:
reports:
junit: results.json
only:
- main
when: manual # Run manually before releaseGitHub Actions Example
name: Load Test
on:
workflow_dispatch: # Manual trigger
jobs:
load-test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Run k6 test
uses: grafana/k6-action@v0.3.1
with:
filename: tests/load/api.js
cloud: true
token: ${{ secrets.K6_CLOUD_TOKEN }}Don't run load tests on every commit — it's expensive and slow. Run before releases or on schedule (nightly).
Business Insights
Load testing is about money and reputation, not just graphs.
ROI of Load Testing
Simple economics:
- Test cost: 1-2 engineer-days (~$500-1000)
- Production failure cost: hours of downtime × revenue loss (~$10k-100k+)
- ROI: 10x-100x
Incident cost formula:
(Average order × Conversion × Affected traffic share × Outage duration) + (Support cost × Ticket count) + (Reputation cost × Loss coefficient)
Load testing lets you plug in real numbers: "at 300 RPS and 10% checkout errors, we lose X dollars per minute".
When Load Testing Is Critical
- E-commerce — sales, Black Friday
- Streaming — new content launches
- Fintech — payouts, mass transactions
- SaaS — viral growth, press mentions
Questions the CEO Asks
- What error percentage are we willing to tolerate and how much is that in absolute daily requests?
- At what p95 does conversion drop and how much money is that per hour?
- Is it cheaper to scale architecture or budget 0.5% losses in the business plan?
Conclusion
Load testing isn't a pre-release checklist item, but a resilience testing practice that directly impacts money. It's a way to turn an unpredictable incident into a planned optimization task. In each run, you answer the question: how much money do we lose with the current architecture and what needs to be done to stop losing it?
P.S. Your first serious load test will almost certainly reveal 2-3 bottlenecks you didn't know about. Better you find out than your users.
Starting Monday, take one step:
- Choose one business-critical scenario;
- Calculate how much money you lose at 1% errors or p95 > 1s;
- Run a test that shows at what load you hit these thresholds.
First results in 2 days, understanding of architectural decision costs in 2 weeks.
See also:
- Proxmox: Home Data Center — where to run tests
- Self-host Supabase — how I tested DB migration under load

