3:47 AM. Saturday. Production has been down for 4 hours. A team of 6 sits on Zoom. Eyes bloodshot. Customers rage in chat. CFO calculates losses: $87,000 and climbing.
9:00 AM. Monday. Meeting. Red-faced CTO: "WHO caused this error?!"
Senior stays quiet. Mid-level stares at the floor. Junior trembles (it was his code). Everyone silent. A week later — the exact same incident. And again, no one knows anything.
I've seen this scene many times across different projects. And it always ends the same: the team learns nothing, incidents repeat, people burn out.
Alternative: Google, Netflix, Amazon. Their production crashes too. But they crash less often, recover faster, and don't repeat mistakes. The secret? Postmortem culture.
I've worked with incidents for 15 years. Been on teams where postmortems turned into witch hunts. And teams where they became the best growth tool. Let's break down how to build a culture where mistakes make the team stronger, not destroy it.
What is a postmortem (and why 90% of companies do it wrong)
A definition that works
Postmortem (Post-Mortem, Incident Review, Retrospective) is a structured incident analysis aimed at:
- Understanding causes (not who's guilty, but why)
- Preventing recurrence (action items)
- Improving the system (processes, monitoring, automation)
- Teaching the team (shared knowledge)
This is NOT:
- ❌ Interrogation ("who's to blame?")
- ❌ Formality ("fill template for the box check")
- ❌ Punishment ("fine for the error")
- ❌ Accusation ("you broke production")
This IS:
- ✅ Learning (how to avoid in the future)
- ✅ Systems thinking (why the system allowed this)
- ✅ Growth (how to become better)
- ✅ Transparency (everyone knows what happened)
Main postmortem principle: we don't look for who's guilty. We look for systemic causes that allowed a person to make a mistake. People don't break systems. Systems allow people to break them.
Why 90% of postmortems don't work
Typical scenario:
- Incident → everyone panics
- Fix → quickly closed
- Postmortem → someone wrote a document in 30 minutes
- Document goes to Confluence → no one reads it
- A month later → exact same incident
- Everyone surprised → "but we wrote a postmortem!"
What went wrong:
- Postmortem written for compliance, not action
- No concrete action items
- No owners for execution
- No follow-up (no one checks if completed)
- Blame culture (people afraid to admit mistakes)
Google SRE statistics:
- 60% of incidents repeat if you don't do postmortems
- 30% of incidents repeat if postmortem is formal
- 5% of incidents repeat with postmortems with action items and follow-up
Blame-Free Culture: foundation of effective postmortems
What "blame-free" means
Bad approach (blame culture):
CTO: "Who deployed the bug to production?"
Developer: "I did..."
CTO: "Why weren't there tests?!"
Developer: "Didn't have time, deadline..."
CTO: "That's your responsibility! Next time, there's a fine."
Result:
- Developer demoralized
- Team afraid to admit mistakes
- Incidents hidden or silenced
- No one learns
Correct approach (blame-free culture):
Tech Lead: "What allowed the bug to reach production?"
Developer: "I didn't have time for tests due to deadline."
Tech Lead: "Why didn't the deadline account for testing time?"
Product: "We didn't factor that into estimates."
Tech Lead: "Action items:
1. Change estimation process: +20% time for tests
2. Implement automatic CI checks
3. Make staging mandatory before production"
Result:
- Systemic improvements
- Team openly discusses issues
- Incidents don't repeat
- Everyone learns
Golden rule of blame-free: "We don't ask WHO, we ask WHY the system allowed this to happen." If a person could break production with one action — the problem is the system, not the person.
How to implement blame-free culture
1. Start with top management
If CTO yells "who's to blame?" — there will be no culture.
Rules for leaders:
- Never ask "who's to blame?"
- Always ask "what in the system allowed this?"
- Publicly praise honest postmortems
- Admit your own mistakes
Example from Amazon:
Jeff Bezos implemented the rule: "A mistake is data for improvement, not a reason for punishment." Amazon has an internal "Just Do It Award" — for teams that conducted the best postmortems after failures.
2. Postmortem for every serious incident
Severity criteria (when postmortem is needed):
- Production downtime > 30 minutes
- Data loss (any amount)
- Security incident
- Financial losses > $1,000
- Customer complaints (> 10 tickets)
- SLA violation
Even if incident is minor but repeats — do a postmortem.
3. Mandatory rule: "Thanks for honesty"
Practice from Etsy:
When a developer admits a mistake, leader's first reaction: "Thank you for being honest. Let's figure out how to prevent this."
NOT: "How could you?!"
Result: at Etsy, hidden incidents decreased by 70%.
4. Postmortems for successful projects too
Not just for failures. Postmortem after successful release:
- What worked well?
- What can be improved?
- What risks did we take and were right about?
This shows: postmortem is not punishment, but a learning tool.
Structure of an effective postmortem
Postmortem template (tested on 100+ incidents)
# Postmortem: [Incident Name]
**Incident Date:** 2025-12-15
**Start Time:** 22:47 UTC
**Recovery Time:** 03:15 UTC
**Duration:** 4 hours 28 minutes
**Severity:** Critical (P0)
**Postmortem Author:** [Name]
**Participants:** [List]
---
## Executive Summary
**What happened:** [1-2 sentences, understandable even to CEO]
Example: "Production database crashed due to disk space exhaustion.
Service was unavailable to all users for 4 hours 28 minutes."
**Impact:**
- Downtime: 4h 28min
- Affected users: 100,000
- Financial losses: ~$87,000
- Reputation damage: 247 negative reviews
**Root Cause:**
[One sentence]
Example: "Logs weren't rotated, disk filled in 3 days."
**Resolution:**
[What was done to restore]
Example: "Manually cleaned old logs, restarted DB."
**Preventing recurrence:**
[Key action items]
Example:
1. Enable automatic log rotation
2. Set up alerts for disk usage > 80%
3. Add runbook for this scenario
---
## Timeline
| Time (UTC) | Event | Action |
| ---------- | -------------------------------------- | --------------------------- |
| 22:47 | Monitoring showed DB connection errors | - |
| 22:52 | On-call engineer received alert | Started investigation |
| 23:15 | Determined: disk 100% full | Attempted to free space |
| 23:45 | Cleaned temp files — didn't help | Escalation to DBA |
| 00:30 | DBA identified: app logs issue | Started log cleanup |
| 01:15 | Freed 50GB, restarted DB | Service partially available |
| 02:00 | Full replication recovery | Monitoring stable |
| 03:15 | Incident declared closed | Post-incident monitoring |
---
## Root Cause Analysis
### What happened (technical details)
[Detailed technical description]
Example:
Application writes logs to /var/log/app/application.log
Logrotate configuration was disabled after server migration 3 months ago
Logs accumulated at ~20GB/day
300GB disk filled in 15 days
PostgreSQL couldn't create WAL files → crash
### 5 Whys method
1. **Why did production crash?**
- PostgreSQL couldn't write WAL file (Write-Ahead Log)
2. **Why couldn't PostgreSQL write WAL file?**
- No disk space (100% full)
3. **Why did disk fill up?**
- Application logs weren't rotated and accumulated
4. **Why weren't logs rotated?**
- Logrotate configuration was lost during server migration
5. **Why wasn't lost configuration detected?**
- No automatic check of critical configs after migration
- No disk usage monitoring
**Root Cause:**
Absence of automatic configuration verification after server migrations.
### Contributing Factors
- No disk usage alerts
- Insufficient migration process testing
- No runbook for this scenario
- DBA not included in on-call rotation immediately
---
## What Went Well
- ✅ On-call engineer responded in 5 minutes (SLA: 15 minutes)
- ✅ Escalation to DBA was timely
- ✅ Team worked synchronously in Slack
- ✅ Customer communication via status page was clear
- ✅ Rollback plan was ready (though not needed)
---
## What Went Wrong
- ❌ Monitoring didn't cover disk space
- ❌ No automatic config check after migration
- ❌ Logs not rotated for 3 months — no one noticed
- ❌ DBA engaged after 1 hour (should be immediate)
- ❌ No runbook for "disk full" scenario
---
## Action Items
### Prevent (prevent recurrence)
| # | Action Item | Owner | Deadline | Status |
| --- | -------------------------------------- | ------- | ---------- | -------------- |
| 1 | Enable logrotate on all servers | DevOps | 2025-12-20 | ✅ Done |
| 2 | Set up alert: disk usage > 80% | SRE | 2025-12-21 | ✅ Done |
| 3 | Automatic config check after migration | DevOps | 2025-12-30 | 🔄 In Progress |
| 4 | Add DBA to primary on-call rotation | Manager | 2025-12-22 | ✅ Done |
### Detect (improve detection)
| # | Action Item | Owner | Deadline | Status |
| --- | ------------------------------------------- | --------- | ---------- | -------------- |
| 5 | Dashboard for disk space monitoring | SRE | 2025-12-25 | 🔄 In Progress |
| 6 | Weekly monitoring log review | Tech Lead | Ongoing | ✅ Done |
| 7 | Alert for anomalous log growth (> 10GB/day) | SRE | 2025-12-28 | 📅 Planned |
### Mitigate (speed up recovery)
| # | Action Item | Owner | Deadline | Status |
| --- | ------------------------------------ | ------ | ---------- | ---------- |
| 8 | Runbook: "Disk Full Recovery" | SRE | 2025-12-23 | ✅ Done |
| 9 | Automatic script for old log cleanup | DevOps | 2026-01-10 | 📅 Planned |
| 10 | Simulate "disk full" on staging | SRE | 2026-01-15 | 📅 Planned |
### Learn (team education)
| # | Action Item | Owner | Deadline | Status |
| --- | ------------------------------------------ | --------- | ---------- | -------------- |
| 11 | Workshop: "Disk Management Best Practices" | DBA | 2025-12-27 | 🔄 In Progress |
| 12 | Update onboarding: add monitoring section | Tech Lead | 2026-01-05 | 📅 Planned |
---
## Lessons Learned
1. **Monitoring must cover basic infrastructure metrics** (CPU, RAM, Disk, Network)
2. **Critical configurations must be under version control** (Infrastructure as Code)
3. **Runbooks save hours during incidents** (our runbook would've saved 2 hours)
4. **Test migrations on staging with full verification suite**
5. **On-call rotation should include experts for each component**
---
## References
- [Incident Slack Thread](https://company.slack.com/archives/incidents/p1734307620)
- [Monitoring Dashboard](https://grafana.company.com/d/incident-2025-12-15)
- [Database Logs](https://logs.company.com/query?incident=2025-12-15)
- [Status Page Updates](https://status.company.com/incidents/2025-12-15)
---
## Sign-off
**Reviewed by:**
- [ ] Tech Lead
- [ ] SRE Lead
- [ ] DevOps Lead
- [ ] Engineering Manager
**Approved by:**
- [ ] CTO
**Post-Mortem Meeting:**
- Date: 2025-12-18
- Attendees: [List]
- Recording: [Link]About Executive Summary: this is the most important part. CEO and business read only this. Write briefly, without technical jargon, focusing on impact and solution.
Postmortem process
Phase 1: Data collection (right after incident)
Immediately after service restoration:
- Create postmortem document (while memory is fresh)
- Collect timeline from logs, Slack, monitoring
- Save all artifacts:
- Logs (before deletion)
- Dashboard screenshots
- Slack threads
- Git commits related to incident
Tools:
# Export logs for incident period
kubectl logs deployment/api --since=6h > incident-logs.txt
# Grafana dashboard screenshots
# (manual or via Grafana API)
# Export Slack thread
# (use Slack Export or screenshots)Phase 2: Meeting preparation (24-48 hours after incident)
Who writes the postmortem:
- Best option: person who led incident response
- Alternative: Tech Lead or SRE who participated
- NOT: person who's "at fault" (creates bias)
What to prepare:
- Postmortem draft using template above
- Timeline with data from logs
- List of participants for meeting
- Discussion questions
Phase 3: Meeting (postmortem meeting)
Duration: 60-90 minutes
Participants:
- Everyone who participated in incident
- Tech Lead / Engineering Manager
- Product representative (for business impact context)
- SRE / DevOps (for infrastructure questions)
- Optional: CEO/CTO (for critical incidents)
Agenda:
- 0-10 min: Executive Summary (what happened, impact)
- 10-30 min: Timeline walkthrough (event chronology)
- 30-50 min: Root Cause Analysis (5 Whys, diagrams)
- 50-70 min: Action Items brainstorming (what to do)
- 70-80 min: Action item prioritization
- 80-90 min: Assign owners and deadlines
Meeting rules:
Most important rule: moderator must stop blame. As soon as someone says "this is Bob's fault" → stop → rephrase: "what in the system allowed this?"
Phrases to block:
- ❌ "This is your fault"
- ❌ "You should have checked"
- ❌ "How could you allow this?"
Phrases to encourage:
- ✅ "What in the review process allowed this to slip through?"
- ✅ "Why didn't we have automatic verification?"
- ✅ "How can we improve the system?"
Phase 4: Document finalization (within a week)
Who:
- Postmortem author updates document based on meeting
- Adds all action items with owners and deadlines
- Sends for review to all participants
Review:
- Tech Lead
- Engineering Manager
- SRE Lead
- CTO (for critical incidents)
Publication:
- Internal wiki (Confluence, Notion, GitHub)
- Email to entire engineering team
- For critical: email to entire company
Optional (for advanced teams):
- Public postmortem (like Google, AWS)
- Presentation at All-Hands meeting
- Company blog post
Phase 5: Follow-up (critical!)
Without follow-up, postmortem is useless.
Process:
- Weekly action items review (on team or in Jira)
- Status tracking:
- ✅ Done
- 🔄 In Progress
- 📅 Planned
- 🚫 Blocked (with reason)
- Escalation: if action item blocked > 2 weeks → escalate to manager
Metrics:
- % completed action items: should be > 80% after a month
- Action item closure time: average — 2-3 weeks
- Incident recurrence: 0 (if all action items completed)
Statistics: 70% of action items from postmortems aren't completed without follow-up process. This is the main cause of recurring incidents.
Best practices from Google SRE, Netflix, Amazon
1. Blameless Post-Mortem (Google)
Rule: Even if a person clearly made a mistake, we ask "why did the system allow this?"
Example:
Developer accidentally deleted production database with command:
rm -rf /data/postgres
Bad postmortem:
"Developer executed dangerous command. Need to train team."
Good postmortem:
"System allowed rm -rf execution on production server.
Action items:
1. Restrict SSH access: production only for SRE
2. Mandatory confirmation for dangerous commands (rm, drop, truncate)
3. Automatic backups every 6 hours
4. Immutable infrastructure: server deletion through Infrastructure as Code"
2. Chaos Engineering (Netflix)
After postmortem: simulate the incident on staging/production.
Why:
- Verify action items actually work
- Train team on this scenario
- Find new issues before they become incidents
Example:
# Simulate "disk full"
# On staging server
dd if=/dev/zero of=/var/log/fill bs=1M count=10000
# Verify:
# 1. Did alert trigger?
# 2. Did on-call notification arrive?
# 3. Did automatic cleanup work?
# 4. Is there a runbook? Is it clear?3. Public Post-Mortems (AWS, GitHub, GitLab)
Publish postmortems for customers.
Why:
- Transparency
- Customer trust
- Business accountability
- PR (good postmortems attract customers)
Public postmortem examples:
Public postmortem structure:
- Brief description (what, when, how long)
- Impact (how many users)
- Root cause (simplified, no technical details)
- What we're doing to prevent
Don't include:
- People's names
- Internal processes
- Technical details (that could be used for attacks)
4. Incident Severity Levels (Amazon)
Incident classification:
| Severity | Criteria | Example | Postmortem | Deadline |
|---|---|---|---|---|
| P0 (Critical) | Production down, all users | Complete service outage | Mandatory | 24 hours |
| P1 (High) | Partial downtime, > 50% users | Database unavailable for some requests | Mandatory | 3 days |
| P2 (Medium) | Performance degradation | Slow queries, timeouts | Recommended | 1 week |
| P3 (Low) | Minor issues | UI bug, doesn't affect functionality | Optional | - |
Why it matters:
- Effort prioritization
- Clear when postmortem is needed
- Escalation process
5. Incident Commander Role (Netflix, PagerDuty)
Assign Incident Commander for every P0/P1 incident.
Incident Commander role:
- Coordinates team actions
- Makes decisions (rollback, escalation)
- Handles business/customer communication
- Documents timeline
NOT Incident Commander:
- Doesn't have to be most senior
- Doesn't have to be the one fixing
- It's a role, not a position
Example:
22:47 - Incident
22:50 - Incident Commander assigned: Alice (SRE)
22:52 - Alice creates Slack channel #incident-2025-12-15
22:55 - Alice assigns roles:
- Bob (Backend) - investigation
- Charlie (DBA) - database recovery
- David (DevOps) - infrastructure check
23:00 - Alice updates status page: "Investigating"
23:30 - Alice escalates to CTO (incident > 30 min)
What to do with postmortem results
1. Action Items Tracking (most important)
Without tracking action items, postmortem is useless.
Tools:
- Jira/Linear: create tasks with label "postmortem"
- Notion/Confluence: table with statuses
- GitHub Issues: for open-source projects
Example Jira workflow:
[POSTMORTEM-123] Enable logrotate on all servers
Priority: Critical
Assignee: DevOps Team
Labels: postmortem, incident-2025-12-15
Due Date: 2025-12-20
Acceptance Criteria:
- [ ] Logrotate configured on prod-01...prod-10
- [ ] Config added to Ansible playbook
- [ ] Verified on staging
- [ ] Documentation updated
Review process:
- Weekly sync: review all open postmortem tasks
- Monthly report: % completed action items
- Escalation: if task doesn't move > 2 weeks
2. Incident Database (knowledge base)
Create centralized storage of all postmortems.
Structure:
/postmortems
/2025
/12-december
/2025-12-15-disk-full.md
/2025-12-10-api-timeout.md
/2024
/tags
/database
/performance
/security
Metadata for each postmortem:
---
incident_id: INC-2025-12-15
date: 2025-12-15
severity: P0
duration: 4h 28min
root_cause: Disk full
tags: [database, infrastructure, monitoring]
affected_services: [api, web, mobile]
financial_impact: $87,000
---Why:
- Search similar incidents
- Trend analysis (what breaks most often)
- Onboarding new team members
3. Trend Analysis
Every quarter: analyze all postmortems.
Questions:
- Which incidents repeat?
- Which systems/services break most often?
- Which root causes occur regularly?
- How many action items completed?
Metrics:
Q4 2025 Incident Trends
Total incidents: 23
- P0: 3 (13%)
- P1: 8 (35%)
- P2: 12 (52%)
Top Root Causes:
1. Configuration errors: 8 (35%)
2. Insufficient monitoring: 6 (26%)
3. Deployment issues: 5 (22%)
4. External dependencies: 4 (17%)
Most Affected Services:
1. API Gateway: 9 incidents
2. Database: 6 incidents
3. Auth Service: 5 incidents
Action Items:
- Total created: 87
- Completed: 65 (75%)
- In Progress: 15 (17%)
- Blocked: 7 (8%)
Conclusions:
- 35% incidents due to configuration → need Infrastructure as Code
- 26% due to poor monitoring → expand coverage
- API Gateway crashes often → priority candidate for refactoring
4. Learning Sessions (team education)
Once per quarter: workshop on postmortems.
Format:
- 1 hour
- Presentation of top-3 most interesting incidents
- Lessons discussion
- Q&A
Example agenda:
Q4 2025 Incident Learning Session
1. Incident: "Disk Full on Production DB"
- Presenter: Alice (SRE)
- Duration: 20 min
- Key learnings: Monitoring, Automation, Runbooks
2. Incident: "API Timeout due to N+1 Query"
- Presenter: Bob (Backend)
- Duration: 15 min
- Key learnings: Performance testing, Query optimization
3. Incident: "Security: Exposed S3 Bucket"
- Presenter: Charlie (Security)
- Duration: 15 min
- Key learnings: IAM policies, Access control
4. Q&A: 10 min
Result:
- Team learns from others' mistakes
- Culture of openness (not ashamed to make mistakes)
- Cross-team knowledge sharing
5. Runbooks (recovery procedures)
Create runbook for each typical incident.
Example runbook: "Disk Full Recovery"
# Runbook: Disk Full Recovery
## Symptoms
- Alert: "Disk usage > 95%"
- Database errors: "No space left on device"
- Application crashes
## Immediate Actions
1. Check current disk usage:
```bash
df -h
du -sh /var/log/* | sort -h
```-
Identify large files:
find / -type f -size +1G 2>/dev/null -
Quick cleanup (if safe):
# Clean old logs (> 7 days) find /var/log -name "*.log" -mtime +7 -delete # Clean temp files rm -rf /tmp/*
Investigation
- Check logrotate status:
systemctl status logrotate - Check application log settings
- Review recent changes (deployments, configs)
Resolution
-
Enable logrotate:
systemctl enable logrotate systemctl start logrotate -
Configure log retention (7 days):
/etc/logrotate.d/application -
Restart affected services
Verification
- Disk usage < 80%
- Application responsive
- Monitoring shows stable metrics
Post-Incident
- Create postmortem if not already exists
- Update this runbook if needed
**Where to store runbooks:**
- Confluence / Notion
- GitHub repository
- PagerDuty (integration)
**Important:** runbooks must be **accessible during incident** (not only in production that crashed).
---
## Common postmortem mistakes
### Mistake #1: Formal approach
**Problem:** Postmortem written for compliance, no one reads it.
**Signs:**
- Document in Confluence with 0 views
- Action items without owners
- No follow-up
- Template phrases ("need to be more careful")
**Solution:**
- Mandatory meeting with team
- Review all action items on weekly sync
- Publish results company-wide
### Mistake #2: Looking for who's guilty
**Problem:** Focus on "who's to blame" not "why it happened".
**Signs:**
- Questions "who did this?", "why didn't you check?"
- Interrogation atmosphere at meeting
- People afraid to admit mistakes
**Solution:**
- Train management on blame-free culture
- Moderator at meeting blocks accusations
- Focus on systemic improvements
### Mistake #3: Too many action items
**Problem:** 30+ action items → nothing done.
**Signs:**
- Action items snowball
- No prioritization
- Everything overdue
**Solution:**
- **Rule:** maximum 5-7 action items per postmortem
- Prioritize by impact
- Rest → backlog as "nice to have"
<Callout type="warning">
**Pareto rule for postmortems:** 20% of action items give 80% of improvements.
Focus on critical, rest can be deferred.
</Callout>
### Mistake #4: No effectiveness metrics
**Problem:** Don't measure if postmortems work.
**Signs:**
- Don't know how many incidents repeat
- Don't know how many action items closed
- No visibility for management
**Solution:**
Measure:
- **MTTR (Mean Time To Recovery):** average recovery time
- **Incident Frequency:** incidents per month
- **Repeat Rate:** % of recurring incidents
- **Action Item Completion Rate:** % of completed action items
### Mistake #5: Postmortem a month later
**Problem:** Writing postmortem a month after incident.
**Signs:**
- Details forgotten
- Timeline inaccurate
- Logs deleted
**Solution:**
- **Deadline:** postmortem within 24-48 hours for P0/P1
- Save artifacts immediately (logs, screenshots)
- Draft timeline during incident
---
## Postmortem effectiveness metrics
### 1. **MTTR (Mean Time To Recovery)**
**What it measures:** Average time from incident start to full recovery.
**Formula:**
MTTR = Σ(Recovery time) / Number of incidents
**Example:**
Q4 2025:
- Incident 1: 4h 28min
- Incident 2: 1h 15min
- Incident 3: 35min
MTTR = (268 + 75 + 35) / 3 = 126 minutes (2h 6min)
**Goal:** MTTR should decrease over time.
**Why:**
- Better runbooks → faster recovery
- Better monitoring → earlier detection
- Team experience → more efficient actions
### 2. **Incident Frequency**
**What it measures:** Number of incidents per period.
**Metric:**
Incidents per Month = Total Incidents / Months
**Trend:**
Q3 2025: 12 incidents/month Q4 2025: 8 incidents/month ✅ (33% improvement)
**Goal:** Decrease over time (if postmortems work).
### 3. **Repeat Rate (% recurring incidents)**
**What it measures:** How many incidents repeat.
**Formula:**
Repeat Rate = (Recurring incidents / Total incidents) × 100%
**Example:**
Q4 2025:
- Total incidents: 23
- Repeat incidents: 3 (disk full x2, API timeout x1)
- Repeat Rate: 3/23 = 13%
**Benchmarks:**
- **< 10%:** Excellent (postmortems work)
- **10-20%:** Good
- **20-40%:** Poor (action items not executed)
- **> 40%:** Critical (postmortem culture doesn't work)
### 4. **Action Item Completion Rate**
**What it measures:** % of completed action items from postmortems.
**Formula:**
Completion Rate = (Completed / Total created) × 100%
**Example:**
Q4 2025:
- Total action items created: 87
- Completed: 65
- In Progress: 15
- Blocked: 7
Completion Rate: 65/87 = 75%
**Benchmarks:**
- **> 80%:** Excellent
- **60-80%:** Good
- **40-60%:** Poor
- **< 40%:** Critical (no follow-up process)
### 5. **Time to Postmortem**
**What it measures:** How long from incident to postmortem publication.
**Formula:**
Time to Postmortem = Publication date - Incident date
**Benchmarks:**
- **< 24 hours:** Excellent
- **24-48 hours:** Good
- **3-7 days:** Acceptable
- **> 1 week:** Poor (details forgotten)
---
## Tools for postmortems
### 1. **Incident Management platforms**
#### PagerDuty
**Features:**
- Automatic alerts
- On-call rotation
- Incident timeline (automatic)
- Integration with Slack, Jira
- Postmortem templates
**Price:** from $21/user/month
**Pros:**
- All-in-one solution
- Great monitoring integration
**Cons:**
- Expensive for small teams
#### Opsgenie (Atlassian)
**Features:**
- Similar to PagerDuty
- Deep Jira integration
**Price:** from $9/user/month
### 2. **Documentation and templates**
#### Confluence / Notion
**Pros:**
- Familiar tool
- Postmortem templates
- Search and tags
**Template for Notion:**
```markdown
# Incident Post-Mortem Template
## Incident Details
- **ID:**
- **Date:**
- **Severity:**
- **Duration:**
## Executive Summary
[Brief description]
## Timeline
| Time | Event | Action |
|------|-------|--------|
| | | |
## Root Cause
[5 Whys analysis]
## Action Items
- [ ] Item 1 (@owner, deadline)
- [ ] Item 2 (@owner, deadline)
GitHub / GitLab
For tech teams:
- Postmortems as markdown files in repository
- Pull requests for review
- Issues for action items
Structure:
/postmortems
/2025
/Q4
incident-2025-12-15-disk-full.md
3. Monitoring and logging
Grafana / Prometheus
For timeline:
- Export graphs for incident period
- Annotations on graphs (deploy, incidents)
Example query:
# CPU usage during incident
rate(cpu_usage[5m])
and on() timestamp() > 1734307620
and on() timestamp() < 1734323700ELK Stack / Datadog
For logs:
- Filter by incident timestamp
- Export logs for postmortem
- Error visualization
4. Incident Response Tools
Incident.io
Specialized incident management tool.
Features:
- Automatic Slack channel creation for incident
- Timeline automatically from Slack
- Postmortem templates
- Action item follow-up
Price: from $1200/month
Suitable for: companies with > 50 engineers
FireHydrant
Similar to Incident.io.
Features:
- Incident Commander rotation
- Runbook management
- Retrospective facilitation
Case studies: real postmortems
Case #1: AWS S3 Outage (2017)
What happened:
- Engineer executed command to stop several servers
- Typo in command → stopped ALL S3 servers in US-East-1 region
- Downtime: 4 hours
Impact:
- Thousands of sites unavailable
- Financial losses: millions of dollars
- Reputation damage
Root Cause:
- Command allowed deleting too many servers at once
- No protection from human error
Action Items:
- Changed tooling: can't delete more than certain number of servers
- Added confirmation for critical operations
- Improved recovery process (now faster)
Public postmortem: https://aws.amazon.com/message/41926/
Lesson: Even AWS makes mistakes. Important not to hide, but to learn.
Case #2: GitLab Database Deletion (2017)
What happened:
- Admin deleted production database instead of staging
- Backups didn't work (discovered during recovery)
- Lost 6 hours of data
Impact:
- Downtime: 18 hours
- Data loss: 300GB
Root Cause:
- No access separation
- Backups not tested
- No recovery procedure
Action Items:
- Implemented regular backup testing process
- Separated production access
- Automated backup verification
Public postmortem: https://about.gitlab.com/blog/2017/02/10/postmortem-of-database-outage-of-january-31/
Lesson: GitLab openly published postmortem → customers appreciated honesty → trust increased.
Case #3: Knight Capital ($440M in 45 minutes)
What happened:
- Trading software bug
- Lost $440 million in 45 minutes
- Company went bankrupt
Root Cause:
- Production deploy without testing
- Old code reactivated due to configuration error
- No automatic stop-loss
Lessons:
- Production-like environment testing mandatory
- Automatic safety checks critical
- Incident response plan must exist
Lesson: Postmortem wasn't conducted (company went bankrupt). This is a lesson to all: without postmortems, business can die.
Implementing postmortem culture: step-by-step plan
Step 1: Get management buy-in (1 week)
Why: Without top management support, culture won't take root.
Actions:
-
Presentation for CTO/CEO:
- Statistics: Google, Netflix reduce incidents by 60% with postmortems
- ROI: fewer incidents = fewer losses
- Public postmortems = customer trust
-
Pilot: choose 1 recent incident, do full postmortem
- Show results
- Show action items
- After a month show incident didn't repeat
-
Get approval:
- Budget for tools (if needed)
- Team time for postmortems (it's part of work)
Step 2: Create template and process (1 week)
Actions:
-
Adapt template from this article for your company
-
Create Wiki page:
- Postmortem template
- Process guide
- FAQ
-
Assign Postmortem Owner:
- Person who oversees process
- Usually: SRE Lead or Engineering Manager
Step 3: Train team (2 weeks)
Actions:
-
Workshop: "Blame-Free Postmortem Culture"
- 1-2 hours
- Explain principles
- Review example postmortem
- Conduct mock postmortem meeting
-
Documentation:
- Guide: "How to write postmortem"
- Guide: "How to conduct postmortem meeting"
- Examples of good postmortems
-
Assign mentor:
- Someone experienced helps with first 3 postmortems
Step 4: First postmortems (1 month)
Actions:
-
For first 3 incidents:
- Detailed postmortems
- Team meetings
- Example for others
-
Collect feedback:
- What's difficult?
- What's unclear?
- How to improve process?
-
Iterate on template and process
Step 5: Automation and scaling (3 months)
Actions:
-
Implement tools:
- PagerDuty / Opsgenie for incident management
- Jira for action items
- Confluence for postmortems
-
Automate:
- Automatic postmortem creation from incident
- Timeline from logs and monitoring
- Follow-up reminders
-
Metrics:
- Dashboard with MTTR, Incident Frequency, Repeat Rate
- Monthly report for management
Step 6: Cultural embedding (6-12 months)
Actions:
- Regular Learning Sessions (once per quarter)
- Public Postmortems (for customers)
- Awards for best postmortems
- Include postmortems in onboarding for new employees
Success indicator: after 6 months team initiates postmortems themselves, not waiting for manager's request. This means culture has taken root.
Checklist: Effective postmortem
✅ Before writing postmortem:
- Service fully restored
- All artifacts collected (logs, screenshots, Slack threads)
- Postmortem author identified (not the one "at fault")
- Meeting date scheduled (24-48 hours after incident)
✅ Postmortem must have:
- Executive Summary (CEO-understandable)
- Timeline (accurate, with timestamps)
- Root Cause Analysis (5 Whys or similar)
- What went well
- What went wrong
- Action Items (concrete, with owners and deadlines)
- Lessons Learned
✅ Meeting (postmortem meeting):
- All incident participants attend
- Moderator blocks accusations
- Discussion focuses on systemic causes
- Action items prioritized
- Owners and deadlines assigned
✅ After postmortem:
- Document published in Wiki
- Email to entire engineering team
- Action items added to Jira
- Weekly action items review
- Follow-up after month: everything completed?
✅ Culture:
- Blameless postmortems
- Public postmortems for customers (optional)
- Learning sessions once per quarter
- Metrics: MTTR, Frequency, Repeat Rate
- Awards for best postmortems
Main takeaway
Postmortem isn't an autopsy. It's a vaccine.
Incidents will always happen. No system is perfect. Difference between mature and immature teams:
- Immature: repeats same mistakes, blames people, hides problems
- Mature: learns from mistakes, improves system, openly shares knowledge
Postmortems turn incidents from catastrophe into growth opportunity.
Key principles:
- Blame-free: not "who's to blame" but "why system allowed"
- Action items: concrete tasks with owners and deadlines
- Follow-up: without execution postmortem is useless
- Transparency: openness within team and to customers
- Learning: postmortems as learning tool
Remember:
"A team that doesn't learn from mistakes is doomed to repeat them. A team that does quality postmortems turns every incident into a step of growth."
What to do right now
If you don't have postmortem culture yet:
- Take last incident (or next one)
- Use template from this article
- Conduct team meeting (60 minutes)
- Create 3-5 action items with owners
- Review in a week: completed?
If you already have postmortems:
- Check last 3 postmortems:
- Concrete action items?
- Owners assigned?
- Executed?
- Measure metrics:
- MTTR
- Repeat Rate
- Action Item Completion Rate
- If > 20% incidents repeat → follow-up problem
If you're Tech Lead / Manager:
- Implement rule: every P0/P1 incident = mandatory postmortem
- Train team on blame-free approach
- Weekly action items review from postmortems
- Quarterly presentation: top incidents and lessons
If you're a developer:
- Suggest postmortems to manager
- Use template from article for next incident
- Be an example: honestly admit mistakes, share lessons
Useful resource: Download free postmortem template in Notion/Confluence format from my website. Adapt for your team and start using today.
Useful resources
Books:
- "Site Reliability Engineering" (Google) — classic, whole chapter on postmortems
- "The DevOps Handbook" — culture and processes
- "Accelerate" — DevOps team effectiveness metrics
Articles:
Public postmortem examples:
Tools:
- PagerDuty — incident management
- Opsgenie — incident management
- Incident.io — specialized tool
- Postmortem Templates (GitHub) — template collection
Share your experience
I collect postmortem examples and best practices. If you have an interesting case — please share:
- Toughest incident in your career?
- How did postmortems help (or not help) your team?
- What metrics do you use to measure effectiveness?
Write in comments or Telegram. Let's discuss culture, share experience.
Need help implementing postmortem culture? Contact me — I'll conduct workshop for your team, help set up processes, choose tools. First consultation free.
Enjoyed the article? Share with a colleague who says "why do we need these postmortems" or "we don't have time for documentation". It might save their project from recurring incidents.
Subscribe to updates on Telegram — I write about DevOps, SRE, incident management and development culture. Only practice, no fluff.



