Incident Response
Authored — not generated. Update directly in
docs-portal/_authored/07-runbooks/incident-response.md.
Severity ladder
| Severity | Definition | Page |
|---|---|---|
| SEV-1 | Customer-impacting outage, data loss risk, security breach | Page on-call + SRE lead immediately |
| SEV-2 | Degraded for some tenants, no data loss | Page on-call |
| SEV-3 | Internal degradation, customer-invisible | Slack channel |
| SEV-4 | Cosmetic | Ticket |
First five minutes
- Open the SLO board and the relevant service's runbook tab.
- Capture the
traceparentof a failing request — it threads through every log/metric. - Confirm tenant scope: single-tenant impact vs platform-wide.
- Decide rollback vs forward-fix. Default is rollback if a deploy in the last 30 min is suspect.
- Open the incident channel and pin the single source of truth doc.
Communication
- Internal updates every 15 min until mitigated.
- Customer-facing status only after the on-call lead's explicit OK.
- Post-incident write-up within 5 business days, blameless format.