Skip to main content

Incident Response

Authored — not generated. Update directly in docs-portal/_authored/07-runbooks/incident-response.md.

Severity ladder

SeverityDefinitionPage
SEV-1Customer-impacting outage, data loss risk, security breachPage on-call + SRE lead immediately
SEV-2Degraded for some tenants, no data lossPage on-call
SEV-3Internal degradation, customer-invisibleSlack channel
SEV-4CosmeticTicket

First five minutes

  1. Open the SLO board and the relevant service's runbook tab.
  2. Capture the traceparent of a failing request — it threads through every log/metric.
  3. Confirm tenant scope: single-tenant impact vs platform-wide.
  4. Decide rollback vs forward-fix. Default is rollback if a deploy in the last 30 min is suspect.
  5. Open the incident channel and pin the single source of truth doc.

Communication

  • Internal updates every 15 min until mitigated.
  • Customer-facing status only after the on-call lead's explicit OK.
  • Post-incident write-up within 5 business days, blameless format.