Auth Service — Failure Modes
Status: populated Owner: SRE Last updated: 2026-04-18
| # | Failure | Impact | Mitigation |
|---|---|---|---|
| 1 | PG down | All logins + token refresh fail | HA primary, read replicas, HPA scales down dependents |
| 2 | Redis down | JWKS cache miss; api-key lookup hotter | Fail-open with DB fallback |
| 3 | Vault down at startup | Pod fails readiness | Rolling restart blocked; alert; use cached signing key on already-running pods |
| 4 | Firebase outage | Firebase login fails; password login still works | Users fall back to password; status page note |
| 5 | JWKS rotation bug (no overlap) | Mass 401 until cache expires | Runbook: force-publish old key; rotation always schedules 10m overlap |
| 6 | API key lookup endpoint slow | Kong 5xx backpressure | Redis cache + HPA; Kong timeout 200ms |
| 7 | Clock skew → JWT expired prematurely | Intermittent 401 | NTP; nbf grace 30s |
| 8 | Leaked signing key | Token impersonation | Immediate emergency rotation; revoke all refresh tokens; force re-login |
| 9 | Password database leak | Credential stuffing | argon2id + rate limit at Kong + breach monitoring |
| 10 | Lockout storm (credential stuffing) | Legit users locked out | IP-based lockout only; email lockout more conservative; CAPTCHA after N failures |