:::info Source
Sourced from services/identity-service/FAILURE_MODES.md in the documentation repo.
:::
1. Known Failure Scenarios
1.1 KMS Unavailable
- Symptom: JWT signing fails; all
/auth/login, /auth/refresh fail with 502 upstream.unavailable.
- Blast radius: Platform-wide (every authenticated API call fails once existing JWTs expire).
- Mitigation:
- Multi-region KMS with automatic failover.
- In-memory signing key cache on each pod (5-min TTL) — tolerates KMS blip.
- Short-term: extend JWT TTL to 60 min (from 15 min) during KMS outage via feature flag.
- Recovery: once KMS back, pods re-fetch
kid material; no user action required.
- Runbook:
runbooks/identity/kms-outage.md
1.2 Postgres Primary Failure
- Symptom: Writes fail; reads from replica still work (degraded auth — no registration, no session creation).
- Mitigation: Patroni-managed HA; automatic failover < 30s. Read replicas serve JWKS + session validation during failover.
- Recovery: automated promotion; replay WAL on rebuild.
- Runbook:
runbooks/identity/postgres-failover.md
1.3 IdP Timeout (SAML / OIDC)
- Symptom: SSO callbacks fail (
502 upstream.timeout).
- Mitigation:
- Circuit breaker per IdP (trip after 5 failures in 30s; half-open after 60s).
- Fallback: display "SSO temporarily unavailable — use local password" on login page (tenant policy permitting).
- Retry: exponential backoff with jitter, max 3 attempts, total 5s budget.
- Recovery: circuit auto-closes once IdP responds OK.
1.4 Session Store (Redis) Failure
- Symptom: Session lookups fall back to Postgres (3× latency); rate-limit counters lost.
- Mitigation:
- Redis Sentinel with 3 replicas; automatic failover.
- Postgres fallback for session validation (degraded mode).
- Rate-limit reset on Redis recovery — temporarily lifts limits (accept risk).
- Recovery: Redis replica promoted; counters rebuild from API traffic.
1.5 Device Binding Race Condition
- Symptom: Two concurrent device-binding requests for same
(userId, fingerprint) create duplicate Device rows.
- Mitigation:
- Unique constraint
UNIQUE (user_id, fingerprint) on devices table.
- Upsert with
ON CONFLICT (user_id, fingerprint) DO UPDATE SET public_key=..., last_seen_at=now().
- Client dedup: clients use
Idempotency-Key on binding request.
- Recovery: duplicate detection cleanup job runs nightly.
1.6 Credential Stuffing Burst
- Symptom: Spike in
/auth/login failures; legitimate users locked out.
- Mitigation:
- Per-email lockout: 5 failures in 15 min → 15-min lockout (exponential on repeated lockouts).
- Per-IP rate limit: 10/min.
- Anomaly classifier escalates MFA challenge on atypical patterns.
- Edge WAF blocks known credential-stuffing signatures.
- Recovery: lockouts expire automatically; support can lift manually.
1.7 Refresh Token Reuse Detected
- Symptom: Same refresh token used twice (classic session hijack indicator).
- Mitigation:
- Family revoke: all sessions descended from the compromised refresh token are revoked.
identity.session.revoked.v1 emitted with reason: 'rotation_reuse'.
- User forced to re-authenticate; optional notification email.
- Recovery: user logs in fresh; audit entry retained.
- Symptom: IdP rotates signing cert; our cached metadata stale; all SAML responses fail verification.
- Mitigation:
- Metadata refresh daily (or on
<ds:KeyInfo> mismatch).
- Graceful error: "Your IdP metadata has changed — contact admin" (generic — no cert leak).
- Recovery: tenant admin refreshes metadata URL; or auto-pull on schedule.
1.9 JWKS Cache Stampede
- Symptom: Every consumer service fetches JWKS simultaneously on rotation.
- Mitigation:
- CDN cache with
stale-while-revalidate prevents origin storm.
- Jittered TTL in consumer libraries (5 min ± 30s).
- Origin supports > 10k rps burst.
1.10 Outbox Backlog (NATS Unavailable)
- Symptom: Events accumulate in
outbox table; downstream services see stale state.
- Mitigation:
- Outbox Relay retries with exponential backoff; buffer up to 7 days.
- Alert on backlog > 5000 rows or > 5 min oldest.
- DLQ for permanently unpublishable events.
- Recovery: NATS reconnect → backlog drains; ordering preserved by
occurred_at.
2. Retry / Backoff Rules
| Operation | Max attempts | Backoff | Total budget |
|---|
| KMS sign | 3 | 50ms, 200ms, 500ms | 1s |
| Postgres write | 3 | 10ms, 50ms, 200ms | 300ms |
| IdP HTTP | 3 | 200ms, 1s, 3s | 5s |
| Outbox publish | infinite | exp with jitter, cap 5 min | — |
| Webhook out (password reset email) | 5 | 1s, 5s, 30s, 2m, 10m | 15 min |
3. Circuit Breakers
| Target | Trip threshold | Reset |
|---|
| KMS | 10 failures / 30s | half-open after 60s |
| External IdP | 5 failures / 30s | half-open after 60s |
| Email provider (reset flow) | 20 failures / 60s | half-open after 2min |
| Redis | 20 failures / 10s | half-open after 30s |
4. Fallback Paths
| Primary | Fallback |
|---|
| KMS-signed JWT | Cached signing key (5-min TTL) |
| Redis session lookup | Postgres session lookup (slower but correct) |
| SSO SAML | Local password (if tenant allows) |
| Adaptive MFA risk score | Default to "challenge" (higher friction, safer) |
| Real-time device binding | Queue binding request; sync on next session |
5. Chaos Engineering
Monthly game-day scenarios:
- Kill 50% of
identity-api pods mid-login flow.
- Simulate KMS 10s latency spike.
- Partition Redis from API pods.
- Drop 10% of NATS messages for 2 min (test outbox recovery).
- Invalidate a
kid mid-request (test JWKS refresh).