Failure Modes

:::info Source Sourced from services/identity-service/FAILURE_MODES.md in the documentation repo. :::

1. Known Failure Scenarios

1.1 KMS Unavailable

Symptom: JWT signing fails; all /auth/login, /auth/refresh fail with 502 upstream.unavailable.
Blast radius: Platform-wide (every authenticated API call fails once existing JWTs expire).
Mitigation:
- Multi-region KMS with automatic failover.
- In-memory signing key cache on each pod (5-min TTL) — tolerates KMS blip.
- Short-term: extend JWT TTL to 60 min (from 15 min) during KMS outage via feature flag.
Recovery: once KMS back, pods re-fetch kid material; no user action required.
Runbook: runbooks/identity/kms-outage.md

1.2 Postgres Primary Failure

Symptom: Writes fail; reads from replica still work (degraded auth — no registration, no session creation).
Mitigation: Patroni-managed HA; automatic failover < 30s. Read replicas serve JWKS + session validation during failover.
Recovery: automated promotion; replay WAL on rebuild.
Runbook: runbooks/identity/postgres-failover.md

1.3 IdP Timeout (SAML / OIDC)

Symptom: SSO callbacks fail (502 upstream.timeout).
Mitigation:
- Circuit breaker per IdP (trip after 5 failures in 30s; half-open after 60s).
- Fallback: display "SSO temporarily unavailable — use local password" on login page (tenant policy permitting).
- Retry: exponential backoff with jitter, max 3 attempts, total 5s budget.
Recovery: circuit auto-closes once IdP responds OK.

1.4 Session Store (Redis) Failure

Symptom: Session lookups fall back to Postgres (3× latency); rate-limit counters lost.
Mitigation:
- Redis Sentinel with 3 replicas; automatic failover.
- Postgres fallback for session validation (degraded mode).
- Rate-limit reset on Redis recovery — temporarily lifts limits (accept risk).
Recovery: Redis replica promoted; counters rebuild from API traffic.

1.5 Device Binding Race Condition

Symptom: Two concurrent device-binding requests for same (userId, fingerprint) create duplicate Device rows.
Mitigation:
- Unique constraint UNIQUE (user_id, fingerprint) on devices table.
- Upsert with ON CONFLICT (user_id, fingerprint) DO UPDATE SET public_key=..., last_seen_at=now().
- Client dedup: clients use Idempotency-Key on binding request.
Recovery: duplicate detection cleanup job runs nightly.

1.6 Credential Stuffing Burst

Symptom: Spike in /auth/login failures; legitimate users locked out.
Mitigation:
- Per-email lockout: 5 failures in 15 min → 15-min lockout (exponential on repeated lockouts).
- Per-IP rate limit: 10/min.
- Anomaly classifier escalates MFA challenge on atypical patterns.
- Edge WAF blocks known credential-stuffing signatures.
Recovery: lockouts expire automatically; support can lift manually.

1.7 Refresh Token Reuse Detected

Symptom: Same refresh token used twice (classic session hijack indicator).
Mitigation:
- Family revoke: all sessions descended from the compromised refresh token are revoked.
- identity.session.revoked.v1 emitted with reason: 'rotation_reuse'.
- User forced to re-authenticate; optional notification email.
Recovery: user logs in fresh; audit entry retained.

1.8 SAML Metadata Drift

Symptom: IdP rotates signing cert; our cached metadata stale; all SAML responses fail verification.
Mitigation:
- Metadata refresh daily (or on <ds:KeyInfo> mismatch).
- Graceful error: "Your IdP metadata has changed — contact admin" (generic — no cert leak).
Recovery: tenant admin refreshes metadata URL; or auto-pull on schedule.

1.9 JWKS Cache Stampede

Symptom: Every consumer service fetches JWKS simultaneously on rotation.
Mitigation:
- CDN cache with stale-while-revalidate prevents origin storm.
- Jittered TTL in consumer libraries (5 min ± 30s).
- Origin supports > 10k rps burst.

1.10 Outbox Backlog (NATS Unavailable)

Symptom: Events accumulate in outbox table; downstream services see stale state.
Mitigation:
- Outbox Relay retries with exponential backoff; buffer up to 7 days.
- Alert on backlog > 5000 rows or > 5 min oldest.
- DLQ for permanently unpublishable events.
Recovery: NATS reconnect → backlog drains; ordering preserved by occurred_at.

2. Retry / Backoff Rules

Operation	Max attempts	Backoff	Total budget
KMS sign	3	50ms, 200ms, 500ms	1s
Postgres write	3	10ms, 50ms, 200ms	300ms
IdP HTTP	3	200ms, 1s, 3s	5s
Outbox publish	infinite	exp with jitter, cap 5 min	—
Webhook out (password reset email)	5	1s, 5s, 30s, 2m, 10m	15 min

3. Circuit Breakers

Target	Trip threshold	Reset
KMS	10 failures / 30s	half-open after 60s
External IdP	5 failures / 30s	half-open after 60s
Email provider (reset flow)	20 failures / 60s	half-open after 2min
Redis	20 failures / 10s	half-open after 30s

4. Fallback Paths

Primary	Fallback
KMS-signed JWT	Cached signing key (5-min TTL)
Redis session lookup	Postgres session lookup (slower but correct)
SSO SAML	Local password (if tenant allows)
Adaptive MFA risk score	Default to "challenge" (higher friction, safer)
Real-time device binding	Queue binding request; sync on next session

5. Chaos Engineering

Monthly game-day scenarios:

Kill 50% of identity-api pods mid-login flow.
Simulate KMS 10s latency spike.
Partition Redis from API pods.
Drop 10% of NATS messages for 2 min (test outbox recovery).
Invalidate a kid mid-request (test JWKS refresh).

1. Known Failure Scenarios​

1.1 KMS Unavailable​

1.2 Postgres Primary Failure​

1.3 IdP Timeout (SAML / OIDC)​

1.4 Session Store (Redis) Failure​

1.5 Device Binding Race Condition​

1.6 Credential Stuffing Burst​

1.7 Refresh Token Reuse Detected​

1.8 SAML Metadata Drift​

1.9 JWKS Cache Stampede​

1.10 Outbox Backlog (NATS Unavailable)​

2. Retry / Backoff Rules​

3. Circuit Breakers​

4. Fallback Paths​

5. Chaos Engineering​

1. Known Failure Scenarios

1.1 KMS Unavailable

1.2 Postgres Primary Failure

1.3 IdP Timeout (SAML / OIDC)

1.4 Session Store (Redis) Failure

1.5 Device Binding Race Condition

1.6 Credential Stuffing Burst

1.7 Refresh Token Reuse Detected

1.8 SAML Metadata Drift

1.9 JWKS Cache Stampede

1.10 Outbox Backlog (NATS Unavailable)

2. Retry / Backoff Rules

3. Circuit Breakers

4. Fallback Paths

5. Chaos Engineering