Skip to main content

Failure Modes

:::info Source Sourced from services/identity-service/FAILURE_MODES.md in the documentation repo. :::

1. Known Failure Scenarios

1.1 KMS Unavailable

  • Symptom: JWT signing fails; all /auth/login, /auth/refresh fail with 502 upstream.unavailable.
  • Blast radius: Platform-wide (every authenticated API call fails once existing JWTs expire).
  • Mitigation:
    • Multi-region KMS with automatic failover.
    • In-memory signing key cache on each pod (5-min TTL) — tolerates KMS blip.
    • Short-term: extend JWT TTL to 60 min (from 15 min) during KMS outage via feature flag.
  • Recovery: once KMS back, pods re-fetch kid material; no user action required.
  • Runbook: runbooks/identity/kms-outage.md

1.2 Postgres Primary Failure

  • Symptom: Writes fail; reads from replica still work (degraded auth — no registration, no session creation).
  • Mitigation: Patroni-managed HA; automatic failover < 30s. Read replicas serve JWKS + session validation during failover.
  • Recovery: automated promotion; replay WAL on rebuild.
  • Runbook: runbooks/identity/postgres-failover.md

1.3 IdP Timeout (SAML / OIDC)

  • Symptom: SSO callbacks fail (502 upstream.timeout).
  • Mitigation:
    • Circuit breaker per IdP (trip after 5 failures in 30s; half-open after 60s).
    • Fallback: display "SSO temporarily unavailable — use local password" on login page (tenant policy permitting).
    • Retry: exponential backoff with jitter, max 3 attempts, total 5s budget.
  • Recovery: circuit auto-closes once IdP responds OK.

1.4 Session Store (Redis) Failure

  • Symptom: Session lookups fall back to Postgres (3× latency); rate-limit counters lost.
  • Mitigation:
    • Redis Sentinel with 3 replicas; automatic failover.
    • Postgres fallback for session validation (degraded mode).
    • Rate-limit reset on Redis recovery — temporarily lifts limits (accept risk).
  • Recovery: Redis replica promoted; counters rebuild from API traffic.

1.5 Device Binding Race Condition

  • Symptom: Two concurrent device-binding requests for same (userId, fingerprint) create duplicate Device rows.
  • Mitigation:
    • Unique constraint UNIQUE (user_id, fingerprint) on devices table.
    • Upsert with ON CONFLICT (user_id, fingerprint) DO UPDATE SET public_key=..., last_seen_at=now().
    • Client dedup: clients use Idempotency-Key on binding request.
  • Recovery: duplicate detection cleanup job runs nightly.

1.6 Credential Stuffing Burst

  • Symptom: Spike in /auth/login failures; legitimate users locked out.
  • Mitigation:
    • Per-email lockout: 5 failures in 15 min → 15-min lockout (exponential on repeated lockouts).
    • Per-IP rate limit: 10/min.
    • Anomaly classifier escalates MFA challenge on atypical patterns.
    • Edge WAF blocks known credential-stuffing signatures.
  • Recovery: lockouts expire automatically; support can lift manually.

1.7 Refresh Token Reuse Detected

  • Symptom: Same refresh token used twice (classic session hijack indicator).
  • Mitigation:
    • Family revoke: all sessions descended from the compromised refresh token are revoked.
    • identity.session.revoked.v1 emitted with reason: 'rotation_reuse'.
    • User forced to re-authenticate; optional notification email.
  • Recovery: user logs in fresh; audit entry retained.

1.8 SAML Metadata Drift

  • Symptom: IdP rotates signing cert; our cached metadata stale; all SAML responses fail verification.
  • Mitigation:
    • Metadata refresh daily (or on <ds:KeyInfo> mismatch).
    • Graceful error: "Your IdP metadata has changed — contact admin" (generic — no cert leak).
  • Recovery: tenant admin refreshes metadata URL; or auto-pull on schedule.

1.9 JWKS Cache Stampede

  • Symptom: Every consumer service fetches JWKS simultaneously on rotation.
  • Mitigation:
    • CDN cache with stale-while-revalidate prevents origin storm.
    • Jittered TTL in consumer libraries (5 min ± 30s).
    • Origin supports > 10k rps burst.

1.10 Outbox Backlog (NATS Unavailable)

  • Symptom: Events accumulate in outbox table; downstream services see stale state.
  • Mitigation:
    • Outbox Relay retries with exponential backoff; buffer up to 7 days.
    • Alert on backlog > 5000 rows or > 5 min oldest.
    • DLQ for permanently unpublishable events.
  • Recovery: NATS reconnect → backlog drains; ordering preserved by occurred_at.

2. Retry / Backoff Rules

OperationMax attemptsBackoffTotal budget
KMS sign350ms, 200ms, 500ms1s
Postgres write310ms, 50ms, 200ms300ms
IdP HTTP3200ms, 1s, 3s5s
Outbox publishinfiniteexp with jitter, cap 5 min
Webhook out (password reset email)51s, 5s, 30s, 2m, 10m15 min

3. Circuit Breakers

TargetTrip thresholdReset
KMS10 failures / 30shalf-open after 60s
External IdP5 failures / 30shalf-open after 60s
Email provider (reset flow)20 failures / 60shalf-open after 2min
Redis20 failures / 10shalf-open after 30s

4. Fallback Paths

PrimaryFallback
KMS-signed JWTCached signing key (5-min TTL)
Redis session lookupPostgres session lookup (slower but correct)
SSO SAMLLocal password (if tenant allows)
Adaptive MFA risk scoreDefault to "challenge" (higher friction, safer)
Real-time device bindingQueue binding request; sync on next session

5. Chaos Engineering

Monthly game-day scenarios:

  • Kill 50% of identity-api pods mid-login flow.
  • Simulate KMS 10s latency spike.
  • Partition Redis from API pods.
  • Drop 10% of NATS messages for 2 min (test outbox recovery).
  • Invalidate a kid mid-request (test JWKS refresh).