iam-service — Failure Modes
Catalog · DEPLOYMENT_TOPOLOGY · OBSERVABILITY · SERVICE_RISK_REGISTER
A catalog of every known failure mode for iam-service: what breaks, what users experience, how we detect it, how we mitigate. iam-service availability defines the platform availability; treat every entry seriously.
1. Failure Catalog
1.1 Cloud KMS — signing key unavailable
| Aspect | Detail |
|---|
| Trigger | KMS regional outage; IAM revocation; misconfigured kid. |
| User impact | New logins / refreshes / SSO callbacks fail (cannot mint JWT). Existing access tokens still valid until TTL (≤ 15 min). |
| Detection | iam_kms_op_duration_seconds errors; IamKMSUnavailable alert; canary login fails. |
| Mitigation | Open circuit breaker; serve 503 problem+json{type:"…/kms_unavailable"}; surface "auth degraded" status banner; failover to standby KMS region (M2). |
| Recovery | Manual: confirm KMS health → close breaker → flush JWKS cache. |
| Runbook | runbooks/iam/kms-outage.md |
| RTO | < 30 min (M0); < 5 min (M2). |
| Data loss | None (no writes lost — denied at edge). |
1.2 Cloud SQL — primary failover
| Aspect | Detail |
|---|
| Trigger | Zone failure; planned maintenance; manual failover. |
| User impact | 30–60 s elevated 5xx; refresh tokens may transiently fail. |
| Detection | iam_db_pool_active drops to 0; readiness probe fails. |
| Mitigation | Cloud Run health probes drain instance; reconnect with exponential backoff; iam-worker pauses outbox publishing during cutover. |
| Recovery | Auto via Cloud SQL HA; verify with smoke test post-failover. |
| Runbook | runbooks/iam/cloudsql-failover.md |
| RTO | < 60 s. |
| Data loss | 0 (synchronous regional replication). |
1.3 Memorystore (Redis) — partition / down
| Aspect | Detail |
|---|
| Trigger | Network partition; node failure; eviction. |
| User impact | Slower auth (DB-backed lookups); rate limits less precise; magic links fail. |
| Detection | iam_redis_op_duration_seconds error spike; readiness degrades. |
| Mitigation | Fall back to DB for session lookup; rate-limit falls back to per-pod in-memory token bucket; magic-link writes return 503 (rare path); circuit breaker on hot reads. |
| Recovery | Redis recovery; cache warms organically. |
| Runbook | runbooks/iam/redis-down.md |
| RTO | < 5 min. |
| Data loss | Cache only (rebuilds); magic-link writes during outage are rejected — user retries. |
1.4 OIDC IdP timeout / down
| Aspect | Detail |
|---|
| Trigger | IdP outage; certificate expiry; metadata changed. |
| User impact | SSO login fails for affected tenant only. |
| Detection | iam_sso_callbacks_total{result="error"}; tenant-scoped alert. |
| Mitigation | Fall back to password / magic-link if tenant policy permits; surface specific error code MELMASTOON.IAM.SSO.PROVIDER_UNAVAILABLE; retry with backoff. |
| Recovery | IdP recovery. |
| Runbook | runbooks/iam/sso-outage.md |
| RTO | Tenant-dependent. |
| Data loss | None. |
| Aspect | Detail |
|---|
| Trigger | IdP rotated signing cert without notice. |
| User impact | All SAML logins for tenant fail with assertion-invalid. |
| Detection | iam_sso_callbacks_total{provider="saml",result="signature_invalid"} spike. |
| Mitigation | Refresh metadata from IdP URL (if MetadataURL configured) every 6 h + on first failure. |
| Recovery | Auto-refresh succeeds, OR manual upload by tenant admin. |
| Runbook | runbooks/iam/saml-drift.md |
| RTO | Auto: ≤ 6 h; manual: minutes. |
| Data loss | None. |
1.6 Breach-list (HIBP) provider down
| Aspect | Detail |
|---|
| Trigger | HIBP outage; rate limit hit; API key invalid. |
| User impact | Registration / password change cannot verify breach status; fails-open to allow user (config-controlled). |
| Detection | IamBreachListProviderDown alert. |
| Mitigation | Fail-open with audit event iam.breach_list.skipped; set pwn_audit_due flag for backfill. |
| Recovery | Backfill job re-checks affected passwords on next login. |
| Runbook | runbooks/iam/hibp-down.md |
| RTO | n/a (degraded mode permitted). |
| Data loss | None. |
1.7 Email delivery (notification-service) down
| Aspect | Detail |
|---|
| Trigger | Notification-service outage; SMTP provider issue. |
| User impact | Magic links, password reset, MFA backup messages delayed. |
| Detection | Notification SLO breach; iam emits but dispatch lags. |
| Mitigation | Dispatch is async via events; iam returns 202 to user; notification queues and retries. |
| Recovery | Notification recovers; queue drains. |
| Runbook | runbooks/iam/email-delayed.md |
| RTO | n/a. |
| Data loss | None. |
1.8 MFA TOTP drift
| Aspect | Detail |
|---|
| Trigger | Server clock drift; user device clock drift. |
| User impact | TOTP rejected even when user types correct code. |
| Detection | iam_mfa_challenges_total{factor="totp",result="invalid"} rate > 2 %; IamMFATotpDrift alert. |
| Mitigation | Accept ±1 step (30 s) by default; widen to ±2 with explicit ops approval; verify NTP healthy. |
| Recovery | Restore NTP; user resyncs device. |
| Runbook | runbooks/iam/totp-drift.md |
| RTO | Minutes. |
| Data loss | None. |
1.9 Device certificate mass expiry
| Aspect | Detail |
|---|
| Trigger | Tenant CA rotation without overlap; bulk-issued certs all expire near-simultaneously. |
| User impact | Offline desktops cannot refresh tokens; user must come online + re-auth. |
| Detection | iam_device_offline_certs_expiring_soon gauge spike; user complaints. |
| Mitigation | Always rotate CA with overlap; pre-emptive renewal triggered T-24h on Electron side; mass-renew batch tool for ops. |
| Recovery | Mass-renew job; user-facing email + in-app notification. |
| Runbook | runbooks/iam/device-cert-expiry.md |
| RTO | Hours. |
| Data loss | None (sessions revalidate online). |
1.10 Refresh-token theft
| Aspect | Detail |
|---|
| Trigger | XSS, malware, credential leak. |
| User impact | Brief unauthorized access until family revoke; legitimate user is also logged out (recovery flow). |
| Detection | iam.session.rotation_reuse_detected; IamRotationReuseSpike. |
| Mitigation | Family revoke on reuse; emit melmastoon.iam.session.revoked.v1{reason='rotation_reuse'}; force re-auth + adaptive MFA on next login; security alert. |
| Recovery | User re-auths with MFA. |
| Runbook | runbooks/iam/token-theft.md |
| RTO | Immediate (revoke is sync). |
| Data loss | None. |
1.11 Credential stuffing attack
| Aspect | Detail |
|---|
| Trigger | Botnet hammering /auth/login. |
| User impact | Increased lockouts for genuine users sharing passwords; latency spikes. |
| Detection | iam_login_failures_total{reason="invalid_credentials"} spike; WAF challenge ratio. |
| Mitigation | Cloud Armor rate limit + bot challenges; per-account + per-IP throttle; force adaptive MFA promotion; CAPTCHA on bursts; auto-lock IP /24 on threshold. |
| Recovery | Attacker volume drops; lockouts auto-expire (15 min default). |
| Runbook | runbooks/iam/credential-stuffing.md |
| RTO | Continuous. |
| Data loss | None. |
1.12 JWKS cache stampede
| Aspect | Detail |
|---|
| Trigger | CDN cache flush + key rotation simultaneously. |
| User impact | Brief CDN miss spike → iam-jwks instances saturate. |
| Detection | iam-jwks request rate spike + p95 elevated. |
| Mitigation | Cache-Control: public, max-age=300, stale-while-revalidate=3600; consumer libs add jitter to refetch; min-replicas tuned for spike capacity. |
| Recovery | Cache repopulates; minutes. |
| Runbook | runbooks/iam/jwks-stampede.md |
| RTO | < 5 min. |
| Data loss | None. |
1.13 Outbox backlog
| Aspect | Detail |
|---|
| Trigger | Pub/Sub regional issue; worker scaling delay. |
| User impact | Downstream services (audit, gdpr, notification) receive delayed events. |
| Detection | iam_outbox_depth > 100; iam_outbox_lag_seconds > 5; IamOutboxStalled alert. |
| Mitigation | Worker auto-scales on depth; circuit breaker on Pub/Sub errors; backpressure (do not block API writes — outbox is decoupled). |
| Recovery | Backlog drains as Pub/Sub recovers. |
| Runbook | runbooks/iam/outbox-stalled.md |
| RTO | < 30 min typical. |
| Data loss | None (outbox is durable). |
1.14 DLQ events
| Aspect | Detail |
|---|
| Trigger | Persistent consumer error; schema incompatibility. |
| User impact | Subset of event-driven flows (e.g. tenant.created provisioning) stalled for affected events. |
| Detection | iam_dlq_depth > 0; IamDLQNonEmpty alert. |
| Mitigation | Circuit breaker on consumer; alert on-call; manual triage; replay or discard via pubsub-replay tool with audit. |
| Recovery | Manual triage. |
| Runbook | runbooks/iam/dlq-triage.md |
| RTO | Variable. |
| Data loss | None (DLQ retained 14 d). |
1.15 Magic-link replay
| Aspect | Detail |
|---|
| Trigger | Attacker intercepts email link. |
| User impact | If first to consume → unauthorized access; legitimate user gets MAGIC_LINK_USED. |
| Detection | iam.magic_link.replay_attempt audit event when same hash seen post-consume. |
| Mitigation | Single-use; 10-min TTL; optional bind to issuing IP/UA; recommend MFA for staff; user notification "your link was used from {IP}". |
| Recovery | User requests new link; suspicious-activity alert. |
| Runbook | runbooks/iam/magic-link-replay.md |
| RTO | n/a. |
| Data loss | None. |
1.16 Concurrent MFA enrollment
| Aspect | Detail |
|---|
| Trigger | User enrolls TOTP from two browsers simultaneously. |
| User impact | One enrollment wins; the other's pending verification fails. |
| Detection | n/a (handled in domain). |
| Mitigation | Optimistic concurrency on MFAFactor; loser receives 409 conflict. |
| Recovery | User retries. |
| Runbook | n/a. |
| RTO | n/a. |
| Data loss | None. |
1.17 Account-lockout DoS
| Aspect | Detail |
|---|
| Trigger | Attacker submits wrong password for known account → lockout the user. |
| User impact | Legitimate user blocked. |
| Detection | Lockout rate spike per email. |
| Mitigation | IP-scoped lockout when IP reputation is unknown / bad; magic-link self-recovery; admin unlock; tenant policy (auto-unlock after 15 min). |
| Recovery | TTL or admin lift. |
| Runbook | runbooks/iam/lockout-dos.md |
| RTO | ≤ 15 min default. |
| Data loss | None. |
1.18 API-key leak
| Aspect | Detail |
|---|
| Trigger | Key committed to repo; CI logs; misconfigured customer system. |
| User impact | Attacker calls APIs as the key holder until revoked. |
| Detection | Secret scanning (GitGuardian / GitHub); anomaly detection on API-key usage geography / rate. |
| Mitigation | Auto-revoke on detection; alert key owner; surface in-app banner; key rotation tool. |
| Recovery | Owner issues new key; updates integrations. |
| Runbook | runbooks/iam/apikey-leak.md |
| RTO | Minutes from detection. |
| Data loss | None. |
1.19 GDPR erasure stuck
| Aspect | Detail |
|---|
| Trigger | Saga participant down; transient DB error. |
| User impact | Erasure SLA (30 d) at risk. |
| Detection | gdpr-service SLO; iam-service emits …erasure_failed.v1. |
| Mitigation | Retry with backoff; on persistent failure escalate; idempotent design allows safe retry. |
| Recovery | Retry succeeds. |
| Runbook | runbooks/iam/gdpr-stuck.md |
| RTO | Days (within SLA). |
| Data loss | None. |
1.20 Tenant CA compromise
| Aspect | Detail |
|---|
| Trigger | Internal misuse / breach. |
| User impact | All offline desktop bindings for tenant must be re-issued. |
| Detection | Audit of CA usage; abnormal cert issuance. |
| Mitigation | Tenant CA private key in HSM (KMS, non-extractable); least-privilege IAM; audit-logged signing. |
| Recovery | Rotate CA; re-issue all certs; force online re-auth. |
| Runbook | runbooks/iam/tenant-ca-compromise.md |
| RTO | Hours. |
| Data loss | None (re-bind required). |
1.21 Clock skew
| Aspect | Detail |
|---|
| Trigger | Pod NTP out of sync. |
| User impact | JWT nbf/exp rejected by consumers; TOTP rejected. |
| Detection | Synthetic NTP probe; failure spike. |
| Mitigation | Use Cloud Run managed NTP; periodic check; reject deploy if drift > 1 s. |
| Recovery | Pod restart. |
| Runbook | runbooks/iam/clock-skew.md |
| RTO | Minutes. |
| Data loss | None. |
2. Retry / Backoff Defaults
| Operation | Strategy |
|---|
| KMS sign | 3 retries, exponential 100 ms / 300 ms / 900 ms, jittered |
| Postgres read | 2 retries on transient (40001, 40P01) |
| Pub/Sub publish | 5 retries, exponential to 30 s, then DLQ |
| HIBP API | 1 retry then fail-open |
| OIDC IdP | 2 retries with 200 ms / 800 ms backoff |
| Notification dispatch | event re-delivered by Pub/Sub (consumer side) |
3. Circuit Breakers
| Dependency | Threshold | Half-open | Open behavior |
|---|
| Cloud KMS | 5 errors / 10 s | 1 probe / 5 s | Reject login + return 503 |
| Cloud SQL | 10 errors / 10 s | 1 probe / 5 s | Reject mutations; reads from cache where possible |
| Memorystore | 10 errors / 10 s | 1 probe / 5 s | DB fallback |
| HIBP | 3 errors / 30 s | 1 probe / 60 s | Fail-open |
| OIDC IdP | 5 errors / 30 s per provider | 1 probe / 60 s | Suggest password fallback |
| AI orchestrator | 3 errors / 30 s | 1 probe / 60 s | Rules-only fallback |
| Notification | n/a (async) | n/a | events queue |
4. Fallback Decision Matrix
| Scenario | Fallback |
|---|
| AI risk classifier down | Static rules, weighted heuristics |
| HIBP down | Skip breach check, audit-mark, backfill |
| Memorystore down | DB-backed lookups |
| Notification down | Event queues, retry |
| OIDC IdP down | Password / magic-link if tenant policy allows |
| SAML metadata stale | Auto-refresh; if still failing, surface clear error |
| Cloud KMS down | No fallback for sign — fail loud |
5. Known Anti-Patterns (do NOT do)
- ❌ Sign JWTs in process if KMS down ("graceful degrade") — would expose root key.
- ❌ Mint long-lived access tokens to "compensate" for refresh failure — breaks auditability.
- ❌ Auto-merge two
User rows on email collision — causes silent account takeover.
- ❌ Cache MFA challenge results across requests — enables replay.
- ❌ Allow
platform_admin to bypass MFA — single-factor on the most powerful role is unacceptable.
6. Chaos Engineering Scenarios
Run weekly in staging via Toxiproxy + custom scripts. See TESTING_STRATEGY §10.
| Scenario | Pass condition |
|---|
| KMS adds 2 s latency | Login p99 < 2.5 s; no cascading failure |
| Postgres failover | Recovery < 60 s; no data loss |
| Pub/Sub publish error rate 50 % | Outbox retries; no event loss |
| Memorystore down | Auth still works; metrics show fallback |
| OIDC IdP returns 500 | Tenant SSO degrades cleanly; alert fires |
| Notification down 5 min | Magic links eventually deliver; user sees "may take a few minutes" hint |
| AI orchestrator down | Adaptive MFA falls back to rules; counter increments |
| Cloud Run zone failure | Auto-recover; canary login passes |
7. Failure-mode → Runbook Cross-Reference
See OBSERVABILITY §8 (Runbook Index). Every failure mode listed here has a runbook entry; missing a runbook blocks readiness (SERVICE_READINESS).