iam-service — Failure Modes

Catalog · DEPLOYMENT_TOPOLOGY · OBSERVABILITY · SERVICE_RISK_REGISTER

A catalog of every known failure mode for iam-service: what breaks, what users experience, how we detect it, how we mitigate. iam-service availability defines the platform availability; treat every entry seriously.

1. Failure Catalog

1.1 Cloud KMS — signing key unavailable

Aspect	Detail
Trigger	KMS regional outage; IAM revocation; misconfigured `kid`.
User impact	New logins / refreshes / SSO callbacks fail (cannot mint JWT). Existing access tokens still valid until TTL (≤ 15 min).
Detection	`iam_kms_op_duration_seconds` errors; `IamKMSUnavailable` alert; canary login fails.
Mitigation	Open circuit breaker; serve `503 problem+json{type:"…/kms_unavailable"}`; surface "auth degraded" status banner; failover to standby KMS region (M2).
Recovery	Manual: confirm KMS health → close breaker → flush JWKS cache.
Runbook	`runbooks/iam/kms-outage.md`
RTO	< 30 min (M0); < 5 min (M2).
Data loss	None (no writes lost — denied at edge).

1.2 Cloud SQL — primary failover

Aspect	Detail
Trigger	Zone failure; planned maintenance; manual failover.
User impact	30–60 s elevated 5xx; refresh tokens may transiently fail.
Detection	`iam_db_pool_active` drops to 0; readiness probe fails.
Mitigation	Cloud Run health probes drain instance; reconnect with exponential backoff; `iam-worker` pauses outbox publishing during cutover.
Recovery	Auto via Cloud SQL HA; verify with smoke test post-failover.
Runbook	`runbooks/iam/cloudsql-failover.md`
RTO	< 60 s.
Data loss	0 (synchronous regional replication).

1.3 Memorystore (Redis) — partition / down

Aspect	Detail
Trigger	Network partition; node failure; eviction.
User impact	Slower auth (DB-backed lookups); rate limits less precise; magic links fail.
Detection	`iam_redis_op_duration_seconds` error spike; readiness degrades.
Mitigation	Fall back to DB for session lookup; rate-limit falls back to per-pod in-memory token bucket; magic-link writes return 503 (rare path); circuit breaker on hot reads.
Recovery	Redis recovery; cache warms organically.
Runbook	`runbooks/iam/redis-down.md`
RTO	< 5 min.
Data loss	Cache only (rebuilds); magic-link writes during outage are rejected — user retries.

1.4 OIDC IdP timeout / down

Aspect	Detail
Trigger	IdP outage; certificate expiry; metadata changed.
User impact	SSO login fails for affected tenant only.
Detection	`iam_sso_callbacks_total{result="error"}`; tenant-scoped alert.
Mitigation	Fall back to password / magic-link if tenant policy permits; surface specific error code `MELMASTOON.IAM.SSO.PROVIDER_UNAVAILABLE`; retry with backoff.
Recovery	IdP recovery.
Runbook	`runbooks/iam/sso-outage.md`
RTO	Tenant-dependent.
Data loss	None.

1.5 SAML metadata drift

Aspect	Detail
Trigger	IdP rotated signing cert without notice.
User impact	All SAML logins for tenant fail with assertion-invalid.
Detection	`iam_sso_callbacks_total{provider="saml",result="signature_invalid"}` spike.
Mitigation	Refresh metadata from IdP URL (if MetadataURL configured) every 6 h + on first failure.
Recovery	Auto-refresh succeeds, OR manual upload by tenant admin.
Runbook	`runbooks/iam/saml-drift.md`
RTO	Auto: ≤ 6 h; manual: minutes.
Data loss	None.

1.6 Breach-list (HIBP) provider down

Aspect	Detail
Trigger	HIBP outage; rate limit hit; API key invalid.
User impact	Registration / password change cannot verify breach status; fails-open to allow user (config-controlled).
Detection	`IamBreachListProviderDown` alert.
Mitigation	Fail-open with audit event `iam.breach_list.skipped`; set `pwn_audit_due` flag for backfill.
Recovery	Backfill job re-checks affected passwords on next login.
Runbook	`runbooks/iam/hibp-down.md`
RTO	n/a (degraded mode permitted).
Data loss	None.

1.7 Email delivery (notification-service) down

Aspect	Detail
Trigger	Notification-service outage; SMTP provider issue.
User impact	Magic links, password reset, MFA backup messages delayed.
Detection	Notification SLO breach; iam emits but dispatch lags.
Mitigation	Dispatch is async via events; iam returns 202 to user; notification queues and retries.
Recovery	Notification recovers; queue drains.
Runbook	`runbooks/iam/email-delayed.md`
RTO	n/a.
Data loss	None.

1.8 MFA TOTP drift

Aspect	Detail
Trigger	Server clock drift; user device clock drift.
User impact	TOTP rejected even when user types correct code.
Detection	`iam_mfa_challenges_total{factor="totp",result="invalid"}` rate > 2 %; `IamMFATotpDrift` alert.
Mitigation	Accept ±1 step (30 s) by default; widen to ±2 with explicit ops approval; verify NTP healthy.
Recovery	Restore NTP; user resyncs device.
Runbook	`runbooks/iam/totp-drift.md`
RTO	Minutes.
Data loss	None.

1.9 Device certificate mass expiry

Aspect	Detail
Trigger	Tenant CA rotation without overlap; bulk-issued certs all expire near-simultaneously.
User impact	Offline desktops cannot refresh tokens; user must come online + re-auth.
Detection	`iam_device_offline_certs_expiring_soon` gauge spike; user complaints.
Mitigation	Always rotate CA with overlap; pre-emptive renewal triggered T-24h on Electron side; mass-renew batch tool for ops.
Recovery	Mass-renew job; user-facing email + in-app notification.
Runbook	`runbooks/iam/device-cert-expiry.md`
RTO	Hours.
Data loss	None (sessions revalidate online).

1.10 Refresh-token theft

Aspect	Detail
Trigger	XSS, malware, credential leak.
User impact	Brief unauthorized access until family revoke; legitimate user is also logged out (recovery flow).
Detection	`iam.session.rotation_reuse_detected`; `IamRotationReuseSpike`.
Mitigation	Family revoke on reuse; emit `melmastoon.iam.session.revoked.v1{reason='rotation_reuse'}`; force re-auth + adaptive MFA on next login; security alert.
Recovery	User re-auths with MFA.
Runbook	`runbooks/iam/token-theft.md`
RTO	Immediate (revoke is sync).
Data loss	None.

1.11 Credential stuffing attack

Aspect	Detail
Trigger	Botnet hammering `/auth/login`.
User impact	Increased lockouts for genuine users sharing passwords; latency spikes.
Detection	`iam_login_failures_total{reason="invalid_credentials"}` spike; WAF challenge ratio.
Mitigation	Cloud Armor rate limit + bot challenges; per-account + per-IP throttle; force adaptive MFA promotion; CAPTCHA on bursts; auto-lock IP /24 on threshold.
Recovery	Attacker volume drops; lockouts auto-expire (15 min default).
Runbook	`runbooks/iam/credential-stuffing.md`
RTO	Continuous.
Data loss	None.

1.12 JWKS cache stampede

Aspect	Detail
Trigger	CDN cache flush + key rotation simultaneously.
User impact	Brief CDN miss spike → `iam-jwks` instances saturate.
Detection	`iam-jwks` request rate spike + p95 elevated.
Mitigation	`Cache-Control: public, max-age=300, stale-while-revalidate=3600`; consumer libs add jitter to refetch; min-replicas tuned for spike capacity.
Recovery	Cache repopulates; minutes.
Runbook	`runbooks/iam/jwks-stampede.md`
RTO	< 5 min.
Data loss	None.

1.13 Outbox backlog

Aspect	Detail
Trigger	Pub/Sub regional issue; worker scaling delay.
User impact	Downstream services (audit, gdpr, notification) receive delayed events.
Detection	`iam_outbox_depth` > 100; `iam_outbox_lag_seconds` > 5; `IamOutboxStalled` alert.
Mitigation	Worker auto-scales on depth; circuit breaker on Pub/Sub errors; backpressure (do not block API writes — outbox is decoupled).
Recovery	Backlog drains as Pub/Sub recovers.
Runbook	`runbooks/iam/outbox-stalled.md`
RTO	< 30 min typical.
Data loss	None (outbox is durable).

1.14 DLQ events

Aspect	Detail
Trigger	Persistent consumer error; schema incompatibility.
User impact	Subset of event-driven flows (e.g. tenant.created provisioning) stalled for affected events.
Detection	`iam_dlq_depth > 0`; `IamDLQNonEmpty` alert.
Mitigation	Circuit breaker on consumer; alert on-call; manual triage; replay or discard via `pubsub-replay` tool with audit.
Recovery	Manual triage.
Runbook	`runbooks/iam/dlq-triage.md`
RTO	Variable.
Data loss	None (DLQ retained 14 d).

1.15 Magic-link replay

Aspect	Detail
Trigger	Attacker intercepts email link.
User impact	If first to consume → unauthorized access; legitimate user gets `MAGIC_LINK_USED`.
Detection	`iam.magic_link.replay_attempt` audit event when same hash seen post-consume.
Mitigation	Single-use; 10-min TTL; optional bind to issuing IP/UA; recommend MFA for staff; user notification "your link was used from {IP}".
Recovery	User requests new link; suspicious-activity alert.
Runbook	`runbooks/iam/magic-link-replay.md`
RTO	n/a.
Data loss	None.

1.16 Concurrent MFA enrollment

Aspect	Detail
Trigger	User enrolls TOTP from two browsers simultaneously.
User impact	One enrollment wins; the other's pending verification fails.
Detection	n/a (handled in domain).
Mitigation	Optimistic concurrency on `MFAFactor`; loser receives `409 conflict`.
Recovery	User retries.
Runbook	n/a.
RTO	n/a.
Data loss	None.

1.17 Account-lockout DoS

Aspect	Detail
Trigger	Attacker submits wrong password for known account → lockout the user.
User impact	Legitimate user blocked.
Detection	Lockout rate spike per email.
Mitigation	IP-scoped lockout when IP reputation is unknown / bad; magic-link self-recovery; admin unlock; tenant policy (auto-unlock after 15 min).
Recovery	TTL or admin lift.
Runbook	`runbooks/iam/lockout-dos.md`
RTO	≤ 15 min default.
Data loss	None.

1.18 API-key leak

Aspect	Detail
Trigger	Key committed to repo; CI logs; misconfigured customer system.
User impact	Attacker calls APIs as the key holder until revoked.
Detection	Secret scanning (GitGuardian / GitHub); anomaly detection on API-key usage geography / rate.
Mitigation	Auto-revoke on detection; alert key owner; surface in-app banner; key rotation tool.
Recovery	Owner issues new key; updates integrations.
Runbook	`runbooks/iam/apikey-leak.md`
RTO	Minutes from detection.
Data loss	None.

Aspect	Detail
Trigger	Saga participant down; transient DB error.
User impact	Erasure SLA (30 d) at risk.
Detection	`gdpr-service` SLO; iam-service emits `…erasure_failed.v1`.
Mitigation	Retry with backoff; on persistent failure escalate; idempotent design allows safe retry.
Recovery	Retry succeeds.
Runbook	`runbooks/iam/gdpr-stuck.md`
RTO	Days (within SLA).
Data loss	None.

1.20 Tenant CA compromise

Aspect	Detail
Trigger	Internal misuse / breach.
User impact	All offline desktop bindings for tenant must be re-issued.
Detection	Audit of CA usage; abnormal cert issuance.
Mitigation	Tenant CA private key in HSM (KMS, non-extractable); least-privilege IAM; audit-logged signing.
Recovery	Rotate CA; re-issue all certs; force online re-auth.
Runbook	`runbooks/iam/tenant-ca-compromise.md`
RTO	Hours.
Data loss	None (re-bind required).

1.21 Clock skew

Aspect	Detail
Trigger	Pod NTP out of sync.
User impact	JWT `nbf`/`exp` rejected by consumers; TOTP rejected.
Detection	Synthetic NTP probe; failure spike.
Mitigation	Use Cloud Run managed NTP; periodic check; reject deploy if drift > 1 s.
Recovery	Pod restart.
Runbook	`runbooks/iam/clock-skew.md`
RTO	Minutes.
Data loss	None.

2. Retry / Backoff Defaults

Operation	Strategy
KMS sign	3 retries, exponential 100 ms / 300 ms / 900 ms, jittered
Postgres read	2 retries on transient (40001, 40P01)
Pub/Sub publish	5 retries, exponential to 30 s, then DLQ
HIBP API	1 retry then fail-open
OIDC IdP	2 retries with 200 ms / 800 ms backoff
Notification dispatch	event re-delivered by Pub/Sub (consumer side)

3. Circuit Breakers

Dependency	Threshold	Half-open	Open behavior
Cloud KMS	5 errors / 10 s	1 probe / 5 s	Reject login + return 503
Cloud SQL	10 errors / 10 s	1 probe / 5 s	Reject mutations; reads from cache where possible
Memorystore	10 errors / 10 s	1 probe / 5 s	DB fallback
HIBP	3 errors / 30 s	1 probe / 60 s	Fail-open
OIDC IdP	5 errors / 30 s per provider	1 probe / 60 s	Suggest password fallback
AI orchestrator	3 errors / 30 s	1 probe / 60 s	Rules-only fallback
Notification	n/a (async)	n/a	events queue

4. Fallback Decision Matrix

Scenario	Fallback
AI risk classifier down	Static rules, weighted heuristics
HIBP down	Skip breach check, audit-mark, backfill
Memorystore down	DB-backed lookups
Notification down	Event queues, retry
OIDC IdP down	Password / magic-link if tenant policy allows
SAML metadata stale	Auto-refresh; if still failing, surface clear error
Cloud KMS down	No fallback for sign — fail loud

5. Known Anti-Patterns (do NOT do)

❌ Sign JWTs in process if KMS down ("graceful degrade") — would expose root key.
❌ Mint long-lived access tokens to "compensate" for refresh failure — breaks auditability.
❌ Auto-merge two User rows on email collision — causes silent account takeover.
❌ Cache MFA challenge results across requests — enables replay.
❌ Allow platform_admin to bypass MFA — single-factor on the most powerful role is unacceptable.

6. Chaos Engineering Scenarios

Run weekly in staging via Toxiproxy + custom scripts. See TESTING_STRATEGY §10.

Scenario	Pass condition
KMS adds 2 s latency	Login p99 < 2.5 s; no cascading failure
Postgres failover	Recovery < 60 s; no data loss
Pub/Sub publish error rate 50 %	Outbox retries; no event loss
Memorystore down	Auth still works; metrics show fallback
OIDC IdP returns 500	Tenant SSO degrades cleanly; alert fires
Notification down 5 min	Magic links eventually deliver; user sees "may take a few minutes" hint
AI orchestrator down	Adaptive MFA falls back to rules; counter increments
Cloud Run zone failure	Auto-recover; canary login passes

7. Failure-mode → Runbook Cross-Reference

See OBSERVABILITY §8 (Runbook Index). Every failure mode listed here has a runbook entry; missing a runbook blocks readiness (SERVICE_READINESS).

1. Failure Catalog​

1.1 Cloud KMS — signing key unavailable​

1.2 Cloud SQL — primary failover​

1.3 Memorystore (Redis) — partition / down​

1.4 OIDC IdP timeout / down​

1.5 SAML metadata drift​

1.6 Breach-list (HIBP) provider down​

1.7 Email delivery (notification-service) down​

1.8 MFA TOTP drift​

1.9 Device certificate mass expiry​

1.10 Refresh-token theft​

1.11 Credential stuffing attack​

1.12 JWKS cache stampede​

1.13 Outbox backlog​

1.14 DLQ events​

1.15 Magic-link replay​

1.16 Concurrent MFA enrollment​

1.17 Account-lockout DoS​

1.18 API-key leak​

1.19 GDPR erasure stuck​

1.20 Tenant CA compromise​

1.21 Clock skew​

2. Retry / Backoff Defaults​

3. Circuit Breakers​

4. Fallback Decision Matrix​

5. Known Anti-Patterns (do NOT do)​

6. Chaos Engineering Scenarios​

7. Failure-mode → Runbook Cross-Reference​