Skip to main content

iam-service — Failure Modes

Catalog · DEPLOYMENT_TOPOLOGY · OBSERVABILITY · SERVICE_RISK_REGISTER

A catalog of every known failure mode for iam-service: what breaks, what users experience, how we detect it, how we mitigate. iam-service availability defines the platform availability; treat every entry seriously.

1. Failure Catalog

1.1 Cloud KMS — signing key unavailable

AspectDetail
TriggerKMS regional outage; IAM revocation; misconfigured kid.
User impactNew logins / refreshes / SSO callbacks fail (cannot mint JWT). Existing access tokens still valid until TTL (≤ 15 min).
Detectioniam_kms_op_duration_seconds errors; IamKMSUnavailable alert; canary login fails.
MitigationOpen circuit breaker; serve 503 problem+json{type:"…/kms_unavailable"}; surface "auth degraded" status banner; failover to standby KMS region (M2).
RecoveryManual: confirm KMS health → close breaker → flush JWKS cache.
Runbookrunbooks/iam/kms-outage.md
RTO< 30 min (M0); < 5 min (M2).
Data lossNone (no writes lost — denied at edge).

1.2 Cloud SQL — primary failover

AspectDetail
TriggerZone failure; planned maintenance; manual failover.
User impact30–60 s elevated 5xx; refresh tokens may transiently fail.
Detectioniam_db_pool_active drops to 0; readiness probe fails.
MitigationCloud Run health probes drain instance; reconnect with exponential backoff; iam-worker pauses outbox publishing during cutover.
RecoveryAuto via Cloud SQL HA; verify with smoke test post-failover.
Runbookrunbooks/iam/cloudsql-failover.md
RTO< 60 s.
Data loss0 (synchronous regional replication).

1.3 Memorystore (Redis) — partition / down

AspectDetail
TriggerNetwork partition; node failure; eviction.
User impactSlower auth (DB-backed lookups); rate limits less precise; magic links fail.
Detectioniam_redis_op_duration_seconds error spike; readiness degrades.
MitigationFall back to DB for session lookup; rate-limit falls back to per-pod in-memory token bucket; magic-link writes return 503 (rare path); circuit breaker on hot reads.
RecoveryRedis recovery; cache warms organically.
Runbookrunbooks/iam/redis-down.md
RTO< 5 min.
Data lossCache only (rebuilds); magic-link writes during outage are rejected — user retries.

1.4 OIDC IdP timeout / down

AspectDetail
TriggerIdP outage; certificate expiry; metadata changed.
User impactSSO login fails for affected tenant only.
Detectioniam_sso_callbacks_total{result="error"}; tenant-scoped alert.
MitigationFall back to password / magic-link if tenant policy permits; surface specific error code MELMASTOON.IAM.SSO.PROVIDER_UNAVAILABLE; retry with backoff.
RecoveryIdP recovery.
Runbookrunbooks/iam/sso-outage.md
RTOTenant-dependent.
Data lossNone.

1.5 SAML metadata drift

AspectDetail
TriggerIdP rotated signing cert without notice.
User impactAll SAML logins for tenant fail with assertion-invalid.
Detectioniam_sso_callbacks_total{provider="saml",result="signature_invalid"} spike.
MitigationRefresh metadata from IdP URL (if MetadataURL configured) every 6 h + on first failure.
RecoveryAuto-refresh succeeds, OR manual upload by tenant admin.
Runbookrunbooks/iam/saml-drift.md
RTOAuto: ≤ 6 h; manual: minutes.
Data lossNone.

1.6 Breach-list (HIBP) provider down

AspectDetail
TriggerHIBP outage; rate limit hit; API key invalid.
User impactRegistration / password change cannot verify breach status; fails-open to allow user (config-controlled).
DetectionIamBreachListProviderDown alert.
MitigationFail-open with audit event iam.breach_list.skipped; set pwn_audit_due flag for backfill.
RecoveryBackfill job re-checks affected passwords on next login.
Runbookrunbooks/iam/hibp-down.md
RTOn/a (degraded mode permitted).
Data lossNone.

1.7 Email delivery (notification-service) down

AspectDetail
TriggerNotification-service outage; SMTP provider issue.
User impactMagic links, password reset, MFA backup messages delayed.
DetectionNotification SLO breach; iam emits but dispatch lags.
MitigationDispatch is async via events; iam returns 202 to user; notification queues and retries.
RecoveryNotification recovers; queue drains.
Runbookrunbooks/iam/email-delayed.md
RTOn/a.
Data lossNone.

1.8 MFA TOTP drift

AspectDetail
TriggerServer clock drift; user device clock drift.
User impactTOTP rejected even when user types correct code.
Detectioniam_mfa_challenges_total{factor="totp",result="invalid"} rate > 2 %; IamMFATotpDrift alert.
MitigationAccept ±1 step (30 s) by default; widen to ±2 with explicit ops approval; verify NTP healthy.
RecoveryRestore NTP; user resyncs device.
Runbookrunbooks/iam/totp-drift.md
RTOMinutes.
Data lossNone.

1.9 Device certificate mass expiry

AspectDetail
TriggerTenant CA rotation without overlap; bulk-issued certs all expire near-simultaneously.
User impactOffline desktops cannot refresh tokens; user must come online + re-auth.
Detectioniam_device_offline_certs_expiring_soon gauge spike; user complaints.
MitigationAlways rotate CA with overlap; pre-emptive renewal triggered T-24h on Electron side; mass-renew batch tool for ops.
RecoveryMass-renew job; user-facing email + in-app notification.
Runbookrunbooks/iam/device-cert-expiry.md
RTOHours.
Data lossNone (sessions revalidate online).

1.10 Refresh-token theft

AspectDetail
TriggerXSS, malware, credential leak.
User impactBrief unauthorized access until family revoke; legitimate user is also logged out (recovery flow).
Detectioniam.session.rotation_reuse_detected; IamRotationReuseSpike.
MitigationFamily revoke on reuse; emit melmastoon.iam.session.revoked.v1{reason='rotation_reuse'}; force re-auth + adaptive MFA on next login; security alert.
RecoveryUser re-auths with MFA.
Runbookrunbooks/iam/token-theft.md
RTOImmediate (revoke is sync).
Data lossNone.

1.11 Credential stuffing attack

AspectDetail
TriggerBotnet hammering /auth/login.
User impactIncreased lockouts for genuine users sharing passwords; latency spikes.
Detectioniam_login_failures_total{reason="invalid_credentials"} spike; WAF challenge ratio.
MitigationCloud Armor rate limit + bot challenges; per-account + per-IP throttle; force adaptive MFA promotion; CAPTCHA on bursts; auto-lock IP /24 on threshold.
RecoveryAttacker volume drops; lockouts auto-expire (15 min default).
Runbookrunbooks/iam/credential-stuffing.md
RTOContinuous.
Data lossNone.

1.12 JWKS cache stampede

AspectDetail
TriggerCDN cache flush + key rotation simultaneously.
User impactBrief CDN miss spike → iam-jwks instances saturate.
Detectioniam-jwks request rate spike + p95 elevated.
MitigationCache-Control: public, max-age=300, stale-while-revalidate=3600; consumer libs add jitter to refetch; min-replicas tuned for spike capacity.
RecoveryCache repopulates; minutes.
Runbookrunbooks/iam/jwks-stampede.md
RTO< 5 min.
Data lossNone.

1.13 Outbox backlog

AspectDetail
TriggerPub/Sub regional issue; worker scaling delay.
User impactDownstream services (audit, gdpr, notification) receive delayed events.
Detectioniam_outbox_depth > 100; iam_outbox_lag_seconds > 5; IamOutboxStalled alert.
MitigationWorker auto-scales on depth; circuit breaker on Pub/Sub errors; backpressure (do not block API writes — outbox is decoupled).
RecoveryBacklog drains as Pub/Sub recovers.
Runbookrunbooks/iam/outbox-stalled.md
RTO< 30 min typical.
Data lossNone (outbox is durable).

1.14 DLQ events

AspectDetail
TriggerPersistent consumer error; schema incompatibility.
User impactSubset of event-driven flows (e.g. tenant.created provisioning) stalled for affected events.
Detectioniam_dlq_depth > 0; IamDLQNonEmpty alert.
MitigationCircuit breaker on consumer; alert on-call; manual triage; replay or discard via pubsub-replay tool with audit.
RecoveryManual triage.
Runbookrunbooks/iam/dlq-triage.md
RTOVariable.
Data lossNone (DLQ retained 14 d).
AspectDetail
TriggerAttacker intercepts email link.
User impactIf first to consume → unauthorized access; legitimate user gets MAGIC_LINK_USED.
Detectioniam.magic_link.replay_attempt audit event when same hash seen post-consume.
MitigationSingle-use; 10-min TTL; optional bind to issuing IP/UA; recommend MFA for staff; user notification "your link was used from {IP}".
RecoveryUser requests new link; suspicious-activity alert.
Runbookrunbooks/iam/magic-link-replay.md
RTOn/a.
Data lossNone.

1.16 Concurrent MFA enrollment

AspectDetail
TriggerUser enrolls TOTP from two browsers simultaneously.
User impactOne enrollment wins; the other's pending verification fails.
Detectionn/a (handled in domain).
MitigationOptimistic concurrency on MFAFactor; loser receives 409 conflict.
RecoveryUser retries.
Runbookn/a.
RTOn/a.
Data lossNone.

1.17 Account-lockout DoS

AspectDetail
TriggerAttacker submits wrong password for known account → lockout the user.
User impactLegitimate user blocked.
DetectionLockout rate spike per email.
MitigationIP-scoped lockout when IP reputation is unknown / bad; magic-link self-recovery; admin unlock; tenant policy (auto-unlock after 15 min).
RecoveryTTL or admin lift.
Runbookrunbooks/iam/lockout-dos.md
RTO≤ 15 min default.
Data lossNone.

1.18 API-key leak

AspectDetail
TriggerKey committed to repo; CI logs; misconfigured customer system.
User impactAttacker calls APIs as the key holder until revoked.
DetectionSecret scanning (GitGuardian / GitHub); anomaly detection on API-key usage geography / rate.
MitigationAuto-revoke on detection; alert key owner; surface in-app banner; key rotation tool.
RecoveryOwner issues new key; updates integrations.
Runbookrunbooks/iam/apikey-leak.md
RTOMinutes from detection.
Data lossNone.

1.19 GDPR erasure stuck

AspectDetail
TriggerSaga participant down; transient DB error.
User impactErasure SLA (30 d) at risk.
Detectiongdpr-service SLO; iam-service emits …erasure_failed.v1.
MitigationRetry with backoff; on persistent failure escalate; idempotent design allows safe retry.
RecoveryRetry succeeds.
Runbookrunbooks/iam/gdpr-stuck.md
RTODays (within SLA).
Data lossNone.

1.20 Tenant CA compromise

AspectDetail
TriggerInternal misuse / breach.
User impactAll offline desktop bindings for tenant must be re-issued.
DetectionAudit of CA usage; abnormal cert issuance.
MitigationTenant CA private key in HSM (KMS, non-extractable); least-privilege IAM; audit-logged signing.
RecoveryRotate CA; re-issue all certs; force online re-auth.
Runbookrunbooks/iam/tenant-ca-compromise.md
RTOHours.
Data lossNone (re-bind required).

1.21 Clock skew

AspectDetail
TriggerPod NTP out of sync.
User impactJWT nbf/exp rejected by consumers; TOTP rejected.
DetectionSynthetic NTP probe; failure spike.
MitigationUse Cloud Run managed NTP; periodic check; reject deploy if drift > 1 s.
RecoveryPod restart.
Runbookrunbooks/iam/clock-skew.md
RTOMinutes.
Data lossNone.

2. Retry / Backoff Defaults

OperationStrategy
KMS sign3 retries, exponential 100 ms / 300 ms / 900 ms, jittered
Postgres read2 retries on transient (40001, 40P01)
Pub/Sub publish5 retries, exponential to 30 s, then DLQ
HIBP API1 retry then fail-open
OIDC IdP2 retries with 200 ms / 800 ms backoff
Notification dispatchevent re-delivered by Pub/Sub (consumer side)

3. Circuit Breakers

DependencyThresholdHalf-openOpen behavior
Cloud KMS5 errors / 10 s1 probe / 5 sReject login + return 503
Cloud SQL10 errors / 10 s1 probe / 5 sReject mutations; reads from cache where possible
Memorystore10 errors / 10 s1 probe / 5 sDB fallback
HIBP3 errors / 30 s1 probe / 60 sFail-open
OIDC IdP5 errors / 30 s per provider1 probe / 60 sSuggest password fallback
AI orchestrator3 errors / 30 s1 probe / 60 sRules-only fallback
Notificationn/a (async)n/aevents queue

4. Fallback Decision Matrix

ScenarioFallback
AI risk classifier downStatic rules, weighted heuristics
HIBP downSkip breach check, audit-mark, backfill
Memorystore downDB-backed lookups
Notification downEvent queues, retry
OIDC IdP downPassword / magic-link if tenant policy allows
SAML metadata staleAuto-refresh; if still failing, surface clear error
Cloud KMS downNo fallback for sign — fail loud

5. Known Anti-Patterns (do NOT do)

  • ❌ Sign JWTs in process if KMS down ("graceful degrade") — would expose root key.
  • ❌ Mint long-lived access tokens to "compensate" for refresh failure — breaks auditability.
  • ❌ Auto-merge two User rows on email collision — causes silent account takeover.
  • ❌ Cache MFA challenge results across requests — enables replay.
  • ❌ Allow platform_admin to bypass MFA — single-factor on the most powerful role is unacceptable.

6. Chaos Engineering Scenarios

Run weekly in staging via Toxiproxy + custom scripts. See TESTING_STRATEGY §10.

ScenarioPass condition
KMS adds 2 s latencyLogin p99 < 2.5 s; no cascading failure
Postgres failoverRecovery < 60 s; no data loss
Pub/Sub publish error rate 50 %Outbox retries; no event loss
Memorystore downAuth still works; metrics show fallback
OIDC IdP returns 500Tenant SSO degrades cleanly; alert fires
Notification down 5 minMagic links eventually deliver; user sees "may take a few minutes" hint
AI orchestrator downAdaptive MFA falls back to rules; counter increments
Cloud Run zone failureAuto-recover; canary login passes

7. Failure-mode → Runbook Cross-Reference

See OBSERVABILITY §8 (Runbook Index). Every failure mode listed here has a runbook entry; missing a runbook blocks readiness (SERVICE_READINESS).