Skip to main content

Identity Service — Failure Modes

Status: populated Owner: TBD Last updated: 2026-04-18 Companion: Service Template · OBSERVABILITY · SERVICE_RISK_REGISTER

1. Failure catalog

IDComponentFailureUser impactDetectionMitigation
FM-IDENT-01PostgreSQL (primary)Primary unavailableAll logins fail; sessions cannot be issued or revokedHealth probe fails; identity_db_errors_total spikesAutomatic failover to streaming replica (RTO ≤ 5 min); PgBouncer retries; alert on-call
FM-IDENT-02PostgreSQL (replica)Replica lag / unavailableRead-heavy endpoints degrade (effective license cache miss)Replica lag metric > 30 sFall back to primary for reads; alert if lag > 60 s
FM-IDENT-03RedisRedis cluster unavailableSession revocation cache misses; rate-limit counters lostRedis connection errors; cache hit rate drops to 0Fail-open for reads (serve stale cache); fall back to DB for revocation; alert SRE
FM-IDENT-04NATS JetStreamOutbox relay cannot publishDomain events not delivered; downstream services miss user.registered etc.Outbox table rows accumulate; outbox_unpublished_age_s alertTransactional outbox retries; alert at 60 s unpublished age; manual replay
FM-IDENT-05AWS KMSKMS unreachableJWT signing fails; new logins cannot completeKMS error rate metric; healthcheck failsCache signed JWTs in memory (15 min TTL); alert immediately; circuit breaker opens after 5 failures
FM-IDENT-06KeycloakBroker unavailableOIDC/SAML federated logins fail; in-house logins unaffectedidentity_federation_errors_total; IDENT_FEDERATION_UNAVAILABLE 503Circuit breaker per provider (half-open retry 30 s); fallback error page with retry guidance
FM-IDENT-07Session — refresh replayStolen refresh token reusedLegitimate user logged out; security incident firedIDENT_REFRESH_REPLAY 401 logged; identity_security_incidents_totalIssue SessionRevoked; notify user via communication-service; require re-authentication
FM-IDENT-08Argon2id — slow hashingLogin latency spike on large batchDegraded login UX (> 300 ms p99)p99 login latency alertRate-limit login endpoints; horizontal scale; backpressure queue
FM-IDENT-09JWKS rotationCached JWKS stale across all services401 errors on all services when old key expires before consumers refreshidentity_jwks_rotation_mismatch_total90-day rotation with 7-day overlap; downstream services cache JWKS with max-age=3600; publish identity.jwks.rotated.v1 event
FM-IDENT-10License resolverHierarchy ancestor walk failsEffective license resolution returns empty set; UI module gates failidentity_license_resolver_errors_total; 5xx from /licensing/nodes/:id/effectiveReturn last-cached result (5 min TTL); log warning; alert if error persists > 2 min
FM-IDENT-11Tenant suspension raceSessions not revoked within JWT TTL (15 min)Suspended tenant users briefly retain accessidentity.user.suspended.v1 consumed; Redis session revocation setProactive revocation on suspension event; Redis TTL ensures maximum 15-min window; alert if event lag > 30 s
FM-IDENT-12OOM / crash loopPod restartBrief traffic disruption (load balanced away)Kubernetes restart count metric; CrashLoopBackOff alertPDB ensures minAvailable=2; graceful shutdown drains in-flight requests; node limit prevents OOM
FM-IDENT-13External IdP misconfigurationJIT provisioning creates duplicate usersData integrity issue; user cannot link accountsIDENT_EXT_IDENTITY_MISMATCH 409; duplicate detectionIdempotent JIT logic keyed on (issuer, subject); admin alert on mismatch

2. Dependency failure impact matrix

DependencyDegraded modeLogins affectedRemediation
PostgreSQL unavailableFull outageAll logins failDB failover; on-call
Redis unavailablePartial degradationRate limits loose; session revocation delayedFail-open; alert
KMS unavailableFull outage (new logins)New sessions cannot be signedKMS HA; 5-replica circuit breaker
NATS unavailableEvents queuedNo immediate user impactOutbox relay; alert
Keycloak unavailableFederated logins onlyIn-house users unaffectedCircuit breaker; user-facing error
tenant-service unavailableAccess context degradedLogin succeeds; /me/access-context returns partial dataCache last-known; log warning

3. Runbooks

RunbookTrigger
runbooks/identity/db-failover.mdFM-IDENT-01: PostgreSQL primary down
runbooks/identity/kms-outage.mdFM-IDENT-05: KMS unreachable
runbooks/identity/federation-outage.mdFM-IDENT-06: Keycloak / IdP circuit open
runbooks/identity/refresh-token-replay.mdFM-IDENT-07: Security incident
runbooks/identity/jwks-rotation.mdFM-IDENT-09: JWKS rotation procedure
runbooks/identity/license-resolver-degraded.mdFM-IDENT-10: License resolution errors