| FM-IDENT-01 | PostgreSQL (primary) | Primary unavailable | All logins fail; sessions cannot be issued or revoked | Health probe fails; identity_db_errors_total spikes | Automatic failover to streaming replica (RTO ≤ 5 min); PgBouncer retries; alert on-call |
| FM-IDENT-02 | PostgreSQL (replica) | Replica lag / unavailable | Read-heavy endpoints degrade (effective license cache miss) | Replica lag metric > 30 s | Fall back to primary for reads; alert if lag > 60 s |
| FM-IDENT-03 | Redis | Redis cluster unavailable | Session revocation cache misses; rate-limit counters lost | Redis connection errors; cache hit rate drops to 0 | Fail-open for reads (serve stale cache); fall back to DB for revocation; alert SRE |
| FM-IDENT-04 | NATS JetStream | Outbox relay cannot publish | Domain events not delivered; downstream services miss user.registered etc. | Outbox table rows accumulate; outbox_unpublished_age_s alert | Transactional outbox retries; alert at 60 s unpublished age; manual replay |
| FM-IDENT-05 | AWS KMS | KMS unreachable | JWT signing fails; new logins cannot complete | KMS error rate metric; healthcheck fails | Cache signed JWTs in memory (15 min TTL); alert immediately; circuit breaker opens after 5 failures |
| FM-IDENT-06 | Keycloak | Broker unavailable | OIDC/SAML federated logins fail; in-house logins unaffected | identity_federation_errors_total; IDENT_FEDERATION_UNAVAILABLE 503 | Circuit breaker per provider (half-open retry 30 s); fallback error page with retry guidance |
| FM-IDENT-07 | Session — refresh replay | Stolen refresh token reused | Legitimate user logged out; security incident fired | IDENT_REFRESH_REPLAY 401 logged; identity_security_incidents_total | Issue SessionRevoked; notify user via communication-service; require re-authentication |
| FM-IDENT-08 | Argon2id — slow hashing | Login latency spike on large batch | Degraded login UX (> 300 ms p99) | p99 login latency alert | Rate-limit login endpoints; horizontal scale; backpressure queue |
| FM-IDENT-09 | JWKS rotation | Cached JWKS stale across all services | 401 errors on all services when old key expires before consumers refresh | identity_jwks_rotation_mismatch_total | 90-day rotation with 7-day overlap; downstream services cache JWKS with max-age=3600; publish identity.jwks.rotated.v1 event |
| FM-IDENT-10 | License resolver | Hierarchy ancestor walk fails | Effective license resolution returns empty set; UI module gates fail | identity_license_resolver_errors_total; 5xx from /licensing/nodes/:id/effective | Return last-cached result (5 min TTL); log warning; alert if error persists > 2 min |
| FM-IDENT-11 | Tenant suspension race | Sessions not revoked within JWT TTL (15 min) | Suspended tenant users briefly retain access | identity.user.suspended.v1 consumed; Redis session revocation set | Proactive revocation on suspension event; Redis TTL ensures maximum 15-min window; alert if event lag > 30 s |
| FM-IDENT-12 | OOM / crash loop | Pod restart | Brief traffic disruption (load balanced away) | Kubernetes restart count metric; CrashLoopBackOff alert | PDB ensures minAvailable=2; graceful shutdown drains in-flight requests; node limit prevents OOM |
| FM-IDENT-13 | External IdP misconfiguration | JIT provisioning creates duplicate users | Data integrity issue; user cannot link accounts | IDENT_EXT_IDENTITY_MISMATCH 409; duplicate detection | Idempotent JIT logic keyed on (issuer, subject); admin alert on mismatch |