Skip to main content

Tenant Service — Failure Modes

Status: populated Owner: TBD Last updated: 2026-04-18 Companion: OBSERVABILITY · SERVICE_RISK_REGISTER

1. Failure catalog

IDComponentFailureUser impactDetectionMitigation
FM-TEN-01PostgreSQLPrimary unavailableAll lifecycle and membership operations failHealth probe; DB error spikeAutomatic failover; pgBouncer retries; alert on-call
FM-TEN-02Activation saga — facility-service unreachableRoot hierarchy node not createdTenant activation fails; stays PENDINGStep failure logged; saga exhausted alertBounded retry (3x); idempotent create-or-return; alert SRE
FM-TEN-03Activation saga — identity-service unreachableAdmin user not seededTenant activation fails; stays PENDINGStep failure loggedBounded retry; idempotent; alert SRE
FM-TEN-04Activation saga — licensing seed failsAlways-on licenses not seededTenant active but no modules accessibleStep failure loggedBounded retry; fallback: run seed on next cron cycle
FM-TEN-05NATS JetStream outboxEvents not publishedDownstream services miss lifecycle eventsOutbox row age > 60 s alertOutbox relay retries; manual replay
FM-TEN-06evaluate() — DB query timeoutAuthorization denied / latency spikeRequest blocked or returns 503p95 latency alertQuery timeout 500 ms; return deny on timeout; alert
FM-TEN-07Redis cache evictionHierarchy tree rebuild on every requestLatency spike on tree queriesCache hit rate drop alertIncrease Redis memory; warm cache on startup
FM-TEN-08Subscription expiry cron crashExpired subscriptions not detectedTenants retain access beyond contract endCron absent from metricsKubernetes CronJob restart policy; alert on job failure
FM-TEN-09identity.user.suspended.v1 not consumedSuspended user retains membershipsUser may appear active in tenant contextInbox lag alertNATS at-least-once; inbox deduplication; manual replay
FM-TEN-10RLS bypass leakCross-tenant data exposureCatastrophic data breachRLS test failure in CI; isolation audittenant_rls_bypass role restricted to background workers only; audit quarterly

2. Dependency failure impact

DependencyDegraded modeMitigation
PostgreSQLFull outageDB failover; on-call
identity-serviceActivation fails; access-context degradedBounded retry; partial cache served
facility-serviceActivation failsBounded retry; PENDING state
NATSEvents queued in outboxOutbox relay; alert
RedisEvaluate + tree queries slowerFall back to DB read; alert

3. Runbooks

RunbookTrigger
runbooks/tenant/activation-saga-failure.mdFM-TEN-02, FM-TEN-03, FM-TEN-04
runbooks/tenant/db-failover.mdFM-TEN-01
runbooks/tenant/outbox-replay.mdFM-TEN-05
runbooks/tenant/subscription-expiry-cron.mdFM-TEN-08