Skip to main content

tenant-service — FAILURE_MODES

Companion: APPLICATION_LOGIC · OBSERVABILITY · DEPLOYMENT_TOPOLOGY · SERVICE_RISK_REGISTER

Catalog of known failure scenarios, their blast radius, mitigations, and runbook hooks. Every entry is paired with a metric and an alert in OBSERVABILITY.


1. Catalog

1.1 PDP (/authz/check) unavailable

PropertyValue
SymptomCallers (gateway, services) see 5xx or timeouts on POST /authz/check
Blast radiusWhole platform — every tenant-scoped action is blocked. Fail-closed by design.
DetectionAuthzCheckErrorRate alert; synthetic monitor failures
MitigationCloud Run auto-recovery; warm cache fallback in callers (10 s TTL) absorbs short blips; circuit breaker in gateway returns MELMASTOON.AUTH.PDP_UNAVAILABLE
Runbookrunbooks/tenant-service/pdp-unavailable.md
OwnerPlatform on-call

1.2 Cloud SQL primary failure

PropertyValue
SymptomAll writes 5xx; reads degrade to replica
Blast radiusPer-region tenants only
Detectiondb_pool_connections_waiting spike; OutboxBacklog follows
MitigationCloud SQL HA failover (~ 30–60 s); read traffic continues via replica; PgBouncer reconnects
Runbookrunbooks/tenant-service/db-failover.md

1.3 Outbox poller stuck

PropertyValue
Symptomoutbox_pending_total grows monotonically; downstream caches go stale
Blast radiusEventually all consumers; tenant.config_updated not propagating ⇒ pricing-service uses stale defaults
DetectionOutboxBacklog (> 1000 for 10 min) pages
MitigationRestart poller pod; investigate Pub/Sub publish errors; manual replay tool pnpm outbox:replay
Runbookrunbooks/tenant-service/outbox-backlog.md

1.4 Tenant deletion cascade — partial ack

PropertyValue
SymptomSome downstream services emit deletion_acked.v1, others time out
Blast radiusTenant remains in closed status but downstream data not fully purged; compliance risk
DetectionSagaTimeout{saga=CloseTenant} alert
MitigationSaga does not auto-complete on partial acks; flips to awaiting_intervention; on-call inspects per-service status; may issue replay of tenant.deleted.v1 to laggard service; if irrecoverable, opens incident with the missing service
CompensationNone — closure is terminal. The unacked data is recorded in the incident; manual remediation per service
Runbookrunbooks/tenant-service/cascade-partial-ack.md

1.5 Invitation email not delivered

PropertyValue
SymptomInvitee never receives email; notification-service reports bounce or queue stall
Blast radiusSingle invitation; tenant onboarding delayed
Detectionnotification-service event notification.send.failed.v1 correlated to invitation; in-app banner for inviter
MitigationPOST /invitations/{id}/resend (does not extend TTL); operator can revoke + re-invite to extend
EdgeIf notification-service is down, the original invitation.sent.v1 event remains in Pub/Sub and will be processed when notification-service recovers
Runbookrunbooks/tenant-service/invite-delivery.md

1.6 Downstream service ignores tenant.suspended.v1

PropertyValue
SymptomSuspended tenant continues accepting writes via the laggard service
Blast radiusCompliance risk; revenue accrual on a non-paying tenant
DetectionSynthetic monitor: post a write to each service as a known-suspended tenant; expect MELMASTOON.TENANT.SUSPENDED. Failure pages security
MitigationGateway enforces a belt at the edge: every request is tagged with tenant.status from the cached membership snapshot; suspended tenants are rejected at the gateway regardless of downstream behavior
CompensationIf a write did land, billing-service flags for review; finance reverses charges
Runbookrunbooks/tenant-service/suspension-leak.md

1.7 Race: concurrent TenantConfig edits

PropertyValue
SymptomTwo operators submit conflicting PATCHes; one returns 412 STALE_VERSION
Blast radiusOne user sees a save error
DetectionNone required; expected behavior
MitigationOptimistic concurrency via If-Match; UI prompts re-fetch and re-apply patch

1.8 Race: invitation token replay

PropertyValue
SymptomTwo simultaneous accept attempts of the same token
Blast radiusNone (one wins, one gets 409)
Detectioninvitation_failed_total{reason=already_accepted}
MitigationSingle UPDATE … SET status='accepted' WHERE status='pending' returns 1 row only for the winner; loser sees MELMASTOON.TENANT.INVITATION_REUSED

1.9 Last-owner removal attempt

PropertyValue
SymptomOwner attempts to remove the only other owner, then themselves
Blast radiusWould lock the tenant out of all administration
Detectionlast_owner_block_total
MitigationOwnerProtectionService.assertNotLastOwner rejects with MELMASTOON.TENANT.LAST_OWNER_REMOVAL; UI surfaces "promote another member to owner first"

1.10 Role escalation attempt

PropertyValue
SymptomA tenant.gm attempts to assign tenant.owner
Blast radiusNone (rejected)
Detectionrole_escalation_block_total; security alert if spike
MitigationRoleEscalationGuard enforces in domain; PDP rejects pre-execution; security alerted on > 10/min for one actor

1.11 iam-service user deletion lost (orphaned membership)

PropertyValue
SymptomA user is deleted in iam-service but the event was not delivered
Blast radiusMembership remains active; potential ghost access
DetectionOrphanedMembershipSweep daily job cross-checks against iam; mismatches written to audit_events and reported
MitigationSweep flips to removed; emits membership.removed.v1; iam-service reconciles

1.12 Memorystore eviction storm

PropertyValue
SymptomCache hit rate drops; DB load spikes; latency rises
Blast radiusAll read endpoints in the affected region
Detectiontenant.config p95 breach; cache hit gauge falls below 80 %
MitigationCloud Run autoscale absorbs DB load; capacity bump; review TTLs

1.13 Property move saga stuck

PropertyValue
SymptomA property is in paused_for_move; downstream services have not all acked
Blast radiusOne property; reservations cannot be created until move completes or is rolled back
DetectionSagaTimeout{saga=MoveProperty}
MitigationOn-call inspects each service's ack; can either drive completion manually or invoke compensating move_aborted.v1 to unpause and revert

1.14 AI orchestrator unavailable

PropertyValue
SymptomInvite classifier and bulk-removal review return error
Blast radiusNone to functional flow (advisory only)
DetectionAIClientFailure alert
MitigationCircuit breaker opens; results recorded as unavailable; manual approval flows continue to work

1.15 Bad migration

PropertyValue
SymptomDeployment Stage rejected; readyz red
Blast radiusDeploy aborts; no traffic shift
DetectionCloud Deploy gates; readiness probe
MitigationAuto-rollback to previous revision; Flyway down-migration if reversible; otherwise restore PITR

1.16 Pub/Sub outage

PropertyValue
SymptomOutbox events not publishing; consumer subs silent
Blast radiusEventual consistency lags across all consumers
DetectionOutboxLagP95 breach; Pub/Sub publish errors
MitigationOutbox absorbs writes; on recovery, poller drains in order; consumers idempotent via inbox

1.17 Cross-tenant data leak (Sev-1)

PropertyValue
SymptomA tenant sees another tenant's data
Blast radiusCatastrophic — privacy breach
DetectionTenantIsolationFailure alert (zero-tolerance); customer report
MitigationImmediate prod freeze; revert to last known-good revision; full forensic; customer notification per SLA; root cause + structural fix before unfreeze
Runbookrunbooks/tenant-service/tenant-isolation-violation.md (Sev-1)

1.18 Suspended tenant — read access for owner

PropertyValue
SymptomOwner of a suspended tenant cannot reach billing portal to pay
Blast radiusTenant cannot self-serve unsuspension
DetectionCustomer support tickets
MitigationReads allowed for tenant.owner to the billing routes only; writes rejected. Documented exception list in gateway config

2. Retry & Backoff Rules

CallerStrategy
Outbox pollerExponential backoff 1 s → 30 s; max 5 attempts before DLQ
Inbox consumerPub/Sub native redelivery; max 5; then DLQ
Saga ack waiter3 retries over 7 days; then awaiting_intervention
IdentityClient.preRegister2 retries with idempotency key; then surface 503 SERVICE_DEGRADED
NotificationClient.sendInviteEmail1 retry; failure does not roll back the invitation (event already enqueued)
AIClientNo retry on classify; circuit-breaker on review

3. Circuit Breakers

Per outbound dependency:

  • iam-service: open after 5 errors / 60 s for 30 s; half-open with one probe.
  • notification-service: open after 10 errors / 60 s for 60 s.
  • ai-orchestrator-service: open after 3 errors / 60 s for 60 s; fail-open (advisory).
  • billing-service: synchronous calls disallowed (all communication via events).

4. Fallback Paths

PathFallback
iam-service.findUserByEmail down on invite-acceptPre-register inline; create membership pending iam.user.registered.v1
Cache miss on tenant.configRead from primary; populate cache; metric increments
AI advisory unavailableTreat as allow; flag in audit row
Sync push attempted offlineReject with MELMASTOON.SYNC.ONLINE_REQUIRED; surface in Activity Center

5. What We Will Not Do

  • Auto-restore from backup without a human decision.
  • Auto-resolve a saga timeout by guessing missing acks.
  • Bypass RLS in production "to debug a customer issue".
  • Lower the two-tenant simulator from a hard CI gate.
  • Allow last_writer_wins on any tenant aggregate.