tenant-service — FAILURE_MODES
Companion: APPLICATION_LOGIC · OBSERVABILITY · DEPLOYMENT_TOPOLOGY · SERVICE_RISK_REGISTER
Catalog of known failure scenarios, their blast radius, mitigations, and runbook hooks. Every entry is paired with a metric and an alert in OBSERVABILITY.
1. Catalog
1.1 PDP (/authz/check) unavailable
| Property | Value |
|---|
| Symptom | Callers (gateway, services) see 5xx or timeouts on POST /authz/check |
| Blast radius | Whole platform — every tenant-scoped action is blocked. Fail-closed by design. |
| Detection | AuthzCheckErrorRate alert; synthetic monitor failures |
| Mitigation | Cloud Run auto-recovery; warm cache fallback in callers (10 s TTL) absorbs short blips; circuit breaker in gateway returns MELMASTOON.AUTH.PDP_UNAVAILABLE |
| Runbook | runbooks/tenant-service/pdp-unavailable.md |
| Owner | Platform on-call |
1.2 Cloud SQL primary failure
| Property | Value |
|---|
| Symptom | All writes 5xx; reads degrade to replica |
| Blast radius | Per-region tenants only |
| Detection | db_pool_connections_waiting spike; OutboxBacklog follows |
| Mitigation | Cloud SQL HA failover (~ 30–60 s); read traffic continues via replica; PgBouncer reconnects |
| Runbook | runbooks/tenant-service/db-failover.md |
1.3 Outbox poller stuck
| Property | Value |
|---|
| Symptom | outbox_pending_total grows monotonically; downstream caches go stale |
| Blast radius | Eventually all consumers; tenant.config_updated not propagating ⇒ pricing-service uses stale defaults |
| Detection | OutboxBacklog (> 1000 for 10 min) pages |
| Mitigation | Restart poller pod; investigate Pub/Sub publish errors; manual replay tool pnpm outbox:replay |
| Runbook | runbooks/tenant-service/outbox-backlog.md |
1.4 Tenant deletion cascade — partial ack
| Property | Value |
|---|
| Symptom | Some downstream services emit deletion_acked.v1, others time out |
| Blast radius | Tenant remains in closed status but downstream data not fully purged; compliance risk |
| Detection | SagaTimeout{saga=CloseTenant} alert |
| Mitigation | Saga does not auto-complete on partial acks; flips to awaiting_intervention; on-call inspects per-service status; may issue replay of tenant.deleted.v1 to laggard service; if irrecoverable, opens incident with the missing service |
| Compensation | None — closure is terminal. The unacked data is recorded in the incident; manual remediation per service |
| Runbook | runbooks/tenant-service/cascade-partial-ack.md |
1.5 Invitation email not delivered
| Property | Value |
|---|
| Symptom | Invitee never receives email; notification-service reports bounce or queue stall |
| Blast radius | Single invitation; tenant onboarding delayed |
| Detection | notification-service event notification.send.failed.v1 correlated to invitation; in-app banner for inviter |
| Mitigation | POST /invitations/{id}/resend (does not extend TTL); operator can revoke + re-invite to extend |
| Edge | If notification-service is down, the original invitation.sent.v1 event remains in Pub/Sub and will be processed when notification-service recovers |
| Runbook | runbooks/tenant-service/invite-delivery.md |
1.6 Downstream service ignores tenant.suspended.v1
| Property | Value |
|---|
| Symptom | Suspended tenant continues accepting writes via the laggard service |
| Blast radius | Compliance risk; revenue accrual on a non-paying tenant |
| Detection | Synthetic monitor: post a write to each service as a known-suspended tenant; expect MELMASTOON.TENANT.SUSPENDED. Failure pages security |
| Mitigation | Gateway enforces a belt at the edge: every request is tagged with tenant.status from the cached membership snapshot; suspended tenants are rejected at the gateway regardless of downstream behavior |
| Compensation | If a write did land, billing-service flags for review; finance reverses charges |
| Runbook | runbooks/tenant-service/suspension-leak.md |
1.7 Race: concurrent TenantConfig edits
| Property | Value |
|---|
| Symptom | Two operators submit conflicting PATCHes; one returns 412 STALE_VERSION |
| Blast radius | One user sees a save error |
| Detection | None required; expected behavior |
| Mitigation | Optimistic concurrency via If-Match; UI prompts re-fetch and re-apply patch |
1.8 Race: invitation token replay
| Property | Value |
|---|
| Symptom | Two simultaneous accept attempts of the same token |
| Blast radius | None (one wins, one gets 409) |
| Detection | invitation_failed_total{reason=already_accepted} |
| Mitigation | Single UPDATE … SET status='accepted' WHERE status='pending' returns 1 row only for the winner; loser sees MELMASTOON.TENANT.INVITATION_REUSED |
1.9 Last-owner removal attempt
| Property | Value |
|---|
| Symptom | Owner attempts to remove the only other owner, then themselves |
| Blast radius | Would lock the tenant out of all administration |
| Detection | last_owner_block_total |
| Mitigation | OwnerProtectionService.assertNotLastOwner rejects with MELMASTOON.TENANT.LAST_OWNER_REMOVAL; UI surfaces "promote another member to owner first" |
1.10 Role escalation attempt
| Property | Value |
|---|
| Symptom | A tenant.gm attempts to assign tenant.owner |
| Blast radius | None (rejected) |
| Detection | role_escalation_block_total; security alert if spike |
| Mitigation | RoleEscalationGuard enforces in domain; PDP rejects pre-execution; security alerted on > 10/min for one actor |
1.11 iam-service user deletion lost (orphaned membership)
| Property | Value |
|---|
| Symptom | A user is deleted in iam-service but the event was not delivered |
| Blast radius | Membership remains active; potential ghost access |
| Detection | OrphanedMembershipSweep daily job cross-checks against iam; mismatches written to audit_events and reported |
| Mitigation | Sweep flips to removed; emits membership.removed.v1; iam-service reconciles |
1.12 Memorystore eviction storm
| Property | Value |
|---|
| Symptom | Cache hit rate drops; DB load spikes; latency rises |
| Blast radius | All read endpoints in the affected region |
| Detection | tenant.config p95 breach; cache hit gauge falls below 80 % |
| Mitigation | Cloud Run autoscale absorbs DB load; capacity bump; review TTLs |
1.13 Property move saga stuck
| Property | Value |
|---|
| Symptom | A property is in paused_for_move; downstream services have not all acked |
| Blast radius | One property; reservations cannot be created until move completes or is rolled back |
| Detection | SagaTimeout{saga=MoveProperty} |
| Mitigation | On-call inspects each service's ack; can either drive completion manually or invoke compensating move_aborted.v1 to unpause and revert |
1.14 AI orchestrator unavailable
| Property | Value |
|---|
| Symptom | Invite classifier and bulk-removal review return error |
| Blast radius | None to functional flow (advisory only) |
| Detection | AIClientFailure alert |
| Mitigation | Circuit breaker opens; results recorded as unavailable; manual approval flows continue to work |
1.15 Bad migration
| Property | Value |
|---|
| Symptom | Deployment Stage rejected; readyz red |
| Blast radius | Deploy aborts; no traffic shift |
| Detection | Cloud Deploy gates; readiness probe |
| Mitigation | Auto-rollback to previous revision; Flyway down-migration if reversible; otherwise restore PITR |
1.16 Pub/Sub outage
| Property | Value |
|---|
| Symptom | Outbox events not publishing; consumer subs silent |
| Blast radius | Eventual consistency lags across all consumers |
| Detection | OutboxLagP95 breach; Pub/Sub publish errors |
| Mitigation | Outbox absorbs writes; on recovery, poller drains in order; consumers idempotent via inbox |
1.17 Cross-tenant data leak (Sev-1)
| Property | Value |
|---|
| Symptom | A tenant sees another tenant's data |
| Blast radius | Catastrophic — privacy breach |
| Detection | TenantIsolationFailure alert (zero-tolerance); customer report |
| Mitigation | Immediate prod freeze; revert to last known-good revision; full forensic; customer notification per SLA; root cause + structural fix before unfreeze |
| Runbook | runbooks/tenant-service/tenant-isolation-violation.md (Sev-1) |
1.18 Suspended tenant — read access for owner
| Property | Value |
|---|
| Symptom | Owner of a suspended tenant cannot reach billing portal to pay |
| Blast radius | Tenant cannot self-serve unsuspension |
| Detection | Customer support tickets |
| Mitigation | Reads allowed for tenant.owner to the billing routes only; writes rejected. Documented exception list in gateway config |
2. Retry & Backoff Rules
| Caller | Strategy |
|---|
| Outbox poller | Exponential backoff 1 s → 30 s; max 5 attempts before DLQ |
| Inbox consumer | Pub/Sub native redelivery; max 5; then DLQ |
| Saga ack waiter | 3 retries over 7 days; then awaiting_intervention |
IdentityClient.preRegister | 2 retries with idempotency key; then surface 503 SERVICE_DEGRADED |
NotificationClient.sendInviteEmail | 1 retry; failure does not roll back the invitation (event already enqueued) |
AIClient | No retry on classify; circuit-breaker on review |
3. Circuit Breakers
Per outbound dependency:
iam-service: open after 5 errors / 60 s for 30 s; half-open with one probe.
notification-service: open after 10 errors / 60 s for 60 s.
ai-orchestrator-service: open after 3 errors / 60 s for 60 s; fail-open (advisory).
billing-service: synchronous calls disallowed (all communication via events).
4. Fallback Paths
| Path | Fallback |
|---|
iam-service.findUserByEmail down on invite-accept | Pre-register inline; create membership pending iam.user.registered.v1 |
Cache miss on tenant.config | Read from primary; populate cache; metric increments |
| AI advisory unavailable | Treat as allow; flag in audit row |
| Sync push attempted offline | Reject with MELMASTOON.SYNC.ONLINE_REQUIRED; surface in Activity Center |
5. What We Will Not Do
- Auto-restore from backup without a human decision.
- Auto-resolve a saga timeout by guessing missing acks.
- Bypass RLS in production "to debug a customer issue".
- Lower the two-tenant simulator from a hard CI gate.
- Allow
last_writer_wins on any tenant aggregate.