:::info Source
Sourced from services/tenant-service/FAILURE_MODES.md in the documentation repo.
:::
1. Known Failure Scenarios
1.1 Authorization Decision Engine Unavailable
- Symptom:
POST /api/v1/authz/check fails; every cross-service ABAC check fails.
- Blast radius: Platform-wide (services cannot make authz decisions).
- Mitigation:
- Policy engine runs in-process in every consumer service (policies shipped as signed bundles, refreshed every 60s).
/authz/check is a convenience endpoint for admin UIs only, not on hot paths.
- Policy bundle is cached for 1 hour on every pod.
- Runbook:
runbooks/tenant/authz-engine.md
1.2 Dynamic Group Evaluation Lag
- Symptom: User added to a dynamic group membership via SCIM/HR sync; assignment-service does not see them for hours.
- Mitigation:
- Full re-evaluation every 15 min per tenant.
- Incremental re-eval on membership events.
- SLA: dynamic group membership change visible within 5 min.
- Recovery: manual re-eval trigger via admin API.
1.3 RBAC Policy Drift Between Services
- Symptom: tenant-service says
role X has permission, but consumer service denies.
- Mitigation:
- Signed policy bundle; version pinned per release.
- Consumers refuse to apply bundles with invalid signature or older version.
- Verification: policy-consistency test runs hourly across all services.
1.4 Data-Residency Migration Failure
- Symptom: Migration saga hangs; tenant writes frozen; data partial across regions.
- Mitigation:
- Saga compensations: unfreeze source; discard target.
- Checksums verified at every step.
- Rollback runbook tested in staging within last 7 days.
- Recovery: manual compensation by SRE + CTO sign-off.
1.5 Cross-Tenant Reference via Shared Role
- Symptom: System role (tenantId: null) grants cross-tenant access due to predicate bug.
- Mitigation:
- Domain invariant: every ABAC predicate must reference
ctx.tenant_id.
- Policy linter rejects predicates without tenant scope.
- Two-tenant isolation test on every policy change.
1.6 Membership Re-Activation Race
- Symptom: Deleted user re-invited; old sessions/enrollments incorrectly re-activated.
- Mitigation:
- Membership ID is ULID, not
(tenantId, userId); re-invite creates new ID.
- Old enrollments remain revoked.
2. Retry / Backoff Rules
| Operation | Max attempts | Backoff | Budget |
|---|
| Postgres write | 3 | 10ms, 50ms, 200ms | 300ms |
| Policy bundle fetch | 5 | 100ms exp | 5s |
| Dynamic group re-eval | 3 | 1s, 5s, 15s | 30s |
| Data-residency step | manual | — | saga timeout 60min |
3. Circuit Breakers
| Target | Trip | Reset |
|---|
| Postgres primary | 10 fail / 30s | 60s |
| Policy bundle signer (KMS) | 5 fail / 30s | 60s |
4. Fallback Paths
| Primary | Fallback |
|---|
Live /authz/check | In-process policy engine with cached bundle |
| Dynamic group real-time | Last successful evaluation (stale but safe) |
| Data-residency saga | Read-only mode at source until manual resolution |