Skip to main content

Failure Modes

:::info Source Sourced from services/tenant-service/FAILURE_MODES.md in the documentation repo. :::

1. Known Failure Scenarios

1.1 Authorization Decision Engine Unavailable

  • Symptom: POST /api/v1/authz/check fails; every cross-service ABAC check fails.
  • Blast radius: Platform-wide (services cannot make authz decisions).
  • Mitigation:
    • Policy engine runs in-process in every consumer service (policies shipped as signed bundles, refreshed every 60s).
    • /authz/check is a convenience endpoint for admin UIs only, not on hot paths.
    • Policy bundle is cached for 1 hour on every pod.
  • Runbook: runbooks/tenant/authz-engine.md

1.2 Dynamic Group Evaluation Lag

  • Symptom: User added to a dynamic group membership via SCIM/HR sync; assignment-service does not see them for hours.
  • Mitigation:
    • Full re-evaluation every 15 min per tenant.
    • Incremental re-eval on membership events.
    • SLA: dynamic group membership change visible within 5 min.
  • Recovery: manual re-eval trigger via admin API.

1.3 RBAC Policy Drift Between Services

  • Symptom: tenant-service says role X has permission, but consumer service denies.
  • Mitigation:
    • Signed policy bundle; version pinned per release.
    • Consumers refuse to apply bundles with invalid signature or older version.
  • Verification: policy-consistency test runs hourly across all services.

1.4 Data-Residency Migration Failure

  • Symptom: Migration saga hangs; tenant writes frozen; data partial across regions.
  • Mitigation:
    • Saga compensations: unfreeze source; discard target.
    • Checksums verified at every step.
    • Rollback runbook tested in staging within last 7 days.
  • Recovery: manual compensation by SRE + CTO sign-off.

1.5 Cross-Tenant Reference via Shared Role

  • Symptom: System role (tenantId: null) grants cross-tenant access due to predicate bug.
  • Mitigation:
    • Domain invariant: every ABAC predicate must reference ctx.tenant_id.
    • Policy linter rejects predicates without tenant scope.
    • Two-tenant isolation test on every policy change.

1.6 Membership Re-Activation Race

  • Symptom: Deleted user re-invited; old sessions/enrollments incorrectly re-activated.
  • Mitigation:
    • Membership ID is ULID, not (tenantId, userId); re-invite creates new ID.
    • Old enrollments remain revoked.

2. Retry / Backoff Rules

OperationMax attemptsBackoffBudget
Postgres write310ms, 50ms, 200ms300ms
Policy bundle fetch5100ms exp5s
Dynamic group re-eval31s, 5s, 15s30s
Data-residency stepmanualsaga timeout 60min

3. Circuit Breakers

TargetTripReset
Postgres primary10 fail / 30s60s
Policy bundle signer (KMS)5 fail / 30s60s

4. Fallback Paths

PrimaryFallback
Live /authz/checkIn-process policy engine with cached bundle
Dynamic group real-timeLast successful evaluation (stale but safe)
Data-residency sagaRead-only mode at source until manual resolution