Skip to main content

Failure Modes

:::info Source Sourced from services/assignment-service/FAILURE_MODES.md in the documentation repo. :::

Companion: 14 Risks & Tradeoffs · OBSERVABILITY


1. Failure Taxonomy

Organised by originating boundary:

BoundaryExamples
Ingress (HTTP)invalid payload, auth failure, rate limit
Domaininvariant violation, state conflict
PersistenceDB down, deadlock, migration in progress
EventingNATS unreachable, DLQ accumulating, consumer lag
Schedulingcron drift, leader-election flap, materializer hang
AIGateway timeout, model validation fail, cost cap hit
Tenant / Shareddynamic group empty, catalog version missing
SecurityRLS bypass attempt, tenant mismatch

2. Failure Matrix

#FailureBlast radiusDetectionAuto-mitigationManual runbook
F01Invalid RRULE at createsingle request400 responsereject
F02Target OU not found at activationsingle assignment400; state stays draftrejecttenant-sync check
F03Catalog version not found (pin policy)single assignment404-class domain errorrejectvalidate catalog projection
F04Postgres primary downwhole servicehealth probe failsreadyz → false; traffic drained; Aurora failoverrb/postgres-failover.md
F05Postgres deadlock (saga)single txDB error + retryoutbox dispatcher retries with backoff (max 5)if > 1% sustained: rb/db-contention.md
F06Migration runs longwrites blockedprobe degradedblock rollout; pin old replicasrb/long-migration.md
F07NATS unreachableevent publishoutbox backlog growspublisher backoff; readyz false after 30 srb/nats-failover.md
F08Consumer lag > 60 sdownstream sagasconsumer-lag metricHPA scales worker on lagrb/consumer-lag.md
F09DLQ > 0saga correctnessalert P1none; humans inspectrb/dlq-triage.md
F10Materializer hangno new windowscron-liveness gaugeleader re-election; watchdog kill after 15 minrb/materializer-hang.md
F11Materializer duplicates (defensive)noisy eventsunique index catchesINSERT ... ON CONFLICT DO NOTHING
F12Overdue sweep skippedwindows not overdue on timesweeper-liveness gaugerestart; alertrb/sweeper.md
F13Escalation storm (policy misconfig)notification overloadrate-limit metriccircuit break at 500/min/tenant; warn adminrb/escalation-storm.md
F14Reminder duplicateslearner annoyancereminder-log unique indexdedupe
F15Dynamic group evaporatesempty targetsgroup_not_foundpause affected active assignments; notify adminrb/group-evaporation.md
F16Tenant membership event out of orderstale windowsversion checklater event wins; delta reconciled
F17AI Gateway 5xx on suggestsuggest endpointHTTP 503 returnedadmin UI degrades to manual
F18AI returns invalid JSONsuggest endpointschema validator502; regression sample loggedrb/ai-regression.md
F19AI cost cap exceededsuggest endpointGateway 402402 surface-level; alertrb/ai-cost.md
F20Prompt-injection detectedsuggest endpointGateway 422reject + logrb/ai-injection.md
F21Clock skew > 2 sRRULE / overdueOS NTP alertnone; skew monitoringrb/clock.md
F22Tenant id mismatch (JWT vs header)auth403reject; alert P1rb/tenant-leak.md
F23RLS disabled on table (migration bug)correctnesssemgrep + post-deploy testrollbackrb/rls-bypass.md
F24Outbox table bloatstorageoutbox_backlog gaugeprune jobrb/outbox-bloat.md
F25Idempotency key collisioncaller retry behaviourresponse from cachereturn prior result
F26GDPR erasure on non-existent userrequest404 no-opidempotent
F27GDPR erasure during active windowopen workstate→closed with reasonpublish eventsee [SECURITY_MODEL §7]
F28RRULE materializes more than 10M windowsDB pressurepre-check at activatereject with explicit error; require overriderb/materializer-explosion.md
F29Leader-election flapscheduler noisemultiple startslease + term-guardrb/leader-flap.md
F30OpenTelemetry collector downobservabilityno spansstill serves trafficfix collector; not P1

3. Graceful Degradation Paths

When NATS is down:

  • HTTP writes succeed (outbox buffers).
  • Reads unaffected.
  • Outbox dispatcher retries.
  • No new windows materialised until NATS restored (we DON'T want to diverge from downstream consumers).

When AI Gateway is down:

  • Only /suggest endpoint returns 503.
  • All other endpoints fully functional.
  • Feature flag FEATURE_AI_SUGGEST=off short-circuits the call.

When Postgres replica is lagging:

  • Reads fall back to primary (explicit routing).
  • Compliance report may be served stale with Warning: staleness 12s header.

When Redis is down:

  • Target resolver falls back to direct-query path (slower).
  • Idempotency layer degrades to DB-backed fallback (slower).
  • Rate limiter fails open for non-critical routes, closed for mutating routes.

4. Invariant Violation Handling

Invariants we enforce in code:

  • Duplicate window insert → unique-violation caught → log, continue.
  • Terminal state transition attempted → domain error, log+metric, discard event.
  • Cross-tenant write attempt → TenantMismatchError → P1 alert, request rejected.

We never silently "fix" bad data. We alert and surface.

5. Retry Policy

BoundaryRetryBackoff
Outbox publisher5exp 1s→16s
NATS consumer5exp 2s→64s, then DLQ
AI Gateway client2500ms
DB query (transient)250ms
HTTP outbound (internal)3jittered exp

Retries always carry the original traceparent + a retry-count log attribute.

6. Backpressure

  • Outbox publisher limited to 500 msgs/s per pod; beyond that it pauses fetching.
  • API writes: if outbox backlog > 50k, API returns 503 with Retry-After to protect downstream.

7. Poison Messages

Any event failing all retries lands in DLQ subject. Operators can:

  • Re-publish after code fix.
  • Discard after investigation (audit-logged).
  • Correlate via traceId into SigNoz.

8. Cascading Failure Prevention

  • Circuit breakers on outbound calls (opossum): AI Gateway, Tenant, Catalog.
  • Bulkheads: AI worker is a separate deployment; its failure cannot saturate API pool.
  • Rate limits: per-tenant, per-endpoint.
  • Timeouts everywhere (no unbounded awaits).

9. Data Drift

If we detect a window whose assignmentId no longer exists (should be impossible thanks to FK), we log orphan.window and garbage-collect. Alert if > 10/hour.

10. Freeze-point Violations

Any PR touching ComplianceWindow state machine or RRULE engine after M3 must:

  • Link an approved RFC.
  • Pass an additional CODEOWNER review from core-platform.

Automated gate rejects merges that change these files without rfc: label.

11. Known Trade-offs

  • Horizon cap (90d) means very far-future occurrences aren't visible in reports. Acceptable — we materialize on a rolling basis.
  • LWW draft sync may lose admin edits across devices — alternative CRDT judged over-engineered for draft-only use case.
  • RLS-based tenancy adds small query overhead (~3-5%) in exchange for catastrophic-bug protection.