Failure Modes
:::info Source
Sourced from services/assignment-service/FAILURE_MODES.md in the documentation repo.
:::
Companion: 14 Risks & Tradeoffs · OBSERVABILITY
1. Failure Taxonomy
Organised by originating boundary:
| Boundary | Examples |
|---|---|
| Ingress (HTTP) | invalid payload, auth failure, rate limit |
| Domain | invariant violation, state conflict |
| Persistence | DB down, deadlock, migration in progress |
| Eventing | NATS unreachable, DLQ accumulating, consumer lag |
| Scheduling | cron drift, leader-election flap, materializer hang |
| AI | Gateway timeout, model validation fail, cost cap hit |
| Tenant / Shared | dynamic group empty, catalog version missing |
| Security | RLS bypass attempt, tenant mismatch |
2. Failure Matrix
| # | Failure | Blast radius | Detection | Auto-mitigation | Manual runbook |
|---|---|---|---|---|---|
| F01 | Invalid RRULE at create | single request | 400 response | reject | — |
| F02 | Target OU not found at activation | single assignment | 400; state stays draft | reject | tenant-sync check |
| F03 | Catalog version not found (pin policy) | single assignment | 404-class domain error | reject | validate catalog projection |
| F04 | Postgres primary down | whole service | health probe fails | readyz → false; traffic drained; Aurora failover | rb/postgres-failover.md |
| F05 | Postgres deadlock (saga) | single tx | DB error + retry | outbox dispatcher retries with backoff (max 5) | if > 1% sustained: rb/db-contention.md |
| F06 | Migration runs long | writes blocked | probe degraded | block rollout; pin old replicas | rb/long-migration.md |
| F07 | NATS unreachable | event publish | outbox backlog grows | publisher backoff; readyz false after 30 s | rb/nats-failover.md |
| F08 | Consumer lag > 60 s | downstream sagas | consumer-lag metric | HPA scales worker on lag | rb/consumer-lag.md |
| F09 | DLQ > 0 | saga correctness | alert P1 | none; humans inspect | rb/dlq-triage.md |
| F10 | Materializer hang | no new windows | cron-liveness gauge | leader re-election; watchdog kill after 15 min | rb/materializer-hang.md |
| F11 | Materializer duplicates (defensive) | noisy events | unique index catches | INSERT ... ON CONFLICT DO NOTHING | — |
| F12 | Overdue sweep skipped | windows not overdue on time | sweeper-liveness gauge | restart; alert | rb/sweeper.md |
| F13 | Escalation storm (policy misconfig) | notification overload | rate-limit metric | circuit break at 500/min/tenant; warn admin | rb/escalation-storm.md |
| F14 | Reminder duplicates | learner annoyance | reminder-log unique index | dedupe | — |
| F15 | Dynamic group evaporates | empty targets | group_not_found | pause affected active assignments; notify admin | rb/group-evaporation.md |
| F16 | Tenant membership event out of order | stale windows | version check | later event wins; delta reconciled | — |
| F17 | AI Gateway 5xx on suggest | suggest endpoint | HTTP 503 returned | admin UI degrades to manual | — |
| F18 | AI returns invalid JSON | suggest endpoint | schema validator | 502; regression sample logged | rb/ai-regression.md |
| F19 | AI cost cap exceeded | suggest endpoint | Gateway 402 | 402 surface-level; alert | rb/ai-cost.md |
| F20 | Prompt-injection detected | suggest endpoint | Gateway 422 | reject + log | rb/ai-injection.md |
| F21 | Clock skew > 2 s | RRULE / overdue | OS NTP alert | none; skew monitoring | rb/clock.md |
| F22 | Tenant id mismatch (JWT vs header) | auth | 403 | reject; alert P1 | rb/tenant-leak.md |
| F23 | RLS disabled on table (migration bug) | correctness | semgrep + post-deploy test | rollback | rb/rls-bypass.md |
| F24 | Outbox table bloat | storage | outbox_backlog gauge | prune job | rb/outbox-bloat.md |
| F25 | Idempotency key collision | caller retry behaviour | response from cache | return prior result | — |
| F26 | GDPR erasure on non-existent user | request | 404 no-op | idempotent | — |
| F27 | GDPR erasure during active window | open work | state→closed with reason | publish event | see [SECURITY_MODEL §7] |
| F28 | RRULE materializes more than 10M windows | DB pressure | pre-check at activate | reject with explicit error; require override | rb/materializer-explosion.md |
| F29 | Leader-election flap | scheduler noise | multiple starts | lease + term-guard | rb/leader-flap.md |
| F30 | OpenTelemetry collector down | observability | no spans | still serves traffic | fix collector; not P1 |
3. Graceful Degradation Paths
When NATS is down:
- HTTP writes succeed (outbox buffers).
- Reads unaffected.
- Outbox dispatcher retries.
- No new windows materialised until NATS restored (we DON'T want to diverge from downstream consumers).
When AI Gateway is down:
- Only
/suggestendpoint returns 503. - All other endpoints fully functional.
- Feature flag
FEATURE_AI_SUGGEST=offshort-circuits the call.
When Postgres replica is lagging:
- Reads fall back to primary (explicit routing).
- Compliance report may be served stale with
Warning: staleness 12sheader.
When Redis is down:
- Target resolver falls back to direct-query path (slower).
- Idempotency layer degrades to DB-backed fallback (slower).
- Rate limiter fails open for non-critical routes, closed for mutating routes.
4. Invariant Violation Handling
Invariants we enforce in code:
- Duplicate window insert → unique-violation caught → log, continue.
- Terminal state transition attempted → domain error, log+metric, discard event.
- Cross-tenant write attempt →
TenantMismatchError→ P1 alert, request rejected.
We never silently "fix" bad data. We alert and surface.
5. Retry Policy
| Boundary | Retry | Backoff |
|---|---|---|
| Outbox publisher | 5 | exp 1s→16s |
| NATS consumer | 5 | exp 2s→64s, then DLQ |
| AI Gateway client | 2 | 500ms |
| DB query (transient) | 2 | 50ms |
| HTTP outbound (internal) | 3 | jittered exp |
Retries always carry the original traceparent + a retry-count log attribute.
6. Backpressure
- Outbox publisher limited to 500 msgs/s per pod; beyond that it pauses fetching.
- API writes: if outbox backlog > 50k, API returns 503 with
Retry-Afterto protect downstream.
7. Poison Messages
Any event failing all retries lands in DLQ subject. Operators can:
- Re-publish after code fix.
- Discard after investigation (audit-logged).
- Correlate via
traceIdinto SigNoz.
8. Cascading Failure Prevention
- Circuit breakers on outbound calls (opossum): AI Gateway, Tenant, Catalog.
- Bulkheads: AI worker is a separate deployment; its failure cannot saturate API pool.
- Rate limits: per-tenant, per-endpoint.
- Timeouts everywhere (no unbounded awaits).
9. Data Drift
If we detect a window whose assignmentId no longer exists (should be impossible thanks to FK), we log orphan.window and garbage-collect. Alert if > 10/hour.
10. Freeze-point Violations
Any PR touching ComplianceWindow state machine or RRULE engine after M3 must:
- Link an approved RFC.
- Pass an additional CODEOWNER review from core-platform.
Automated gate rejects merges that change these files without rfc: label.
11. Known Trade-offs
- Horizon cap (90d) means very far-future occurrences aren't visible in reports. Acceptable — we materialize on a rolling basis.
- LWW draft sync may lose admin edits across devices — alternative CRDT judged over-engineered for draft-only use case.
- RLS-based tenancy adds small query overhead (~3-5%) in exchange for catastrophic-bug protection.