Compliance Layer — Failure Modes
Status: populated | Last updated: 2026-04-18
1. Operating Principle: Fail-Closed
The Compliance Layer operates in fail-closed mode at all times. This is a non-negotiable architectural invariant:
No message may be dispatched to a carrier without an explicit ALLOW or FLAG verdict from the Compliance Layer.
If compliance evaluation cannot complete for any reason, the message:
- Is not ACKed by the sms-orchestrator NATS consumer
- Is redelivered by JetStream after the ack wait
- After 3 redeliveries (configurable) moves to
sms.outbound.deadletterwith reasoncompliance_unavailable - The tenant is notified via the web portal that the message could not be processed
- A platform alert fires for SRE intervention
Because the ingestion HTTP endpoint returned 202 asynchronously, delayed processing during compliance outages does not break the tenant's API contract — it is reflected as QUEUED or DEAD_LETTER in the tenant's web portal.
2. Failure Mode Summary
| # | Failure | Probability | Impact | Mitigation Summary |
|---|---|---|---|---|
| FM-01 | compliance-engine unreachable | Low | High | NATS redelivery; retries; SRE alert |
| FM-02 | PostgreSQL unavailable | Low | High | Redis stale cache; NATS redelivery |
| FM-03 | Redis unavailable | Low | Medium | DB-direct evaluation; no caching |
| FM-04 | Local LLM unavailable | Medium | Medium–High | Per-rule fallbackAction (HOLD by default) |
| FM-05 | Hold queue overflow | Low | Medium | Auto-expiry + bulk-action + tier escalation |
| FM-06 | Rule evaluation timeout (budget exceeded) | Low | Medium | Fail-closed on HOLD-eligible rules |
| FM-07 | Composite rule cycle / infinite recursion | Very Low | High | Cycle detection at save + runtime depth limit |
| FM-08 | NATS DLR consumer lag | Low | Low | minSampleSize guard; horizontal scaling |
| FM-09 | Scoring worker failure | Low | Low | Automatic retry on next cycle |
| FM-10 | NATS event publish failure (async side-effects) | Low | Low | In-process retry; DB state remains consistent |
3. Detailed Failure Modes
FM-01 — compliance-engine unreachable
Scenario: All compliance-engine pods crash; gRPC calls from sms-orchestrator time out.
Detection: compliance_unavailable_retry_total counter rising on sms-orchestrator; gRPC deadline exceeded errors in logs.
Impact: All outbound SMS processing pauses. Messages accumulate in the NATS sms.outbound.request stream.
Recovery:
- NATS JetStream holds the messages (default retention 7 days).
- sms-orchestrator NATS consumer retries each message up to 3 times (30 s ack wait between retries).
- After 3 redeliveries →
sms.outbound.deadletterwith reasoncompliance_unavailable. - Dead-lettered messages trigger a tenant notification: "Message processing delayed — retry available."
- A reconciliation job can re-inject deadlettered messages once compliance-engine recovers.
Mitigation:
- Kubernetes HPA scales compliance-engine pods; PodDisruptionBudget ensures minAvailable = 2 during rolling updates.
- gRPC client has a 1 s deadline per call (leaves headroom for NATS ack wait of 30 s).
- Alert
ComplianceLayerDownfires when any pod is unready > 2 min.
FM-02 — PostgreSQL unavailable
Scenario: PostgreSQL connection pool exhausted or DB unreachable.
Detection: compliance_evaluation_errors_total{error_type="db_unavailable"}; /ready probe returns 503.
Impact:
- Rule set cache miss cannot be served from DB. Stale Redis cache serves for up to 300 s.
- Beyond cache expiry: compliance-engine returns INTERNAL; NATS consumer retries.
- Hold queue inserts and audit writes are deferred.
Recovery:
- Rule set cache TTL 300 s provides operational buffer during transient outages.
- Pending writes (audit, hold, evaluation log) are queued in-process (bounded to 10,000); replayed on DB recovery.
- After 30 s outage with cache expired: sms-orchestrator begins retry cycle (FM-01 behaviour).
FM-03 — Redis unavailable
Scenario: Redis becomes unreachable.
Detection: compliance_rule_cache_misses_total spike; connection errors in logs.
Impact:
- Rule set cache misses force DB queries every evaluation; latency increases but stays within SLA.
- Evaluation result dedup cache unavailable — duplicates re-evaluated.
- RATE_VOLUME rules (sliding window counters) cannot evaluate — fall back to HOLD (fail-closed).
Recovery:
- Evaluation continues; latency impact acceptable in async pipeline.
- RATE_VOLUME rules emit
compliance.evaluation.degradedfinding (HOLD action) — message held for review rather than passed through.
FM-04 — Local LLM unavailable
Scenario: Local LLM service returns errors or times out.
Detection: compliance_ai_fallback_total counter > 0; LocalLLMUnavailable alert.
Impact: AI_CLASSIFICATION rules cannot evaluate. Keyword, regex, and other deterministic rules still function.
Recovery:
- Each AI rule has
fallbackAction. Production configuration recommendation:- High-sensitivity categories (TERRORISM, PHISHING, FINANCIAL_FRAUD) →
fallbackAction: HOLD(fail-closed) - Enhancement categories (SPAM enhancement, HATE_SPEECH nuance) →
fallbackAction: HOLD(still fail-closed; avoid SKIP in regulated contexts)
- High-sensitivity categories (TERRORISM, PHISHING, FINANCIAL_FRAUD) →
- Circuit breaker on LLM client prevents cascade (5 failures / 30 s → open for 60 s).
- Secondary LLM provider (external API) tried once before fallback applies.
- Alert
LocalLLMUnavailablefires for on-call; optional auto-failover to external LLM.
FM-05 — Hold queue overflow
Scenario: Mass-spam event or misconfigured rule causes thousands of holds in a short period.
Detection: compliance_hold_queue_pending_total > 500 (HIGH alert); > 2000 (CRITICAL alert).
Recovery:
- Auto-expiry: PENDING holds past
auto_expires_atmove to AUTO_EXPIRED (every 5 min cron). - Bulk-reject: Platform admin invokes
POST /compliance/hold-queue/bulk-reviewfiltered by tenant or rule. - Tier escalation: Tenant generating >1,000 holds in 1 h is auto-escalated to RESTRICTED, halving their rate limits.
- Rule adjustment: Compliance admin can temporarily disable an overly sensitive rule.
Auto-expired messages are terminal — the tenant is notified in the web portal that the message expired without review.
FM-06 — Rule evaluation budget exceeded
Scenario: Single evaluation exceeds the 450 ms internal budget.
Detection: compliance_evaluation_budget_exceeded_total counter.
Recovery:
- Remaining slow-path rules are skipped.
- Fail-closed applies for HOLD-eligible rules: if a skipped AI rule's fallbackAction is HOLD, the final verdict becomes HOLD.
- Skipped FLAG rules generate a finding with
evidence: "skipped_budget_exceeded". - Budget violations are investigated — typically indicates LLM latency or rule complexity issues.
FM-07 — Composite rule cycle
Scenario: A COMPOSITE rule references another rule that references back, directly or transitively.
Prevention (save time):
- DFS cycle check on rule creation/update; 422 rejection if cycle found.
Runtime safeguard:
visited: Set<ruleId>during traversal; encountering a visited ID aborts the branch with HOLD finding (fail-closed).- Depth limit of 5 provides additional backstop.
FM-08 — NATS DLR consumer lag
Scenario: compliance-engine-dlr consumer falls behind on sms.dlr.inbound.
Detection: nats_consumer_num_pending > 10000.
Impact: DLR stats stale; DLR_ABUSE rules evaluate against older numbers.
Recovery:
minSampleSizeguard prevents false positives from partial data.- Consumer is horizontally scalable.
- Lag resolves automatically as consumers catch up.
FM-09 — Scoring worker failure
Scenario: 15-min scoring cron job fails.
Detection: compliance_scoring_cycle_duration_seconds stops updating; compliance_tenant_score gauge becomes stale.
Impact: Tenant tiers are stale (up to 15 min lag). No message flow impact — Redis risk state cache serves prior values.
Recovery:
- Cron retries automatically next tick.
- Three consecutive failures → HIGH alert, on-call investigation.
FM-10 — NATS event publish failure (async side-effects)
Scenario: NATS publish for compliance.message.held, compliance.audit.v1, etc. fails after the gRPC response has been sent.
Detection: compliance.nats.publish.error log events.
Impact: Downstream consumers (notification-service, analytics-service) may miss events. Tenant may not see immediate web portal notification, though the underlying message state in sms_messages is correct.
Recovery:
- PostgreSQL state (audit_log, hold_queue, evaluation_log) is written before NATS publish — DB state is authoritative.
- Publish is retried up to 3 times with 100 ms backoff before being logged as dropped.
- A reconciliation job re-publishes missed events by scanning recent audit_log rows against NATS stream offsets.
4. Graceful Degradation Summary
Full operation:
Rule load (Redis cache) → Rule eval → AI (local LLM, cached) → Verdict → Routing
Redis unavailable:
Rule load (DB direct) → Rule eval → AI (uncached, slower) → Verdict → Routing
[RATE_VOLUME rules degrade to HOLD — fail-closed]
DB unavailable (Redis cache warm):
Rule load (Redis stale) → Rule eval → AI → Verdict → Routing
[Hold queue writes deferred; NATS consumer retries via redelivery]
Local LLM unavailable:
Rule load → Rule eval → AI rules use fallbackAction (HOLD) → Verdict
[Messages held for manual review rather than passed through]
compliance-engine unavailable:
[sms-orchestrator NATS consumer does not ACK]
→ NATS redelivers (3 attempts, 30 s ack wait)
→ Eventually: sms.outbound.deadletter
→ Tenant sees DEAD_LETTER status in web portal
5. Failure Mode ↔ Tenant Experience Matrix
Because the API is asynchronous, failures present differently to the tenant:
| Failure | Tenant's web portal view |
|---|---|
| compliance-engine brief outage (< 2 min) | Message remains QUEUED briefly, then proceeds |
| compliance-engine extended outage | Message → DEAD_LETTER with reason "Compliance system temporarily unavailable. Retry available." |
| BLOCK verdict | Message → BLOCKED with reason and rule citation; appeal link (if enabled) |
| HOLD verdict | Message → ON_HOLD with reason; "Under review — typically resolved within 4 hours" |
| Hold auto-expired | Message → AUTO_EXPIRED; "Review window elapsed. Please resubmit if still relevant." |
| Hold released by admin | Message → ROUTING → eventually DELIVERED; notification of release |
| Hold rejected by admin | Message → BLOCKED with review notes |