Skip to main content

Compliance Layer — Failure Modes

Status: populated | Last updated: 2026-04-18

1. Operating Principle: Fail-Closed

The Compliance Layer operates in fail-closed mode at all times. This is a non-negotiable architectural invariant:

No message may be dispatched to a carrier without an explicit ALLOW or FLAG verdict from the Compliance Layer.

If compliance evaluation cannot complete for any reason, the message:

  1. Is not ACKed by the sms-orchestrator NATS consumer
  2. Is redelivered by JetStream after the ack wait
  3. After 3 redeliveries (configurable) moves to sms.outbound.deadletter with reason compliance_unavailable
  4. The tenant is notified via the web portal that the message could not be processed
  5. A platform alert fires for SRE intervention

Because the ingestion HTTP endpoint returned 202 asynchronously, delayed processing during compliance outages does not break the tenant's API contract — it is reflected as QUEUED or DEAD_LETTER in the tenant's web portal.


2. Failure Mode Summary

#FailureProbabilityImpactMitigation Summary
FM-01compliance-engine unreachableLowHighNATS redelivery; retries; SRE alert
FM-02PostgreSQL unavailableLowHighRedis stale cache; NATS redelivery
FM-03Redis unavailableLowMediumDB-direct evaluation; no caching
FM-04Local LLM unavailableMediumMedium–HighPer-rule fallbackAction (HOLD by default)
FM-05Hold queue overflowLowMediumAuto-expiry + bulk-action + tier escalation
FM-06Rule evaluation timeout (budget exceeded)LowMediumFail-closed on HOLD-eligible rules
FM-07Composite rule cycle / infinite recursionVery LowHighCycle detection at save + runtime depth limit
FM-08NATS DLR consumer lagLowLowminSampleSize guard; horizontal scaling
FM-09Scoring worker failureLowLowAutomatic retry on next cycle
FM-10NATS event publish failure (async side-effects)LowLowIn-process retry; DB state remains consistent

3. Detailed Failure Modes

FM-01 — compliance-engine unreachable

Scenario: All compliance-engine pods crash; gRPC calls from sms-orchestrator time out.

Detection: compliance_unavailable_retry_total counter rising on sms-orchestrator; gRPC deadline exceeded errors in logs.

Impact: All outbound SMS processing pauses. Messages accumulate in the NATS sms.outbound.request stream.

Recovery:

  • NATS JetStream holds the messages (default retention 7 days).
  • sms-orchestrator NATS consumer retries each message up to 3 times (30 s ack wait between retries).
  • After 3 redeliveries → sms.outbound.deadletter with reason compliance_unavailable.
  • Dead-lettered messages trigger a tenant notification: "Message processing delayed — retry available."
  • A reconciliation job can re-inject deadlettered messages once compliance-engine recovers.

Mitigation:

  • Kubernetes HPA scales compliance-engine pods; PodDisruptionBudget ensures minAvailable = 2 during rolling updates.
  • gRPC client has a 1 s deadline per call (leaves headroom for NATS ack wait of 30 s).
  • Alert ComplianceLayerDown fires when any pod is unready > 2 min.

FM-02 — PostgreSQL unavailable

Scenario: PostgreSQL connection pool exhausted or DB unreachable.

Detection: compliance_evaluation_errors_total{error_type="db_unavailable"}; /ready probe returns 503.

Impact:

  • Rule set cache miss cannot be served from DB. Stale Redis cache serves for up to 300 s.
  • Beyond cache expiry: compliance-engine returns INTERNAL; NATS consumer retries.
  • Hold queue inserts and audit writes are deferred.

Recovery:

  • Rule set cache TTL 300 s provides operational buffer during transient outages.
  • Pending writes (audit, hold, evaluation log) are queued in-process (bounded to 10,000); replayed on DB recovery.
  • After 30 s outage with cache expired: sms-orchestrator begins retry cycle (FM-01 behaviour).

FM-03 — Redis unavailable

Scenario: Redis becomes unreachable.

Detection: compliance_rule_cache_misses_total spike; connection errors in logs.

Impact:

  • Rule set cache misses force DB queries every evaluation; latency increases but stays within SLA.
  • Evaluation result dedup cache unavailable — duplicates re-evaluated.
  • RATE_VOLUME rules (sliding window counters) cannot evaluate — fall back to HOLD (fail-closed).

Recovery:

  • Evaluation continues; latency impact acceptable in async pipeline.
  • RATE_VOLUME rules emit compliance.evaluation.degraded finding (HOLD action) — message held for review rather than passed through.

FM-04 — Local LLM unavailable

Scenario: Local LLM service returns errors or times out.

Detection: compliance_ai_fallback_total counter > 0; LocalLLMUnavailable alert.

Impact: AI_CLASSIFICATION rules cannot evaluate. Keyword, regex, and other deterministic rules still function.

Recovery:

  • Each AI rule has fallbackAction. Production configuration recommendation:
    • High-sensitivity categories (TERRORISM, PHISHING, FINANCIAL_FRAUD) → fallbackAction: HOLD (fail-closed)
    • Enhancement categories (SPAM enhancement, HATE_SPEECH nuance) → fallbackAction: HOLD (still fail-closed; avoid SKIP in regulated contexts)
  • Circuit breaker on LLM client prevents cascade (5 failures / 30 s → open for 60 s).
  • Secondary LLM provider (external API) tried once before fallback applies.
  • Alert LocalLLMUnavailable fires for on-call; optional auto-failover to external LLM.

FM-05 — Hold queue overflow

Scenario: Mass-spam event or misconfigured rule causes thousands of holds in a short period.

Detection: compliance_hold_queue_pending_total > 500 (HIGH alert); > 2000 (CRITICAL alert).

Recovery:

  • Auto-expiry: PENDING holds past auto_expires_at move to AUTO_EXPIRED (every 5 min cron).
  • Bulk-reject: Platform admin invokes POST /compliance/hold-queue/bulk-review filtered by tenant or rule.
  • Tier escalation: Tenant generating >1,000 holds in 1 h is auto-escalated to RESTRICTED, halving their rate limits.
  • Rule adjustment: Compliance admin can temporarily disable an overly sensitive rule.

Auto-expired messages are terminal — the tenant is notified in the web portal that the message expired without review.


FM-06 — Rule evaluation budget exceeded

Scenario: Single evaluation exceeds the 450 ms internal budget.

Detection: compliance_evaluation_budget_exceeded_total counter.

Recovery:

  • Remaining slow-path rules are skipped.
  • Fail-closed applies for HOLD-eligible rules: if a skipped AI rule's fallbackAction is HOLD, the final verdict becomes HOLD.
  • Skipped FLAG rules generate a finding with evidence: "skipped_budget_exceeded".
  • Budget violations are investigated — typically indicates LLM latency or rule complexity issues.

FM-07 — Composite rule cycle

Scenario: A COMPOSITE rule references another rule that references back, directly or transitively.

Prevention (save time):

  • DFS cycle check on rule creation/update; 422 rejection if cycle found.

Runtime safeguard:

  • visited: Set<ruleId> during traversal; encountering a visited ID aborts the branch with HOLD finding (fail-closed).
  • Depth limit of 5 provides additional backstop.

FM-08 — NATS DLR consumer lag

Scenario: compliance-engine-dlr consumer falls behind on sms.dlr.inbound.

Detection: nats_consumer_num_pending > 10000.

Impact: DLR stats stale; DLR_ABUSE rules evaluate against older numbers.

Recovery:

  • minSampleSize guard prevents false positives from partial data.
  • Consumer is horizontally scalable.
  • Lag resolves automatically as consumers catch up.

FM-09 — Scoring worker failure

Scenario: 15-min scoring cron job fails.

Detection: compliance_scoring_cycle_duration_seconds stops updating; compliance_tenant_score gauge becomes stale.

Impact: Tenant tiers are stale (up to 15 min lag). No message flow impact — Redis risk state cache serves prior values.

Recovery:

  • Cron retries automatically next tick.
  • Three consecutive failures → HIGH alert, on-call investigation.

FM-10 — NATS event publish failure (async side-effects)

Scenario: NATS publish for compliance.message.held, compliance.audit.v1, etc. fails after the gRPC response has been sent.

Detection: compliance.nats.publish.error log events.

Impact: Downstream consumers (notification-service, analytics-service) may miss events. Tenant may not see immediate web portal notification, though the underlying message state in sms_messages is correct.

Recovery:

  • PostgreSQL state (audit_log, hold_queue, evaluation_log) is written before NATS publish — DB state is authoritative.
  • Publish is retried up to 3 times with 100 ms backoff before being logged as dropped.
  • A reconciliation job re-publishes missed events by scanning recent audit_log rows against NATS stream offsets.

4. Graceful Degradation Summary

Full operation:
Rule load (Redis cache) → Rule eval → AI (local LLM, cached) → Verdict → Routing

Redis unavailable:
Rule load (DB direct) → Rule eval → AI (uncached, slower) → Verdict → Routing
[RATE_VOLUME rules degrade to HOLD — fail-closed]

DB unavailable (Redis cache warm):
Rule load (Redis stale) → Rule eval → AI → Verdict → Routing
[Hold queue writes deferred; NATS consumer retries via redelivery]

Local LLM unavailable:
Rule load → Rule eval → AI rules use fallbackAction (HOLD) → Verdict
[Messages held for manual review rather than passed through]

compliance-engine unavailable:
[sms-orchestrator NATS consumer does not ACK]
→ NATS redelivers (3 attempts, 30 s ack wait)
→ Eventually: sms.outbound.deadletter
→ Tenant sees DEAD_LETTER status in web portal

5. Failure Mode ↔ Tenant Experience Matrix

Because the API is asynchronous, failures present differently to the tenant:

FailureTenant's web portal view
compliance-engine brief outage (< 2 min)Message remains QUEUED briefly, then proceeds
compliance-engine extended outageMessage → DEAD_LETTER with reason "Compliance system temporarily unavailable. Retry available."
BLOCK verdictMessage → BLOCKED with reason and rule citation; appeal link (if enabled)
HOLD verdictMessage → ON_HOLD with reason; "Under review — typically resolved within 4 hours"
Hold auto-expiredMessage → AUTO_EXPIRED; "Review window elapsed. Please resubmit if still relevant."
Hold released by adminMessage → ROUTING → eventually DELIVERED; notification of release
Hold rejected by adminMessage → BLOCKED with review notes