Compliance Layer (compliance-engine) — Service Overview

Status: populated Owner: Platform Engineering / Trust & Safety Last updated: 2026-04-18 Companion: DOMAIN_MODEL · API_CONTRACTS · EVENT_SCHEMAS · AI_INTEGRATION

1. Purpose — An Architectural Layer, Not Just a Service

The Compliance Layer is a first-class tier in the SMS gateway architecture — conceptually equivalent to the "ingestion layer" (Kong + sms-orchestrator HTTP), the "routing layer" (routing-engine), and the "transport layer" (smpp-connector).

The layer is implemented by the compliance-engine microservice plus integration points in:

sms-orchestrator (NATS consumer — invokes the layer for every queued message)
admin-dashboard (rule / hold queue / tenant score management UI)
notification-service (delivers hold/block notifications to tenants via the web portal)

Every outbound SMS must pass through the Compliance Layer before routing or transmission. Messages that violate rules are either blocked or held for manual review — neither reaches a carrier. Tenants are notified of holds and blocks asynchronously through the web portal; they do not wait on API responses.

The Compliance Layer provides five distinct capabilities:

Capability	Description
Async compliance evaluation	Every queued message is evaluated against the tenant's rule sets before routing
Rule authoring & management	Platform admins define, version, deploy, and retire custom compliance rules
Hold queue & manual review	Held messages are queued for admin review; released or permanently rejected
Tenant scoring & risk tiering	Continuous scoring of every tenant; risk tiers drive automated enforcement
Audit, reporting & evidence	Immutable audit log feeds compliance reports for internal governance and regulators

2. Position in the Message Pipeline (Async Flow)

                         Tenant / Client App
                                │
                                ▼  HTTP POST /v1/sms/send
                         ┌───────────────┐
                         │ Kong Gateway  │  (JWT, API-key, rate limits)
                         └───────┬───────┘
                                 ▼
                      ┌─────────────────────┐
                      │  sms-orchestrator   │
                      │   HTTP Handler      │
                      │                     │
                      │  [1] Zod validate   │
                      │  [2] E.164 + seg    │
                      │  [3] Idempotency    │
                      │  [4] INSERT QUEUED  │
                      │  [5] Publish NATS   │
                      │  [6] Return 202 ◀───┼─── Tenant receives {messageId, status:"QUEUED"}
                      └──────────┬──────────┘             (does NOT wait for compliance)
                                 │
                                 ▼  NATS: sms.outbound.request
                      ┌─────────────────────┐
                      │  sms-orchestrator   │
                      │   NATS Consumer     │
                      │                     │
                      │  Update → EVALUATING│
                      │                     │
                      │   ┌────────────────┐│
                      │   │  COMPLIANCE    │◄──── gRPC: EvaluateCompliance
                      │   │    LAYER       │      (compliance-engine)
                      │   └────────────────┘
                      │          │
                      │   ┌──────┴──────┐
                      │   ▼             ▼
                      │  ALLOW/FLAG   BLOCK/HOLD
                      │   │             │
                      │   ▼             ▼
                      │  routing      Update →
                      │  engine       BLOCKED / ON_HOLD
                      │                 │
                      └─────────────────┤
                                        ▼
                              ┌─────────────────┐
                              │ notification-   │ ──► Web Portal
                              │   service       │     (tenant sees
                              └─────────────────┘      hold/block)

Why async?

The HTTP handler returns 202 Accepted within ~50 ms of receiving a request. Tenants never block on compliance evaluation.
Compliance evaluation runs in the NATS consumer pipeline — latency is measured in milliseconds of platform processing, not of user-perceived latency.
This enables more thorough evaluation (local LLM classification, cross-rule composite checks, DB lookups) without degrading the API experience.
Fail-closed is operationally viable: if compliance is unavailable, the message waits in the queue and retries — no unverified message reaches a carrier, ever.

Message state transitions driven by the Compliance Layer

QUEUED → EVALUATING → ALLOWED → ROUTING → ROUTED → SENT → DELIVERED
                   │
                   ├── BLOCKED (terminal; tenant notified)
                   │
                   ├── ON_HOLD ──► REVIEWED_RELEASED → ROUTING → ...
                   │                  │
                   │                  ▼
                   │           REVIEWED_REJECTED (terminal; tenant notified)
                   │
                   └── AUTO_EXPIRED (terminal; tenant notified)

3. Bounded Context

Dimension	Value
Domain	Trust & Safety / Regulatory Compliance
Owner squad	Platform Engineering / Trust & Safety
Deployment unit	Kubernetes `Deployment` — `compliance-engine`
Communication style	Inbound: gRPC (from sms-orchestrator NATS consumer) · HTTP REST (admin CRUD) · NATS (DLR consumer) · HTTPS (local LLM)
Storage	PostgreSQL schema `compliance` · Redis cache
Failure mode	Fail-closed (always) — no message may be dispatched without an explicit ALLOW/FLAG verdict

4. Responsibilities

#	Responsibility
R1	Accept `EvaluateCompliance` gRPC calls from `sms-orchestrator`'s NATS consumer and return a verdict within P95 ≤ 500 ms
R2	Evaluate each message against the tenant's assigned rule sets and platform-level rules
R3	Support 10 rule types: KEYWORD, REGEX, SENDER_ID, RECIPIENT, RATE_VOLUME, GEO_RESTRICTION, TEMPORAL, DLR_ABUSE, AI_CLASSIFICATION, COMPOSITE
R4	Operate an allowlist-first evaluation model so trusted senders bypass restriction rules
R5	Place HOLD-verdict messages into the hold queue with full payload preservation
R6	Expose a REST API for platform admins to manage rules, rule sets, blocklists, and the hold queue
R7	Maintain a continuous compliance score (0–100) and risk tier for every tenant
R8	Produce an immutable audit log for every evaluation, rule change, and hold-queue decision
R9	Consume `sms.dlr.inbound` NATS events to maintain per-tenant DLR statistics for abuse rules
R10	Publish compliance lifecycle events to NATS; `notification-service` consumes to alert tenants via the web portal

5. Non-Responsibilities

Does not return verdicts to tenants via the ingestion API — tenants see state via the web portal, fed by notification-service consuming compliance events
Does not transmit SMS — handled by smpp-connector
Does not enforce billing or segment quotas — handled by billing-service
Does not manage API keys or JWT authentication — handled by auth-service
Does not own the carrier routing decision — handled by routing-engine

6. Upstream / Downstream Dependencies

Direction	Service	Protocol	Purpose
Inbound caller	`sms-orchestrator` NATS consumer	gRPC (mTLS)	Per-message compliance evaluation
Inbound admin	`admin-dashboard`	HTTP REST (mTLS)	Rule / hold queue / tenant management
Inbound event	`dlr-processor`	NATS JetStream `sms.dlr.inbound`	DLR statistics for DLR_ABUSE rules
Outbound read/write	PostgreSQL `compliance` schema	TCP (pg driver)	Rules, hold queue, scores, audit log
Outbound cache	Redis	TCP	Rule set cache, evaluation result cache, score cache
Outbound (optional)	Local LLM service (primary) / LLM API (fallback)	HTTPS / gRPC	AI_CLASSIFICATION rule evaluation
Outbound events	NATS JetStream	TCP	Compliance lifecycle events → notification-service, analytics-service

7. High-Level Flow

8. Key Design Decisions

Decision	Rationale
Compliance is an architectural layer, not a bolt-on	Every message, from every tenant, for every rule set, passes through this layer — it is tier-defining, not feature-level
Asynchronous evaluation in NATS consumer pipeline	Tenants receive 202 immediately; compliance runs in platform processing. Enables richer evaluation without impacting API latency
Fail-closed is the only mode	Non-compliance must never result in a dispatched message. On compliance unavailability, messages retry in-queue; after exhausted retries they move to DLQ with `failed_compliance_unavailable` reason
Tenants notified via web portal, not API response	Holds/blocks surface in the tenant dashboard with full context (which rule, why, appeal path) — superior UX to a sync HTTP error
Allowlist-first evaluation	Trusted senders (OTP, alerts, verified templates) bypass restriction rules via explicit ALLOW rules
Local LLM as primary AI provider	Data residency, cost, no DPA overhead with third parties, acceptable latency in async flow
CEL-inspired condition expressions	Human-readable, auditable, sandboxed — no arbitrary code execution
AI classification result cached 24 h by body hash	Identical SMS bodies appear millions of times; caching eliminates redundant inferences
Evaluation log is append-only, partitioned monthly	Compliance evidence must be tamper-evident and query-efficient for auditors
Hold queue with 24 h auto-expiry	Ensures held messages do not accumulate indefinitely
Tenant score recomputed every 15 minutes	Balances freshness with compute cost; score is not per-message

1. Purpose — An Architectural Layer, Not Just a Service​

2. Position in the Message Pipeline (Async Flow)​

Why async?​

Message state transitions driven by the Compliance Layer​

3. Bounded Context​

4. Responsibilities​

5. Non-Responsibilities​

6. Upstream / Downstream Dependencies​

7. High-Level Flow​

8. Key Design Decisions​