Compliance Layer (compliance-engine) — Service Overview
Status: populated Owner: Platform Engineering / Trust & Safety Last updated: 2026-04-18 Companion: DOMAIN_MODEL · API_CONTRACTS · EVENT_SCHEMAS · AI_INTEGRATION
1. Purpose — An Architectural Layer, Not Just a Service
The Compliance Layer is a first-class tier in the SMS gateway architecture — conceptually equivalent to the "ingestion layer" (Kong + sms-orchestrator HTTP), the "routing layer" (routing-engine), and the "transport layer" (smpp-connector).
The layer is implemented by the compliance-engine microservice plus integration points in:
sms-orchestrator(NATS consumer — invokes the layer for every queued message)admin-dashboard(rule / hold queue / tenant score management UI)notification-service(delivers hold/block notifications to tenants via the web portal)
Every outbound SMS must pass through the Compliance Layer before routing or transmission. Messages that violate rules are either blocked or held for manual review — neither reaches a carrier. Tenants are notified of holds and blocks asynchronously through the web portal; they do not wait on API responses.
The Compliance Layer provides five distinct capabilities:
| Capability | Description |
|---|---|
| Async compliance evaluation | Every queued message is evaluated against the tenant's rule sets before routing |
| Rule authoring & management | Platform admins define, version, deploy, and retire custom compliance rules |
| Hold queue & manual review | Held messages are queued for admin review; released or permanently rejected |
| Tenant scoring & risk tiering | Continuous scoring of every tenant; risk tiers drive automated enforcement |
| Audit, reporting & evidence | Immutable audit log feeds compliance reports for internal governance and regulators |
2. Position in the Message Pipeline (Async Flow)
Tenant / Client App
│
▼ HTTP POST /v1/sms/send
┌───────────────┐
│ Kong Gateway │ (JWT, API-key, rate limits)
└───────┬───────┘
▼
┌─────────────────────┐
│ sms-orchestrator │
│ HTTP Handler │
│ │
│ [1] Zod validate │
│ [2] E.164 + seg │
│ [3] Idempotency │
│ [4] INSERT QUEUED │
│ [5] Publish NATS │
│ [6] Return 202 ◀───┼─── Tenant receives {messageId, status:"QUEUED"}
└──────────┬──────────┘ (does NOT wait for compliance)
│
▼ NATS: sms.outbound.request
┌─────────────────────┐
│ sms-orchestrator │
│ NATS Consumer │
│ │
│ Update → EVALUATING│
│ │
│ ┌────────────────┐│
│ │ COMPLIANCE │◄──── gRPC: EvaluateCompliance
│ │ LAYER │ (compliance-engine)
│ └────────────────┘
│ │
│ ┌──────┴──────┐
│ ▼ ▼
│ ALLOW/FLAG BLOCK/HOLD
│ │ │
│ ▼ ▼
│ routing Update →
│ engine BLOCKED / ON_HOLD
│ │
└─────────────────┤
▼
┌─────────────────┐
│ notification- │ ──► Web Portal
│ service │ (tenant sees
└─────────────────┘ hold/block)
Why async?
- The HTTP handler returns 202 Accepted within ~50 ms of receiving a request. Tenants never block on compliance evaluation.
- Compliance evaluation runs in the NATS consumer pipeline — latency is measured in milliseconds of platform processing, not of user-perceived latency.
- This enables more thorough evaluation (local LLM classification, cross-rule composite checks, DB lookups) without degrading the API experience.
- Fail-closed is operationally viable: if compliance is unavailable, the message waits in the queue and retries — no unverified message reaches a carrier, ever.
Message state transitions driven by the Compliance Layer
QUEUED → EVALUATING → ALLOWED → ROUTING → ROUTED → SENT → DELIVERED
│
├── BLOCKED (terminal; tenant notified)
│
├── ON_HOLD ──► REVIEWED_RELEASED → ROUTING → ...
│ │
│ ▼
│ REVIEWED_REJECTED (terminal; tenant notified)
│
└── AUTO_EXPIRED (terminal; tenant notified)
3. Bounded Context
| Dimension | Value |
|---|---|
| Domain | Trust & Safety / Regulatory Compliance |
| Owner squad | Platform Engineering / Trust & Safety |
| Deployment unit | Kubernetes Deployment — compliance-engine |
| Communication style | Inbound: gRPC (from sms-orchestrator NATS consumer) · HTTP REST (admin CRUD) · NATS (DLR consumer) · HTTPS (local LLM) |
| Storage | PostgreSQL schema compliance · Redis cache |
| Failure mode | Fail-closed (always) — no message may be dispatched without an explicit ALLOW/FLAG verdict |
4. Responsibilities
| # | Responsibility |
|---|---|
| R1 | Accept EvaluateCompliance gRPC calls from sms-orchestrator's NATS consumer and return a verdict within P95 ≤ 500 ms |
| R2 | Evaluate each message against the tenant's assigned rule sets and platform-level rules |
| R3 | Support 10 rule types: KEYWORD, REGEX, SENDER_ID, RECIPIENT, RATE_VOLUME, GEO_RESTRICTION, TEMPORAL, DLR_ABUSE, AI_CLASSIFICATION, COMPOSITE |
| R4 | Operate an allowlist-first evaluation model so trusted senders bypass restriction rules |
| R5 | Place HOLD-verdict messages into the hold queue with full payload preservation |
| R6 | Expose a REST API for platform admins to manage rules, rule sets, blocklists, and the hold queue |
| R7 | Maintain a continuous compliance score (0–100) and risk tier for every tenant |
| R8 | Produce an immutable audit log for every evaluation, rule change, and hold-queue decision |
| R9 | Consume sms.dlr.inbound NATS events to maintain per-tenant DLR statistics for abuse rules |
| R10 | Publish compliance lifecycle events to NATS; notification-service consumes to alert tenants via the web portal |
5. Non-Responsibilities
- Does not return verdicts to tenants via the ingestion API — tenants see state via the web portal, fed by
notification-serviceconsuming compliance events - Does not transmit SMS — handled by
smpp-connector - Does not enforce billing or segment quotas — handled by
billing-service - Does not manage API keys or JWT authentication — handled by
auth-service - Does not own the carrier routing decision — handled by
routing-engine
6. Upstream / Downstream Dependencies
| Direction | Service | Protocol | Purpose |
|---|---|---|---|
| Inbound caller | sms-orchestrator NATS consumer | gRPC (mTLS) | Per-message compliance evaluation |
| Inbound admin | admin-dashboard | HTTP REST (mTLS) | Rule / hold queue / tenant management |
| Inbound event | dlr-processor | NATS JetStream sms.dlr.inbound | DLR statistics for DLR_ABUSE rules |
| Outbound read/write | PostgreSQL compliance schema | TCP (pg driver) | Rules, hold queue, scores, audit log |
| Outbound cache | Redis | TCP | Rule set cache, evaluation result cache, score cache |
| Outbound (optional) | Local LLM service (primary) / LLM API (fallback) | HTTPS / gRPC | AI_CLASSIFICATION rule evaluation |
| Outbound events | NATS JetStream | TCP | Compliance lifecycle events → notification-service, analytics-service |
7. High-Level Flow
8. Key Design Decisions
| Decision | Rationale |
|---|---|
| Compliance is an architectural layer, not a bolt-on | Every message, from every tenant, for every rule set, passes through this layer — it is tier-defining, not feature-level |
| Asynchronous evaluation in NATS consumer pipeline | Tenants receive 202 immediately; compliance runs in platform processing. Enables richer evaluation without impacting API latency |
| Fail-closed is the only mode | Non-compliance must never result in a dispatched message. On compliance unavailability, messages retry in-queue; after exhausted retries they move to DLQ with failed_compliance_unavailable reason |
| Tenants notified via web portal, not API response | Holds/blocks surface in the tenant dashboard with full context (which rule, why, appeal path) — superior UX to a sync HTTP error |
| Allowlist-first evaluation | Trusted senders (OTP, alerts, verified templates) bypass restriction rules via explicit ALLOW rules |
| Local LLM as primary AI provider | Data residency, cost, no DPA overhead with third parties, acceptable latency in async flow |
| CEL-inspired condition expressions | Human-readable, auditable, sandboxed — no arbitrary code execution |
| AI classification result cached 24 h by body hash | Identical SMS bodies appear millions of times; caching eliminates redundant inferences |
| Evaluation log is append-only, partitioned monthly | Compliance evidence must be tamper-evident and query-efficient for auditors |
| Hold queue with 24 h auto-expiry | Ensures held messages do not accumulate indefinitely |
| Tenant score recomputed every 15 minutes | Balances freshness with compute cost; score is not per-message |