Skip to main content

Compliance Layer (compliance-engine) — Service Overview

Status: populated Owner: Platform Engineering / Trust & Safety Last updated: 2026-04-18 Companion: DOMAIN_MODEL · API_CONTRACTS · EVENT_SCHEMAS · AI_INTEGRATION

1. Purpose — An Architectural Layer, Not Just a Service

The Compliance Layer is a first-class tier in the SMS gateway architecture — conceptually equivalent to the "ingestion layer" (Kong + sms-orchestrator HTTP), the "routing layer" (routing-engine), and the "transport layer" (smpp-connector).

The layer is implemented by the compliance-engine microservice plus integration points in:

  • sms-orchestrator (NATS consumer — invokes the layer for every queued message)
  • admin-dashboard (rule / hold queue / tenant score management UI)
  • notification-service (delivers hold/block notifications to tenants via the web portal)

Every outbound SMS must pass through the Compliance Layer before routing or transmission. Messages that violate rules are either blocked or held for manual review — neither reaches a carrier. Tenants are notified of holds and blocks asynchronously through the web portal; they do not wait on API responses.

The Compliance Layer provides five distinct capabilities:

CapabilityDescription
Async compliance evaluationEvery queued message is evaluated against the tenant's rule sets before routing
Rule authoring & managementPlatform admins define, version, deploy, and retire custom compliance rules
Hold queue & manual reviewHeld messages are queued for admin review; released or permanently rejected
Tenant scoring & risk tieringContinuous scoring of every tenant; risk tiers drive automated enforcement
Audit, reporting & evidenceImmutable audit log feeds compliance reports for internal governance and regulators

2. Position in the Message Pipeline (Async Flow)

Tenant / Client App

▼ HTTP POST /v1/sms/send
┌───────────────┐
│ Kong Gateway │ (JWT, API-key, rate limits)
└───────┬───────┘

┌─────────────────────┐
│ sms-orchestrator │
│ HTTP Handler │
│ │
│ [1] Zod validate │
│ [2] E.164 + seg │
│ [3] Idempotency │
│ [4] INSERT QUEUED │
│ [5] Publish NATS │
│ [6] Return 202 ◀───┼─── Tenant receives {messageId, status:"QUEUED"}
└──────────┬──────────┘ (does NOT wait for compliance)

▼ NATS: sms.outbound.request
┌─────────────────────┐
│ sms-orchestrator │
│ NATS Consumer │
│ │
│ Update → EVALUATING│
│ │
│ ┌────────────────┐│
│ │ COMPLIANCE │◄──── gRPC: EvaluateCompliance
│ │ LAYER │ (compliance-engine)
│ └────────────────┘
│ │
│ ┌──────┴──────┐
│ ▼ ▼
│ ALLOW/FLAG BLOCK/HOLD
│ │ │
│ ▼ ▼
│ routing Update →
│ engine BLOCKED / ON_HOLD
│ │
└─────────────────┤

┌─────────────────┐
│ notification- │ ──► Web Portal
│ service │ (tenant sees
└─────────────────┘ hold/block)

Why async?

  • The HTTP handler returns 202 Accepted within ~50 ms of receiving a request. Tenants never block on compliance evaluation.
  • Compliance evaluation runs in the NATS consumer pipeline — latency is measured in milliseconds of platform processing, not of user-perceived latency.
  • This enables more thorough evaluation (local LLM classification, cross-rule composite checks, DB lookups) without degrading the API experience.
  • Fail-closed is operationally viable: if compliance is unavailable, the message waits in the queue and retries — no unverified message reaches a carrier, ever.

Message state transitions driven by the Compliance Layer

QUEUED → EVALUATING → ALLOWED → ROUTING → ROUTED → SENT → DELIVERED

├── BLOCKED (terminal; tenant notified)

├── ON_HOLD ──► REVIEWED_RELEASED → ROUTING → ...
│ │
│ ▼
│ REVIEWED_REJECTED (terminal; tenant notified)

└── AUTO_EXPIRED (terminal; tenant notified)

3. Bounded Context

DimensionValue
DomainTrust & Safety / Regulatory Compliance
Owner squadPlatform Engineering / Trust & Safety
Deployment unitKubernetes Deploymentcompliance-engine
Communication styleInbound: gRPC (from sms-orchestrator NATS consumer) · HTTP REST (admin CRUD) · NATS (DLR consumer) · HTTPS (local LLM)
StoragePostgreSQL schema compliance · Redis cache
Failure modeFail-closed (always) — no message may be dispatched without an explicit ALLOW/FLAG verdict

4. Responsibilities

#Responsibility
R1Accept EvaluateCompliance gRPC calls from sms-orchestrator's NATS consumer and return a verdict within P95 ≤ 500 ms
R2Evaluate each message against the tenant's assigned rule sets and platform-level rules
R3Support 10 rule types: KEYWORD, REGEX, SENDER_ID, RECIPIENT, RATE_VOLUME, GEO_RESTRICTION, TEMPORAL, DLR_ABUSE, AI_CLASSIFICATION, COMPOSITE
R4Operate an allowlist-first evaluation model so trusted senders bypass restriction rules
R5Place HOLD-verdict messages into the hold queue with full payload preservation
R6Expose a REST API for platform admins to manage rules, rule sets, blocklists, and the hold queue
R7Maintain a continuous compliance score (0–100) and risk tier for every tenant
R8Produce an immutable audit log for every evaluation, rule change, and hold-queue decision
R9Consume sms.dlr.inbound NATS events to maintain per-tenant DLR statistics for abuse rules
R10Publish compliance lifecycle events to NATS; notification-service consumes to alert tenants via the web portal

5. Non-Responsibilities

  • Does not return verdicts to tenants via the ingestion API — tenants see state via the web portal, fed by notification-service consuming compliance events
  • Does not transmit SMS — handled by smpp-connector
  • Does not enforce billing or segment quotas — handled by billing-service
  • Does not manage API keys or JWT authentication — handled by auth-service
  • Does not own the carrier routing decision — handled by routing-engine

6. Upstream / Downstream Dependencies

DirectionServiceProtocolPurpose
Inbound callersms-orchestrator NATS consumergRPC (mTLS)Per-message compliance evaluation
Inbound adminadmin-dashboardHTTP REST (mTLS)Rule / hold queue / tenant management
Inbound eventdlr-processorNATS JetStream sms.dlr.inboundDLR statistics for DLR_ABUSE rules
Outbound read/writePostgreSQL compliance schemaTCP (pg driver)Rules, hold queue, scores, audit log
Outbound cacheRedisTCPRule set cache, evaluation result cache, score cache
Outbound (optional)Local LLM service (primary) / LLM API (fallback)HTTPS / gRPCAI_CLASSIFICATION rule evaluation
Outbound eventsNATS JetStreamTCPCompliance lifecycle events → notification-service, analytics-service

7. High-Level Flow


8. Key Design Decisions

DecisionRationale
Compliance is an architectural layer, not a bolt-onEvery message, from every tenant, for every rule set, passes through this layer — it is tier-defining, not feature-level
Asynchronous evaluation in NATS consumer pipelineTenants receive 202 immediately; compliance runs in platform processing. Enables richer evaluation without impacting API latency
Fail-closed is the only modeNon-compliance must never result in a dispatched message. On compliance unavailability, messages retry in-queue; after exhausted retries they move to DLQ with failed_compliance_unavailable reason
Tenants notified via web portal, not API responseHolds/blocks surface in the tenant dashboard with full context (which rule, why, appeal path) — superior UX to a sync HTTP error
Allowlist-first evaluationTrusted senders (OTP, alerts, verified templates) bypass restriction rules via explicit ALLOW rules
Local LLM as primary AI providerData residency, cost, no DPA overhead with third parties, acceptable latency in async flow
CEL-inspired condition expressionsHuman-readable, auditable, sandboxed — no arbitrary code execution
AI classification result cached 24 h by body hashIdentical SMS bodies appear millions of times; caching eliminates redundant inferences
Evaluation log is append-only, partitioned monthlyCompliance evidence must be tamper-evident and query-efficient for auditors
Hold queue with 24 h auto-expiryEnsures held messages do not accumulate indefinitely
Tenant score recomputed every 15 minutesBalances freshness with compute cost; score is not per-message