Compliance Layer — Migration Plan

Status: populated | Last updated: 2026-04-18

1. Overview

The Compliance Layer is a new architectural tier being added to the SMS gateway platform. Because the ingestion API is asynchronous — tenants receive 202 immediately and learn of holds/blocks via the web portal — deployment does not risk breaking the tenant-facing API contract. Integration occurs in the sms-orchestrator NATS consumer pipeline.

The Compliance Layer is always fail-closed: no message is dispatched without an explicit verdict. This shapes the rollout — we introduce the layer first, then enforce rule-by-rule, rather than flip a global "enforce or don't enforce" switch.

2. Migration Phases

Phase 0 — Pre-Deployment Readiness (Week 0)

Task	Owner	Status
Provision PostgreSQL `compliance` schema in shared DB	Platform DBA	☐
Provision Redis logical DB 3 for compliance-engine	Platform SRE	☐
Create NATS streams: `COMPLIANCE_EVENTS`	Platform SRE	☐
Provision GPU nodes in Kubernetes cluster	Platform SRE	☐
Deploy local LLM service (vLLM + Llama-3.1-8B-AWQ or equivalent)	Platform Engineering	☐
Validate local LLM throughput and latency under load	Platform Engineering	☐
Create Kubernetes namespace resources (ConfigMap, Secrets, PDB)	Platform SRE	☐
Curate initial keyword lists (gambling, fraud, phishing, terrorism — multiple languages)	Trust & Safety	☐
Seed default rule set (observation-only initially — action=FLAG on everything)	Trust & Safety	☐

Phase 1 — Layer Deployment, Observation Mode (Week 1)

Deploy the Compliance Layer with the default rule set configured as observation-only — every rule uses action=FLAG, so no message is ever blocked or held. The verdict is evaluated and logged; the sms-orchestrator NATS consumer receives the verdict and proceeds to routing regardless.

Goal: Validate infrastructure, measure real-world latency and verdict distribution, detect any integration issues — without affecting a single message.

Changes:

Deploy compliance-engine (3 replicas) + local LLM (2 GPU pods).
Update sms-orchestrator NATS consumer to call EvaluateCompliance on every message.
Consumer respects ALLOW / FLAG / BLOCK / HOLD verdicts, but all active rules are FLAG so only ALLOW/FLAG are ever produced.
All flagged evaluations land in evaluation_log for Trust & Safety to analyse.

Observation window: 7 days of production traffic.

Exit criteria:

P95 evaluation latency ≤ 500 ms over 7 consecutive days
Evaluation error rate < 0.1%
No sms-orchestrator NATS consumer lag growth
Trust & Safety team has identified candidate rules for activation (based on FLAG analysis)
Local LLM cache hit rate ≥ 90%

Phase 2 — Enable First BLOCK Rules (Week 2)

Activate a small, conservative set of BLOCK rules that target high-confidence violations only.

Recommended initial BLOCK rules:

KEYWORD rule: known-spam phrases (very conservative wordlist)
SENDER_ID rule: explicitly blacklisted senders (internal list)
COUNTRY-level GEO_RESTRICTION (if applicable to deployment)

Per-rule rollout procedure:

Author rule with isActive=false.
Review with Trust & Safety lead.
Enable rule via POST /compliance/rules/:id/enable.
Monitor BLOCK rate for 24 h.
Audit 10 random BLOCK verdicts — verify each is a true positive.
Keep if accuracy > 99%; disable and adjust otherwise.

Tenant experience: Affected tenants see BLOCKED messages in their web portal with rule citation and appeal path.

Phase 3 — Enable HOLD Rules + Hold Queue Review Operations (Week 3)

Activate HOLD rules so Trust & Safety reviewers can manually adjudicate ambiguous cases.

Prerequisites:

Compliance reviewer team staffed (≥ 2 reviewers during business hours)
Review SLA defined (target: 95% of holds reviewed within 4 business hours)
Reviewer training delivered

Recommended initial HOLD rules:

RATE_VOLUME: sudden burst per sender_id (e.g., > 5,000 messages/5 min from single sender)
DLR_ABUSE: account with > 20% 24-h failure rate
KEYWORD: ambiguous terms that need context review

Operational cadence:

Daily review queue triage standup
Weekly false-positive / false-negative audit
Monthly rule tuning based on review outcomes

Phase 4 — Enable AI_CLASSIFICATION Rules (Week 4–5)

Roll out AI-powered content classification category by category.

Sequence:

Enable PHISHING category with action=HOLD, fallbackAction=HOLD (fail-closed).
Monitor for 5 business days — cache hit rate, LLM latency, hold review quality.
If stable: enable MALWARE_LINK category.
Progressive rollout: FINANCIAL_FRAUD → TERRORISM → SPAM → others.
Optionally promote highest-confidence rules from HOLD to BLOCK once precision ≥ 99.5% verified.

Local LLM operations:

Daily GPU utilisation check
Weekly accuracy audit against labelled dataset
Quarterly model re-evaluation (consider upgrade to newer base model)

Phase 5 — Tenant Scoring Goes Live (Week 6)

Scoring worker has been running in Phase 1+ (gathering metrics) but tier enforcement is disabled. In this phase, automated tier actions are activated:

RESTRICTED tier: bulk sends (>500/hr) auto-held; rate limits halved
SUSPENDED tier: all messages auto-held pending admin review

Safety measures:

Tier change notifications sent to tenant admins 48 h before first automated action affects their traffic
Manual tier override capability verified
Appeal workflow documented and linked from web portal notifications

Phase 6 — Steady State & Continuous Improvement (Week 7+)

Weekly compliance report review by Trust & Safety
Monthly rule tuning based on false-positive / false-negative analysis
Quarterly regulatory review
Quarterly LLM model evaluation
Annual compliance framework audit

3. Database Migration Steps

All DDL managed through Prisma migrations, applied in order:

20260401000000_create_compliance_schema — schema + enum types
20260401100000_create_rule_tables — rule_sets, rules, rule_versions, tenant_rule_set_assignments
20260401200000_create_hold_queue — hold_queue + RLS policies
20260401300000_create_evaluation_log — partitioned evaluation_log
20260402000000_create_scoring_tables — tenant_compliance_scores, score_history
20260402100000_create_blocklists_keywords — blocklists, blocklist_entries, keyword_lists, keyword_entries
20260402200000_create_dlr_stats — dlr_stats table
20260402300000_create_audit_log — audit_log + immutability rules
20260403000000_seed_default_rule_set — default rule set (all rules FLAG-only for Phase 1)

Each migration tested in staging with rollback SQL prepared.

4. Existing Service Changes

sms-orchestrator

Change	Complexity	Risk
Add `ComplianceClient` gRPC stub in NATS consumer	Low	Low
Integrate compliance call in NATS consumer pipeline (between dequeue and routing)	Medium	Low
New message states: `EVALUATING`, `ON_HOLD`, `BLOCKED`, `AUTO_EXPIRED`	Medium	Low
Fail-closed handling: do-not-ack on compliance error → NATS redelivery	Medium	Medium
Handle `sms.outbound.retry` with `skipCompliance: true` flag (for release flow)	Low	Low

admin-dashboard

Change	Complexity	Risk
Compliance management UI (rules, rule sets, blocklists, keyword lists)	High	Low
Hold queue review UI with priority sorting and bulk actions	Medium	Low
Tenant compliance score dashboard + ranking view	Medium	Low
Tier override interface	Low	Low
Audit log query interface	Low	Low

customer-portal (tenant web portal)

Change	Complexity	Risk
Display message states including `BLOCKED`, `ON_HOLD`, `AUTO_EXPIRED`	Medium	Low
Show rule citation and rationale on blocked/held messages	Medium	Low
Tenant compliance score visibility	Low	Low
Appeal / dispute submission flow (future)	Medium	Low

notification-service

Change	Complexity	Risk
Subscribe to `compliance.message.held`, `compliance.message.blocked`, `compliance.message.rejected`, `compliance.message.released`, `compliance.message.expired`	Low	Low
Push notifications to tenant web portal	Medium	Low
Tier change notifications to tenant admins	Low	Low

billing-service

Change	Complexity	Risk
Skip billing for `BLOCKED` messages	Low	Low
Defer billing for `ON_HOLD` until release	Medium	Low
Skip billing for `AUTO_EXPIRED`	Low	Low

dlr-processor

No changes required. Already publishes sms.dlr.inbound.

5. Rollback Plan

Because the Compliance Layer is fail-closed, rollback cannot simply disable the layer — doing so would block all traffic. Rollback is performed by deactivating rules, not by disabling the layer.

Phase	Rollback Action	Rollback Time
Phase 2 (BLOCK rules)	Disable specific rule via `POST /rules/:id/disable`	< 1 min per rule
Phase 3 (HOLD rules)	Same as above	< 1 min per rule
Phase 4 (AI rules)	Disable AI rule or change `fallbackAction` to HOLD only (fail-closed)	< 1 min
Phase 5 (tenant scoring enforcement)	Disable tier actions via feature flag	< 5 min

Full architectural rollback — removing the compliance-engine call from sms-orchestrator — requires a code change in sms-orchestrator and is not supported as an operational rollback. The Compliance Layer is structural to the platform.

6. Data Migration

None required. This is a net-new layer with no existing data to migrate. Historical messages in sms_messages are not retroactively evaluated.

7. Timeline Summary

Week	Phase	Milestone
0	Pre-deployment readiness	Infrastructure + local LLM deployed
1	Phase 1 — Observation mode	Layer live; all rules FLAG-only; metrics gathered
2	Phase 2 — BLOCK rules	First deterministic BLOCK rules active
3	Phase 3 — HOLD rules	Review operations running
4–5	Phase 4 — AI rules	Content classification active category-by-category
6	Phase 5 — Scoring enforcement	Tier actions active
7+	Steady state	Ongoing tuning and reporting

1. Overview​

2. Migration Phases​

Phase 0 — Pre-Deployment Readiness (Week 0)​

Phase 1 — Layer Deployment, Observation Mode (Week 1)​

Phase 2 — Enable First BLOCK Rules (Week 2)​

Phase 3 — Enable HOLD Rules + Hold Queue Review Operations (Week 3)​

Phase 4 — Enable AI_CLASSIFICATION Rules (Week 4–5)​

Phase 5 — Tenant Scoring Goes Live (Week 6)​

Phase 6 — Steady State & Continuous Improvement (Week 7+)​

3. Database Migration Steps​

4. Existing Service Changes​

sms-orchestrator​

admin-dashboard​

customer-portal (tenant web portal)​

notification-service​

billing-service​

dlr-processor​

5. Rollback Plan​

6. Data Migration​

7. Timeline Summary​