Skip to main content

Compliance Layer — Migration Plan

Status: populated | Last updated: 2026-04-18

1. Overview

The Compliance Layer is a new architectural tier being added to the SMS gateway platform. Because the ingestion API is asynchronous — tenants receive 202 immediately and learn of holds/blocks via the web portal — deployment does not risk breaking the tenant-facing API contract. Integration occurs in the sms-orchestrator NATS consumer pipeline.

The Compliance Layer is always fail-closed: no message is dispatched without an explicit verdict. This shapes the rollout — we introduce the layer first, then enforce rule-by-rule, rather than flip a global "enforce or don't enforce" switch.


2. Migration Phases

Phase 0 — Pre-Deployment Readiness (Week 0)

TaskOwnerStatus
Provision PostgreSQL compliance schema in shared DBPlatform DBA
Provision Redis logical DB 3 for compliance-enginePlatform SRE
Create NATS streams: COMPLIANCE_EVENTSPlatform SRE
Provision GPU nodes in Kubernetes clusterPlatform SRE
Deploy local LLM service (vLLM + Llama-3.1-8B-AWQ or equivalent)Platform Engineering
Validate local LLM throughput and latency under loadPlatform Engineering
Create Kubernetes namespace resources (ConfigMap, Secrets, PDB)Platform SRE
Curate initial keyword lists (gambling, fraud, phishing, terrorism — multiple languages)Trust & Safety
Seed default rule set (observation-only initially — action=FLAG on everything)Trust & Safety

Phase 1 — Layer Deployment, Observation Mode (Week 1)

Deploy the Compliance Layer with the default rule set configured as observation-only — every rule uses action=FLAG, so no message is ever blocked or held. The verdict is evaluated and logged; the sms-orchestrator NATS consumer receives the verdict and proceeds to routing regardless.

Goal: Validate infrastructure, measure real-world latency and verdict distribution, detect any integration issues — without affecting a single message.

Changes:

  1. Deploy compliance-engine (3 replicas) + local LLM (2 GPU pods).
  2. Update sms-orchestrator NATS consumer to call EvaluateCompliance on every message.
  3. Consumer respects ALLOW / FLAG / BLOCK / HOLD verdicts, but all active rules are FLAG so only ALLOW/FLAG are ever produced.
  4. All flagged evaluations land in evaluation_log for Trust & Safety to analyse.

Observation window: 7 days of production traffic.

Exit criteria:

  • P95 evaluation latency ≤ 500 ms over 7 consecutive days
  • Evaluation error rate < 0.1%
  • No sms-orchestrator NATS consumer lag growth
  • Trust & Safety team has identified candidate rules for activation (based on FLAG analysis)
  • Local LLM cache hit rate ≥ 90%

Phase 2 — Enable First BLOCK Rules (Week 2)

Activate a small, conservative set of BLOCK rules that target high-confidence violations only.

Recommended initial BLOCK rules:

  • KEYWORD rule: known-spam phrases (very conservative wordlist)
  • SENDER_ID rule: explicitly blacklisted senders (internal list)
  • COUNTRY-level GEO_RESTRICTION (if applicable to deployment)

Per-rule rollout procedure:

  1. Author rule with isActive=false.
  2. Review with Trust & Safety lead.
  3. Enable rule via POST /compliance/rules/:id/enable.
  4. Monitor BLOCK rate for 24 h.
  5. Audit 10 random BLOCK verdicts — verify each is a true positive.
  6. Keep if accuracy > 99%; disable and adjust otherwise.

Tenant experience: Affected tenants see BLOCKED messages in their web portal with rule citation and appeal path.


Phase 3 — Enable HOLD Rules + Hold Queue Review Operations (Week 3)

Activate HOLD rules so Trust & Safety reviewers can manually adjudicate ambiguous cases.

Prerequisites:

  • Compliance reviewer team staffed (≥ 2 reviewers during business hours)
  • Review SLA defined (target: 95% of holds reviewed within 4 business hours)
  • Reviewer training delivered

Recommended initial HOLD rules:

  • RATE_VOLUME: sudden burst per sender_id (e.g., > 5,000 messages/5 min from single sender)
  • DLR_ABUSE: account with > 20% 24-h failure rate
  • KEYWORD: ambiguous terms that need context review

Operational cadence:

  • Daily review queue triage standup
  • Weekly false-positive / false-negative audit
  • Monthly rule tuning based on review outcomes

Phase 4 — Enable AI_CLASSIFICATION Rules (Week 4–5)

Roll out AI-powered content classification category by category.

Sequence:

  1. Enable PHISHING category with action=HOLD, fallbackAction=HOLD (fail-closed).
  2. Monitor for 5 business days — cache hit rate, LLM latency, hold review quality.
  3. If stable: enable MALWARE_LINK category.
  4. Progressive rollout: FINANCIAL_FRAUD → TERRORISM → SPAM → others.
  5. Optionally promote highest-confidence rules from HOLD to BLOCK once precision ≥ 99.5% verified.

Local LLM operations:

  • Daily GPU utilisation check
  • Weekly accuracy audit against labelled dataset
  • Quarterly model re-evaluation (consider upgrade to newer base model)

Phase 5 — Tenant Scoring Goes Live (Week 6)

Scoring worker has been running in Phase 1+ (gathering metrics) but tier enforcement is disabled. In this phase, automated tier actions are activated:

  • RESTRICTED tier: bulk sends (>500/hr) auto-held; rate limits halved
  • SUSPENDED tier: all messages auto-held pending admin review

Safety measures:

  • Tier change notifications sent to tenant admins 48 h before first automated action affects their traffic
  • Manual tier override capability verified
  • Appeal workflow documented and linked from web portal notifications

Phase 6 — Steady State & Continuous Improvement (Week 7+)

  • Weekly compliance report review by Trust & Safety
  • Monthly rule tuning based on false-positive / false-negative analysis
  • Quarterly regulatory review
  • Quarterly LLM model evaluation
  • Annual compliance framework audit

3. Database Migration Steps

All DDL managed through Prisma migrations, applied in order:

  1. 20260401000000_create_compliance_schema — schema + enum types
  2. 20260401100000_create_rule_tables — rule_sets, rules, rule_versions, tenant_rule_set_assignments
  3. 20260401200000_create_hold_queue — hold_queue + RLS policies
  4. 20260401300000_create_evaluation_log — partitioned evaluation_log
  5. 20260402000000_create_scoring_tables — tenant_compliance_scores, score_history
  6. 20260402100000_create_blocklists_keywords — blocklists, blocklist_entries, keyword_lists, keyword_entries
  7. 20260402200000_create_dlr_stats — dlr_stats table
  8. 20260402300000_create_audit_log — audit_log + immutability rules
  9. 20260403000000_seed_default_rule_set — default rule set (all rules FLAG-only for Phase 1)

Each migration tested in staging with rollback SQL prepared.


4. Existing Service Changes

sms-orchestrator

ChangeComplexityRisk
Add ComplianceClient gRPC stub in NATS consumerLowLow
Integrate compliance call in NATS consumer pipeline (between dequeue and routing)MediumLow
New message states: EVALUATING, ON_HOLD, BLOCKED, AUTO_EXPIREDMediumLow
Fail-closed handling: do-not-ack on compliance error → NATS redeliveryMediumMedium
Handle sms.outbound.retry with skipCompliance: true flag (for release flow)LowLow

admin-dashboard

ChangeComplexityRisk
Compliance management UI (rules, rule sets, blocklists, keyword lists)HighLow
Hold queue review UI with priority sorting and bulk actionsMediumLow
Tenant compliance score dashboard + ranking viewMediumLow
Tier override interfaceLowLow
Audit log query interfaceLowLow

customer-portal (tenant web portal)

ChangeComplexityRisk
Display message states including BLOCKED, ON_HOLD, AUTO_EXPIREDMediumLow
Show rule citation and rationale on blocked/held messagesMediumLow
Tenant compliance score visibilityLowLow
Appeal / dispute submission flow (future)MediumLow

notification-service

ChangeComplexityRisk
Subscribe to compliance.message.held, compliance.message.blocked, compliance.message.rejected, compliance.message.released, compliance.message.expiredLowLow
Push notifications to tenant web portalMediumLow
Tier change notifications to tenant adminsLowLow

billing-service

ChangeComplexityRisk
Skip billing for BLOCKED messagesLowLow
Defer billing for ON_HOLD until releaseMediumLow
Skip billing for AUTO_EXPIREDLowLow

dlr-processor

  • No changes required. Already publishes sms.dlr.inbound.

5. Rollback Plan

Because the Compliance Layer is fail-closed, rollback cannot simply disable the layer — doing so would block all traffic. Rollback is performed by deactivating rules, not by disabling the layer.

PhaseRollback ActionRollback Time
Phase 2 (BLOCK rules)Disable specific rule via POST /rules/:id/disable< 1 min per rule
Phase 3 (HOLD rules)Same as above< 1 min per rule
Phase 4 (AI rules)Disable AI rule or change fallbackAction to HOLD only (fail-closed)< 1 min
Phase 5 (tenant scoring enforcement)Disable tier actions via feature flag< 5 min

Full architectural rollback — removing the compliance-engine call from sms-orchestrator — requires a code change in sms-orchestrator and is not supported as an operational rollback. The Compliance Layer is structural to the platform.


6. Data Migration

None required. This is a net-new layer with no existing data to migrate. Historical messages in sms_messages are not retroactively evaluated.


7. Timeline Summary

WeekPhaseMilestone
0Pre-deployment readinessInfrastructure + local LLM deployed
1Phase 1 — Observation modeLayer live; all rules FLAG-only; metrics gathered
2Phase 2 — BLOCK rulesFirst deterministic BLOCK rules active
3Phase 3 — HOLD rulesReview operations running
4–5Phase 4 — AI rulesContent classification active category-by-category
6Phase 5 — Scoring enforcementTier actions active
7+Steady stateOngoing tuning and reporting