Compliance Layer — Migration Plan
Status: populated | Last updated: 2026-04-18
1. Overview
The Compliance Layer is a new architectural tier being added to the SMS gateway platform. Because the ingestion API is asynchronous — tenants receive 202 immediately and learn of holds/blocks via the web portal — deployment does not risk breaking the tenant-facing API contract. Integration occurs in the sms-orchestrator NATS consumer pipeline.
The Compliance Layer is always fail-closed: no message is dispatched without an explicit verdict. This shapes the rollout — we introduce the layer first, then enforce rule-by-rule, rather than flip a global "enforce or don't enforce" switch.
2. Migration Phases
Phase 0 — Pre-Deployment Readiness (Week 0)
| Task | Owner | Status |
|---|---|---|
Provision PostgreSQL compliance schema in shared DB | Platform DBA | ☐ |
| Provision Redis logical DB 3 for compliance-engine | Platform SRE | ☐ |
Create NATS streams: COMPLIANCE_EVENTS | Platform SRE | ☐ |
| Provision GPU nodes in Kubernetes cluster | Platform SRE | ☐ |
| Deploy local LLM service (vLLM + Llama-3.1-8B-AWQ or equivalent) | Platform Engineering | ☐ |
| Validate local LLM throughput and latency under load | Platform Engineering | ☐ |
| Create Kubernetes namespace resources (ConfigMap, Secrets, PDB) | Platform SRE | ☐ |
| Curate initial keyword lists (gambling, fraud, phishing, terrorism — multiple languages) | Trust & Safety | ☐ |
| Seed default rule set (observation-only initially — action=FLAG on everything) | Trust & Safety | ☐ |
Phase 1 — Layer Deployment, Observation Mode (Week 1)
Deploy the Compliance Layer with the default rule set configured as observation-only — every rule uses action=FLAG, so no message is ever blocked or held. The verdict is evaluated and logged; the sms-orchestrator NATS consumer receives the verdict and proceeds to routing regardless.
Goal: Validate infrastructure, measure real-world latency and verdict distribution, detect any integration issues — without affecting a single message.
Changes:
- Deploy
compliance-engine(3 replicas) + local LLM (2 GPU pods). - Update
sms-orchestratorNATS consumer to callEvaluateComplianceon every message. - Consumer respects
ALLOW/FLAG/BLOCK/HOLDverdicts, but all active rules areFLAGso only ALLOW/FLAG are ever produced. - All flagged evaluations land in
evaluation_logfor Trust & Safety to analyse.
Observation window: 7 days of production traffic.
Exit criteria:
- P95 evaluation latency ≤ 500 ms over 7 consecutive days
- Evaluation error rate < 0.1%
- No sms-orchestrator NATS consumer lag growth
- Trust & Safety team has identified candidate rules for activation (based on FLAG analysis)
- Local LLM cache hit rate ≥ 90%
Phase 2 — Enable First BLOCK Rules (Week 2)
Activate a small, conservative set of BLOCK rules that target high-confidence violations only.
Recommended initial BLOCK rules:
- KEYWORD rule: known-spam phrases (very conservative wordlist)
- SENDER_ID rule: explicitly blacklisted senders (internal list)
- COUNTRY-level GEO_RESTRICTION (if applicable to deployment)
Per-rule rollout procedure:
- Author rule with
isActive=false. - Review with Trust & Safety lead.
- Enable rule via
POST /compliance/rules/:id/enable. - Monitor BLOCK rate for 24 h.
- Audit 10 random BLOCK verdicts — verify each is a true positive.
- Keep if accuracy > 99%; disable and adjust otherwise.
Tenant experience: Affected tenants see BLOCKED messages in their web portal with rule citation and appeal path.
Phase 3 — Enable HOLD Rules + Hold Queue Review Operations (Week 3)
Activate HOLD rules so Trust & Safety reviewers can manually adjudicate ambiguous cases.
Prerequisites:
- Compliance reviewer team staffed (≥ 2 reviewers during business hours)
- Review SLA defined (target: 95% of holds reviewed within 4 business hours)
- Reviewer training delivered
Recommended initial HOLD rules:
- RATE_VOLUME: sudden burst per sender_id (e.g., > 5,000 messages/5 min from single sender)
- DLR_ABUSE: account with > 20% 24-h failure rate
- KEYWORD: ambiguous terms that need context review
Operational cadence:
- Daily review queue triage standup
- Weekly false-positive / false-negative audit
- Monthly rule tuning based on review outcomes
Phase 4 — Enable AI_CLASSIFICATION Rules (Week 4–5)
Roll out AI-powered content classification category by category.
Sequence:
- Enable PHISHING category with
action=HOLD,fallbackAction=HOLD(fail-closed). - Monitor for 5 business days — cache hit rate, LLM latency, hold review quality.
- If stable: enable MALWARE_LINK category.
- Progressive rollout: FINANCIAL_FRAUD → TERRORISM → SPAM → others.
- Optionally promote highest-confidence rules from HOLD to BLOCK once precision ≥ 99.5% verified.
Local LLM operations:
- Daily GPU utilisation check
- Weekly accuracy audit against labelled dataset
- Quarterly model re-evaluation (consider upgrade to newer base model)
Phase 5 — Tenant Scoring Goes Live (Week 6)
Scoring worker has been running in Phase 1+ (gathering metrics) but tier enforcement is disabled. In this phase, automated tier actions are activated:
- RESTRICTED tier: bulk sends (>500/hr) auto-held; rate limits halved
- SUSPENDED tier: all messages auto-held pending admin review
Safety measures:
- Tier change notifications sent to tenant admins 48 h before first automated action affects their traffic
- Manual tier override capability verified
- Appeal workflow documented and linked from web portal notifications
Phase 6 — Steady State & Continuous Improvement (Week 7+)
- Weekly compliance report review by Trust & Safety
- Monthly rule tuning based on false-positive / false-negative analysis
- Quarterly regulatory review
- Quarterly LLM model evaluation
- Annual compliance framework audit
3. Database Migration Steps
All DDL managed through Prisma migrations, applied in order:
20260401000000_create_compliance_schema— schema + enum types20260401100000_create_rule_tables— rule_sets, rules, rule_versions, tenant_rule_set_assignments20260401200000_create_hold_queue— hold_queue + RLS policies20260401300000_create_evaluation_log— partitioned evaluation_log20260402000000_create_scoring_tables— tenant_compliance_scores, score_history20260402100000_create_blocklists_keywords— blocklists, blocklist_entries, keyword_lists, keyword_entries20260402200000_create_dlr_stats— dlr_stats table20260402300000_create_audit_log— audit_log + immutability rules20260403000000_seed_default_rule_set— default rule set (all rules FLAG-only for Phase 1)
Each migration tested in staging with rollback SQL prepared.
4. Existing Service Changes
sms-orchestrator
| Change | Complexity | Risk |
|---|---|---|
Add ComplianceClient gRPC stub in NATS consumer | Low | Low |
| Integrate compliance call in NATS consumer pipeline (between dequeue and routing) | Medium | Low |
New message states: EVALUATING, ON_HOLD, BLOCKED, AUTO_EXPIRED | Medium | Low |
| Fail-closed handling: do-not-ack on compliance error → NATS redelivery | Medium | Medium |
Handle sms.outbound.retry with skipCompliance: true flag (for release flow) | Low | Low |
admin-dashboard
| Change | Complexity | Risk |
|---|---|---|
| Compliance management UI (rules, rule sets, blocklists, keyword lists) | High | Low |
| Hold queue review UI with priority sorting and bulk actions | Medium | Low |
| Tenant compliance score dashboard + ranking view | Medium | Low |
| Tier override interface | Low | Low |
| Audit log query interface | Low | Low |
customer-portal (tenant web portal)
| Change | Complexity | Risk |
|---|---|---|
Display message states including BLOCKED, ON_HOLD, AUTO_EXPIRED | Medium | Low |
| Show rule citation and rationale on blocked/held messages | Medium | Low |
| Tenant compliance score visibility | Low | Low |
| Appeal / dispute submission flow (future) | Medium | Low |
notification-service
| Change | Complexity | Risk |
|---|---|---|
Subscribe to compliance.message.held, compliance.message.blocked, compliance.message.rejected, compliance.message.released, compliance.message.expired | Low | Low |
| Push notifications to tenant web portal | Medium | Low |
| Tier change notifications to tenant admins | Low | Low |
billing-service
| Change | Complexity | Risk |
|---|---|---|
Skip billing for BLOCKED messages | Low | Low |
Defer billing for ON_HOLD until release | Medium | Low |
Skip billing for AUTO_EXPIRED | Low | Low |
dlr-processor
- No changes required. Already publishes
sms.dlr.inbound.
5. Rollback Plan
Because the Compliance Layer is fail-closed, rollback cannot simply disable the layer — doing so would block all traffic. Rollback is performed by deactivating rules, not by disabling the layer.
| Phase | Rollback Action | Rollback Time |
|---|---|---|
| Phase 2 (BLOCK rules) | Disable specific rule via POST /rules/:id/disable | < 1 min per rule |
| Phase 3 (HOLD rules) | Same as above | < 1 min per rule |
| Phase 4 (AI rules) | Disable AI rule or change fallbackAction to HOLD only (fail-closed) | < 1 min |
| Phase 5 (tenant scoring enforcement) | Disable tier actions via feature flag | < 5 min |
Full architectural rollback — removing the compliance-engine call from sms-orchestrator — requires a code change in sms-orchestrator and is not supported as an operational rollback. The Compliance Layer is structural to the platform.
6. Data Migration
None required. This is a net-new layer with no existing data to migrate. Historical messages in sms_messages are not retroactively evaluated.
7. Timeline Summary
| Week | Phase | Milestone |
|---|---|---|
| 0 | Pre-deployment readiness | Infrastructure + local LLM deployed |
| 1 | Phase 1 — Observation mode | Layer live; all rules FLAG-only; metrics gathered |
| 2 | Phase 2 — BLOCK rules | First deterministic BLOCK rules active |
| 3 | Phase 3 — HOLD rules | Review operations running |
| 4–5 | Phase 4 — AI rules | Content classification active category-by-category |
| 6 | Phase 5 — Scoring enforcement | Tier actions active |
| 7+ | Steady state | Ongoing tuning and reporting |