Skip to main content

sms-firewall-service — Migration Plan

Version: 1.0 Status: Draft Owner: Trust and Safety + Regulator Liaison + SRE Last Updated: 2026-04-21 References: SERVICE_OVERVIEW.md, _report.md, SERVICE_READINESS.md

sms-firewall-service is greenfield. Today, the platform has no national perimeter firewall — only the in-service checks inside compliance-engine and routing-engine. The migration introduces the perimeter tier and gradually moves gating responsibility to it.


1. What Is Migrating

InputSourceVolume (estimate)Notes
Baseline platform traffic signalssms-orchestrator + smpp-connector submit/DLR logs (30 d retrospective)Billions of eventsFor ML training + rule-tuning
Current in-compliance-engine blocklistscompliance-engine.blocklist_entries~50 k entriesImported as seed for firewall blocklist
National blocklist federation partnersMNOs (AWCC, Roshan, Etisalat, MTN-AF, Salaam) + ATRATBD; agreements requiredPhase 0 handshake
Initial rule setAuthored by T&S Lead + Security~20 rules v1Covers AIT, SIM-box, consent, sender-ID verification, geo, restricted-pattern
ML model v1 (AIT / SIM-box / OTP-harvest)Trained on 30 d retrospective3 modelsEach benchmarked on held-out test
National blocklist seedDerived from platform-wide suspend events~10 k MSISDNsGhasi-seeded pending ATRA adoption

2. Migration Phases

Phase 0 — Pre-migration (Weeks -6 to 0)

StepOwnerOutput
30-d traffic inventory for ML training corpusSRE + ML OpsTraining dataset
v1 rule set authored and Legal-signedT&S Lead + Security20 rules in firewall.rules
ML models v1 trained + validated (per-MNO fairness audit)ML Ops3 models in Triton registry
Federation partner engagement (MNOs + ATRA)Regulator LiaisonMoUs signed with ≥ 3 sources
Import compliance.blocklist_entriesfirewall.blocklist_entriesPlatform EngSeed blocklist populated
Service deployed to staging with design-partner tenantsSREStaging green

Phase 1 — Observation (14 days)

StepOwnerOutput
FilterInbound + EvaluateTransit return ALLOW for all trafficServiceZero enforcement
Audit rows captured with the would-have-been verdictServiceShadow-verdict distribution
Daily report: shadow-block rate per tenant; ML recall/precision; false-positive projectionT&SDaily dashboard
Red-team corpus test (homoglyph, encoded, recursive) runs dailySecurityBypass report

Exit criteria. FP projection < 1%; red-team corpus shows 0 bypasses; regulator sign-off on v1 rules.

Phase 2 — Enforcement: Transit MT only (7 days)

StepOwnerOutput
EvaluateTransit enforces (routing-engine honours BLOCK)ServiceOutbound gating live
Inbound FilterInbound still shadowServiceOne direction at a time
Per-tenant block-rate dashboard + escalation trackerFrontendDaily triage
Tenant notification if their block-rate > 1%Notification-serviceAutomatic alert
First signed federation export to ATRARegulator LiaisonDaily artefact

Exit criteria. < 10 escalations per day; OTP P1 submit-to-DLR SLA unchanged; no regulator complaints.

Phase 3 — Full Enforcement (ongoing)

StepOwnerOutput
FilterInbound enforces on MO directionServiceFull perimeter
National blocklist federation live (MNO + regulator sources)ServiceDaily sync
Citizen complaint intake via regulator-portalProduct + LegalPublic trust
ML drift monitoring + quarterly retrainingML OpsSteady-state ops

Rollback via feature flags:

  • FIREWALL_ENFORCEMENT_TRANSIT = on|off
  • FIREWALL_ENFORCEMENT_INBOUND = on|off
  • FIREWALL_ML_ENABLED = true|false (falls back to rule-based)

3. Blocklist Seed Strategy

3.1 From existing compliance-engine blocklists

compliance.blocklist_entries already contains ~50 k entries used by the current SENDER_ID / RECIPIENT rules. These are imported verbatim into firewall.blocklist_entries with source=compliance-engine-seed and a 90-day TTL pending re-classification.

3.2 From platform suspend events

The last 6 months of sender.id.suspended.v1 and compliance.message.blocked.v1 events yield candidate entries. Duplicates merged. Suspend-events seed the MSISDN blocklist.

3.3 From fraud-intel-service (ongoing)

Once both services are live, fraud.detected.* feeds new entries continuously. The migration formalises this feed as a first-class federation-source internal to the platform.

3.4 From MNO federation (Phase 3)

Each MNO publishes their own known-bad MSISDNs and grey-route originators. Ghasi imports on daily cadence.


4. ML Model Rollout

4.1 v1 model characteristics

ModelTraining windowTest setTarget precisionTarget recall
AIT detector30 d traffic + labelled campaigns10 k labelled≥ 0.92≥ 0.80
SIM-box detector30 d inbound MO + labelled boxes5 k labelled≥ 0.88≥ 0.75
OTP-harvest detector30 d OTP delivery stats3 k labelled≥ 0.90≥ 0.70

4.2 Serving infrastructure

  • Triton Inference Server, 3 replicas, GPU-backed (ADR-0004 §6 "np-data").
  • Circuit breaker; fallback to rule-based on outage.
  • Detection mode (ML or FALLBACK) captured per audit row.

4.3 Retraining cadence

  • Quarterly (scheduled).
  • On-demand when drift alert fires (F1 drop > 5%).
  • Human-labelled corpus refreshed weekly via T&S review queue.

5. Federation Onboarding

PartnerDirectionSchemaCadenceAuth
AWCCBoth (import + export)Ghasi canonical JSON LinesDailymTLS + HSM-signed cert
RoshanBothGhasi canonical JSON LinesDailymTLS + HSM-signed cert
Etisalat AFBothGhasi canonical JSON LinesDailymTLS + HSM-signed cert
MTN AFBothGhasi canonical JSON LinesDailymTLS + HSM-signed cert
SalaamBothGhasi canonical JSON LinesDailymTLS + HSM-signed cert
ATRAImport only (national reg) + Export (our list to ATRA)ATRA schema (adapter)DailyATRA-specific cert

Onboarding sequence per partner (Weeks -4 to -2):

  1. MoU signed.
  2. Partner receives Ghasi public cert; sends theirs.
  3. Schema mapping reviewed; adapter implemented if needed.
  4. Dry-run file exchanged; both sides confirm parse.
  5. T-7d production-dry-run; ATRA / partner ACK.
  6. T-0 (Phase 2 start) live federation.

6. Downstream Consumer Migration

ConsumerChangeTiming
routing-engineConsult EvaluateTransit before selecting operator; honour verdictsPhase 2
channel-router-serviceConsult FilterInbound on MO ingestPhase 3
compliance-engineConsume firewall.audit.v1 for tenant scoring signalPhase 2
fraud-intel-serviceConsume firewall.alert.v1 as ML featurePhase 2
sender-id-registry-serviceConsume firewall.alert.v1 to adjust reputationPhase 3
regulator-portal-serviceSIEM stream of firewall.audit.v1Phase 3
admin-dashboardFirewall admin workbench + NOC panelPhase 0

7. Rollback Plan

7.1 During Phase 1

  • No rollback needed (observe only).

7.2 During Phase 2

  • FIREWALL_ENFORCEMENT_TRANSIT = off.
  • routing-engine falls back to previous behaviour.

7.3 During Phase 3

  • FIREWALL_ENFORCEMENT_INBOUND = off reverts to Phase 2.
  • FIREWALL_ENFORCEMENT_TRANSIT = off reverts to Phase 1.
  • FIREWALL_ML_ENABLED = false falls back to rule-based only.

7.4 Catastrophic

  • Fall back to last-known-good Postgres backup.
  • Replay firewall.* NATS events from JetStream (7 d retention).
  • Tenant impact: possible < 1 h state gap.

8. Success Metrics for Migration

MetricTargetMeasurement
Phase-1 shadow-verdict FP projection< 1%Daily report
Phase-2 tenant escalations per day< 10Zendesk
Phase-2 OTP P1 SLA maintainedP95 ≤ 3 sPlatform dashboard
Federation daily sync success≥ 99%Service metric
Adversarial corpus bypasses0CI gate
ML precision (per model)≥ 0.92 / 0.88 / 0.90Held-out test
Migration phase transition ±5 d of planProject tracker

9. Dependencies

  • ATRA + MNO federation MoUs (Regulator Liaison).
  • HSM provisioned for federation signing (ADR-0004 §11).
  • GPU node pool for ML serving (ADR-0004 §6 "np-data").
  • Service mesh + SPIRE SVIDs for mTLS (ADR-0004 §12).
  • fraud-intel-service live and emitting fraud.detected.* (Wave 1 dependency).
  • consent-ledger-service live with CheckConsent (Wave 1 dependency).
  • sender-id-registry-service live with Verify (Wave 1 dependency).

Blocking any dependency blocks the phase in which it first appears.