sms-firewall-service — Migration Plan
Version: 1.0 Status: Draft Owner: Trust and Safety + Regulator Liaison + SRE Last Updated: 2026-04-21 References: SERVICE_OVERVIEW.md, _report.md, SERVICE_READINESS.md
sms-firewall-service is greenfield. Today, the platform has no national perimeter firewall — only the in-service checks inside compliance-engine and routing-engine. The migration introduces the perimeter tier and gradually moves gating responsibility to it.
1. What Is Migrating
| Input | Source | Volume (estimate) | Notes |
|---|---|---|---|
| Baseline platform traffic signals | sms-orchestrator + smpp-connector submit/DLR logs (30 d retrospective) | Billions of events | For ML training + rule-tuning |
| Current in-compliance-engine blocklists | compliance-engine.blocklist_entries | ~50 k entries | Imported as seed for firewall blocklist |
| National blocklist federation partners | MNOs (AWCC, Roshan, Etisalat, MTN-AF, Salaam) + ATRA | TBD; agreements required | Phase 0 handshake |
| Initial rule set | Authored by T&S Lead + Security | ~20 rules v1 | Covers AIT, SIM-box, consent, sender-ID verification, geo, restricted-pattern |
| ML model v1 (AIT / SIM-box / OTP-harvest) | Trained on 30 d retrospective | 3 models | Each benchmarked on held-out test |
| National blocklist seed | Derived from platform-wide suspend events | ~10 k MSISDNs | Ghasi-seeded pending ATRA adoption |
2. Migration Phases
Phase 0 — Pre-migration (Weeks -6 to 0)
| Step | Owner | Output |
|---|---|---|
| 30-d traffic inventory for ML training corpus | SRE + ML Ops | Training dataset |
| v1 rule set authored and Legal-signed | T&S Lead + Security | 20 rules in firewall.rules |
| ML models v1 trained + validated (per-MNO fairness audit) | ML Ops | 3 models in Triton registry |
| Federation partner engagement (MNOs + ATRA) | Regulator Liaison | MoUs signed with ≥ 3 sources |
Import compliance.blocklist_entries → firewall.blocklist_entries | Platform Eng | Seed blocklist populated |
| Service deployed to staging with design-partner tenants | SRE | Staging green |
Phase 1 — Observation (14 days)
| Step | Owner | Output |
|---|---|---|
FilterInbound + EvaluateTransit return ALLOW for all traffic | Service | Zero enforcement |
| Audit rows captured with the would-have-been verdict | Service | Shadow-verdict distribution |
| Daily report: shadow-block rate per tenant; ML recall/precision; false-positive projection | T&S | Daily dashboard |
| Red-team corpus test (homoglyph, encoded, recursive) runs daily | Security | Bypass report |
Exit criteria. FP projection < 1%; red-team corpus shows 0 bypasses; regulator sign-off on v1 rules.
Phase 2 — Enforcement: Transit MT only (7 days)
| Step | Owner | Output |
|---|---|---|
EvaluateTransit enforces (routing-engine honours BLOCK) | Service | Outbound gating live |
Inbound FilterInbound still shadow | Service | One direction at a time |
| Per-tenant block-rate dashboard + escalation tracker | Frontend | Daily triage |
| Tenant notification if their block-rate > 1% | Notification-service | Automatic alert |
| First signed federation export to ATRA | Regulator Liaison | Daily artefact |
Exit criteria. < 10 escalations per day; OTP P1 submit-to-DLR SLA unchanged; no regulator complaints.
Phase 3 — Full Enforcement (ongoing)
| Step | Owner | Output |
|---|---|---|
FilterInbound enforces on MO direction | Service | Full perimeter |
| National blocklist federation live (MNO + regulator sources) | Service | Daily sync |
| Citizen complaint intake via regulator-portal | Product + Legal | Public trust |
| ML drift monitoring + quarterly retraining | ML Ops | Steady-state ops |
Rollback via feature flags:
FIREWALL_ENFORCEMENT_TRANSIT = on|offFIREWALL_ENFORCEMENT_INBOUND = on|offFIREWALL_ML_ENABLED = true|false(falls back to rule-based)
3. Blocklist Seed Strategy
3.1 From existing compliance-engine blocklists
compliance.blocklist_entries already contains ~50 k entries used by the current SENDER_ID / RECIPIENT rules. These are imported verbatim into firewall.blocklist_entries with source=compliance-engine-seed and a 90-day TTL pending re-classification.
3.2 From platform suspend events
The last 6 months of sender.id.suspended.v1 and compliance.message.blocked.v1 events yield candidate entries. Duplicates merged. Suspend-events seed the MSISDN blocklist.
3.3 From fraud-intel-service (ongoing)
Once both services are live, fraud.detected.* feeds new entries continuously. The migration formalises this feed as a first-class federation-source internal to the platform.
3.4 From MNO federation (Phase 3)
Each MNO publishes their own known-bad MSISDNs and grey-route originators. Ghasi imports on daily cadence.
4. ML Model Rollout
4.1 v1 model characteristics
| Model | Training window | Test set | Target precision | Target recall |
|---|---|---|---|---|
| AIT detector | 30 d traffic + labelled campaigns | 10 k labelled | ≥ 0.92 | ≥ 0.80 |
| SIM-box detector | 30 d inbound MO + labelled boxes | 5 k labelled | ≥ 0.88 | ≥ 0.75 |
| OTP-harvest detector | 30 d OTP delivery stats | 3 k labelled | ≥ 0.90 | ≥ 0.70 |
4.2 Serving infrastructure
- Triton Inference Server, 3 replicas, GPU-backed (ADR-0004 §6 "np-data").
- Circuit breaker; fallback to rule-based on outage.
- Detection mode (ML or FALLBACK) captured per audit row.
4.3 Retraining cadence
- Quarterly (scheduled).
- On-demand when drift alert fires (F1 drop > 5%).
- Human-labelled corpus refreshed weekly via T&S review queue.
5. Federation Onboarding
| Partner | Direction | Schema | Cadence | Auth |
|---|---|---|---|---|
| AWCC | Both (import + export) | Ghasi canonical JSON Lines | Daily | mTLS + HSM-signed cert |
| Roshan | Both | Ghasi canonical JSON Lines | Daily | mTLS + HSM-signed cert |
| Etisalat AF | Both | Ghasi canonical JSON Lines | Daily | mTLS + HSM-signed cert |
| MTN AF | Both | Ghasi canonical JSON Lines | Daily | mTLS + HSM-signed cert |
| Salaam | Both | Ghasi canonical JSON Lines | Daily | mTLS + HSM-signed cert |
| ATRA | Import only (national reg) + Export (our list to ATRA) | ATRA schema (adapter) | Daily | ATRA-specific cert |
Onboarding sequence per partner (Weeks -4 to -2):
- MoU signed.
- Partner receives Ghasi public cert; sends theirs.
- Schema mapping reviewed; adapter implemented if needed.
- Dry-run file exchanged; both sides confirm parse.
- T-7d production-dry-run; ATRA / partner ACK.
- T-0 (Phase 2 start) live federation.
6. Downstream Consumer Migration
| Consumer | Change | Timing |
|---|---|---|
routing-engine | Consult EvaluateTransit before selecting operator; honour verdicts | Phase 2 |
channel-router-service | Consult FilterInbound on MO ingest | Phase 3 |
compliance-engine | Consume firewall.audit.v1 for tenant scoring signal | Phase 2 |
fraud-intel-service | Consume firewall.alert.v1 as ML feature | Phase 2 |
sender-id-registry-service | Consume firewall.alert.v1 to adjust reputation | Phase 3 |
regulator-portal-service | SIEM stream of firewall.audit.v1 | Phase 3 |
admin-dashboard | Firewall admin workbench + NOC panel | Phase 0 |
7. Rollback Plan
7.1 During Phase 1
- No rollback needed (observe only).
7.2 During Phase 2
FIREWALL_ENFORCEMENT_TRANSIT = off.routing-enginefalls back to previous behaviour.
7.3 During Phase 3
FIREWALL_ENFORCEMENT_INBOUND = offreverts to Phase 2.FIREWALL_ENFORCEMENT_TRANSIT = offreverts to Phase 1.FIREWALL_ML_ENABLED = falsefalls back to rule-based only.
7.4 Catastrophic
- Fall back to last-known-good Postgres backup.
- Replay
firewall.*NATS events from JetStream (7 d retention). - Tenant impact: possible < 1 h state gap.
8. Success Metrics for Migration
| Metric | Target | Measurement |
|---|---|---|
| Phase-1 shadow-verdict FP projection | < 1% | Daily report |
| Phase-2 tenant escalations per day | < 10 | Zendesk |
| Phase-2 OTP P1 SLA maintained | P95 ≤ 3 s | Platform dashboard |
| Federation daily sync success | ≥ 99% | Service metric |
| Adversarial corpus bypasses | 0 | CI gate |
| ML precision (per model) | ≥ 0.92 / 0.88 / 0.90 | Held-out test |
| Migration phase transition ±5 d of plan | — | Project tracker |
9. Dependencies
- ATRA + MNO federation MoUs (Regulator Liaison).
- HSM provisioned for federation signing (ADR-0004 §11).
- GPU node pool for ML serving (ADR-0004 §6 "np-data").
- Service mesh + SPIRE SVIDs for mTLS (ADR-0004 §12).
fraud-intel-servicelive and emittingfraud.detected.*(Wave 1 dependency).consent-ledger-servicelive withCheckConsent(Wave 1 dependency).sender-id-registry-servicelive withVerify(Wave 1 dependency).
Blocking any dependency blocks the phase in which it first appears.