sms-firewall-service — Migration Plan

Version: 1.0 Status: Draft Owner: Trust and Safety + Regulator Liaison + SRE Last Updated: 2026-04-21 References: SERVICE_OVERVIEW.md, _report.md, SERVICE_READINESS.md

sms-firewall-service is greenfield. Today, the platform has no national perimeter firewall — only the in-service checks inside compliance-engine and routing-engine. The migration introduces the perimeter tier and gradually moves gating responsibility to it.

1. What Is Migrating

Input	Source	Volume (estimate)	Notes
Baseline platform traffic signals	`sms-orchestrator` + `smpp-connector` submit/DLR logs (30 d retrospective)	Billions of events	For ML training + rule-tuning
Current in-compliance-engine blocklists	`compliance-engine.blocklist_entries`	~50 k entries	Imported as seed for firewall blocklist
National blocklist federation partners	MNOs (AWCC, Roshan, Etisalat, MTN-AF, Salaam) + ATRA	TBD; agreements required	Phase 0 handshake
Initial rule set	Authored by T&S Lead + Security	~20 rules v1	Covers AIT, SIM-box, consent, sender-ID verification, geo, restricted-pattern
ML model v1 (AIT / SIM-box / OTP-harvest)	Trained on 30 d retrospective	3 models	Each benchmarked on held-out test
National blocklist seed	Derived from platform-wide suspend events	~10 k MSISDNs	Ghasi-seeded pending ATRA adoption

2. Migration Phases

Phase 0 — Pre-migration (Weeks -6 to 0)

Step	Owner	Output
30-d traffic inventory for ML training corpus	SRE + ML Ops	Training dataset
v1 rule set authored and Legal-signed	T&S Lead + Security	20 rules in `firewall.rules`
ML models v1 trained + validated (per-MNO fairness audit)	ML Ops	3 models in Triton registry
Federation partner engagement (MNOs + ATRA)	Regulator Liaison	MoUs signed with ≥ 3 sources
Import `compliance.blocklist_entries` → `firewall.blocklist_entries`	Platform Eng	Seed blocklist populated
Service deployed to staging with design-partner tenants	SRE	Staging green

Phase 1 — Observation (14 days)

Step	Owner	Output
`FilterInbound` + `EvaluateTransit` return `ALLOW` for all traffic	Service	Zero enforcement
Audit rows captured with the would-have-been verdict	Service	Shadow-verdict distribution
Daily report: shadow-block rate per tenant; ML recall/precision; false-positive projection	T&S	Daily dashboard
Red-team corpus test (homoglyph, encoded, recursive) runs daily	Security	Bypass report

Exit criteria. FP projection < 1%; red-team corpus shows 0 bypasses; regulator sign-off on v1 rules.

Phase 2 — Enforcement: Transit MT only (7 days)

Step	Owner	Output
`EvaluateTransit` enforces (routing-engine honours BLOCK)	Service	Outbound gating live
Inbound `FilterInbound` still shadow	Service	One direction at a time
Per-tenant block-rate dashboard + escalation tracker	Frontend	Daily triage
Tenant notification if their block-rate > 1%	Notification-service	Automatic alert
First signed federation export to ATRA	Regulator Liaison	Daily artefact

Exit criteria. < 10 escalations per day; OTP P1 submit-to-DLR SLA unchanged; no regulator complaints.

Phase 3 — Full Enforcement (ongoing)

Step	Owner	Output
`FilterInbound` enforces on MO direction	Service	Full perimeter
National blocklist federation live (MNO + regulator sources)	Service	Daily sync
Citizen complaint intake via regulator-portal	Product + Legal	Public trust
ML drift monitoring + quarterly retraining	ML Ops	Steady-state ops

Rollback via feature flags:

FIREWALL_ENFORCEMENT_TRANSIT = on|off
FIREWALL_ENFORCEMENT_INBOUND = on|off
FIREWALL_ML_ENABLED = true|false (falls back to rule-based)

3. Blocklist Seed Strategy

3.1 From existing compliance-engine blocklists

compliance.blocklist_entries already contains ~50 k entries used by the current SENDER_ID / RECIPIENT rules. These are imported verbatim into firewall.blocklist_entries with source=compliance-engine-seed and a 90-day TTL pending re-classification.

3.2 From platform suspend events

The last 6 months of sender.id.suspended.v1 and compliance.message.blocked.v1 events yield candidate entries. Duplicates merged. Suspend-events seed the MSISDN blocklist.

3.3 From fraud-intel-service (ongoing)

Once both services are live, fraud.detected.* feeds new entries continuously. The migration formalises this feed as a first-class federation-source internal to the platform.

3.4 From MNO federation (Phase 3)

Each MNO publishes their own known-bad MSISDNs and grey-route originators. Ghasi imports on daily cadence.

4. ML Model Rollout

4.1 v1 model characteristics

Model	Training window	Test set	Target precision	Target recall
AIT detector	30 d traffic + labelled campaigns	10 k labelled	≥ 0.92	≥ 0.80
SIM-box detector	30 d inbound MO + labelled boxes	5 k labelled	≥ 0.88	≥ 0.75
OTP-harvest detector	30 d OTP delivery stats	3 k labelled	≥ 0.90	≥ 0.70

4.2 Serving infrastructure

Triton Inference Server, 3 replicas, GPU-backed (ADR-0004 §6 "np-data").
Circuit breaker; fallback to rule-based on outage.
Detection mode (ML or FALLBACK) captured per audit row.

4.3 Retraining cadence

Quarterly (scheduled).
On-demand when drift alert fires (F1 drop > 5%).
Human-labelled corpus refreshed weekly via T&S review queue.

5. Federation Onboarding

Partner	Direction	Schema	Cadence	Auth
AWCC	Both (import + export)	Ghasi canonical JSON Lines	Daily	mTLS + HSM-signed cert
Roshan	Both	Ghasi canonical JSON Lines	Daily	mTLS + HSM-signed cert
Etisalat AF	Both	Ghasi canonical JSON Lines	Daily	mTLS + HSM-signed cert
MTN AF	Both	Ghasi canonical JSON Lines	Daily	mTLS + HSM-signed cert
Salaam	Both	Ghasi canonical JSON Lines	Daily	mTLS + HSM-signed cert
ATRA	Import only (national reg) + Export (our list to ATRA)	ATRA schema (adapter)	Daily	ATRA-specific cert

Onboarding sequence per partner (Weeks -4 to -2):

MoU signed.
Partner receives Ghasi public cert; sends theirs.
Schema mapping reviewed; adapter implemented if needed.
Dry-run file exchanged; both sides confirm parse.
T-7d production-dry-run; ATRA / partner ACK.
T-0 (Phase 2 start) live federation.

6. Downstream Consumer Migration

Consumer	Change	Timing
`routing-engine`	Consult `EvaluateTransit` before selecting operator; honour verdicts	Phase 2
`channel-router-service`	Consult `FilterInbound` on MO ingest	Phase 3
`compliance-engine`	Consume `firewall.audit.v1` for tenant scoring signal	Phase 2
`fraud-intel-service`	Consume `firewall.alert.v1` as ML feature	Phase 2
`sender-id-registry-service`	Consume `firewall.alert.v1` to adjust reputation	Phase 3
`regulator-portal-service`	SIEM stream of `firewall.audit.v1`	Phase 3
`admin-dashboard`	Firewall admin workbench + NOC panel	Phase 0

7. Rollback Plan

7.1 During Phase 1

No rollback needed (observe only).

7.2 During Phase 2

FIREWALL_ENFORCEMENT_TRANSIT = off.
routing-engine falls back to previous behaviour.

7.3 During Phase 3

FIREWALL_ENFORCEMENT_INBOUND = off reverts to Phase 2.
FIREWALL_ENFORCEMENT_TRANSIT = off reverts to Phase 1.
FIREWALL_ML_ENABLED = false falls back to rule-based only.

7.4 Catastrophic

Fall back to last-known-good Postgres backup.
Replay firewall.* NATS events from JetStream (7 d retention).
Tenant impact: possible < 1 h state gap.

8. Success Metrics for Migration

Metric	Target	Measurement
Phase-1 shadow-verdict FP projection	< 1%	Daily report
Phase-2 tenant escalations per day	< 10	Zendesk
Phase-2 OTP P1 SLA maintained	P95 ≤ 3 s	Platform dashboard
Federation daily sync success	≥ 99%	Service metric
Adversarial corpus bypasses	0	CI gate
ML precision (per model)	≥ 0.92 / 0.88 / 0.90	Held-out test
Migration phase transition ±5 d of plan	—	Project tracker

9. Dependencies

ATRA + MNO federation MoUs (Regulator Liaison).
HSM provisioned for federation signing (ADR-0004 §11).
GPU node pool for ML serving (ADR-0004 §6 "np-data").
Service mesh + SPIRE SVIDs for mTLS (ADR-0004 §12).
fraud-intel-service live and emitting fraud.detected.* (Wave 1 dependency).
consent-ledger-service live with CheckConsent (Wave 1 dependency).
sender-id-registry-service live with Verify (Wave 1 dependency).

Blocking any dependency blocks the phase in which it first appears.

1. What Is Migrating​

2. Migration Phases​

Phase 0 — Pre-migration (Weeks -6 to 0)​

Phase 1 — Observation (14 days)​

Phase 2 — Enforcement: Transit MT only (7 days)​

Phase 3 — Full Enforcement (ongoing)​

3. Blocklist Seed Strategy​

3.1 From existing compliance-engine blocklists​

3.2 From platform suspend events​

3.3 From fraud-intel-service (ongoing)​

3.4 From MNO federation (Phase 3)​

4. ML Model Rollout​

4.1 v1 model characteristics​

4.2 Serving infrastructure​

4.3 Retraining cadence​

5. Federation Onboarding​

6. Downstream Consumer Migration​

7. Rollback Plan​

7.1 During Phase 1​

7.2 During Phase 2​

7.3 During Phase 3​

7.4 Catastrophic​

8. Success Metrics for Migration​

9. Dependencies​