fraud-intel-service — Migration Plan

Version: 1.0 Status: Draft Owner: Trust and Safety + ML Ops + Platform Engineering Last Updated: 2026-04-21 References: SERVICE_OVERVIEW.md, _report.md, SERVICE_READINESS.md, AI_INTEGRATION.md

fraud-intel-service is greenfield. Today, the platform has fraud signals scattered across logs, Zendesk tickets, and manual T&S review. The migration centralises fraud detection into an ML-backed service and formalises its role as the signal producer for firewall, sender-id-registry, and compliance.

1. What Is Migrating

Input	Source	Volume (estimate)	Notes
30-d platform-wide DLR + MO stream	`sms-orchestrator`, `smpp-connector` retrospective logs	Billions of events	ML training corpus + feature bootstrap
90-d compliance-engine audit log	`compliance.audit`	~10 M rows	Labels for supervised training
T&S historical correction log	Zendesk + internal spreadsheets	~5 k labelled incidents	Seed labelled corpus
External MISP feeds (if agreed)	Partner platforms	TBD	Feed partner MoUs
Trained models v1 (AIT, SIM-box, OTP-harvest)	Output of pre-launch training pipeline	3 artifacts	S3 model-bucket
Feature store seed (per-MSISDN aggregates)	Computed from 30-d retrospective	100 M features	Redis + Postgres

2. Migration Phases

Phase 0 — Pre-migration (Weeks -8 to 0)

Longer pre-phase than other services because of ML training and fairness review.

Step	Owner	Output
Training dataset built from 30-d platform traffic	ML Ops + Data Eng	S3 dataset; documented in lineage
T&S + Security label ~5 k historical incidents	T&S	Labelled corpus
v1 models trained for AIT, SIM-box, OTP-harvest	ML Ops	Artifacts + evaluation reports
Fairness audit per model (per-MNO disparate recall ≤ 15%)	T&S + ML Ops	Audit sign-off
Model cards published	ML Ops	Published in model registry
Adversarial corpus (500+ per category) built	Security + T&S	CI test corpus
Feed partner engagement (optional)	Regulator Liaison	MoUs with any partner platforms
Service deployed to staging with design-partner data	SRE	Staging green
Downstream consumer mocks deployed (firewall, sender-id, compliance)	Platform Eng	Integration tests possible

Phase 1 — Signals emitted, not enforced (14 days)

Step	Owner	Output
`Score` serves live predictions; `fraud.detected.*` events published	Service	Signals flowing
Downstream consumers log but do NOT act on signals	Firewall / Sender-ID / Compliance	Observation only
Daily dashboard: signal volume per category, per-MNO, per-tenant; FP-rate projection from T&S spot-checks	T&S	Daily report
Adversarial corpus re-run on live models	Security	Weekly report
Feedback API open to T&S	T&S	Correction log growing

Exit criteria. Per-category FP projection < 2%; adversarial corpus < 2% bypass; T&S confidence in signal quality.

Phase 2 — Enforcement: sender-id reputation only (7 days)

Step	Owner	Output
Sender-id-registry consumes `fraud.detected.*` and adjusts reputation	Service + Sender-ID Registry	Reputation feedback loop live
Auto-suspension thresholds active in sender-id-registry (score < 30)	Sender-ID Registry	Self-regulation begins
Firewall continues to observe (not enforce) fraud signals	Firewall	Gradual rollout
Daily dashboard tracks auto-suspensions and reversal rate	T&S	Trust built

Exit criteria. Auto-suspension reversal rate < 10%; no cluster of unexpected suspensions.

Phase 3 — Full Enforcement (ongoing)

Step	Owner	Output
Firewall consumes fraud signals → blocklist updates via auto-rule creation	Firewall + Service	Detection → enforcement loop
Compliance-engine tenant scoring consumes fraud signals	Compliance	Tenant risk informs policy
NOC dashboards live; incident mode wired	NOC + Service	Incident response uplifted
External MISP sharing (if agreed)	Service + Regulator Liaison	Platform reputation
Quarterly retraining cadence active	ML Ops	Ongoing quality

Rollback via feature flags:

FRAUD_SIGNAL_EMISSION_ENABLED = false (kill-switch for all downstream impact).
FRAUD_ML_ENABLED = false (rule-based signals only).
FRAUD_FEEDBACK_API_ENABLED = false (disable correction API; training continues on auto-labels only).

3. Data-Acquisition Bootstrap

3.1 Training corpus

Signal class	Source	Labelling strategy
AIT	30-d `sms.dlr.inbound` + compliance-blocks + T&S reviews	Auto-label from compliance blocks + manual T&S spot-check; 10 k positives targeted
SIM-box	30-d `sms.mo.inbound` + grey-route flags + T&S reviews	Auto-label via pattern match + T&S; 5 k positives
OTP-harvest	30-d OTP traffic + sender-ID complaint log	Auto-label from low-conversion + high-retry patterns + T&S; 3 k positives

3.2 Negative corpus

Balanced negative corpus sampled from known-legitimate traffic (design-partner tenants, pre-vetted senders).

3.3 PII handling

All training data pre-processed: MSISDN hashed; content redacted; only features retained. Training pipelines run on internal infrastructure — no data leaves the platform boundary.

4. Model v1 Go-Live Checklist

Before Phase 1 begins, each model must pass:

Gate	Target	Owner
Held-out test precision	AIT ≥ 0.92 / SIM ≥ 0.88 / OTP ≥ 0.90	ML Ops
Held-out test recall	AIT ≥ 0.80 / SIM ≥ 0.75 / OTP ≥ 0.70	ML Ops
Per-MNO disparate recall	≤ 15%	T&S
Adversarial corpus bypass rate	< 2%	Security
Inference P99 latency on Triton	≤ 50 ms per model	SRE
Model card published with above metrics	Done	ML Ops
Fairness + DPIA sign-off	Legal + Security sign	Legal

5. Downstream Consumer Migration

Consumer	Change	Timing
`sender-id-registry-service`	Consume `fraud.detected.*`; adjust reputation; auto-suspend	Phase 2
`sms-firewall-service`	Consume `fraud.detected.*`; create temporary blocklist entries	Phase 3
`compliance-engine`	Consume `fraud.detected.*` and `firewall.audit.v1` for tenant scoring	Phase 3
`admin-dashboard`	NOC fraud-signal stream (EP-ADMDASH-09); drill-down views	Phase 1
`regulator-portal-service`	SIEM stream `fraud.detected.*`	Phase 3

6. Rollback Plan

6.1 During Phase 1

FRAUD_SIGNAL_EMISSION_ENABLED = false stops event publishing; service still ingests signals.

6.2 During Phase 2

Above, plus sender-id-registry flag SID_FRAUD_CONSUMPTION = false stops reputation updates.

6.3 During Phase 3

Full chain rollback: all downstream consumers disable their fraud-intel integration.
Triton and training pipeline continue to run; models continue to be refreshed for when re-enabled.

6.4 Catastrophic (model bad)

Roll back model via registry to prior version (< 5 min).
If rollback insufficient, disable ML (FRAUD_ML_ENABLED = false) and operate on rule-based only.

6.5 Feedback poisoning detected

Stop feedback API (FRAUD_FEEDBACK_API_ENABLED = false).
Roll back model to version predating poisoning (registry immutable; always possible).
Investigate insider / compromised account.
Retrain on clean corpus.

7. Success Metrics for Migration

Metric	Target	Measurement
Phase 1 signal volume matches retrospective forecast	±20%	Daily
Phase 2 auto-suspension reversal rate	< 10%	Sender-ID manual-review sample
Phase 3 firewall temporary-rule lifetime	95% resolve within 24 h (expire or permanent)	Firewall metric
End-to-end detection-to-enforcement latency	≤ 5 min P95	Cross-service trace
Model retraining SLA	Quarterly + on-drift	ML Ops cadence
Model drift F1 variance	< 5% vs. baseline	Weekly drift job
Cost per 1 000 Score calls	$TBD budget	Finance

8. Dependencies

Training corpus infrastructure (Airflow + Spark / DuckDB on training data).
Triton Inference Server with GPU capacity (ADR-0004 §6 "np-data").
Model registry (S3 bucket with object-lock + versioning).
Consumer services (firewall, sender-id-registry, compliance-engine) ready to consume signals.
regulator-portal-service SIEM stream (Phase 3).

9. Model Lineage & Reproducibility

Every deployed model must have:

Artifact checksum + immutable S3 location.
Training dataset snapshot reference.
Training code commit hash.
Training environment image hash.
Model card with evaluation numbers.
Fairness audit report.
Deployment audit-log entry.

This lineage is a regulator-facing artefact and is part of the annual compliance attestation bundle (per EP-REG-03).

1. What Is Migrating​

2. Migration Phases​

Phase 0 — Pre-migration (Weeks -8 to 0)​

Phase 1 — Signals emitted, not enforced (14 days)​

Phase 2 — Enforcement: sender-id reputation only (7 days)​

Phase 3 — Full Enforcement (ongoing)​

3. Data-Acquisition Bootstrap​

3.1 Training corpus​

3.2 Negative corpus​

3.3 PII handling​

4. Model v1 Go-Live Checklist​

5. Downstream Consumer Migration​

6. Rollback Plan​

6.1 During Phase 1​

6.2 During Phase 2​

6.3 During Phase 3​

6.4 Catastrophic (model bad)​

6.5 Feedback poisoning detected​

7. Success Metrics for Migration​

8. Dependencies​

9. Model Lineage & Reproducibility​