fraud-intel-service — Migration Plan
Version: 1.0 Status: Draft Owner: Trust and Safety + ML Ops + Platform Engineering Last Updated: 2026-04-21 References: SERVICE_OVERVIEW.md, _report.md, SERVICE_READINESS.md, AI_INTEGRATION.md
fraud-intel-service is greenfield. Today, the platform has fraud signals scattered across logs, Zendesk tickets, and manual T&S review. The migration centralises fraud detection into an ML-backed service and formalises its role as the signal producer for firewall, sender-id-registry, and compliance.
1. What Is Migrating
| Input | Source | Volume (estimate) | Notes |
|---|---|---|---|
| 30-d platform-wide DLR + MO stream | sms-orchestrator, smpp-connector retrospective logs | Billions of events | ML training corpus + feature bootstrap |
| 90-d compliance-engine audit log | compliance.audit | ~10 M rows | Labels for supervised training |
| T&S historical correction log | Zendesk + internal spreadsheets | ~5 k labelled incidents | Seed labelled corpus |
| External MISP feeds (if agreed) | Partner platforms | TBD | Feed partner MoUs |
| Trained models v1 (AIT, SIM-box, OTP-harvest) | Output of pre-launch training pipeline | 3 artifacts | S3 model-bucket |
| Feature store seed (per-MSISDN aggregates) | Computed from 30-d retrospective | 100 M features | Redis + Postgres |
2. Migration Phases
Phase 0 — Pre-migration (Weeks -8 to 0)
Longer pre-phase than other services because of ML training and fairness review.
| Step | Owner | Output |
|---|---|---|
| Training dataset built from 30-d platform traffic | ML Ops + Data Eng | S3 dataset; documented in lineage |
| T&S + Security label ~5 k historical incidents | T&S | Labelled corpus |
| v1 models trained for AIT, SIM-box, OTP-harvest | ML Ops | Artifacts + evaluation reports |
| Fairness audit per model (per-MNO disparate recall ≤ 15%) | T&S + ML Ops | Audit sign-off |
| Model cards published | ML Ops | Published in model registry |
| Adversarial corpus (500+ per category) built | Security + T&S | CI test corpus |
| Feed partner engagement (optional) | Regulator Liaison | MoUs with any partner platforms |
| Service deployed to staging with design-partner data | SRE | Staging green |
| Downstream consumer mocks deployed (firewall, sender-id, compliance) | Platform Eng | Integration tests possible |
Phase 1 — Signals emitted, not enforced (14 days)
| Step | Owner | Output |
|---|---|---|
Score serves live predictions; fraud.detected.* events published | Service | Signals flowing |
| Downstream consumers log but do NOT act on signals | Firewall / Sender-ID / Compliance | Observation only |
| Daily dashboard: signal volume per category, per-MNO, per-tenant; FP-rate projection from T&S spot-checks | T&S | Daily report |
| Adversarial corpus re-run on live models | Security | Weekly report |
| Feedback API open to T&S | T&S | Correction log growing |
Exit criteria. Per-category FP projection < 2%; adversarial corpus < 2% bypass; T&S confidence in signal quality.
Phase 2 — Enforcement: sender-id reputation only (7 days)
| Step | Owner | Output |
|---|---|---|
Sender-id-registry consumes fraud.detected.* and adjusts reputation | Service + Sender-ID Registry | Reputation feedback loop live |
| Auto-suspension thresholds active in sender-id-registry (score < 30) | Sender-ID Registry | Self-regulation begins |
| Firewall continues to observe (not enforce) fraud signals | Firewall | Gradual rollout |
| Daily dashboard tracks auto-suspensions and reversal rate | T&S | Trust built |
Exit criteria. Auto-suspension reversal rate < 10%; no cluster of unexpected suspensions.
Phase 3 — Full Enforcement (ongoing)
| Step | Owner | Output |
|---|---|---|
| Firewall consumes fraud signals → blocklist updates via auto-rule creation | Firewall + Service | Detection → enforcement loop |
| Compliance-engine tenant scoring consumes fraud signals | Compliance | Tenant risk informs policy |
| NOC dashboards live; incident mode wired | NOC + Service | Incident response uplifted |
| External MISP sharing (if agreed) | Service + Regulator Liaison | Platform reputation |
| Quarterly retraining cadence active | ML Ops | Ongoing quality |
Rollback via feature flags:
FRAUD_SIGNAL_EMISSION_ENABLED = false(kill-switch for all downstream impact).FRAUD_ML_ENABLED = false(rule-based signals only).FRAUD_FEEDBACK_API_ENABLED = false(disable correction API; training continues on auto-labels only).
3. Data-Acquisition Bootstrap
3.1 Training corpus
| Signal class | Source | Labelling strategy |
|---|---|---|
| AIT | 30-d sms.dlr.inbound + compliance-blocks + T&S reviews | Auto-label from compliance blocks + manual T&S spot-check; 10 k positives targeted |
| SIM-box | 30-d sms.mo.inbound + grey-route flags + T&S reviews | Auto-label via pattern match + T&S; 5 k positives |
| OTP-harvest | 30-d OTP traffic + sender-ID complaint log | Auto-label from low-conversion + high-retry patterns + T&S; 3 k positives |
3.2 Negative corpus
Balanced negative corpus sampled from known-legitimate traffic (design-partner tenants, pre-vetted senders).
3.3 PII handling
All training data pre-processed: MSISDN hashed; content redacted; only features retained. Training pipelines run on internal infrastructure — no data leaves the platform boundary.
4. Model v1 Go-Live Checklist
Before Phase 1 begins, each model must pass:
| Gate | Target | Owner |
|---|---|---|
| Held-out test precision | AIT ≥ 0.92 / SIM ≥ 0.88 / OTP ≥ 0.90 | ML Ops |
| Held-out test recall | AIT ≥ 0.80 / SIM ≥ 0.75 / OTP ≥ 0.70 | ML Ops |
| Per-MNO disparate recall | ≤ 15% | T&S |
| Adversarial corpus bypass rate | < 2% | Security |
| Inference P99 latency on Triton | ≤ 50 ms per model | SRE |
| Model card published with above metrics | Done | ML Ops |
| Fairness + DPIA sign-off | Legal + Security sign | Legal |
5. Downstream Consumer Migration
| Consumer | Change | Timing |
|---|---|---|
sender-id-registry-service | Consume fraud.detected.*; adjust reputation; auto-suspend | Phase 2 |
sms-firewall-service | Consume fraud.detected.*; create temporary blocklist entries | Phase 3 |
compliance-engine | Consume fraud.detected.* and firewall.audit.v1 for tenant scoring | Phase 3 |
admin-dashboard | NOC fraud-signal stream (EP-ADMDASH-09); drill-down views | Phase 1 |
regulator-portal-service | SIEM stream fraud.detected.* | Phase 3 |
6. Rollback Plan
6.1 During Phase 1
FRAUD_SIGNAL_EMISSION_ENABLED = falsestops event publishing; service still ingests signals.
6.2 During Phase 2
- Above, plus sender-id-registry flag
SID_FRAUD_CONSUMPTION = falsestops reputation updates.
6.3 During Phase 3
- Full chain rollback: all downstream consumers disable their fraud-intel integration.
- Triton and training pipeline continue to run; models continue to be refreshed for when re-enabled.
6.4 Catastrophic (model bad)
- Roll back model via registry to prior version (< 5 min).
- If rollback insufficient, disable ML (
FRAUD_ML_ENABLED = false) and operate on rule-based only.
6.5 Feedback poisoning detected
- Stop feedback API (
FRAUD_FEEDBACK_API_ENABLED = false). - Roll back model to version predating poisoning (registry immutable; always possible).
- Investigate insider / compromised account.
- Retrain on clean corpus.
7. Success Metrics for Migration
| Metric | Target | Measurement |
|---|---|---|
| Phase 1 signal volume matches retrospective forecast | ±20% | Daily |
| Phase 2 auto-suspension reversal rate | < 10% | Sender-ID manual-review sample |
| Phase 3 firewall temporary-rule lifetime | 95% resolve within 24 h (expire or permanent) | Firewall metric |
| End-to-end detection-to-enforcement latency | ≤ 5 min P95 | Cross-service trace |
| Model retraining SLA | Quarterly + on-drift | ML Ops cadence |
| Model drift F1 variance | < 5% vs. baseline | Weekly drift job |
| Cost per 1 000 Score calls | $TBD budget | Finance |
8. Dependencies
- Training corpus infrastructure (Airflow + Spark / DuckDB on training data).
- Triton Inference Server with GPU capacity (ADR-0004 §6 "np-data").
- Model registry (S3 bucket with object-lock + versioning).
- Consumer services (firewall, sender-id-registry, compliance-engine) ready to consume signals.
regulator-portal-serviceSIEM stream (Phase 3).
9. Model Lineage & Reproducibility
Every deployed model must have:
- Artifact checksum + immutable S3 location.
- Training dataset snapshot reference.
- Training code commit hash.
- Training environment image hash.
- Model card with evaluation numbers.
- Fairness audit report.
- Deployment audit-log entry.
This lineage is a regulator-facing artefact and is part of the annual compliance attestation bundle (per EP-REG-03).