Skip to main content

fraud-intel-service — Migration Plan

Version: 1.0 Status: Draft Owner: Trust and Safety + ML Ops + Platform Engineering Last Updated: 2026-04-21 References: SERVICE_OVERVIEW.md, _report.md, SERVICE_READINESS.md, AI_INTEGRATION.md

fraud-intel-service is greenfield. Today, the platform has fraud signals scattered across logs, Zendesk tickets, and manual T&S review. The migration centralises fraud detection into an ML-backed service and formalises its role as the signal producer for firewall, sender-id-registry, and compliance.


1. What Is Migrating

InputSourceVolume (estimate)Notes
30-d platform-wide DLR + MO streamsms-orchestrator, smpp-connector retrospective logsBillions of eventsML training corpus + feature bootstrap
90-d compliance-engine audit logcompliance.audit~10 M rowsLabels for supervised training
T&S historical correction logZendesk + internal spreadsheets~5 k labelled incidentsSeed labelled corpus
External MISP feeds (if agreed)Partner platformsTBDFeed partner MoUs
Trained models v1 (AIT, SIM-box, OTP-harvest)Output of pre-launch training pipeline3 artifactsS3 model-bucket
Feature store seed (per-MSISDN aggregates)Computed from 30-d retrospective100 M featuresRedis + Postgres

2. Migration Phases

Phase 0 — Pre-migration (Weeks -8 to 0)

Longer pre-phase than other services because of ML training and fairness review.

StepOwnerOutput
Training dataset built from 30-d platform trafficML Ops + Data EngS3 dataset; documented in lineage
T&S + Security label ~5 k historical incidentsT&SLabelled corpus
v1 models trained for AIT, SIM-box, OTP-harvestML OpsArtifacts + evaluation reports
Fairness audit per model (per-MNO disparate recall ≤ 15%)T&S + ML OpsAudit sign-off
Model cards publishedML OpsPublished in model registry
Adversarial corpus (500+ per category) builtSecurity + T&SCI test corpus
Feed partner engagement (optional)Regulator LiaisonMoUs with any partner platforms
Service deployed to staging with design-partner dataSREStaging green
Downstream consumer mocks deployed (firewall, sender-id, compliance)Platform EngIntegration tests possible

Phase 1 — Signals emitted, not enforced (14 days)

StepOwnerOutput
Score serves live predictions; fraud.detected.* events publishedServiceSignals flowing
Downstream consumers log but do NOT act on signalsFirewall / Sender-ID / ComplianceObservation only
Daily dashboard: signal volume per category, per-MNO, per-tenant; FP-rate projection from T&S spot-checksT&SDaily report
Adversarial corpus re-run on live modelsSecurityWeekly report
Feedback API open to T&ST&SCorrection log growing

Exit criteria. Per-category FP projection < 2%; adversarial corpus < 2% bypass; T&S confidence in signal quality.

Phase 2 — Enforcement: sender-id reputation only (7 days)

StepOwnerOutput
Sender-id-registry consumes fraud.detected.* and adjusts reputationService + Sender-ID RegistryReputation feedback loop live
Auto-suspension thresholds active in sender-id-registry (score < 30)Sender-ID RegistrySelf-regulation begins
Firewall continues to observe (not enforce) fraud signalsFirewallGradual rollout
Daily dashboard tracks auto-suspensions and reversal rateT&STrust built

Exit criteria. Auto-suspension reversal rate < 10%; no cluster of unexpected suspensions.

Phase 3 — Full Enforcement (ongoing)

StepOwnerOutput
Firewall consumes fraud signals → blocklist updates via auto-rule creationFirewall + ServiceDetection → enforcement loop
Compliance-engine tenant scoring consumes fraud signalsComplianceTenant risk informs policy
NOC dashboards live; incident mode wiredNOC + ServiceIncident response uplifted
External MISP sharing (if agreed)Service + Regulator LiaisonPlatform reputation
Quarterly retraining cadence activeML OpsOngoing quality

Rollback via feature flags:

  • FRAUD_SIGNAL_EMISSION_ENABLED = false (kill-switch for all downstream impact).
  • FRAUD_ML_ENABLED = false (rule-based signals only).
  • FRAUD_FEEDBACK_API_ENABLED = false (disable correction API; training continues on auto-labels only).

3. Data-Acquisition Bootstrap

3.1 Training corpus

Signal classSourceLabelling strategy
AIT30-d sms.dlr.inbound + compliance-blocks + T&S reviewsAuto-label from compliance blocks + manual T&S spot-check; 10 k positives targeted
SIM-box30-d sms.mo.inbound + grey-route flags + T&S reviewsAuto-label via pattern match + T&S; 5 k positives
OTP-harvest30-d OTP traffic + sender-ID complaint logAuto-label from low-conversion + high-retry patterns + T&S; 3 k positives

3.2 Negative corpus

Balanced negative corpus sampled from known-legitimate traffic (design-partner tenants, pre-vetted senders).

3.3 PII handling

All training data pre-processed: MSISDN hashed; content redacted; only features retained. Training pipelines run on internal infrastructure — no data leaves the platform boundary.


4. Model v1 Go-Live Checklist

Before Phase 1 begins, each model must pass:

GateTargetOwner
Held-out test precisionAIT ≥ 0.92 / SIM ≥ 0.88 / OTP ≥ 0.90ML Ops
Held-out test recallAIT ≥ 0.80 / SIM ≥ 0.75 / OTP ≥ 0.70ML Ops
Per-MNO disparate recall≤ 15%T&S
Adversarial corpus bypass rate< 2%Security
Inference P99 latency on Triton≤ 50 ms per modelSRE
Model card published with above metricsDoneML Ops
Fairness + DPIA sign-offLegal + Security signLegal

5. Downstream Consumer Migration

ConsumerChangeTiming
sender-id-registry-serviceConsume fraud.detected.*; adjust reputation; auto-suspendPhase 2
sms-firewall-serviceConsume fraud.detected.*; create temporary blocklist entriesPhase 3
compliance-engineConsume fraud.detected.* and firewall.audit.v1 for tenant scoringPhase 3
admin-dashboardNOC fraud-signal stream (EP-ADMDASH-09); drill-down viewsPhase 1
regulator-portal-serviceSIEM stream fraud.detected.*Phase 3

6. Rollback Plan

6.1 During Phase 1

  • FRAUD_SIGNAL_EMISSION_ENABLED = false stops event publishing; service still ingests signals.

6.2 During Phase 2

  • Above, plus sender-id-registry flag SID_FRAUD_CONSUMPTION = false stops reputation updates.

6.3 During Phase 3

  • Full chain rollback: all downstream consumers disable their fraud-intel integration.
  • Triton and training pipeline continue to run; models continue to be refreshed for when re-enabled.

6.4 Catastrophic (model bad)

  • Roll back model via registry to prior version (< 5 min).
  • If rollback insufficient, disable ML (FRAUD_ML_ENABLED = false) and operate on rule-based only.

6.5 Feedback poisoning detected

  • Stop feedback API (FRAUD_FEEDBACK_API_ENABLED = false).
  • Roll back model to version predating poisoning (registry immutable; always possible).
  • Investigate insider / compromised account.
  • Retrain on clean corpus.

7. Success Metrics for Migration

MetricTargetMeasurement
Phase 1 signal volume matches retrospective forecast±20%Daily
Phase 2 auto-suspension reversal rate< 10%Sender-ID manual-review sample
Phase 3 firewall temporary-rule lifetime95% resolve within 24 h (expire or permanent)Firewall metric
End-to-end detection-to-enforcement latency≤ 5 min P95Cross-service trace
Model retraining SLAQuarterly + on-driftML Ops cadence
Model drift F1 variance< 5% vs. baselineWeekly drift job
Cost per 1 000 Score calls$TBD budgetFinance

8. Dependencies

  • Training corpus infrastructure (Airflow + Spark / DuckDB on training data).
  • Triton Inference Server with GPU capacity (ADR-0004 §6 "np-data").
  • Model registry (S3 bucket with object-lock + versioning).
  • Consumer services (firewall, sender-id-registry, compliance-engine) ready to consume signals.
  • regulator-portal-service SIEM stream (Phase 3).

9. Model Lineage & Reproducibility

Every deployed model must have:

  1. Artifact checksum + immutable S3 location.
  2. Training dataset snapshot reference.
  3. Training code commit hash.
  4. Training environment image hash.
  5. Model card with evaluation numbers.
  6. Fairness audit report.
  7. Deployment audit-log entry.

This lineage is a regulator-facing artefact and is part of the annual compliance attestation bundle (per EP-REG-03).