Skip to main content

Fraud Intelligence (fraud-intel-service) — Service Overview

Version: 1.0 Status: Draft Owner: Trust & Safety Last Updated: 2026-04-20 Companion: DOMAIN_MODEL · API_CONTRACTS · EVENT_SCHEMAS · AI_INTEGRATION Related ADR: ADR-0004 National-Backbone Resilience §3


1. Purpose — Telecom Fraud Detection at National Scale

The Fraud Intelligence Service is the central detection brain for telecom-grade fraud on the Ghasi SMS backbone. It ingests evidence from every leg of the message lifecycle (orchestrator submits, firewall verdicts, SMPP DLRs, MO PDUs, billing CDRs) and emits high-confidence detections that drive automated enforcement (firewall rule promotion, tenant scoring, sender-ID suspension) and regulator-grade fraud feed exports.

Whereas sms-firewall-service enforces rules at the perimeter and compliance-engine enforces policy on outbound tenant SMS, fraud-intel-service is inferential and analytical: it operates on cross-message graphs and time-series patterns to detect what no single rule can express:

  1. AIT — Artificially Inflated Traffic generated to harvest SMS termination revenue (typically OTPs)
  2. SIM-box — voice-converted SMS terminating via grey-route hardware bypassing legitimate interconnect
  3. OTP harvesting — campaigns triggering OTPs against accounts the attacker controls in order to capture per-message termination revenue
  4. OTP grinding — automated OTP guessing against a single victim (security threat to the recipient)
  5. Grey-route arbitrage — long-running peer aggregator schemes routing traffic via untaxed paths
  6. Sender-ID spoofing networks — patterns of sender-ID misuse across multiple aggregators

The service is asynchronous (NATS consumer + ML batch pipeline + offline graph queries), with one synchronous gRPC entry point (Score) used by compliance-engine for tenant-level scoring during evaluation. It does not sit in the data-plane critical path.


2. Position in the Platform — The Detection Plane

┌──────────────────┐ ┌──────────────────┐ ┌──────────────────┐
│ sms-firewall │ │ sms-orchestrator│ │ dlr-processor │
│ firewall.audit │ │ sms.events.* │ │ sms.dlr.inbound │
└────────┬─────────┘ └────────┬─────────┘ └────────┬─────────┘
│ │ │
│ NATS │ NATS │ NATS
▼ ▼ ▼
╔══════════════════════════════════════════════════════════════════╗
║ fraud-intel-service ║
║ ║
║ ┌─────────────────────────┐ ┌──────────────────────────┐ ║
║ │ Stream Ingestion │───▶│ Feature Store (ClickHouse)│ ║
║ │ (NATS consumer pool) │ │ + Redis hot features │ ║
║ └─────────────────────────┘ └─────────┬────────────────┘ ║
║ │ ║
║ ▼ ║
║ ┌─────────────────────────────────────────────────────────┐ ║
║ │ Detection Pipelines (per fraud class) │ ║
║ │ · AIT: graph + ML (XGBoost on cross-tenant features) │ ║
║ │ · SIM-box: temporal + HLR-mismatch + ASN heuristic │ ║
║ │ · OTP harvesting: OTP-keyword + recipient cohort │ ║
║ │ · OTP grinding: per-MSISDN OTP-attempt bursts │ ║
║ │ · Grey-route: long-window peer-MNO routing entropy │ ║
║ └─────────────────────────┬───────────────────────────────┘ ║
║ │ ║
║ ▼ ║
║ fraud.detected.v1 (NATS, signed) ║
╚════════════════════════════════════════════════════════════════════╝

┌────────────────────────┼────────────────────────┐
▼ ▼ ▼
┌──────────────────┐ ┌──────────────────┐ ┌──────────────────────┐
│ sms-firewall │ │ compliance-engine│ │ regulator-portal + │
│ promotes to BLOCK│ │ tenant scoring │ │ MISP fraud-feed export│
└──────────────────┘ └──────────────────┘ └──────────────────────┘

3. Bounded Context

DimensionValue
DomainTrust & Safety / Fraud Intelligence
Owner squadTrust & Safety (Data Science track)
Deployment unitKubernetes — fraud-intel-service (control-plane) + fraud-intel-worker-pool (batch / ML)
Communication styleInbound: NATS JetStream consumers (primary) · gRPC Score (synchronous, low-volume) · HTTP REST (admin + MISP feed)
StorageClickHouse (feature store + detection log) · Postgres (model catalog, detection cases, MISP entities) · Redis (hot features, model output cache) · MinIO (model artifacts, training datasets, signed feed exports)
Failure modeFail-soft for detection (a missed window is acceptable; backfill later); fail-closed for Score gRPC (compliance-engine treats unscored tenants as PROBATION)
Region affinityKbl primary; mzr runs as warm standby; ClickHouse is regionally sharded with cross-region read replicas

4. Responsibilities

#Responsibility
R1Consume firewall.audit.v1, sms.events.status.v1, sms.dlr.inbound.v1, cdr.generated.v1 and project into the ClickHouse feature store within P95 ≤ 30 s ingestion lag
R2Run AIT detection pipeline (graph + XGBoost) on rolling 5-minute windows; mean-time-to-detect ≤ 15 min for a new AIT campaign
R3Run SIM-box detection pipeline on inbound MO patterns (per-ASN, per-MNO, per-MSISDN-range); emit fraud.detected.simbox.v1 with confidence ≥ 0.8
R4Run OTP-harvesting and OTP-grinding detection on tenant outbound traffic; emit per-tenant fraud cases visible in admin-dashboard
R5Run grey-route arbitrage detection on peer aggregator behaviour over rolling 24 h windows
R6Maintain a model catalog (fraud.models table) with model ID, version, training-set hash, deployment date, performance metrics; every detection event carries aiProvenance
R7Expose synchronous gRPC `Score(tenantId
R8Export a MISP-compatible fraud feed (JSON Lines + STIX 2.1) signed with HSM key for cross-operator and regulator consumption
R9Import MISP feeds from peer MNOs and from ATRA's regulator portal; merge into the feature store with source attribution and decay
R10Provide an admin REST API for fraud analyst case-management workflow: cases, case_evidence, case_decisions (confirm fraud / dismiss / refine model)
R11Publish per-tenant fraud scores hourly to compliance-engine via fraud.tenant_score.updated.v1 for use in compliance-engine's tenant tier calculation

5. Non-Responsibilities

  • Does not enforce verdicts on individual messages — that is sms-firewall-service (perimeter) and compliance-engine (outbound)
  • Does not terminate SMPP binds or quarantine peers — sms-firewall-service does that based on fraud.detected.* events
  • Does not train or serve general-purpose LLMs — the local LLM service handles that
  • Does not generate billing CDRs — cdr-mediation-service does
  • Does not manage tenant onboarding decisions — compliance-engine consumes our scores but owns tenant tier policy
  • Does not persist subscriber consent — consent-ledger-service

6. Upstream / Downstream Dependencies

DirectionServiceProtocolPurpose
Inbound eventsms-firewall-serviceNATS JetStream firewall.audit.v1Verdict evidence for AIT / SIM-box detection
Inbound eventsms-orchestratorNATS JetStream sms.events.status.v1Outbound SMS lifecycle events
Inbound eventdlr-processorNATS JetStream sms.dlr.inbound.v1Delivery receipts for AIT detection (DLR success-rate per terminating MNO)
Inbound eventcdr-mediation-serviceNATS JetStream cdr.generated.v1Per-message CDR for grey-route arbitrage detection
Inbound eventconsent-ledger-serviceNATS JetStream consent.revoked.v1OTP-harvesting heuristic input (revocation cohorts)
Inbound callercompliance-enginegRPC Score(scope, id)Per-tenant / sender-ID / MSISDN fraud score
Inbound callerrouting-enginegRPC Score(senderId) (optional)Routing decision input for high-risk senders
Inbound adminadmin-dashboardHTTP REST (mTLS, JWT role tns-fraud-analyst)Case management
Inbound MISPregulator-portal-serviceHTTP REST POST /v1/internal/fraud/feed/importRegulator MISP feed import
Outbound read/writeClickHouse fraud_features schemaTCPFeature store, detection log
Outbound read/writePostgreSQL fraud schemaTCPModel catalog, cases, MISP entities
Outbound read/writeRedisTCPHot feature cache, model output cache
Outbound read/writeMinIOS3Model artifacts, signed MISP exports
Outbound eventsNATS JetStreamTCPfraud.detected.*, fraud.tenant_score.updated.v1, fraud.feed.updated.v1

7. High-Level Flow — AIT Detection


8. High-Level Flow — Synchronous Score gRPC


9. Runtime Topology Summary

AspectValue
Process modelTwo deployment groups: (a) fraud-intel-service (NestJS, exposes gRPC + REST), (b) fraud-intel-worker (Python, ML batch pipelines, runs every 5 min via CronJob + KEDA-scaled stream-processor)
Replicasfraud-intel-service: minReplicas=3 in kbl, 2 in mzr; fraud-intel-worker: scales 0–20 by KEDA on NATS lag
Node poolfraud-intel-service on np-ctrl; fraud-intel-worker on np-identity (GPU optional for future deep models)
StartupService: load active model catalog → warm Redis caches; Worker: pull model artifacts from MinIO → register with control-plane
Hot reloadNew model version → admin REST POST /v1/admin/fraud/models/{id}/promote → workers swap model atomically on next batch boundary
ShutdownDrain NATS consumers (max 30 s) → flush in-memory feature buffers to ClickHouse → exit
Region affinitykbl primary; mzr warm standby; cross-region ClickHouse replication for queries; model registry mirrored

10. Key Design Decisions

DecisionRationale
Asynchronous detection plane, synchronous-only Score gRPCDetection latency is allowed to be minutes; synchronous use is read-only score lookups against pre-computed cache
ClickHouse for feature store, not PostgresTelemetry-class volumes (10 M events/h target) require columnar OLAP; Postgres is reserved for case management and model metadata
XGBoost + graph features, not deep learning (initially)Telecom fraud patterns are well-suited to gradient-boosted trees; explainability via SHAP values matters for regulator defensibility
Detections are events, not in-line decisionsEvery detection is a NATS event consumable by multiple downstream services; this avoids hard coupling and lets policy evolve independently
MISP-compatible feed formatIndustry-standard threat-intel exchange; lets us interoperate with peer MNOs and regulator without bespoke schema
Detections carry aiProvenance (model ID, version, training-set hash, SHAP top-3 features)Regulator-grade explainability; cases can be defended in dispute resolution
Fail-soft on detection, fail-closed on ScoreA missed window can be backfilled; an unscored tenant cannot be allowed through compliance without a probation handle
Per-fraud-class pipeline, not a single mega-modelEach pipeline can be retrained, A/B-tested, and rolled back independently; reduces blast radius
Confidence thresholds 0.85 / 0.60 / 0.40: above 0.85 auto-enforce; 0.6–0.85 case opened (HITL); below 0.6 logged-onlyCalibrated to observed false-positive cost (subscriber impact) vs false-negative cost (revenue + reputation)
Models are versioned, signed, and immutable in MinIOSupply-chain integrity for ML artifacts; prevents drift between training and inference replicas

11. Cross-Service Citations

Related epicOwner serviceWhy it matters here
EP-FW-02 Transit MT Firewallsms-firewall-serviceConsumes our fraud.detected.greyroute.v1 and fraud.detected.simbox.v1 to promote to BLOCK rules
EP-FW-03 Federationsms-firewall-serviceOur MISP feed feeds federated blocklist entries with source = 'FRAUD_INTEL'
EP-CE-* Compliance scoringcompliance-engineCalls our Score gRPC for per-tenant fraud score; consumes fraud.tenant_score.updated.v1
EP-SID-* Sender-ID lifecyclesender-id-registry-serviceConsumes fraud.detected.senderid_abuse.v1 to suspend abused IDs
EP-DLR-* DLR processingdlr-processorOur primary AIT signal: per-MNO DLR success rate anomalies
EP-CONS-02 STOP-keywordconsent-ledger-serviceMass revocations are an OTP-harvesting signal

12. Open Questions

IDQuestionOwnerTarget
OQ-FRAUD-01Should the MISP export include sender-ID-class indicators or only MSISDN-class to align with ATRA's published format?Regulator Liaison2026-05-30
OQ-FRAUD-02Are we permitted to share tenant-attributed AIT cases with peer MNOs, or only de-identified MSISDN-level evidence?Legal2026-05-15
OQ-FRAUD-03How do we handle cross-region model drift — train per-region or unified model?Data Science Lead2026-06-15
OQ-FRAUD-04What is the expected GPU budget if we migrate the OTP-harvesting classifier to a transformer model in 2027?SRE + DS2026-Q4