Fraud Intelligence (fraud-intel-service) — Service Overview

Version: 1.0 Status: Draft Owner: Trust & Safety Last Updated: 2026-04-20 Companion: DOMAIN_MODEL · API_CONTRACTS · EVENT_SCHEMAS · AI_INTEGRATION Related ADR: ADR-0004 National-Backbone Resilience §3

1. Purpose — Telecom Fraud Detection at National Scale

The Fraud Intelligence Service is the central detection brain for telecom-grade fraud on the Ghasi SMS backbone. It ingests evidence from every leg of the message lifecycle (orchestrator submits, firewall verdicts, SMPP DLRs, MO PDUs, billing CDRs) and emits high-confidence detections that drive automated enforcement (firewall rule promotion, tenant scoring, sender-ID suspension) and regulator-grade fraud feed exports.

Whereas sms-firewall-service enforces rules at the perimeter and compliance-engine enforces policy on outbound tenant SMS, fraud-intel-service is inferential and analytical: it operates on cross-message graphs and time-series patterns to detect what no single rule can express:

AIT — Artificially Inflated Traffic generated to harvest SMS termination revenue (typically OTPs)
SIM-box — voice-converted SMS terminating via grey-route hardware bypassing legitimate interconnect
OTP harvesting — campaigns triggering OTPs against accounts the attacker controls in order to capture per-message termination revenue
OTP grinding — automated OTP guessing against a single victim (security threat to the recipient)
Grey-route arbitrage — long-running peer aggregator schemes routing traffic via untaxed paths
Sender-ID spoofing networks — patterns of sender-ID misuse across multiple aggregators

The service is asynchronous (NATS consumer + ML batch pipeline + offline graph queries), with one synchronous gRPC entry point (Score) used by compliance-engine for tenant-level scoring during evaluation. It does not sit in the data-plane critical path.

2. Position in the Platform — The Detection Plane

       ┌──────────────────┐    ┌──────────────────┐    ┌──────────────────┐
       │  sms-firewall    │    │  sms-orchestrator│    │  dlr-processor   │
       │  firewall.audit  │    │  sms.events.*    │    │  sms.dlr.inbound │
       └────────┬─────────┘    └────────┬─────────┘    └────────┬─────────┘
                │                       │                       │
                │  NATS                 │  NATS                 │  NATS
                ▼                       ▼                       ▼
       ╔══════════════════════════════════════════════════════════════════╗
       ║                  fraud-intel-service                              ║
       ║                                                                   ║
       ║   ┌─────────────────────────┐    ┌──────────────────────────┐    ║
       ║   │ Stream Ingestion        │───▶│ Feature Store (ClickHouse)│    ║
       ║   │ (NATS consumer pool)    │    │  + Redis hot features    │    ║
       ║   └─────────────────────────┘    └─────────┬────────────────┘    ║
       ║                                            │                      ║
       ║                                            ▼                      ║
       ║   ┌─────────────────────────────────────────────────────────┐    ║
       ║   │ Detection Pipelines (per fraud class)                   │    ║
       ║   │  · AIT: graph + ML (XGBoost on cross-tenant features)   │    ║
       ║   │  · SIM-box: temporal + HLR-mismatch + ASN heuristic     │    ║
       ║   │  · OTP harvesting: OTP-keyword + recipient cohort       │    ║
       ║   │  · OTP grinding: per-MSISDN OTP-attempt bursts          │    ║
       ║   │  · Grey-route: long-window peer-MNO routing entropy     │    ║
       ║   └─────────────────────────┬───────────────────────────────┘    ║
       ║                             │                                     ║
       ║                             ▼                                     ║
       ║              fraud.detected.v1 (NATS, signed)                     ║
       ╚════════════════════════════════════════════════════════════════════╝
                                     │
            ┌────────────────────────┼────────────────────────┐
            ▼                        ▼                        ▼
   ┌──────────────────┐   ┌──────────────────┐   ┌──────────────────────┐
   │ sms-firewall     │   │ compliance-engine│   │ regulator-portal +   │
   │ promotes to BLOCK│   │ tenant scoring   │   │ MISP fraud-feed export│
   └──────────────────┘   └──────────────────┘   └──────────────────────┘

3. Bounded Context

Dimension	Value
Domain	Trust & Safety / Fraud Intelligence
Owner squad	Trust & Safety (Data Science track)
Deployment unit	Kubernetes — `fraud-intel-service` (control-plane) + `fraud-intel-worker-pool` (batch / ML)
Communication style	Inbound: NATS JetStream consumers (primary) · gRPC `Score` (synchronous, low-volume) · HTTP REST (admin + MISP feed)
Storage	ClickHouse (feature store + detection log) · Postgres (model catalog, detection cases, MISP entities) · Redis (hot features, model output cache) · MinIO (model artifacts, training datasets, signed feed exports)
Failure mode	Fail-soft for detection (a missed window is acceptable; backfill later); fail-closed for `Score` gRPC (compliance-engine treats unscored tenants as PROBATION)
Region affinity	Kbl primary; mzr runs as warm standby; ClickHouse is regionally sharded with cross-region read replicas

4. Responsibilities

#	Responsibility
R1	Consume `firewall.audit.v1`, `sms.events.status.v1`, `sms.dlr.inbound.v1`, `cdr.generated.v1` and project into the ClickHouse feature store within P95 ≤ 30 s ingestion lag
R2	Run AIT detection pipeline (graph + XGBoost) on rolling 5-minute windows; mean-time-to-detect ≤ 15 min for a new AIT campaign
R3	Run SIM-box detection pipeline on inbound MO patterns (per-ASN, per-MNO, per-MSISDN-range); emit `fraud.detected.simbox.v1` with confidence ≥ 0.8
R4	Run OTP-harvesting and OTP-grinding detection on tenant outbound traffic; emit per-tenant fraud cases visible in `admin-dashboard`
R5	Run grey-route arbitrage detection on peer aggregator behaviour over rolling 24 h windows
R6	Maintain a model catalog (`fraud.models` table) with model ID, version, training-set hash, deployment date, performance metrics; every detection event carries `aiProvenance`
R7	Expose synchronous gRPC `Score(tenantId
R8	Export a MISP-compatible fraud feed (JSON Lines + STIX 2.1) signed with HSM key for cross-operator and regulator consumption
R9	Import MISP feeds from peer MNOs and from ATRA's regulator portal; merge into the feature store with source attribution and decay
R10	Provide an admin REST API for fraud analyst case-management workflow: `cases`, `case_evidence`, `case_decisions` (confirm fraud / dismiss / refine model)
R11	Publish per-tenant fraud scores hourly to `compliance-engine` via `fraud.tenant_score.updated.v1` for use in `compliance-engine`'s tenant tier calculation

5. Non-Responsibilities

Does not enforce verdicts on individual messages — that is sms-firewall-service (perimeter) and compliance-engine (outbound)
Does not terminate SMPP binds or quarantine peers — sms-firewall-service does that based on fraud.detected.* events
Does not train or serve general-purpose LLMs — the local LLM service handles that
Does not generate billing CDRs — cdr-mediation-service does
Does not manage tenant onboarding decisions — compliance-engine consumes our scores but owns tenant tier policy
Does not persist subscriber consent — consent-ledger-service

6. Upstream / Downstream Dependencies

Direction	Service	Protocol	Purpose
Inbound event	`sms-firewall-service`	NATS JetStream `firewall.audit.v1`	Verdict evidence for AIT / SIM-box detection
Inbound event	`sms-orchestrator`	NATS JetStream `sms.events.status.v1`	Outbound SMS lifecycle events
Inbound event	`dlr-processor`	NATS JetStream `sms.dlr.inbound.v1`	Delivery receipts for AIT detection (DLR success-rate per terminating MNO)
Inbound event	`cdr-mediation-service`	NATS JetStream `cdr.generated.v1`	Per-message CDR for grey-route arbitrage detection
Inbound event	`consent-ledger-service`	NATS JetStream `consent.revoked.v1`	OTP-harvesting heuristic input (revocation cohorts)
Inbound caller	`compliance-engine`	gRPC `Score(scope, id)`	Per-tenant / sender-ID / MSISDN fraud score
Inbound caller	`routing-engine`	gRPC `Score(senderId)` (optional)	Routing decision input for high-risk senders
Inbound admin	`admin-dashboard`	HTTP REST (mTLS, JWT role `tns-fraud-analyst`)	Case management
Inbound MISP	`regulator-portal-service`	HTTP REST `POST /v1/internal/fraud/feed/import`	Regulator MISP feed import
Outbound read/write	ClickHouse `fraud_features` schema	TCP	Feature store, detection log
Outbound read/write	PostgreSQL `fraud` schema	TCP	Model catalog, cases, MISP entities
Outbound read/write	Redis	TCP	Hot feature cache, model output cache
Outbound read/write	MinIO	S3	Model artifacts, signed MISP exports
Outbound events	NATS JetStream	TCP	`fraud.detected.*`, `fraud.tenant_score.updated.v1`, `fraud.feed.updated.v1`

7. High-Level Flow — AIT Detection

8. High-Level Flow — Synchronous Score gRPC

9. Runtime Topology Summary

Aspect	Value
Process model	Two deployment groups: (a) `fraud-intel-service` (NestJS, exposes gRPC + REST), (b) `fraud-intel-worker` (Python, ML batch pipelines, runs every 5 min via CronJob + KEDA-scaled stream-processor)
Replicas	`fraud-intel-service`: minReplicas=3 in kbl, 2 in mzr; `fraud-intel-worker`: scales 0–20 by KEDA on NATS lag
Node pool	`fraud-intel-service` on `np-ctrl`; `fraud-intel-worker` on `np-identity` (GPU optional for future deep models)
Startup	Service: load active model catalog → warm Redis caches; Worker: pull model artifacts from MinIO → register with control-plane
Hot reload	New model version → admin REST `POST /v1/admin/fraud/models/{id}/promote` → workers swap model atomically on next batch boundary
Shutdown	Drain NATS consumers (max 30 s) → flush in-memory feature buffers to ClickHouse → exit
Region affinity	kbl primary; mzr warm standby; cross-region ClickHouse replication for queries; model registry mirrored

10. Key Design Decisions

Decision	Rationale
Asynchronous detection plane, synchronous-only `Score` gRPC	Detection latency is allowed to be minutes; synchronous use is read-only score lookups against pre-computed cache
ClickHouse for feature store, not Postgres	Telemetry-class volumes (10 M events/h target) require columnar OLAP; Postgres is reserved for case management and model metadata
XGBoost + graph features, not deep learning (initially)	Telecom fraud patterns are well-suited to gradient-boosted trees; explainability via SHAP values matters for regulator defensibility
Detections are events, not in-line decisions	Every detection is a NATS event consumable by multiple downstream services; this avoids hard coupling and lets policy evolve independently
MISP-compatible feed format	Industry-standard threat-intel exchange; lets us interoperate with peer MNOs and regulator without bespoke schema
Detections carry `aiProvenance` (model ID, version, training-set hash, SHAP top-3 features)	Regulator-grade explainability; cases can be defended in dispute resolution
Fail-soft on detection, fail-closed on `Score`	A missed window can be backfilled; an unscored tenant cannot be allowed through compliance without a probation handle
Per-fraud-class pipeline, not a single mega-model	Each pipeline can be retrained, A/B-tested, and rolled back independently; reduces blast radius
Confidence thresholds 0.85 / 0.60 / 0.40: above 0.85 auto-enforce; 0.6–0.85 case opened (HITL); below 0.6 logged-only	Calibrated to observed false-positive cost (subscriber impact) vs false-negative cost (revenue + reputation)
Models are versioned, signed, and immutable in MinIO	Supply-chain integrity for ML artifacts; prevents drift between training and inference replicas

11. Cross-Service Citations

Related epic	Owner service	Why it matters here
`EP-FW-02` Transit MT Firewall	`sms-firewall-service`	Consumes our `fraud.detected.greyroute.v1` and `fraud.detected.simbox.v1` to promote to BLOCK rules
`EP-FW-03` Federation	`sms-firewall-service`	Our MISP feed feeds federated blocklist entries with `source = 'FRAUD_INTEL'`
`EP-CE-*` Compliance scoring	`compliance-engine`	Calls our `Score` gRPC for per-tenant fraud score; consumes `fraud.tenant_score.updated.v1`
`EP-SID-*` Sender-ID lifecycle	`sender-id-registry-service`	Consumes `fraud.detected.senderid_abuse.v1` to suspend abused IDs
`EP-DLR-*` DLR processing	`dlr-processor`	Our primary AIT signal: per-MNO DLR success rate anomalies
`EP-CONS-02` STOP-keyword	`consent-ledger-service`	Mass revocations are an OTP-harvesting signal

12. Open Questions

ID	Question	Owner	Target
OQ-FRAUD-01	Should the MISP export include sender-ID-class indicators or only MSISDN-class to align with ATRA's published format?	Regulator Liaison	2026-05-30
OQ-FRAUD-02	Are we permitted to share tenant-attributed AIT cases with peer MNOs, or only de-identified MSISDN-level evidence?	Legal	2026-05-15
OQ-FRAUD-03	How do we handle cross-region model drift — train per-region or unified model?	Data Science Lead	2026-06-15
OQ-FRAUD-04	What is the expected GPU budget if we migrate the OTP-harvesting classifier to a transformer model in 2027?	SRE + DS	2026-Q4

1. Purpose — Telecom Fraud Detection at National Scale​

2. Position in the Platform — The Detection Plane​

3. Bounded Context​

4. Responsibilities​

5. Non-Responsibilities​

6. Upstream / Downstream Dependencies​

7. High-Level Flow — AIT Detection​

8. High-Level Flow — Synchronous Score gRPC​

9. Runtime Topology Summary​

10. Key Design Decisions​

11. Cross-Service Citations​

12. Open Questions​