Fraud Intelligence (fraud-intel-service) — Service Overview
Version: 1.0 Status: Draft Owner: Trust & Safety Last Updated: 2026-04-20 Companion: DOMAIN_MODEL · API_CONTRACTS · EVENT_SCHEMAS · AI_INTEGRATION Related ADR: ADR-0004 National-Backbone Resilience §3
1. Purpose — Telecom Fraud Detection at National Scale
The Fraud Intelligence Service is the central detection brain for telecom-grade fraud on the Ghasi SMS backbone. It ingests evidence from every leg of the message lifecycle (orchestrator submits, firewall verdicts, SMPP DLRs, MO PDUs, billing CDRs) and emits high-confidence detections that drive automated enforcement (firewall rule promotion, tenant scoring, sender-ID suspension) and regulator-grade fraud feed exports.
Whereas sms-firewall-service enforces rules at the perimeter and compliance-engine enforces policy on outbound tenant SMS, fraud-intel-service is inferential and analytical: it operates on cross-message graphs and time-series patterns to detect what no single rule can express:
- AIT — Artificially Inflated Traffic generated to harvest SMS termination revenue (typically OTPs)
- SIM-box — voice-converted SMS terminating via grey-route hardware bypassing legitimate interconnect
- OTP harvesting — campaigns triggering OTPs against accounts the attacker controls in order to capture per-message termination revenue
- OTP grinding — automated OTP guessing against a single victim (security threat to the recipient)
- Grey-route arbitrage — long-running peer aggregator schemes routing traffic via untaxed paths
- Sender-ID spoofing networks — patterns of sender-ID misuse across multiple aggregators
The service is asynchronous (NATS consumer + ML batch pipeline + offline graph queries), with one synchronous gRPC entry point (Score) used by compliance-engine for tenant-level scoring during evaluation. It does not sit in the data-plane critical path.
2. Position in the Platform — The Detection Plane
┌──────────────────┐ ┌──────────────────┐ ┌──────────────────┐
│ sms-firewall │ │ sms-orchestrator│ │ dlr-processor │
│ firewall.audit │ │ sms.events.* │ │ sms.dlr.inbound │
└────────┬─────────┘ └────────┬─────────┘ └────────┬─────────┘
│ │ │
│ NATS │ NATS │ NATS
▼ ▼ ▼
╔══════════════════════════════════════════════════════════════════╗
║ fraud-intel-service ║
║ ║
║ ┌─────────────────────────┐ ┌──────────────────────────┐ ║
║ │ Stream Ingestion │───▶│ Feature Store (ClickHouse)│ ║
║ │ (NATS consumer pool) │ │ + Redis hot features │ ║
║ └─────────────────────────┘ └─────────┬────────────────┘ ║
║ │ ║
║ ▼ ║
║ ┌─────────────────────────────────────────────────────────┐ ║
║ │ Detection Pipelines (per fraud class) │ ║
║ │ · AIT: graph + ML (XGBoost on cross-tenant features) │ ║
║ │ · SIM-box: temporal + HLR-mismatch + ASN heuristic │ ║
║ │ · OTP harvesting: OTP-keyword + recipient cohort │ ║
║ │ · OTP grinding: per-MSISDN OTP-attempt bursts │ ║
║ │ · Grey-route: long-window peer-MNO routing entropy │ ║
║ └─────────────────────────┬───────────────────────────────┘ ║
║ │ ║
║ ▼ ║
║ fraud.detected.v1 (NATS, signed) ║
╚════════════════════════════════════════════════════════════════════╝
│
┌────────────────────────┼────────────────────────┐
▼ ▼ ▼
┌──────────────────┐ ┌──────────────────┐ ┌──────────────────────┐
│ sms-firewall │ │ compliance-engine│ │ regulator-portal + │
│ promotes to BLOCK│ │ tenant scoring │ │ MISP fraud-feed export│
└──────────────────┘ └──────────────────┘ └──────────────────────┘
3. Bounded Context
| Dimension | Value |
|---|---|
| Domain | Trust & Safety / Fraud Intelligence |
| Owner squad | Trust & Safety (Data Science track) |
| Deployment unit | Kubernetes — fraud-intel-service (control-plane) + fraud-intel-worker-pool (batch / ML) |
| Communication style | Inbound: NATS JetStream consumers (primary) · gRPC Score (synchronous, low-volume) · HTTP REST (admin + MISP feed) |
| Storage | ClickHouse (feature store + detection log) · Postgres (model catalog, detection cases, MISP entities) · Redis (hot features, model output cache) · MinIO (model artifacts, training datasets, signed feed exports) |
| Failure mode | Fail-soft for detection (a missed window is acceptable; backfill later); fail-closed for Score gRPC (compliance-engine treats unscored tenants as PROBATION) |
| Region affinity | Kbl primary; mzr runs as warm standby; ClickHouse is regionally sharded with cross-region read replicas |
4. Responsibilities
| # | Responsibility |
|---|---|
| R1 | Consume firewall.audit.v1, sms.events.status.v1, sms.dlr.inbound.v1, cdr.generated.v1 and project into the ClickHouse feature store within P95 ≤ 30 s ingestion lag |
| R2 | Run AIT detection pipeline (graph + XGBoost) on rolling 5-minute windows; mean-time-to-detect ≤ 15 min for a new AIT campaign |
| R3 | Run SIM-box detection pipeline on inbound MO patterns (per-ASN, per-MNO, per-MSISDN-range); emit fraud.detected.simbox.v1 with confidence ≥ 0.8 |
| R4 | Run OTP-harvesting and OTP-grinding detection on tenant outbound traffic; emit per-tenant fraud cases visible in admin-dashboard |
| R5 | Run grey-route arbitrage detection on peer aggregator behaviour over rolling 24 h windows |
| R6 | Maintain a model catalog (fraud.models table) with model ID, version, training-set hash, deployment date, performance metrics; every detection event carries aiProvenance |
| R7 | Expose synchronous gRPC `Score(tenantId |
| R8 | Export a MISP-compatible fraud feed (JSON Lines + STIX 2.1) signed with HSM key for cross-operator and regulator consumption |
| R9 | Import MISP feeds from peer MNOs and from ATRA's regulator portal; merge into the feature store with source attribution and decay |
| R10 | Provide an admin REST API for fraud analyst case-management workflow: cases, case_evidence, case_decisions (confirm fraud / dismiss / refine model) |
| R11 | Publish per-tenant fraud scores hourly to compliance-engine via fraud.tenant_score.updated.v1 for use in compliance-engine's tenant tier calculation |
5. Non-Responsibilities
- Does not enforce verdicts on individual messages — that is
sms-firewall-service(perimeter) andcompliance-engine(outbound) - Does not terminate SMPP binds or quarantine peers —
sms-firewall-servicedoes that based onfraud.detected.*events - Does not train or serve general-purpose LLMs — the local LLM service handles that
- Does not generate billing CDRs —
cdr-mediation-servicedoes - Does not manage tenant onboarding decisions —
compliance-engineconsumes our scores but owns tenant tier policy - Does not persist subscriber consent —
consent-ledger-service
6. Upstream / Downstream Dependencies
| Direction | Service | Protocol | Purpose |
|---|---|---|---|
| Inbound event | sms-firewall-service | NATS JetStream firewall.audit.v1 | Verdict evidence for AIT / SIM-box detection |
| Inbound event | sms-orchestrator | NATS JetStream sms.events.status.v1 | Outbound SMS lifecycle events |
| Inbound event | dlr-processor | NATS JetStream sms.dlr.inbound.v1 | Delivery receipts for AIT detection (DLR success-rate per terminating MNO) |
| Inbound event | cdr-mediation-service | NATS JetStream cdr.generated.v1 | Per-message CDR for grey-route arbitrage detection |
| Inbound event | consent-ledger-service | NATS JetStream consent.revoked.v1 | OTP-harvesting heuristic input (revocation cohorts) |
| Inbound caller | compliance-engine | gRPC Score(scope, id) | Per-tenant / sender-ID / MSISDN fraud score |
| Inbound caller | routing-engine | gRPC Score(senderId) (optional) | Routing decision input for high-risk senders |
| Inbound admin | admin-dashboard | HTTP REST (mTLS, JWT role tns-fraud-analyst) | Case management |
| Inbound MISP | regulator-portal-service | HTTP REST POST /v1/internal/fraud/feed/import | Regulator MISP feed import |
| Outbound read/write | ClickHouse fraud_features schema | TCP | Feature store, detection log |
| Outbound read/write | PostgreSQL fraud schema | TCP | Model catalog, cases, MISP entities |
| Outbound read/write | Redis | TCP | Hot feature cache, model output cache |
| Outbound read/write | MinIO | S3 | Model artifacts, signed MISP exports |
| Outbound events | NATS JetStream | TCP | fraud.detected.*, fraud.tenant_score.updated.v1, fraud.feed.updated.v1 |
7. High-Level Flow — AIT Detection
8. High-Level Flow — Synchronous Score gRPC
9. Runtime Topology Summary
| Aspect | Value |
|---|---|
| Process model | Two deployment groups: (a) fraud-intel-service (NestJS, exposes gRPC + REST), (b) fraud-intel-worker (Python, ML batch pipelines, runs every 5 min via CronJob + KEDA-scaled stream-processor) |
| Replicas | fraud-intel-service: minReplicas=3 in kbl, 2 in mzr; fraud-intel-worker: scales 0–20 by KEDA on NATS lag |
| Node pool | fraud-intel-service on np-ctrl; fraud-intel-worker on np-identity (GPU optional for future deep models) |
| Startup | Service: load active model catalog → warm Redis caches; Worker: pull model artifacts from MinIO → register with control-plane |
| Hot reload | New model version → admin REST POST /v1/admin/fraud/models/{id}/promote → workers swap model atomically on next batch boundary |
| Shutdown | Drain NATS consumers (max 30 s) → flush in-memory feature buffers to ClickHouse → exit |
| Region affinity | kbl primary; mzr warm standby; cross-region ClickHouse replication for queries; model registry mirrored |
10. Key Design Decisions
| Decision | Rationale |
|---|---|
Asynchronous detection plane, synchronous-only Score gRPC | Detection latency is allowed to be minutes; synchronous use is read-only score lookups against pre-computed cache |
| ClickHouse for feature store, not Postgres | Telemetry-class volumes (10 M events/h target) require columnar OLAP; Postgres is reserved for case management and model metadata |
| XGBoost + graph features, not deep learning (initially) | Telecom fraud patterns are well-suited to gradient-boosted trees; explainability via SHAP values matters for regulator defensibility |
| Detections are events, not in-line decisions | Every detection is a NATS event consumable by multiple downstream services; this avoids hard coupling and lets policy evolve independently |
| MISP-compatible feed format | Industry-standard threat-intel exchange; lets us interoperate with peer MNOs and regulator without bespoke schema |
Detections carry aiProvenance (model ID, version, training-set hash, SHAP top-3 features) | Regulator-grade explainability; cases can be defended in dispute resolution |
Fail-soft on detection, fail-closed on Score | A missed window can be backfilled; an unscored tenant cannot be allowed through compliance without a probation handle |
| Per-fraud-class pipeline, not a single mega-model | Each pipeline can be retrained, A/B-tested, and rolled back independently; reduces blast radius |
| Confidence thresholds 0.85 / 0.60 / 0.40: above 0.85 auto-enforce; 0.6–0.85 case opened (HITL); below 0.6 logged-only | Calibrated to observed false-positive cost (subscriber impact) vs false-negative cost (revenue + reputation) |
| Models are versioned, signed, and immutable in MinIO | Supply-chain integrity for ML artifacts; prevents drift between training and inference replicas |
11. Cross-Service Citations
| Related epic | Owner service | Why it matters here |
|---|---|---|
EP-FW-02 Transit MT Firewall | sms-firewall-service | Consumes our fraud.detected.greyroute.v1 and fraud.detected.simbox.v1 to promote to BLOCK rules |
EP-FW-03 Federation | sms-firewall-service | Our MISP feed feeds federated blocklist entries with source = 'FRAUD_INTEL' |
EP-CE-* Compliance scoring | compliance-engine | Calls our Score gRPC for per-tenant fraud score; consumes fraud.tenant_score.updated.v1 |
EP-SID-* Sender-ID lifecycle | sender-id-registry-service | Consumes fraud.detected.senderid_abuse.v1 to suspend abused IDs |
EP-DLR-* DLR processing | dlr-processor | Our primary AIT signal: per-MNO DLR success rate anomalies |
EP-CONS-02 STOP-keyword | consent-ledger-service | Mass revocations are an OTP-harvesting signal |
12. Open Questions
| ID | Question | Owner | Target |
|---|---|---|---|
| OQ-FRAUD-01 | Should the MISP export include sender-ID-class indicators or only MSISDN-class to align with ATRA's published format? | Regulator Liaison | 2026-05-30 |
| OQ-FRAUD-02 | Are we permitted to share tenant-attributed AIT cases with peer MNOs, or only de-identified MSISDN-level evidence? | Legal | 2026-05-15 |
| OQ-FRAUD-03 | How do we handle cross-region model drift — train per-region or unified model? | Data Science Lead | 2026-06-15 |
| OQ-FRAUD-04 | What is the expected GPU budget if we migrate the OTP-harvesting classifier to a transformer model in 2027? | SRE + DS | 2026-Q4 |