Fraud Intelligence Service — Application Logic
Version: 1.0 Status: Draft Owner: Trust and Safety Last Updated: 2026-04-21 Companion: DOMAIN_MODEL · API_CONTRACTS · EVENT_SCHEMAS · AI_INTEGRATION
This document specifies the use cases that compose the bounded context. The service has one synchronous hot path (Score gRPC, sub-100 ms) and eight asynchronous pipelines (ingestion, feature engineering, four ML detection pipelines, MISP feed I/O, model training and promotion, scoring recompute). Latency budgets and fail-modes are stated per use case.
1. Use Cases
UC-01: Score (gRPC hot path)
Trigger: compliance-engine evaluates an outbound message and calls FraudIntelService.Score(scope, id). Also called by routing-engine for high-risk sender decisions.
Input: ScoreRequest { scope: TENANT|SENDER_ID|MSISDN, id, traceId }
Output: ScoreResponse { score (0..1), tier (SAFE|WATCH|RISKY|HIGH_RISK|PROBATION), contributingFactors[], modelId, modelVersion, computedAt, traceId }
SLA: P50 ≤ 10 ms (cache hit), P95 ≤ 50 ms, P99 ≤ 100 ms.
Steps:
- Input validation. Reject with
INVALID_ARGUMENTif scope unknown, id malformed (E.164 for MSISDN; UUID for TENANT; alpha-numeric ≤ 11 chars for SENDER_ID). - mTLS allowlist check. Caller's SPIFFE ID must be in
[spiffe://ghasi/compliance-engine, spiffe://ghasi/routing-engine, spiffe://ghasi/sender-id-registry]. OtherwisePERMISSION_DENIED. - L1 Redis lookup.
GET fraud:score:{scope}:{id}with TTL 15 min. If hit → return immediately. P99 ≤ 10 ms. - L2 Postgres materialised view lookup.
SELECT * FROM fraud.entity_scores WHERE scope = $1 AND subject_id = $2. If found → populate RedisSETEX 900and return. P95 ≤ 30 ms. - Cold path (no recent computation). If the entity has had no signals in last 30 days → return
tier = PROBATION,score = 0.5,contributingFactors = []. Otherwise trigger an async refresh (publish tofraud:score:refresh:queue) and return the most recent value withstaleSecondspopulated. - Emit
Scoreaccess metric (fraud_score_grpc_total{scope, tier, cache}) and Prometheus histogramfraud_score_grpc_duration_seconds.
Error codes:
| gRPC status | Condition | Caller behaviour |
|---|---|---|
OK | Score returned | Use as input |
INVALID_ARGUMENT | Bad input | Caller logs and skips fraud factor |
PERMISSION_DENIED | Caller SPIFFE not allowlisted | Caller logs SOC alert |
UNAVAILABLE, DEADLINE_EXCEEDED | Service down or slow | Caller treats as PROBATION (fail-closed-with-default) |
INTERNAL | Internal error | Same as UNAVAILABLE |
Why fail-closed-with-default, not strict fail-closed: A fraud-intel outage must not cascade into a compliance freeze. Fraud is informational; the conservative default
PROBATIONlets compliance proceed with neutral assumption until fraud-intel recovers.
UC-02: BulkScore (gRPC batch)
Trigger: compliance-engine retroactive scan or NOC dashboard refresh.
Input: BulkScoreRequest { entries: [{scope, id}], traceId } — max 1,000 entries per call.
Output: Stream of ScoreResponse per entry (server-streaming gRPC).
SLA: Throughput ≥ 5,000 scores/sec per pod via Redis pipelining.
Steps:
- Validate batch size ≤ 1,000 (else
RESOURCE_EXHAUSTED). - Pipelined
MGETagainstfraud:score:{scope}:{id}for all entries. - For misses: batch-SELECT against
fraud.entity_scores(1 round-trip). - For still-missing: emit
PROBATIONand queue refresh. - Stream responses in input order; backpressure via gRPC flow control.
UC-03: AitGraphAnalysis (5-min batch pipeline)
Trigger: Cron */5 * * * * on fraud-intel-worker. Distributed Redis lock fraud:lock:ait:5m.
Steps:
- Window selection. Compute
windowEnd = floor(now, 5m),windowStart = windowEnd - 5m. - Feature engineering.
INSERT INTO fraud_features.ait_window_featuresSELECT windowStart, tenant_id, mno_id AS dst_mno, sender_id,count(*) AS submit_count,sumIf(1, dlr_status = 'DELIVERED') AS dlr_delivered_count,sumIf(1, dlr_status = 'FAILED') AS dlr_failed_count,dlr_delivered_count / nullIf(submit_count, 0) AS dlr_success_rate,uniqExact(dst_msisdn) AS unique_dst_msisdns,avg(segments) AS mean_segments_per_msg,entropy(substring(dst_msisdn, 1, 5)) AS entropy_of_dst_prefix,uniqExact(sender_id) AS unique_sender_ids,sum(payload_hash_repeat_count) /nullIf(submit_count, 0) AS repeated_body_ratio,uniqExact(peer_asn) AS peer_asn_diversityFROM fraud_features.eventsWHERE event_ts >= windowStart AND event_ts < windowEndGROUP BY tenant_id, mno_id, sender_id;
- Cohort feature join. Left-join the latest
fraud_features.ait_cohorts.cohort_anomaly_scoreproduced by UC-04. - Imputation. Apply
fraud.feature_imputation_policyto nulls (e.g.unique_sender_ids → 1,entropy_of_dst_prefix → mean of last 24h). - Inference. Load active model artifact (cached on disk + RAM); run XGBoost batch predict via Triton (gRPC
ModelInfer). Persist results tofraud_features.ait_predictionswithshap_top3(computed via TreeSHAP). - Emit detections.
score >= 0.85→ publishfraud.detected.ait.v1via outbox + insertfraud.detectionsrow.0.6 ≤ score < 0.85→ insertfraud.cases(statusPENDING_REVIEW) + publishfraud.case.opened.v1.score < 0.6→ log only (ClickHouse retention 30d).
- Metric emission.
fraud_pipeline_run_duration_seconds{pipeline="ait"},fraud_pipeline_predictions_total{pipeline, bucket}.
Budget: 90 s for 1,000 active tenants × 5 MNOs.
Failure modes:
- Lock contention → second pod backs off; pipeline run skipped (next tick covers it).
- Triton unreachable → fall back to cached rule-based pattern matcher (
FraudPatternrows for AIT category) for this run;fraud.alert.model.unavailable.v1emitted.
UC-04: AitCohortJob (1-h rolling)
Trigger: Cron 0 * * * * (hourly).
Steps:
- Cohort hash computation.
INSERT INTO fraud_features.ait_cohorts (cohort_hash, dst_msisdn_count, contributing_tenants,contributing_sender_ids, first_seen_ts, last_seen_ts)WITH per_tenant AS (SELECT tenant_id, sender_id, dst_msisdn, min(event_ts) AS first_ts, max(event_ts) AS last_tsFROM fraud_features.eventsWHERE event_ts >= now() - INTERVAL 1 HOURAND source_stream = 'SMS_STATUS'GROUP BY tenant_id, sender_id, dst_msisdn)SELECTcityHash64(arraySort(groupArray(dst_msisdn))) AS cohort_hash,uniqExact(dst_msisdn) AS dst_msisdn_count,uniqExact(tenant_id) AS contributing_tenants,uniqExact(sender_id) AS contributing_sender_ids,min(first_ts) AS first_seen_ts,max(last_ts) AS last_seen_tsFROM per_tenantGROUP BY cohort_hashHAVING dst_msisdn_count >= 5;
- Anomaly score. GraphSAGE inference over the bipartite (tenant, msisdn) graph; output
cohort_anomaly_score ∈ [0,1]per cohort. - Threshold check.
dst_msisdn_count >= 100 AND contributing_tenants >= 5 AND cohort_anomaly_score >= 0.85→ publishfraud.detected.ait_ring.v1. - Allowlist check. Cohorts present in
fraud.cohort_exclusions(e.g. national emergency CBC fallback) are skipped withenforcementStatus = SUPPRESSED.
Budget: < 5 min for 1 M unique recipients in window. Worker requests CPU 4, memory 16 Gi.
UC-05: SimboxPatternDetector (30-min batch)
Trigger: Cron */30 * * * *.
Steps:
- Feature aggregation per
/28MSISDN block (16-MSISDN blocks):msisdn_range_density: fraction of/28block that submitted MO in window.body_template_hash_concentration: top-1 template-hash share of all MO bodies.hlr_mismatch_rate: fraction where claimed MNO ≠ HLR-resolved MNO.imsi_unique_count: distinct IMSI across the block.mno_bind_concentration: share of MO routed via single peer-ASN.
- Threshold predicate (rule-based fast-path).
density > 0.6 AND template_concentration > 0.4 AND hlr_mismatch > 0.3→ flag. - Isolation Forest scoring (ML). Score the same feature vector with the active iForest model; ensemble with rule-based predicate.
- Detection emission. Threshold + ensemble agree →
fraud.detected.simbox.v1withevidence: { msisdnBlock, features, sampleEventIds[] }. - Network detection. If a 24h cluster of detections shares a peer-ASN → emit
fraud.detected.simbox_network.v1(handled by 24h cron in UC-06).
UC-06: GreyRouteDetector (24-h rolling, hourly)
Trigger: Cron 15 * * * *.
Steps:
- Per-peer features over rolling 24h:
total_mt,mt_to_peered_mno,mt_to_non_peered_mno,mt_to_non_peered_ratio,hlr_mismatch_rate,dlr_success_rate_anomaly. - Predicate:
mt_to_non_peered_ratio > 0.3 AND total_mt > 1000over 24h. - Score with calibrated model. Confidence ≥ 0.85 → emit
fraud.detected.greyroute.v1with auto-quarantine recommendation. 0.6 ≤ confidence < 0.85 → open case.
UC-07: OtpHarvestDetector (6-h rolling join)
Trigger: Cron */30 * * * * over 6h window.
Steps:
- Identify outbound OTP-class messages (
events.is_otp_likely = TRUEset by UC-12 OTP-keyword classifier). - Join with
consent.revoked.v1events from same recipient cohort within ±1h. - Compute
(tenantId, dstMsisdnCohortHash) → otpCount, revocationCount, revocationRate. revocationRate > 0.05 AND cohort_size >= 100→ publishfraud.detected.otp_harvesting.v1withsuggestedAction = SUSPEND_SENDER_ID. Confidence ≥ 0.85 → open case for NOC; only NOC may execute the suspension viafraud.case.action_dispatched.v1.- Bank/Gov sender-IDs (per
sender-id-registry-servicelookup) → downgrade toflagonly with audit-log entry.
UC-08: OtpGrindingStreamingDetector (real-time)
Trigger: NATS JetStream consumer on sms.events.status.v1 filter messageType=OTP. Streaming aggregator over Redis sorted sets.
Steps:
- For each outbound OTP-class message:
ZADD fraud:otp:dst:{msisdnHash}:60s {ts} {messageId}; trim withZREMRANGEBYSCOREfor entries older than 60s. ZCOUNTreturns rolling-60s OTP count to that MSISDN.- If count > 10 → publish
fraud.detected.otp_grinding.v1with{ dstMsisdn (hashed), srcTenants[], srcSenderIds[], windowStart, windowEnd }andSETEX fraud:throttle:dst:{msisdnHash} 21600 "1"(6h throttle handle). compliance-engineconsumes the event and applies its own per-MSISDN OTP throttle (1/60s for 6h) — fraud-intel-service does not enforce; it observes and recommends.
Latency: Detection-to-emission within 5 seconds of the threshold-crossing message.
UC-09: FeedExport (daily MISP/STIX export)
Trigger: Cron 0 4 * * * Asia/Kabul (UTC+04:30 → 23:30 UTC prior day).
Steps:
- Select
FraudCaserows withstatus = 'CONFIRMED' AND confirmed_at > now() - 24h. - Build a MISP 2.4 event with attributes per case category:
MSISDN→phone-numberattributeMSISDN_BLOCK→ custommsisdn-blockattribute (per ATRA convention)SENDER_ID→textwith commentcategory=sender-idPEER_ASN→ASattributeTEMPLATE_HASH→sha256attribute with commentcategory=template
- Build a STIX 2.1 bundle with
indicatorSDOs whosepatternfollows[mobile:phone-number = '+937…']syntax. - Sign both files with platform HSM via PKCS#11 (
pkcs11-tool --sign); produce.misp.json.sigand.stix.json.sig. - Upload to MinIO
s3://fraud-feed-out/{yyyymmdd}.{misp|stix}.json.sig. - Mirror to regulator SFTP within 5 minutes (
sftp://atra-mirror.gov.af/incoming/). - Emit
fraud.feed.exported.v1with SHA-256, signature, and 24h-presigned URLs. - Zero-diff days → emit
fraud.feed.heartbeat.v1(absence-of-evidence signal so consumers can detect a silent failure).
UC-10: FeedImport (push from regulator/peer)
Trigger: POST /v1/internal/fraud/feed/import from regulator-portal-service or peer-MNO over mTLS.
Steps:
- Verify body signature against
FraudFeed.publicKeyReffrom Vault. Failure →fraud.alert.feed.signature.invalid.v1(PagerDuty CRITICAL). - Parse MISP 2.4 or STIX 2.1.
- UPSERT into
fraud.feed_indicatorskeyed by(feedId, sourceUuid)— idempotent. - For new
MSISDNindicators → invalidatefraud:imported:msisdn:*Redis bloom-filter; rebuild on next read. - Emit
fraud.feed.imported.v1with counts:{ added, updated, expired }.
UC-11: ModelTrainingPipeline (Airflow DAG)
Trigger: Nightly 02:00 Asia/Kabul + manual via POST /v1/admin/fraud/training-runs.
Pipeline stages:
- Snapshot training set. Pull last 90d of
fraud.signals∪fraud.detections∪fraud.case_decisions(HITL labels). ComputetrainingSetHash. - Feature engineering. Reuse the same feature transformers as the inference path (single source of truth in
fraud-featuresPython package). - Train. XGBoost / GraphSAGE / iForest depending on category.
- Evaluate. Hold-out test set (20%). Compute AUC, F1, precision, recall, FPR-at-threshold, Brier score (calibration), per-tenant fairness metrics.
- Adversarial audit. Run a fixed corpus of known-evasion attempts (paraphrased OTP-pumping templates, MSISDN-block sweep simulations). Recall on this corpus must be ≥ 0.80.
- Model card. Generate YAML model card per Hugging Face / Model Card Toolkit format. Persist to MinIO.
- Register.
POST /v1/admin/fraud/modelswith artifact URI, hashes, evaluation metrics. StatusREGISTERED.
UC-12: ModelShadowAndPromote
Trigger: REST POST /v1/admin/fraud/models/{id}/shadow then …/promote.
Shadow steps:
- Worker loads shadow model alongside active.
- Each feature window is scored by both models; shadow predictions persist to
fraud_features.shadow_predictions. Nofraud.detected.*events are emitted by the shadow model. - Shadow runs ≥ 24h.
Promote steps:
- Verify shadow has run ≥ 24h (else
412 SHADOW_EVAL_INSUFFICIENT). - Compare shadow vs active over the shadow window: AUC delta ≥ 0; Brier ≤ active × 1.05; per-tenant fairness Δ ≤ 0.10. Else
412. - Verify artifact SHA-256 matches registered hash. Else
422 MODEL_ARTIFACT_INTEGRITY_FAIL. - Atomic UPDATE: previous active →
RETIRED, new →ACTIVE. - Emit
fraud.model.promoted.v1. - Workers hot-reload on next batch boundary (no cold restart).
Rollback:
POST /v1/admin/fraud/models/{id}/rollback→ atomic re-activation of previous version. < 60 seconds. Emitsfraud.model.rolled_back.v1.
UC-13: AlertEmission (downstream consumer hooks)
When fraud.detected.* events are published, the following automatic consumers act (see EVENT_SCHEMAS §4):
| Detection | Consumer | Action |
|---|---|---|
fraud.detected.simbox.v1 (conf ≥ 0.85) | sms-firewall-service | Add MSISDN block to firewall.peer_quarantine 7d TTL |
fraud.detected.ait_ring.v1 | sms-firewall-service | Add cohort cluster MSISDNs to dynamic blocklist |
fraud.detected.otp_harvesting.v1 (conf ≥ 0.85) | sender-id-registry-service (after NOC dispatch) | Suspend sender-ID |
fraud.detected.otp_grinding.v1 | compliance-engine | Per-MSISDN OTP throttle (1/60s × 6h) |
fraud.detected.greyroute.v1 (conf ≥ 0.85) | sms-firewall-service | Quarantine peer-ASN binding |
fraud.tenant_score.updated.v1 | compliance-engine | Update compliance.tenant_tiers input |
UC-14: FeedbackLoop (HITL → retrain)
Trigger: POST /v1/admin/fraud/cases/{caseId}/decide from a Trust & Safety analyst.
Steps:
- Validate decision (
CONFIRM_FRAUD/DISMISS/REFINE_FEATURES); reason ≥ 20 chars. - Enforce separation-of-duties (
opened_by != decided_by). - Persist
FeedbackDecisionrow. - Update
fraud.cases.statusaccordingly. - Emit
fraud.case.decided.v1. - If
executeAction = true AND decision = CONFIRM_FRAUD:- Look up the case's
suggestedAction. - Dispatch the action via NATS event (
firewall.blocklist.add.v1,sender-id.suspend.v1, etc.). - Audit-log every dispatch with caller's JWT subject and trace ID.
- Look up the case's
- The decision is read by the next nightly
ModelTrainingPipelinerun as a labelled training row.
UC-15: TenantScoreRecompute (hourly cron)
Trigger: Cron 0 * * * *. Distributed lock fraud:lock:scoring:hourly.
Steps:
- For each tenant with activity in last 30d:
- Apply formula in DOMAIN_MODEL §5.
- Read previous tier from
fraud.entity_scores. - If tier changed → emit
fraud.tenant_score.updated.v1with{ previousTier, newTier, score, contributingFactors[], modelVersion }. - UPSERT
fraud.entity_scores. - Refresh Redis
fraud:score:tenant:{tenantId}SETEX 900.
- Emit
fraud_score_recompute_duration_secondsandfraud_score_recompute_failed_total.
2. Pipeline Latency Budgets
| Pipeline | Window | Budget | SLA |
|---|---|---|---|
| Score gRPC | per-call | n/a | P95 ≤ 50 ms |
| Stream ingestion → ClickHouse | per-event | 30 s | P95 ≤ 30 s |
| AIT 5-min pipeline | per-window | 90 s | < 5 min wall |
| AIT cohort job | hourly | 5 min | < 10 min wall |
| SIM-box detector | 30-min | 60 s | < 2 min wall |
| Grey-route detector | hourly | 90 s | < 5 min wall |
| OTP-harvest detector | 30-min | 60 s | < 2 min wall |
| OTP-grinding streaming | real-time | 5 s | detection-to-event |
| Hourly tenant-score recompute | hourly | 5 min | < 10 min wall |
| Nightly model training | nightly | 6 h | < 12 h |
3. Hot-Path Caching Strategy
The Score gRPC is the only synchronous surface and is heavily cached:
| Layer | Key | TTL | Hit target |
|---|---|---|---|
| Redis L1 | fraud:score:{scope}:{id} | 15 min | ≥ 95% steady-state |
| Postgres L2 (mat. view) | fraud.entity_scores | refreshed hourly | residual misses |
| Cold path (PROBATION) | n/a | n/a | < 1% |
Cache invalidation is time-based only (15-min TTL). On a fraud.tenant_score.updated.v1 event we additionally DEL fraud:score:tenant:{tenantId} to force the next Score call to read fresh.