Skip to main content

Fraud Intelligence Service — Application Logic

Version: 1.0 Status: Draft Owner: Trust and Safety Last Updated: 2026-04-21 Companion: DOMAIN_MODEL · API_CONTRACTS · EVENT_SCHEMAS · AI_INTEGRATION

This document specifies the use cases that compose the bounded context. The service has one synchronous hot path (Score gRPC, sub-100 ms) and eight asynchronous pipelines (ingestion, feature engineering, four ML detection pipelines, MISP feed I/O, model training and promotion, scoring recompute). Latency budgets and fail-modes are stated per use case.


1. Use Cases

UC-01: Score (gRPC hot path)

Trigger: compliance-engine evaluates an outbound message and calls FraudIntelService.Score(scope, id). Also called by routing-engine for high-risk sender decisions.

Input: ScoreRequest { scope: TENANT|SENDER_ID|MSISDN, id, traceId }

Output: ScoreResponse { score (0..1), tier (SAFE|WATCH|RISKY|HIGH_RISK|PROBATION), contributingFactors[], modelId, modelVersion, computedAt, traceId }

SLA: P50 ≤ 10 ms (cache hit), P95 ≤ 50 ms, P99 ≤ 100 ms.

Steps:

  1. Input validation. Reject with INVALID_ARGUMENT if scope unknown, id malformed (E.164 for MSISDN; UUID for TENANT; alpha-numeric ≤ 11 chars for SENDER_ID).
  2. mTLS allowlist check. Caller's SPIFFE ID must be in [spiffe://ghasi/compliance-engine, spiffe://ghasi/routing-engine, spiffe://ghasi/sender-id-registry]. Otherwise PERMISSION_DENIED.
  3. L1 Redis lookup. GET fraud:score:{scope}:{id} with TTL 15 min. If hit → return immediately. P99 ≤ 10 ms.
  4. L2 Postgres materialised view lookup. SELECT * FROM fraud.entity_scores WHERE scope = $1 AND subject_id = $2. If found → populate Redis SETEX 900 and return. P95 ≤ 30 ms.
  5. Cold path (no recent computation). If the entity has had no signals in last 30 days → return tier = PROBATION, score = 0.5, contributingFactors = []. Otherwise trigger an async refresh (publish to fraud:score:refresh:queue) and return the most recent value with staleSeconds populated.
  6. Emit Score access metric (fraud_score_grpc_total{scope, tier, cache}) and Prometheus histogram fraud_score_grpc_duration_seconds.

Error codes:

gRPC statusConditionCaller behaviour
OKScore returnedUse as input
INVALID_ARGUMENTBad inputCaller logs and skips fraud factor
PERMISSION_DENIEDCaller SPIFFE not allowlistedCaller logs SOC alert
UNAVAILABLE, DEADLINE_EXCEEDEDService down or slowCaller treats as PROBATION (fail-closed-with-default)
INTERNALInternal errorSame as UNAVAILABLE

Why fail-closed-with-default, not strict fail-closed: A fraud-intel outage must not cascade into a compliance freeze. Fraud is informational; the conservative default PROBATION lets compliance proceed with neutral assumption until fraud-intel recovers.


UC-02: BulkScore (gRPC batch)

Trigger: compliance-engine retroactive scan or NOC dashboard refresh.

Input: BulkScoreRequest { entries: [{scope, id}], traceId } — max 1,000 entries per call.

Output: Stream of ScoreResponse per entry (server-streaming gRPC).

SLA: Throughput ≥ 5,000 scores/sec per pod via Redis pipelining.

Steps:

  1. Validate batch size ≤ 1,000 (else RESOURCE_EXHAUSTED).
  2. Pipelined MGET against fraud:score:{scope}:{id} for all entries.
  3. For misses: batch-SELECT against fraud.entity_scores (1 round-trip).
  4. For still-missing: emit PROBATION and queue refresh.
  5. Stream responses in input order; backpressure via gRPC flow control.

UC-03: AitGraphAnalysis (5-min batch pipeline)

Trigger: Cron */5 * * * * on fraud-intel-worker. Distributed Redis lock fraud:lock:ait:5m.

Steps:

  1. Window selection. Compute windowEnd = floor(now, 5m), windowStart = windowEnd - 5m.
  2. Feature engineering.
    INSERT INTO fraud_features.ait_window_features
    SELECT windowStart, tenant_id, mno_id AS dst_mno, sender_id,
    count(*) AS submit_count,
    sumIf(1, dlr_status = 'DELIVERED') AS dlr_delivered_count,
    sumIf(1, dlr_status = 'FAILED') AS dlr_failed_count,
    dlr_delivered_count / nullIf(submit_count, 0) AS dlr_success_rate,
    uniqExact(dst_msisdn) AS unique_dst_msisdns,
    avg(segments) AS mean_segments_per_msg,
    entropy(substring(dst_msisdn, 1, 5)) AS entropy_of_dst_prefix,
    uniqExact(sender_id) AS unique_sender_ids,
    sum(payload_hash_repeat_count) /
    nullIf(submit_count, 0) AS repeated_body_ratio,
    uniqExact(peer_asn) AS peer_asn_diversity
    FROM fraud_features.events
    WHERE event_ts >= windowStart AND event_ts < windowEnd
    GROUP BY tenant_id, mno_id, sender_id;
  3. Cohort feature join. Left-join the latest fraud_features.ait_cohorts.cohort_anomaly_score produced by UC-04.
  4. Imputation. Apply fraud.feature_imputation_policy to nulls (e.g. unique_sender_ids → 1, entropy_of_dst_prefix → mean of last 24h).
  5. Inference. Load active model artifact (cached on disk + RAM); run XGBoost batch predict via Triton (gRPC ModelInfer). Persist results to fraud_features.ait_predictions with shap_top3 (computed via TreeSHAP).
  6. Emit detections.
    • score >= 0.85 → publish fraud.detected.ait.v1 via outbox + insert fraud.detections row.
    • 0.6 ≤ score < 0.85 → insert fraud.cases (status PENDING_REVIEW) + publish fraud.case.opened.v1.
    • score < 0.6 → log only (ClickHouse retention 30d).
  7. Metric emission. fraud_pipeline_run_duration_seconds{pipeline="ait"}, fraud_pipeline_predictions_total{pipeline, bucket}.

Budget: 90 s for 1,000 active tenants × 5 MNOs.

Failure modes:

  • Lock contention → second pod backs off; pipeline run skipped (next tick covers it).
  • Triton unreachable → fall back to cached rule-based pattern matcher (FraudPattern rows for AIT category) for this run; fraud.alert.model.unavailable.v1 emitted.

UC-04: AitCohortJob (1-h rolling)

Trigger: Cron 0 * * * * (hourly).

Steps:

  1. Cohort hash computation.
    INSERT INTO fraud_features.ait_cohorts (cohort_hash, dst_msisdn_count, contributing_tenants,
    contributing_sender_ids, first_seen_ts, last_seen_ts)
    WITH per_tenant AS (
    SELECT tenant_id, sender_id, dst_msisdn, min(event_ts) AS first_ts, max(event_ts) AS last_ts
    FROM fraud_features.events
    WHERE event_ts >= now() - INTERVAL 1 HOUR
    AND source_stream = 'SMS_STATUS'
    GROUP BY tenant_id, sender_id, dst_msisdn
    )
    SELECT
    cityHash64(arraySort(groupArray(dst_msisdn))) AS cohort_hash,
    uniqExact(dst_msisdn) AS dst_msisdn_count,
    uniqExact(tenant_id) AS contributing_tenants,
    uniqExact(sender_id) AS contributing_sender_ids,
    min(first_ts) AS first_seen_ts,
    max(last_ts) AS last_seen_ts
    FROM per_tenant
    GROUP BY cohort_hash
    HAVING dst_msisdn_count >= 5;
  2. Anomaly score. GraphSAGE inference over the bipartite (tenant, msisdn) graph; output cohort_anomaly_score ∈ [0,1] per cohort.
  3. Threshold check. dst_msisdn_count >= 100 AND contributing_tenants >= 5 AND cohort_anomaly_score >= 0.85 → publish fraud.detected.ait_ring.v1.
  4. Allowlist check. Cohorts present in fraud.cohort_exclusions (e.g. national emergency CBC fallback) are skipped with enforcementStatus = SUPPRESSED.

Budget: < 5 min for 1 M unique recipients in window. Worker requests CPU 4, memory 16 Gi.


UC-05: SimboxPatternDetector (30-min batch)

Trigger: Cron */30 * * * *.

Steps:

  1. Feature aggregation per /28 MSISDN block (16-MSISDN blocks):
    • msisdn_range_density: fraction of /28 block that submitted MO in window.
    • body_template_hash_concentration: top-1 template-hash share of all MO bodies.
    • hlr_mismatch_rate: fraction where claimed MNO ≠ HLR-resolved MNO.
    • imsi_unique_count: distinct IMSI across the block.
    • mno_bind_concentration: share of MO routed via single peer-ASN.
  2. Threshold predicate (rule-based fast-path). density > 0.6 AND template_concentration > 0.4 AND hlr_mismatch > 0.3 → flag.
  3. Isolation Forest scoring (ML). Score the same feature vector with the active iForest model; ensemble with rule-based predicate.
  4. Detection emission. Threshold + ensemble agree → fraud.detected.simbox.v1 with evidence: { msisdnBlock, features, sampleEventIds[] }.
  5. Network detection. If a 24h cluster of detections shares a peer-ASN → emit fraud.detected.simbox_network.v1 (handled by 24h cron in UC-06).

UC-06: GreyRouteDetector (24-h rolling, hourly)

Trigger: Cron 15 * * * *.

Steps:

  1. Per-peer features over rolling 24h: total_mt, mt_to_peered_mno, mt_to_non_peered_mno, mt_to_non_peered_ratio, hlr_mismatch_rate, dlr_success_rate_anomaly.
  2. Predicate: mt_to_non_peered_ratio > 0.3 AND total_mt > 1000 over 24h.
  3. Score with calibrated model. Confidence ≥ 0.85 → emit fraud.detected.greyroute.v1 with auto-quarantine recommendation. 0.6 ≤ confidence < 0.85 → open case.

UC-07: OtpHarvestDetector (6-h rolling join)

Trigger: Cron */30 * * * * over 6h window.

Steps:

  1. Identify outbound OTP-class messages (events.is_otp_likely = TRUE set by UC-12 OTP-keyword classifier).
  2. Join with consent.revoked.v1 events from same recipient cohort within ±1h.
  3. Compute (tenantId, dstMsisdnCohortHash) → otpCount, revocationCount, revocationRate.
  4. revocationRate > 0.05 AND cohort_size >= 100 → publish fraud.detected.otp_harvesting.v1 with suggestedAction = SUSPEND_SENDER_ID. Confidence ≥ 0.85 → open case for NOC; only NOC may execute the suspension via fraud.case.action_dispatched.v1.
  5. Bank/Gov sender-IDs (per sender-id-registry-service lookup) → downgrade to flag only with audit-log entry.

UC-08: OtpGrindingStreamingDetector (real-time)

Trigger: NATS JetStream consumer on sms.events.status.v1 filter messageType=OTP. Streaming aggregator over Redis sorted sets.

Steps:

  1. For each outbound OTP-class message: ZADD fraud:otp:dst:{msisdnHash}:60s {ts} {messageId}; trim with ZREMRANGEBYSCORE for entries older than 60s.
  2. ZCOUNT returns rolling-60s OTP count to that MSISDN.
  3. If count > 10 → publish fraud.detected.otp_grinding.v1 with { dstMsisdn (hashed), srcTenants[], srcSenderIds[], windowStart, windowEnd } and SETEX fraud:throttle:dst:{msisdnHash} 21600 "1" (6h throttle handle).
  4. compliance-engine consumes the event and applies its own per-MSISDN OTP throttle (1/60s for 6h) — fraud-intel-service does not enforce; it observes and recommends.

Latency: Detection-to-emission within 5 seconds of the threshold-crossing message.


UC-09: FeedExport (daily MISP/STIX export)

Trigger: Cron 0 4 * * * Asia/Kabul (UTC+04:30 → 23:30 UTC prior day).

Steps:

  1. Select FraudCase rows with status = 'CONFIRMED' AND confirmed_at > now() - 24h.
  2. Build a MISP 2.4 event with attributes per case category:
    • MSISDNphone-number attribute
    • MSISDN_BLOCK → custom msisdn-block attribute (per ATRA convention)
    • SENDER_IDtext with comment category=sender-id
    • PEER_ASNAS attribute
    • TEMPLATE_HASHsha256 attribute with comment category=template
  3. Build a STIX 2.1 bundle with indicator SDOs whose pattern follows [mobile:phone-number = '+937…'] syntax.
  4. Sign both files with platform HSM via PKCS#11 (pkcs11-tool --sign); produce .misp.json.sig and .stix.json.sig.
  5. Upload to MinIO s3://fraud-feed-out/{yyyymmdd}.{misp|stix}.json.sig.
  6. Mirror to regulator SFTP within 5 minutes (sftp://atra-mirror.gov.af/incoming/).
  7. Emit fraud.feed.exported.v1 with SHA-256, signature, and 24h-presigned URLs.
  8. Zero-diff days → emit fraud.feed.heartbeat.v1 (absence-of-evidence signal so consumers can detect a silent failure).

UC-10: FeedImport (push from regulator/peer)

Trigger: POST /v1/internal/fraud/feed/import from regulator-portal-service or peer-MNO over mTLS.

Steps:

  1. Verify body signature against FraudFeed.publicKeyRef from Vault. Failure → fraud.alert.feed.signature.invalid.v1 (PagerDuty CRITICAL).
  2. Parse MISP 2.4 or STIX 2.1.
  3. UPSERT into fraud.feed_indicators keyed by (feedId, sourceUuid) — idempotent.
  4. For new MSISDN indicators → invalidate fraud:imported:msisdn:* Redis bloom-filter; rebuild on next read.
  5. Emit fraud.feed.imported.v1 with counts: { added, updated, expired }.

UC-11: ModelTrainingPipeline (Airflow DAG)

Trigger: Nightly 02:00 Asia/Kabul + manual via POST /v1/admin/fraud/training-runs.

Pipeline stages:

  1. Snapshot training set. Pull last 90d of fraud.signalsfraud.detectionsfraud.case_decisions (HITL labels). Compute trainingSetHash.
  2. Feature engineering. Reuse the same feature transformers as the inference path (single source of truth in fraud-features Python package).
  3. Train. XGBoost / GraphSAGE / iForest depending on category.
  4. Evaluate. Hold-out test set (20%). Compute AUC, F1, precision, recall, FPR-at-threshold, Brier score (calibration), per-tenant fairness metrics.
  5. Adversarial audit. Run a fixed corpus of known-evasion attempts (paraphrased OTP-pumping templates, MSISDN-block sweep simulations). Recall on this corpus must be ≥ 0.80.
  6. Model card. Generate YAML model card per Hugging Face / Model Card Toolkit format. Persist to MinIO.
  7. Register. POST /v1/admin/fraud/models with artifact URI, hashes, evaluation metrics. Status REGISTERED.

UC-12: ModelShadowAndPromote

Trigger: REST POST /v1/admin/fraud/models/{id}/shadow then …/promote.

Shadow steps:

  1. Worker loads shadow model alongside active.
  2. Each feature window is scored by both models; shadow predictions persist to fraud_features.shadow_predictions. No fraud.detected.* events are emitted by the shadow model.
  3. Shadow runs ≥ 24h.

Promote steps:

  1. Verify shadow has run ≥ 24h (else 412 SHADOW_EVAL_INSUFFICIENT).
  2. Compare shadow vs active over the shadow window: AUC delta ≥ 0; Brier ≤ active × 1.05; per-tenant fairness Δ ≤ 0.10. Else 412.
  3. Verify artifact SHA-256 matches registered hash. Else 422 MODEL_ARTIFACT_INTEGRITY_FAIL.
  4. Atomic UPDATE: previous active → RETIRED, new → ACTIVE.
  5. Emit fraud.model.promoted.v1.
  6. Workers hot-reload on next batch boundary (no cold restart).

Rollback:

  • POST /v1/admin/fraud/models/{id}/rollback → atomic re-activation of previous version. < 60 seconds. Emits fraud.model.rolled_back.v1.

UC-13: AlertEmission (downstream consumer hooks)

When fraud.detected.* events are published, the following automatic consumers act (see EVENT_SCHEMAS §4):

DetectionConsumerAction
fraud.detected.simbox.v1 (conf ≥ 0.85)sms-firewall-serviceAdd MSISDN block to firewall.peer_quarantine 7d TTL
fraud.detected.ait_ring.v1sms-firewall-serviceAdd cohort cluster MSISDNs to dynamic blocklist
fraud.detected.otp_harvesting.v1 (conf ≥ 0.85)sender-id-registry-service (after NOC dispatch)Suspend sender-ID
fraud.detected.otp_grinding.v1compliance-enginePer-MSISDN OTP throttle (1/60s × 6h)
fraud.detected.greyroute.v1 (conf ≥ 0.85)sms-firewall-serviceQuarantine peer-ASN binding
fraud.tenant_score.updated.v1compliance-engineUpdate compliance.tenant_tiers input

UC-14: FeedbackLoop (HITL → retrain)

Trigger: POST /v1/admin/fraud/cases/{caseId}/decide from a Trust & Safety analyst.

Steps:

  1. Validate decision (CONFIRM_FRAUD / DISMISS / REFINE_FEATURES); reason ≥ 20 chars.
  2. Enforce separation-of-duties (opened_by != decided_by).
  3. Persist FeedbackDecision row.
  4. Update fraud.cases.status accordingly.
  5. Emit fraud.case.decided.v1.
  6. If executeAction = true AND decision = CONFIRM_FRAUD:
    • Look up the case's suggestedAction.
    • Dispatch the action via NATS event (firewall.blocklist.add.v1, sender-id.suspend.v1, etc.).
    • Audit-log every dispatch with caller's JWT subject and trace ID.
  7. The decision is read by the next nightly ModelTrainingPipeline run as a labelled training row.

UC-15: TenantScoreRecompute (hourly cron)

Trigger: Cron 0 * * * *. Distributed lock fraud:lock:scoring:hourly.

Steps:

  1. For each tenant with activity in last 30d:
    • Apply formula in DOMAIN_MODEL §5.
    • Read previous tier from fraud.entity_scores.
    • If tier changed → emit fraud.tenant_score.updated.v1 with { previousTier, newTier, score, contributingFactors[], modelVersion }.
    • UPSERT fraud.entity_scores.
    • Refresh Redis fraud:score:tenant:{tenantId} SETEX 900.
  2. Emit fraud_score_recompute_duration_seconds and fraud_score_recompute_failed_total.

2. Pipeline Latency Budgets

PipelineWindowBudgetSLA
Score gRPCper-calln/aP95 ≤ 50 ms
Stream ingestion → ClickHouseper-event30 sP95 ≤ 30 s
AIT 5-min pipelineper-window90 s< 5 min wall
AIT cohort jobhourly5 min< 10 min wall
SIM-box detector30-min60 s< 2 min wall
Grey-route detectorhourly90 s< 5 min wall
OTP-harvest detector30-min60 s< 2 min wall
OTP-grinding streamingreal-time5 sdetection-to-event
Hourly tenant-score recomputehourly5 min< 10 min wall
Nightly model trainingnightly6 h< 12 h

3. Hot-Path Caching Strategy

The Score gRPC is the only synchronous surface and is heavily cached:

LayerKeyTTLHit target
Redis L1fraud:score:{scope}:{id}15 min≥ 95% steady-state
Postgres L2 (mat. view)fraud.entity_scoresrefreshed hourlyresidual misses
Cold path (PROBATION)n/an/a< 1%

Cache invalidation is time-based only (15-min TTL). On a fraud.tenant_score.updated.v1 event we additionally DEL fraud:score:tenant:{tenantId} to force the next Score call to read fresh.