Fraud Intelligence Service — Application Logic

Version: 1.0 Status: Draft Owner: Trust and Safety Last Updated: 2026-04-21 Companion: DOMAIN_MODEL · API_CONTRACTS · EVENT_SCHEMAS · AI_INTEGRATION

This document specifies the use cases that compose the bounded context. The service has one synchronous hot path (Score gRPC, sub-100 ms) and eight asynchronous pipelines (ingestion, feature engineering, four ML detection pipelines, MISP feed I/O, model training and promotion, scoring recompute). Latency budgets and fail-modes are stated per use case.

1. Use Cases

UC-01: Score (gRPC hot path)

Trigger: compliance-engine evaluates an outbound message and calls FraudIntelService.Score(scope, id). Also called by routing-engine for high-risk sender decisions.

Input: ScoreRequest { scope: TENANT|SENDER_ID|MSISDN, id, traceId }

Output: ScoreResponse { score (0..1), tier (SAFE|WATCH|RISKY|HIGH_RISK|PROBATION), contributingFactors[], modelId, modelVersion, computedAt, traceId }

SLA: P50 ≤ 10 ms (cache hit), P95 ≤ 50 ms, P99 ≤ 100 ms.

Steps:

Input validation. Reject with INVALID_ARGUMENT if scope unknown, id malformed (E.164 for MSISDN; UUID for TENANT; alpha-numeric ≤ 11 chars for SENDER_ID).
mTLS allowlist check. Caller's SPIFFE ID must be in [spiffe://ghasi/compliance-engine, spiffe://ghasi/routing-engine, spiffe://ghasi/sender-id-registry]. Otherwise PERMISSION_DENIED.
L1 Redis lookup. GET fraud:score:{scope}:{id} with TTL 15 min. If hit → return immediately. P99 ≤ 10 ms.
L2 Postgres materialised view lookup. SELECT * FROM fraud.entity_scores WHERE scope = $1 AND subject_id = $2. If found → populate Redis SETEX 900 and return. P95 ≤ 30 ms.
Cold path (no recent computation). If the entity has had no signals in last 30 days → return tier = PROBATION, score = 0.5, contributingFactors = []. Otherwise trigger an async refresh (publish to fraud:score:refresh:queue) and return the most recent value with staleSeconds populated.
Emit Score access metric (fraud_score_grpc_total{scope, tier, cache}) and Prometheus histogram fraud_score_grpc_duration_seconds.

Error codes:

gRPC status	Condition	Caller behaviour
`OK`	Score returned	Use as input
`INVALID_ARGUMENT`	Bad input	Caller logs and skips fraud factor
`PERMISSION_DENIED`	Caller SPIFFE not allowlisted	Caller logs SOC alert
`UNAVAILABLE`, `DEADLINE_EXCEEDED`	Service down or slow	Caller treats as `PROBATION` (fail-closed-with-default)
`INTERNAL`	Internal error	Same as UNAVAILABLE

Why fail-closed-with-default, not strict fail-closed: A fraud-intel outage must not cascade into a compliance freeze. Fraud is informational; the conservative default PROBATION lets compliance proceed with neutral assumption until fraud-intel recovers.

UC-02: BulkScore (gRPC batch)

Trigger: compliance-engine retroactive scan or NOC dashboard refresh.

Input: BulkScoreRequest { entries: [{scope, id}], traceId } — max 1,000 entries per call.

Output: Stream of ScoreResponse per entry (server-streaming gRPC).

SLA: Throughput ≥ 5,000 scores/sec per pod via Redis pipelining.

Steps:

Validate batch size ≤ 1,000 (else RESOURCE_EXHAUSTED).
Pipelined MGET against fraud:score:{scope}:{id} for all entries.
For misses: batch-SELECT against fraud.entity_scores (1 round-trip).
For still-missing: emit PROBATION and queue refresh.
Stream responses in input order; backpressure via gRPC flow control.

UC-03: AitGraphAnalysis (5-min batch pipeline)

Trigger: Cron */5 * * * * on fraud-intel-worker. Distributed Redis lock fraud:lock:ait:5m.

Steps:

Window selection. Compute windowEnd = floor(now, 5m), windowStart = windowEnd - 5m.

Feature engineering.

INSERT INTO fraud_features.ait_window_features
SELECT  windowStart, tenant_id, mno_id AS dst_mno, sender_id,
        count(*)                                              AS submit_count,
        sumIf(1, dlr_status = 'DELIVERED')                    AS dlr_delivered_count,
        sumIf(1, dlr_status = 'FAILED')                       AS dlr_failed_count,
        dlr_delivered_count / nullIf(submit_count, 0)         AS dlr_success_rate,
        uniqExact(dst_msisdn)                                 AS unique_dst_msisdns,
        avg(segments)                                         AS mean_segments_per_msg,
        entropy(substring(dst_msisdn, 1, 5))                  AS entropy_of_dst_prefix,
        uniqExact(sender_id)                                  AS unique_sender_ids,
        sum(payload_hash_repeat_count) /
          nullIf(submit_count, 0)                             AS repeated_body_ratio,
        uniqExact(peer_asn)                                   AS peer_asn_diversity
FROM fraud_features.events
WHERE event_ts >= windowStart AND event_ts < windowEnd
GROUP BY tenant_id, mno_id, sender_id;

Cohort feature join. Left-join the latest fraud_features.ait_cohorts.cohort_anomaly_score produced by UC-04.
Imputation. Apply fraud.feature_imputation_policy to nulls (e.g. unique_sender_ids → 1, entropy_of_dst_prefix → mean of last 24h).
Inference. Load active model artifact (cached on disk + RAM); run XGBoost batch predict via Triton (gRPC ModelInfer). Persist results to fraud_features.ait_predictions with shap_top3 (computed via TreeSHAP).
Emit detections.
- score >= 0.85 → publish fraud.detected.ait.v1 via outbox + insert fraud.detections row.
- 0.6 ≤ score < 0.85 → insert fraud.cases (status PENDING_REVIEW) + publish fraud.case.opened.v1.
- score < 0.6 → log only (ClickHouse retention 30d).
Metric emission. fraud_pipeline_run_duration_seconds{pipeline="ait"}, fraud_pipeline_predictions_total{pipeline, bucket}.

Budget: 90 s for 1,000 active tenants × 5 MNOs.

Failure modes:

Lock contention → second pod backs off; pipeline run skipped (next tick covers it).
Triton unreachable → fall back to cached rule-based pattern matcher (FraudPattern rows for AIT category) for this run; fraud.alert.model.unavailable.v1 emitted.

UC-04: AitCohortJob (1-h rolling)

Trigger: Cron 0 * * * * (hourly).

Steps:

Cohort hash computation.

INSERT INTO fraud_features.ait_cohorts (cohort_hash, dst_msisdn_count, contributing_tenants,
                                         contributing_sender_ids, first_seen_ts, last_seen_ts)
WITH per_tenant AS (
  SELECT tenant_id, sender_id, dst_msisdn, min(event_ts) AS first_ts, max(event_ts) AS last_ts
  FROM fraud_features.events
  WHERE event_ts >= now() - INTERVAL 1 HOUR
    AND source_stream = 'SMS_STATUS'
  GROUP BY tenant_id, sender_id, dst_msisdn
)
SELECT
  cityHash64(arraySort(groupArray(dst_msisdn)))    AS cohort_hash,
  uniqExact(dst_msisdn)                             AS dst_msisdn_count,
  uniqExact(tenant_id)                              AS contributing_tenants,
  uniqExact(sender_id)                              AS contributing_sender_ids,
  min(first_ts)                                     AS first_seen_ts,
  max(last_ts)                                      AS last_seen_ts
FROM per_tenant
GROUP BY cohort_hash
HAVING dst_msisdn_count >= 5;

Anomaly score. GraphSAGE inference over the bipartite (tenant, msisdn) graph; output cohort_anomaly_score ∈ [0,1] per cohort.
Threshold check. dst_msisdn_count >= 100 AND contributing_tenants >= 5 AND cohort_anomaly_score >= 0.85 → publish fraud.detected.ait_ring.v1.
Allowlist check. Cohorts present in fraud.cohort_exclusions (e.g. national emergency CBC fallback) are skipped with enforcementStatus = SUPPRESSED.

Budget: < 5 min for 1 M unique recipients in window. Worker requests CPU 4, memory 16 Gi.

UC-05: SimboxPatternDetector (30-min batch)

Trigger: Cron */30 * * * *.

Steps:

Feature aggregation per /28 MSISDN block (16-MSISDN blocks):
- msisdn_range_density: fraction of /28 block that submitted MO in window.
- body_template_hash_concentration: top-1 template-hash share of all MO bodies.
- hlr_mismatch_rate: fraction where claimed MNO ≠ HLR-resolved MNO.
- imsi_unique_count: distinct IMSI across the block.
- mno_bind_concentration: share of MO routed via single peer-ASN.
Threshold predicate (rule-based fast-path). density > 0.6 AND template_concentration > 0.4 AND hlr_mismatch > 0.3 → flag.
Isolation Forest scoring (ML). Score the same feature vector with the active iForest model; ensemble with rule-based predicate.
Detection emission. Threshold + ensemble agree → fraud.detected.simbox.v1 with evidence: { msisdnBlock, features, sampleEventIds[] }.
Network detection. If a 24h cluster of detections shares a peer-ASN → emit fraud.detected.simbox_network.v1 (handled by 24h cron in UC-06).

UC-06: GreyRouteDetector (24-h rolling, hourly)

Trigger: Cron 15 * * * *.

Steps:

Per-peer features over rolling 24h: total_mt, mt_to_peered_mno, mt_to_non_peered_mno, mt_to_non_peered_ratio, hlr_mismatch_rate, dlr_success_rate_anomaly.
Predicate: mt_to_non_peered_ratio > 0.3 AND total_mt > 1000 over 24h.
Score with calibrated model. Confidence ≥ 0.85 → emit fraud.detected.greyroute.v1 with auto-quarantine recommendation. 0.6 ≤ confidence < 0.85 → open case.

UC-07: OtpHarvestDetector (6-h rolling join)

Trigger: Cron */30 * * * * over 6h window.

Steps:

Identify outbound OTP-class messages (events.is_otp_likely = TRUE set by UC-12 OTP-keyword classifier).
Join with consent.revoked.v1 events from same recipient cohort within ±1h.
Compute (tenantId, dstMsisdnCohortHash) → otpCount, revocationCount, revocationRate.
revocationRate > 0.05 AND cohort_size >= 100 → publish fraud.detected.otp_harvesting.v1 with suggestedAction = SUSPEND_SENDER_ID. Confidence ≥ 0.85 → open case for NOC; only NOC may execute the suspension via fraud.case.action_dispatched.v1.
Bank/Gov sender-IDs (per sender-id-registry-service lookup) → downgrade to flag only with audit-log entry.

UC-08: OtpGrindingStreamingDetector (real-time)

Trigger: NATS JetStream consumer on sms.events.status.v1 filter messageType=OTP. Streaming aggregator over Redis sorted sets.

Steps:

For each outbound OTP-class message: ZADD fraud:otp:dst:{msisdnHash}:60s {ts} {messageId}; trim with ZREMRANGEBYSCORE for entries older than 60s.
ZCOUNT returns rolling-60s OTP count to that MSISDN.
If count > 10 → publish fraud.detected.otp_grinding.v1 with { dstMsisdn (hashed), srcTenants[], srcSenderIds[], windowStart, windowEnd } and SETEX fraud:throttle:dst:{msisdnHash} 21600 "1" (6h throttle handle).
compliance-engine consumes the event and applies its own per-MSISDN OTP throttle (1/60s for 6h) — fraud-intel-service does not enforce; it observes and recommends.

Latency: Detection-to-emission within 5 seconds of the threshold-crossing message.

UC-09: FeedExport (daily MISP/STIX export)

Trigger: Cron 0 4 * * * Asia/Kabul (UTC+04:30 → 23:30 UTC prior day).

Steps:

Select FraudCase rows with status = 'CONFIRMED' AND confirmed_at > now() - 24h.
Build a MISP 2.4 event with attributes per case category:
- MSISDN → phone-number attribute
- MSISDN_BLOCK → custom msisdn-block attribute (per ATRA convention)
- SENDER_ID → text with comment category=sender-id
- PEER_ASN → AS attribute
- TEMPLATE_HASH → sha256 attribute with comment category=template
Build a STIX 2.1 bundle with indicator SDOs whose pattern follows [mobile:phone-number = '+937…'] syntax.
Sign both files with platform HSM via PKCS#11 (pkcs11-tool --sign); produce .misp.json.sig and .stix.json.sig.
Upload to MinIO s3://fraud-feed-out/{yyyymmdd}.{misp|stix}.json.sig.
Mirror to regulator SFTP within 5 minutes (sftp://atra-mirror.gov.af/incoming/).
Emit fraud.feed.exported.v1 with SHA-256, signature, and 24h-presigned URLs.
Zero-diff days → emit fraud.feed.heartbeat.v1 (absence-of-evidence signal so consumers can detect a silent failure).

UC-10: FeedImport (push from regulator/peer)

Trigger: POST /v1/internal/fraud/feed/import from regulator-portal-service or peer-MNO over mTLS.

Steps:

Verify body signature against FraudFeed.publicKeyRef from Vault. Failure → fraud.alert.feed.signature.invalid.v1 (PagerDuty CRITICAL).
Parse MISP 2.4 or STIX 2.1.
UPSERT into fraud.feed_indicators keyed by (feedId, sourceUuid) — idempotent.
For new MSISDN indicators → invalidate fraud:imported:msisdn:* Redis bloom-filter; rebuild on next read.
Emit fraud.feed.imported.v1 with counts: { added, updated, expired }.

UC-11: ModelTrainingPipeline (Airflow DAG)

Trigger: Nightly 02:00 Asia/Kabul + manual via POST /v1/admin/fraud/training-runs.

Pipeline stages:

Snapshot training set. Pull last 90d of fraud.signals ∪ fraud.detections ∪ fraud.case_decisions (HITL labels). Compute trainingSetHash.
Feature engineering. Reuse the same feature transformers as the inference path (single source of truth in fraud-features Python package).
Train. XGBoost / GraphSAGE / iForest depending on category.
Evaluate. Hold-out test set (20%). Compute AUC, F1, precision, recall, FPR-at-threshold, Brier score (calibration), per-tenant fairness metrics.
Adversarial audit. Run a fixed corpus of known-evasion attempts (paraphrased OTP-pumping templates, MSISDN-block sweep simulations). Recall on this corpus must be ≥ 0.80.
Model card. Generate YAML model card per Hugging Face / Model Card Toolkit format. Persist to MinIO.
Register. POST /v1/admin/fraud/models with artifact URI, hashes, evaluation metrics. Status REGISTERED.

UC-12: ModelShadowAndPromote

Trigger: REST POST /v1/admin/fraud/models/{id}/shadow then …/promote.

Shadow steps:

Worker loads shadow model alongside active.
Each feature window is scored by both models; shadow predictions persist to fraud_features.shadow_predictions. No fraud.detected.* events are emitted by the shadow model.
Shadow runs ≥ 24h.

Promote steps:

Verify shadow has run ≥ 24h (else 412 SHADOW_EVAL_INSUFFICIENT).
Compare shadow vs active over the shadow window: AUC delta ≥ 0; Brier ≤ active × 1.05; per-tenant fairness Δ ≤ 0.10. Else 412.
Verify artifact SHA-256 matches registered hash. Else 422 MODEL_ARTIFACT_INTEGRITY_FAIL.
Atomic UPDATE: previous active → RETIRED, new → ACTIVE.
Emit fraud.model.promoted.v1.
Workers hot-reload on next batch boundary (no cold restart).

Rollback:

POST /v1/admin/fraud/models/{id}/rollback → atomic re-activation of previous version. < 60 seconds. Emits fraud.model.rolled_back.v1.

UC-13: AlertEmission (downstream consumer hooks)

When fraud.detected.* events are published, the following automatic consumers act (see EVENT_SCHEMAS §4):

Detection	Consumer	Action
`fraud.detected.simbox.v1` (conf ≥ 0.85)	`sms-firewall-service`	Add MSISDN block to `firewall.peer_quarantine` 7d TTL
`fraud.detected.ait_ring.v1`	`sms-firewall-service`	Add cohort cluster MSISDNs to dynamic blocklist
`fraud.detected.otp_harvesting.v1` (conf ≥ 0.85)	`sender-id-registry-service` (after NOC dispatch)	Suspend sender-ID
`fraud.detected.otp_grinding.v1`	`compliance-engine`	Per-MSISDN OTP throttle (1/60s × 6h)
`fraud.detected.greyroute.v1` (conf ≥ 0.85)	`sms-firewall-service`	Quarantine peer-ASN binding
`fraud.tenant_score.updated.v1`	`compliance-engine`	Update `compliance.tenant_tiers` input

UC-14: FeedbackLoop (HITL → retrain)

Trigger: POST /v1/admin/fraud/cases/{caseId}/decide from a Trust & Safety analyst.

Steps:

Validate decision (CONFIRM_FRAUD / DISMISS / REFINE_FEATURES); reason ≥ 20 chars.
Enforce separation-of-duties (opened_by != decided_by).
Persist FeedbackDecision row.
Update fraud.cases.status accordingly.
Emit fraud.case.decided.v1.
If executeAction = true AND decision = CONFIRM_FRAUD:
- Look up the case's suggestedAction.
- Dispatch the action via NATS event (firewall.blocklist.add.v1, sender-id.suspend.v1, etc.).
- Audit-log every dispatch with caller's JWT subject and trace ID.
The decision is read by the next nightly ModelTrainingPipeline run as a labelled training row.

UC-15: TenantScoreRecompute (hourly cron)

Trigger: Cron 0 * * * *. Distributed lock fraud:lock:scoring:hourly.

Steps:

For each tenant with activity in last 30d:
- Apply formula in DOMAIN_MODEL §5.
- Read previous tier from fraud.entity_scores.
- If tier changed → emit fraud.tenant_score.updated.v1 with { previousTier, newTier, score, contributingFactors[], modelVersion }.
- UPSERT fraud.entity_scores.
- Refresh Redis fraud:score:tenant:{tenantId} SETEX 900.
Emit fraud_score_recompute_duration_seconds and fraud_score_recompute_failed_total.

2. Pipeline Latency Budgets

Pipeline	Window	Budget	SLA
Score gRPC	per-call	n/a	P95 ≤ 50 ms
Stream ingestion → ClickHouse	per-event	30 s	P95 ≤ 30 s
AIT 5-min pipeline	per-window	90 s	< 5 min wall
AIT cohort job	hourly	5 min	< 10 min wall
SIM-box detector	30-min	60 s	< 2 min wall
Grey-route detector	hourly	90 s	< 5 min wall
OTP-harvest detector	30-min	60 s	< 2 min wall
OTP-grinding streaming	real-time	5 s	detection-to-event
Hourly tenant-score recompute	hourly	5 min	< 10 min wall
Nightly model training	nightly	6 h	< 12 h

3. Hot-Path Caching Strategy

The Score gRPC is the only synchronous surface and is heavily cached:

Layer	Key	TTL	Hit target
Redis L1	`fraud:score:{scope}:{id}`	15 min	≥ 95% steady-state
Postgres L2 (mat. view)	`fraud.entity_scores`	refreshed hourly	residual misses
Cold path (PROBATION)	n/a	n/a	< 1%

Cache invalidation is time-based only (15-min TTL). On a fraud.tenant_score.updated.v1 event we additionally DEL fraud:score:tenant:{tenantId} to force the next Score call to read fresh.

1. Use Cases​

UC-01: Score (gRPC hot path)​

UC-02: BulkScore (gRPC batch)​

UC-03: AitGraphAnalysis (5-min batch pipeline)​

UC-04: AitCohortJob (1-h rolling)​

UC-05: SimboxPatternDetector (30-min batch)​

UC-06: GreyRouteDetector (24-h rolling, hourly)​

UC-07: OtpHarvestDetector (6-h rolling join)​

UC-08: OtpGrindingStreamingDetector (real-time)​

UC-09: FeedExport (daily MISP/STIX export)​

UC-10: FeedImport (push from regulator/peer)​

UC-11: ModelTrainingPipeline (Airflow DAG)​

UC-12: ModelShadowAndPromote​

UC-13: AlertEmission (downstream consumer hooks)​

UC-14: FeedbackLoop (HITL → retrain)​

UC-15: TenantScoreRecompute (hourly cron)​

2. Pipeline Latency Budgets​

3. Hot-Path Caching Strategy​

1. Use Cases

UC-01: Score (gRPC hot path)

UC-02: BulkScore (gRPC batch)

UC-03: AitGraphAnalysis (5-min batch pipeline)

UC-04: AitCohortJob (1-h rolling)

UC-05: SimboxPatternDetector (30-min batch)

UC-06: GreyRouteDetector (24-h rolling, hourly)

UC-07: OtpHarvestDetector (6-h rolling join)

UC-08: OtpGrindingStreamingDetector (real-time)

UC-09: FeedExport (daily MISP/STIX export)

UC-10: FeedImport (push from regulator/peer)

UC-11: ModelTrainingPipeline (Airflow DAG)

UC-12: ModelShadowAndPromote

UC-13: AlertEmission (downstream consumer hooks)

UC-14: FeedbackLoop (HITL → retrain)

UC-15: TenantScoreRecompute (hourly cron)

2. Pipeline Latency Budgets

3. Hot-Path Caching Strategy