Skip to main content

Fraud Intelligence Service — Observability

Version: 1.0 Status: Draft Owner: Trust and Safety + Platform SRE Last Updated: 2026-04-21 Companion: APPLICATION_LOGIC · FAILURE_MODES · SERVICE_READINESS


1. Service Level Objectives (SLOs)

Bound to platform NFR catalog (EP-PLAT-NB-09):

SLISLOWindowError budget
Score gRPC P95 latency≤ 50 ms30 d rolling0.1% (43 min)
Score gRPC availability≥ 99.5%30 d rolling3.6 h
Stream ingestion lag (NATS publish → ClickHouse insert) P95≤ 30 s30 d rolling1% > 30s
AIT pipeline freshness (last successful run)≤ 15 min30 d rolling0.5% > 15min
Detection-to-event latency (high-confidence)≤ 5 min30 d rolling1% > 5min
OTP-grinding detection-to-event latency≤ 5 s24 h rolling0.1% > 5s
MISP feed export daily success100%quarterly0 misses
Model precision (rolling 7d on confirmed cases)≥ 0.92 (AIT), ≥ 0.92 (SIM-box)7 dn/a (binary alarm)
HITL case backlog age (oldest PENDING)≤ 7 d7 d rolling5% > 7d

2. Prometheus Metrics

All metrics exposed at GET /metrics on port 3014 (REST) and port 9091 (worker pods). Prometheus scrape interval 15 s; retention 90 d.

2.1 Score gRPC (hot path)

MetricTypeLabelsDescription
fraud_score_grpc_totalCounterscope, tier, cache (hit/miss/cold)Score gRPC call count
fraud_score_grpc_duration_secondsHistogramscope, cacheEnd-to-end gRPC latency
fraud_score_grpc_errors_totalCountercode (UNAVAILABLE/INVALID/PERMISSION)Error count
fraud_score_cache_hit_ratioGaugeRolling 5-min Redis L1 hit ratio

Histogram buckets: [0.001, 0.005, 0.01, 0.025, 0.05, 0.075, 0.1, 0.25, 0.5, 1.0].

2.2 Stream ingestion

MetricTypeLabelsDescription
fraud_ingestion_lag_secondsGaugesource_streamNATS publish → ClickHouse insert lag
fraud_ingestion_rows_totalCountersource_streamSuccessfully ingested rows
fraud_ingestion_dlq_totalCountersource_stream, reject_reasonDLQ-routed rows
fraud_ingestion_wal_buffer_bytesGaugeOn-disk WAL buffer size during ClickHouse outages

2.3 Pipeline metrics (AIT, SIM-box, OTP-harvest, grey-route, cohort, scoring)

MetricTypeLabelsDescription
fraud_pipeline_run_duration_secondsHistogrampipelineEnd-to-end pipeline run duration
fraud_pipeline_last_success_timestampGaugepipelineUnix ts of last successful run (freshness)
fraud_pipeline_failed_totalCounterpipeline, error_typePipeline failures
fraud_pipeline_predictions_totalCounterpipeline, bucket (high/medium/low)Predictions emitted by confidence bucket
fraud_pipeline_lock_contention_totalCounterpipelineDistributed-lock skip count

2.4 Detection emission

MetricTypeLabelsDescription
fraud_detections_emitted_totalCountercategory, subject_scope, confidence_tierDetection event emission
fraud_cases_opened_totalCountercategory, suggested_actionCase openings
fraud_cases_decided_totalCounterdecision, categoryCase decisions
fraud_cases_pending_totalGaugecategoryCurrent PENDING count
fraud_cases_age_secondsHistogramAge at decision time
fraud_cases_auto_stale_totalCounterStale-closed cases

2.5 Model serving (Triton)

MetricTypeLabelsDescription
fraud_model_inference_duration_secondsHistogrammodel_id, model_versionTriton inference latency
fraud_model_inference_errors_totalCountermodel_id, error_typeInference errors
fraud_model_load_totalCountermodel_id, versionModel load events
fraud_model_active_versionGaugemodel_id, versionPer-model active version (1.0 for active)
fraud_model_artifact_sha_mismatch_totalCountermodel_idTamper events

2.6 Model accuracy & drift

MetricTypeLabelsDescription
fraud_model_precision_rolling_7dGaugemodel_id, categoryPrecision on confirmed-vs-emitted last 7 d
fraud_model_recall_rolling_7dGaugemodel_id, categoryRecall on confirmed-vs-missed last 7 d
fraud_model_brier_rolling_7dGaugemodel_idCalibration
fraud_model_psi_per_featureGaugemodel_id, featurePopulation Stability Index per input feature
fraud_model_prediction_drift_wassersteinGaugemodel_idWasserstein distance to 7-d baseline
fraud_model_fairness_deltaGaugemodel_id, cohortPer-cohort AUC delta from population AUC

2.7 Feed (MISP / STIX)

MetricTypeLabelsDescription
fraud_feed_export_runs_totalCounterfeed_id, statusDaily export runs
fraud_feed_export_indicator_countGaugefeed_idIndicators in last export
fraud_feed_export_signature_duration_secondsHistogramHSM signing duration
fraud_feed_import_runs_totalCounterfeed_id, signature_validImport runs
fraud_feed_import_indicator_added_totalCounterfeed_id, typeIndicators added per import
fraud_feed_sync_lag_secondsGaugefeed_id, directionLast sync age

2.8 Tenant scoring

MetricTypeLabelsDescription
fraud_tenant_scoreGaugetenant_id, tierPer-tenant fraud score
fraud_tenant_score_tier_distributionGaugetierTenants per tier
fraud_tenant_score_recompute_duration_secondsHistogramHourly recompute duration
fraud_score_recompute_failed_totalCounterRecompute failures

2.9 OTP-grinding streaming

MetricTypeLabelsDescription
fraud_otp_grinding_detections_totalCounterDetections emitted
fraud_otp_grinding_throttle_active_totalGaugeActive throttle handles
fraud_otp_grinding_window_size_avgGaugeAvg messages in 60s window for tracked MSISDNs

3. Structured Log Events

All log output is valid JSON (Pino format, NestJS) or structlog JSON (Python workers). Log level controlled by LOG_LEVEL env var.

3.1 Score evaluation

{
"level": "info",
"time": "2026-04-21T10:00:00.123Z",
"event": "fraud.score.served",
"scope": "TENANT",
"subjectId": "tnt_acme",
"score": 0.21,
"tier": "WATCH",
"cache": "hit",
"latencyMs": 4,
"callerSpiffe": "spiffe://ghasi/compliance-engine",
"traceId": "abc123",
"spanId": "def456"
}

3.2 Detection emission

{
"level": "warn",
"event": "fraud.detection.emitted",
"detectionId": "fd_01H...",
"category": "AIT",
"subjectScope": "TENANT",
"subjectId": "tnt_xyz",
"score": 0.94,
"modelId": "ml_ait_xgboost",
"modelVersion": "2.1.4",
"shapTop3": [
{ "feature": "dlr_success_rate", "contribution": -0.42 },
{ "feature": "cohort_anomaly_score", "contribution": 0.31 },
{ "feature": "tenant_age_days", "contribution": 0.24 }
],
"windowStart": "2026-04-21T10:00:00Z",
"windowEnd": "2026-04-21T10:05:00Z",
"traceId": "..."
}

3.3 Case decision

{
"level": "info",
"event": "fraud.case.decided",
"caseId": "fc_01H...",
"decision": "CONFIRM_FRAUD",
"decidedBy": "user_jane",
"ageSeconds": 8400,
"actionExecuted": true,
"actionDispatched": "sender_id.suspend.v1"
}

3.4 Pipeline run

{
"level": "info",
"event": "fraud.pipeline.completed",
"pipeline": "ait",
"windowStart": "2026-04-21T10:00:00Z",
"windowEnd": "2026-04-21T10:05:00Z",
"rowsProcessed": 4127,
"predictionsHigh": 3,
"predictionsMedium": 12,
"predictionsLow": 421,
"durationMs": 71200,
"modelVersion": "ait-xgboost-2.1.4"
}

3.5 Model promotion

{
"level": "warn",
"event": "fraud.model.promoted",
"modelId": "ml_ait_xgboost",
"previousVersion": "2.1.3",
"newVersion": "2.1.4",
"promotedBy": "user_ds_alice",
"approvedBy": "user_compl_bob",
"shadowDurationHours": 26,
"shadowAuc": 0.937,
"activeAuc": 0.932
}

3.6 Drift / anomaly

{
"level": "warn",
"event": "fraud.model.drift.detected",
"modelId": "ml_ait_xgboost",
"feature": "dlr_success_rate",
"psi": 0.31,
"thresholdMedium": 0.25,
"thresholdHigh": 0.50
}

3.7 Feed signature failure

{
"level": "error",
"event": "fraud.alert.feed.signature.invalid",
"feedId": "ff_regulator_atra",
"expectedKeyId": "atra-prod-2026-q2",
"observedKeyId": "atra-prod-2025-q4",
"reason": "KEY_REVOKED"
}

PII rules: never log raw dst_msisdn / src_msisdn. Use +CCNNN*** masked or msisdnHash. Body content is never stored anywhere, including logs.


4. OpenTelemetry Tracing

Parent spans:

  • fraud-intel-service.Score (gRPC)
  • fraud-intel-service.IngestSignal (NATS consumer)
  • fraud-intel-service.AitPipelineRun (5-min cron)
  • fraud-intel-service.OtpGrindingStream (per-event)

Child spans (Score):

SpanOperationAttributes
fraud.score.cache.l1Redis GETcache.hit
fraud.score.cache.l2Postgres SELECTrows_returned
fraud.score.refresh.queueRedis LPUSHqueue_depth

Child spans (AIT pipeline):

SpanOperationAttributes
fraud.ait.feature.engineeringClickHouse INSERT…SELECTrows_in, rows_out
fraud.ait.cohort.joinClickHouse JOINcohorts_joined
fraud.ait.inferenceTriton ModelInfermodel_id, model_version, batch_size
fraud.ait.shapTreeSHAP computen_features
fraud.ait.outbox.emitPostgres INSERT + NATS publishevent_count

Trace context propagated via W3C Trace Context (traceparent header on gRPC + Nats-Trace-Parent on NATS messages).

Exporter: OTLP gRPC to platform OTel collector → SigNoz backend.


5. Alerting Rules

Prometheus alert YAML (excerpt):

groups:
- name: fraud-intel
interval: 30s
rules:

# Score gRPC SLO
- alert: FraudScoreP95High
expr: histogram_quantile(0.95, sum(rate(fraud_score_grpc_duration_seconds_bucket[5m])) by (le)) > 0.05
for: 5m
labels: { severity: warning, service: fraud-intel-service }
annotations:
summary: "Score gRPC P95 > 50 ms"
runbook: "runbooks/fraud-intel/score-p95-high.md"

- alert: FraudScoreUnavailable
expr: rate(fraud_score_grpc_errors_total{code="UNAVAILABLE"}[2m]) > 0.5
for: 2m
labels: { severity: high }
annotations: { runbook: "runbooks/fraud-intel/score-unavailable.md" }

# Pipeline freshness
- alert: FraudAitPipelineStale
expr: time() - fraud_pipeline_last_success_timestamp{pipeline="ait"} > 900
for: 5m
labels: { severity: high }
annotations:
summary: "AIT pipeline last success > 15 min ago"
runbook: "runbooks/fraud-intel/pipeline-stale.md"

- alert: FraudIngestionLagHigh
expr: fraud_ingestion_lag_seconds > 60
for: 5m
labels: { severity: warning }
annotations: { runbook: "runbooks/fraud-intel/ingestion-lag.md" }

# Model accuracy / drift
- alert: FraudModelDriftHigh
expr: max(fraud_model_psi_per_feature) by (model_id) > 0.50
for: 1h
labels: { severity: high }
annotations:
summary: "Model {{ $labels.model_id }} feature drift PSI > 0.50"
runbook: "runbooks/fraud-intel/model-drift.md"

- alert: FraudModelPrecisionLow
expr: fraud_model_precision_rolling_7d{category="AIT"} < 0.92
for: 4h
labels: { severity: high }
annotations: { runbook: "runbooks/fraud-intel/precision-degraded.md" }

- alert: FraudModelArtifactTamper
expr: increase(fraud_model_artifact_sha_mismatch_total[5m]) > 0
for: 0m
labels: { severity: critical, page: pagerduty }
annotations:
summary: "Model artifact SHA-256 mismatch — supply-chain attack suspected"
runbook: "runbooks/fraud-intel/artifact-tamper.md"

# Feed sync
- alert: FraudFeedSyncStale
expr: fraud_feed_sync_lag_seconds{direction="EXPORT"} > 86400 * 1.1
for: 30m
labels: { severity: medium }
annotations: { runbook: "runbooks/fraud-intel/feed-export-stale.md" }

- alert: FraudFeedSignatureInvalid
expr: increase(fraud_feed_import_runs_total{signature_valid="false"}[5m]) > 0
for: 0m
labels: { severity: critical, page: pagerduty }
annotations:
summary: "MISP feed import signature failure — possible compromised peer key"
runbook: "runbooks/fraud-intel/feed-signature-invalid.md"

# Detection volume
- alert: FraudScoreSpike
expr: rate(fraud_detections_emitted_total{category="AIT"}[5m]) > 5 * avg_over_time(rate(fraud_detections_emitted_total{category="AIT"}[5m])[1d:5m])
for: 10m
labels: { severity: warning }
annotations:
summary: "AIT detection rate 5× baseline — possible coordinated campaign or model false-positive spike"
runbook: "runbooks/fraud-intel/detection-spike.md"

# HITL backlog
- alert: FraudCaseBacklogHigh
expr: fraud_cases_pending_total > 200
for: 30m
labels: { severity: warning }
annotations: { runbook: "runbooks/fraud-intel/case-backlog.md" }

- alert: FraudCaseAutoStaleSpike
expr: rate(fraud_cases_auto_stale_total[1h]) > 10
for: 1h
labels: { severity: warning }
annotations: { runbook: "runbooks/fraud-intel/case-auto-stale.md" }

6. Grafana Dashboards

Dashboard JSON sources in dashboards/fraud-intel-service.json. Five canonical dashboards:

6.1 Score gRPC (hot path)

PanelQueryVisualisation
RPSrate(fraud_score_grpc_total[5m]) by tierStacked area
P50/P95/P99 latencyhistogram_quantile(0.5/0.95/0.99, …)Multi-line
Cache hit ratiofraud_score_cache_hit_ratioGauge + time series
Tier distributionfraud_score_grpc_total by tier (last 24h)Pie
Error breakdownfraud_score_grpc_errors_total by codeBar

6.2 Detection pipelines

PanelQueryVisualisation
Pipeline freshnesstime() - fraud_pipeline_last_success_timestamp per pipelineMulti-stat
Pipeline durationfraud_pipeline_run_duration_seconds per pipelineMulti-line
Detection rate (24h)rate(fraud_detections_emitted_total[5m]) by category, confidence_tierStacked area
Cases opened vs decideddual lineTime series
Case backlog by categoryfraud_cases_pending_total by categoryStacked bar

6.3 Model serving

PanelQueryVisualisation
Triton inference latency P95per modelMulti-line
Active model versionsfraud_model_active_versionTable
Model load eventsfraud_model_load_totalTime series
Artifact tamper countfraud_model_artifact_sha_mismatch_totalStat (alarm-coloured)

6.4 Model accuracy & drift

PanelQueryVisualisation
Precision/Recall rolling 7d per categorydual gauge + line
PSI per feature heatmapfraud_model_psi_per_featureHeatmap
Wasserstein drift per modelfraud_model_prediction_drift_wassersteinTime series
Per-cohort fairness deltafraud_model_fairness_delta by cohortBar (with thresholds)

6.5 Feed (MISP / STIX)

PanelQueryVisualisation
Last export agefraud_feed_sync_lag_seconds{direction="EXPORT"}Stat
Indicator counts per feedfraud_feed_export_indicator_count by feed_idBar
Import success ratesuccess / totalGauge
Signature failure timelinefraud_feed_import_runs_total{signature_valid="false"}Annotated time series

7. Runbook References

Every alert in §5 links to a runbook under runbooks/fraud-intel/:

  • score-p95-high.md — investigate Redis L1 hit ratio, Triton latency, network
  • score-unavailable.md — pod readiness, Postgres availability, mTLS cert validity
  • pipeline-stale.md — Airflow DAG status, distributed-lock contention, ClickHouse availability
  • ingestion-lag.md — NATS consumer lag, ClickHouse insert latency, WAL buffer
  • model-drift.md — feature population shift, training freshness, retrain trigger
  • precision-degraded.md — recent confirmed/dismissed analysis, label noise check, model rollback decision
  • artifact-tamper.mdCRITICAL: stop the affected pipeline, isolate the artifact, page Security
  • feed-export-stale.md — HSM availability, MinIO availability, SFTP destination health
  • feed-signature-invalid.md — peer-key rotation status, KMS revocation list, contact peer SOC
  • detection-spike.md — distinguish real campaign from false-positive spike via case sampling
  • case-backlog.md — analyst capacity check, bulk-action consideration
  • case-auto-stale.md — analyst headcount review, queue prioritisation