Fraud Intelligence Service — Observability
Version: 1.0 Status: Draft Owner: Trust and Safety + Platform SRE Last Updated: 2026-04-21 Companion: APPLICATION_LOGIC · FAILURE_MODES · SERVICE_READINESS
1. Service Level Objectives (SLOs)
Bound to platform NFR catalog (EP-PLAT-NB-09):
| SLI | SLO | Window | Error budget |
|---|---|---|---|
Score gRPC P95 latency | ≤ 50 ms | 30 d rolling | 0.1% (43 min) |
Score gRPC availability | ≥ 99.5% | 30 d rolling | 3.6 h |
| Stream ingestion lag (NATS publish → ClickHouse insert) P95 | ≤ 30 s | 30 d rolling | 1% > 30s |
| AIT pipeline freshness (last successful run) | ≤ 15 min | 30 d rolling | 0.5% > 15min |
| Detection-to-event latency (high-confidence) | ≤ 5 min | 30 d rolling | 1% > 5min |
| OTP-grinding detection-to-event latency | ≤ 5 s | 24 h rolling | 0.1% > 5s |
| MISP feed export daily success | 100% | quarterly | 0 misses |
| Model precision (rolling 7d on confirmed cases) | ≥ 0.92 (AIT), ≥ 0.92 (SIM-box) | 7 d | n/a (binary alarm) |
| HITL case backlog age (oldest PENDING) | ≤ 7 d | 7 d rolling | 5% > 7d |
2. Prometheus Metrics
All metrics exposed at GET /metrics on port 3014 (REST) and port 9091 (worker pods). Prometheus scrape interval 15 s; retention 90 d.
2.1 Score gRPC (hot path)
| Metric | Type | Labels | Description |
|---|---|---|---|
fraud_score_grpc_total | Counter | scope, tier, cache (hit/miss/cold) | Score gRPC call count |
fraud_score_grpc_duration_seconds | Histogram | scope, cache | End-to-end gRPC latency |
fraud_score_grpc_errors_total | Counter | code (UNAVAILABLE/INVALID/PERMISSION) | Error count |
fraud_score_cache_hit_ratio | Gauge | — | Rolling 5-min Redis L1 hit ratio |
Histogram buckets: [0.001, 0.005, 0.01, 0.025, 0.05, 0.075, 0.1, 0.25, 0.5, 1.0].
2.2 Stream ingestion
| Metric | Type | Labels | Description |
|---|---|---|---|
fraud_ingestion_lag_seconds | Gauge | source_stream | NATS publish → ClickHouse insert lag |
fraud_ingestion_rows_total | Counter | source_stream | Successfully ingested rows |
fraud_ingestion_dlq_total | Counter | source_stream, reject_reason | DLQ-routed rows |
fraud_ingestion_wal_buffer_bytes | Gauge | — | On-disk WAL buffer size during ClickHouse outages |
2.3 Pipeline metrics (AIT, SIM-box, OTP-harvest, grey-route, cohort, scoring)
| Metric | Type | Labels | Description |
|---|---|---|---|
fraud_pipeline_run_duration_seconds | Histogram | pipeline | End-to-end pipeline run duration |
fraud_pipeline_last_success_timestamp | Gauge | pipeline | Unix ts of last successful run (freshness) |
fraud_pipeline_failed_total | Counter | pipeline, error_type | Pipeline failures |
fraud_pipeline_predictions_total | Counter | pipeline, bucket (high/medium/low) | Predictions emitted by confidence bucket |
fraud_pipeline_lock_contention_total | Counter | pipeline | Distributed-lock skip count |
2.4 Detection emission
| Metric | Type | Labels | Description |
|---|---|---|---|
fraud_detections_emitted_total | Counter | category, subject_scope, confidence_tier | Detection event emission |
fraud_cases_opened_total | Counter | category, suggested_action | Case openings |
fraud_cases_decided_total | Counter | decision, category | Case decisions |
fraud_cases_pending_total | Gauge | category | Current PENDING count |
fraud_cases_age_seconds | Histogram | — | Age at decision time |
fraud_cases_auto_stale_total | Counter | — | Stale-closed cases |
2.5 Model serving (Triton)
| Metric | Type | Labels | Description |
|---|---|---|---|
fraud_model_inference_duration_seconds | Histogram | model_id, model_version | Triton inference latency |
fraud_model_inference_errors_total | Counter | model_id, error_type | Inference errors |
fraud_model_load_total | Counter | model_id, version | Model load events |
fraud_model_active_version | Gauge | model_id, version | Per-model active version (1.0 for active) |
fraud_model_artifact_sha_mismatch_total | Counter | model_id | Tamper events |
2.6 Model accuracy & drift
| Metric | Type | Labels | Description |
|---|---|---|---|
fraud_model_precision_rolling_7d | Gauge | model_id, category | Precision on confirmed-vs-emitted last 7 d |
fraud_model_recall_rolling_7d | Gauge | model_id, category | Recall on confirmed-vs-missed last 7 d |
fraud_model_brier_rolling_7d | Gauge | model_id | Calibration |
fraud_model_psi_per_feature | Gauge | model_id, feature | Population Stability Index per input feature |
fraud_model_prediction_drift_wasserstein | Gauge | model_id | Wasserstein distance to 7-d baseline |
fraud_model_fairness_delta | Gauge | model_id, cohort | Per-cohort AUC delta from population AUC |
2.7 Feed (MISP / STIX)
| Metric | Type | Labels | Description |
|---|---|---|---|
fraud_feed_export_runs_total | Counter | feed_id, status | Daily export runs |
fraud_feed_export_indicator_count | Gauge | feed_id | Indicators in last export |
fraud_feed_export_signature_duration_seconds | Histogram | — | HSM signing duration |
fraud_feed_import_runs_total | Counter | feed_id, signature_valid | Import runs |
fraud_feed_import_indicator_added_total | Counter | feed_id, type | Indicators added per import |
fraud_feed_sync_lag_seconds | Gauge | feed_id, direction | Last sync age |
2.8 Tenant scoring
| Metric | Type | Labels | Description |
|---|---|---|---|
fraud_tenant_score | Gauge | tenant_id, tier | Per-tenant fraud score |
fraud_tenant_score_tier_distribution | Gauge | tier | Tenants per tier |
fraud_tenant_score_recompute_duration_seconds | Histogram | — | Hourly recompute duration |
fraud_score_recompute_failed_total | Counter | — | Recompute failures |
2.9 OTP-grinding streaming
| Metric | Type | Labels | Description |
|---|---|---|---|
fraud_otp_grinding_detections_total | Counter | — | Detections emitted |
fraud_otp_grinding_throttle_active_total | Gauge | — | Active throttle handles |
fraud_otp_grinding_window_size_avg | Gauge | — | Avg messages in 60s window for tracked MSISDNs |
3. Structured Log Events
All log output is valid JSON (Pino format, NestJS) or structlog JSON (Python workers). Log level controlled by LOG_LEVEL env var.
3.1 Score evaluation
{
"level": "info",
"time": "2026-04-21T10:00:00.123Z",
"event": "fraud.score.served",
"scope": "TENANT",
"subjectId": "tnt_acme",
"score": 0.21,
"tier": "WATCH",
"cache": "hit",
"latencyMs": 4,
"callerSpiffe": "spiffe://ghasi/compliance-engine",
"traceId": "abc123",
"spanId": "def456"
}
3.2 Detection emission
{
"level": "warn",
"event": "fraud.detection.emitted",
"detectionId": "fd_01H...",
"category": "AIT",
"subjectScope": "TENANT",
"subjectId": "tnt_xyz",
"score": 0.94,
"modelId": "ml_ait_xgboost",
"modelVersion": "2.1.4",
"shapTop3": [
{ "feature": "dlr_success_rate", "contribution": -0.42 },
{ "feature": "cohort_anomaly_score", "contribution": 0.31 },
{ "feature": "tenant_age_days", "contribution": 0.24 }
],
"windowStart": "2026-04-21T10:00:00Z",
"windowEnd": "2026-04-21T10:05:00Z",
"traceId": "..."
}
3.3 Case decision
{
"level": "info",
"event": "fraud.case.decided",
"caseId": "fc_01H...",
"decision": "CONFIRM_FRAUD",
"decidedBy": "user_jane",
"ageSeconds": 8400,
"actionExecuted": true,
"actionDispatched": "sender_id.suspend.v1"
}
3.4 Pipeline run
{
"level": "info",
"event": "fraud.pipeline.completed",
"pipeline": "ait",
"windowStart": "2026-04-21T10:00:00Z",
"windowEnd": "2026-04-21T10:05:00Z",
"rowsProcessed": 4127,
"predictionsHigh": 3,
"predictionsMedium": 12,
"predictionsLow": 421,
"durationMs": 71200,
"modelVersion": "ait-xgboost-2.1.4"
}
3.5 Model promotion
{
"level": "warn",
"event": "fraud.model.promoted",
"modelId": "ml_ait_xgboost",
"previousVersion": "2.1.3",
"newVersion": "2.1.4",
"promotedBy": "user_ds_alice",
"approvedBy": "user_compl_bob",
"shadowDurationHours": 26,
"shadowAuc": 0.937,
"activeAuc": 0.932
}
3.6 Drift / anomaly
{
"level": "warn",
"event": "fraud.model.drift.detected",
"modelId": "ml_ait_xgboost",
"feature": "dlr_success_rate",
"psi": 0.31,
"thresholdMedium": 0.25,
"thresholdHigh": 0.50
}
3.7 Feed signature failure
{
"level": "error",
"event": "fraud.alert.feed.signature.invalid",
"feedId": "ff_regulator_atra",
"expectedKeyId": "atra-prod-2026-q2",
"observedKeyId": "atra-prod-2025-q4",
"reason": "KEY_REVOKED"
}
PII rules: never log raw dst_msisdn / src_msisdn. Use +CCNNN*** masked or msisdnHash. Body content is never stored anywhere, including logs.
4. OpenTelemetry Tracing
Parent spans:
fraud-intel-service.Score(gRPC)fraud-intel-service.IngestSignal(NATS consumer)fraud-intel-service.AitPipelineRun(5-min cron)fraud-intel-service.OtpGrindingStream(per-event)
Child spans (Score):
| Span | Operation | Attributes |
|---|---|---|
fraud.score.cache.l1 | Redis GET | cache.hit |
fraud.score.cache.l2 | Postgres SELECT | rows_returned |
fraud.score.refresh.queue | Redis LPUSH | queue_depth |
Child spans (AIT pipeline):
| Span | Operation | Attributes |
|---|---|---|
fraud.ait.feature.engineering | ClickHouse INSERT…SELECT | rows_in, rows_out |
fraud.ait.cohort.join | ClickHouse JOIN | cohorts_joined |
fraud.ait.inference | Triton ModelInfer | model_id, model_version, batch_size |
fraud.ait.shap | TreeSHAP compute | n_features |
fraud.ait.outbox.emit | Postgres INSERT + NATS publish | event_count |
Trace context propagated via W3C Trace Context (traceparent header on gRPC + Nats-Trace-Parent on NATS messages).
Exporter: OTLP gRPC to platform OTel collector → SigNoz backend.
5. Alerting Rules
Prometheus alert YAML (excerpt):
groups:
- name: fraud-intel
interval: 30s
rules:
# Score gRPC SLO
- alert: FraudScoreP95High
expr: histogram_quantile(0.95, sum(rate(fraud_score_grpc_duration_seconds_bucket[5m])) by (le)) > 0.05
for: 5m
labels: { severity: warning, service: fraud-intel-service }
annotations:
summary: "Score gRPC P95 > 50 ms"
runbook: "runbooks/fraud-intel/score-p95-high.md"
- alert: FraudScoreUnavailable
expr: rate(fraud_score_grpc_errors_total{code="UNAVAILABLE"}[2m]) > 0.5
for: 2m
labels: { severity: high }
annotations: { runbook: "runbooks/fraud-intel/score-unavailable.md" }
# Pipeline freshness
- alert: FraudAitPipelineStale
expr: time() - fraud_pipeline_last_success_timestamp{pipeline="ait"} > 900
for: 5m
labels: { severity: high }
annotations:
summary: "AIT pipeline last success > 15 min ago"
runbook: "runbooks/fraud-intel/pipeline-stale.md"
- alert: FraudIngestionLagHigh
expr: fraud_ingestion_lag_seconds > 60
for: 5m
labels: { severity: warning }
annotations: { runbook: "runbooks/fraud-intel/ingestion-lag.md" }
# Model accuracy / drift
- alert: FraudModelDriftHigh
expr: max(fraud_model_psi_per_feature) by (model_id) > 0.50
for: 1h
labels: { severity: high }
annotations:
summary: "Model {{ $labels.model_id }} feature drift PSI > 0.50"
runbook: "runbooks/fraud-intel/model-drift.md"
- alert: FraudModelPrecisionLow
expr: fraud_model_precision_rolling_7d{category="AIT"} < 0.92
for: 4h
labels: { severity: high }
annotations: { runbook: "runbooks/fraud-intel/precision-degraded.md" }
- alert: FraudModelArtifactTamper
expr: increase(fraud_model_artifact_sha_mismatch_total[5m]) > 0
for: 0m
labels: { severity: critical, page: pagerduty }
annotations:
summary: "Model artifact SHA-256 mismatch — supply-chain attack suspected"
runbook: "runbooks/fraud-intel/artifact-tamper.md"
# Feed sync
- alert: FraudFeedSyncStale
expr: fraud_feed_sync_lag_seconds{direction="EXPORT"} > 86400 * 1.1
for: 30m
labels: { severity: medium }
annotations: { runbook: "runbooks/fraud-intel/feed-export-stale.md" }
- alert: FraudFeedSignatureInvalid
expr: increase(fraud_feed_import_runs_total{signature_valid="false"}[5m]) > 0
for: 0m
labels: { severity: critical, page: pagerduty }
annotations:
summary: "MISP feed import signature failure — possible compromised peer key"
runbook: "runbooks/fraud-intel/feed-signature-invalid.md"
# Detection volume
- alert: FraudScoreSpike
expr: rate(fraud_detections_emitted_total{category="AIT"}[5m]) > 5 * avg_over_time(rate(fraud_detections_emitted_total{category="AIT"}[5m])[1d:5m])
for: 10m
labels: { severity: warning }
annotations:
summary: "AIT detection rate 5× baseline — possible coordinated campaign or model false-positive spike"
runbook: "runbooks/fraud-intel/detection-spike.md"
# HITL backlog
- alert: FraudCaseBacklogHigh
expr: fraud_cases_pending_total > 200
for: 30m
labels: { severity: warning }
annotations: { runbook: "runbooks/fraud-intel/case-backlog.md" }
- alert: FraudCaseAutoStaleSpike
expr: rate(fraud_cases_auto_stale_total[1h]) > 10
for: 1h
labels: { severity: warning }
annotations: { runbook: "runbooks/fraud-intel/case-auto-stale.md" }
6. Grafana Dashboards
Dashboard JSON sources in dashboards/fraud-intel-service.json. Five canonical dashboards:
6.1 Score gRPC (hot path)
| Panel | Query | Visualisation |
|---|---|---|
| RPS | rate(fraud_score_grpc_total[5m]) by tier | Stacked area |
| P50/P95/P99 latency | histogram_quantile(0.5/0.95/0.99, …) | Multi-line |
| Cache hit ratio | fraud_score_cache_hit_ratio | Gauge + time series |
| Tier distribution | fraud_score_grpc_total by tier (last 24h) | Pie |
| Error breakdown | fraud_score_grpc_errors_total by code | Bar |
6.2 Detection pipelines
| Panel | Query | Visualisation |
|---|---|---|
| Pipeline freshness | time() - fraud_pipeline_last_success_timestamp per pipeline | Multi-stat |
| Pipeline duration | fraud_pipeline_run_duration_seconds per pipeline | Multi-line |
| Detection rate (24h) | rate(fraud_detections_emitted_total[5m]) by category, confidence_tier | Stacked area |
| Cases opened vs decided | dual line | Time series |
| Case backlog by category | fraud_cases_pending_total by category | Stacked bar |
6.3 Model serving
| Panel | Query | Visualisation |
|---|---|---|
| Triton inference latency P95 | per model | Multi-line |
| Active model versions | fraud_model_active_version | Table |
| Model load events | fraud_model_load_total | Time series |
| Artifact tamper count | fraud_model_artifact_sha_mismatch_total | Stat (alarm-coloured) |
6.4 Model accuracy & drift
| Panel | Query | Visualisation |
|---|---|---|
| Precision/Recall rolling 7d per category | dual gauge + line | |
| PSI per feature heatmap | fraud_model_psi_per_feature | Heatmap |
| Wasserstein drift per model | fraud_model_prediction_drift_wasserstein | Time series |
| Per-cohort fairness delta | fraud_model_fairness_delta by cohort | Bar (with thresholds) |
6.5 Feed (MISP / STIX)
| Panel | Query | Visualisation |
|---|---|---|
| Last export age | fraud_feed_sync_lag_seconds{direction="EXPORT"} | Stat |
| Indicator counts per feed | fraud_feed_export_indicator_count by feed_id | Bar |
| Import success rate | success / total | Gauge |
| Signature failure timeline | fraud_feed_import_runs_total{signature_valid="false"} | Annotated time series |
7. Runbook References
Every alert in §5 links to a runbook under runbooks/fraud-intel/:
score-p95-high.md— investigate Redis L1 hit ratio, Triton latency, networkscore-unavailable.md— pod readiness, Postgres availability, mTLS cert validitypipeline-stale.md— Airflow DAG status, distributed-lock contention, ClickHouse availabilityingestion-lag.md— NATS consumer lag, ClickHouse insert latency, WAL buffermodel-drift.md— feature population shift, training freshness, retrain triggerprecision-degraded.md— recent confirmed/dismissed analysis, label noise check, model rollback decisionartifact-tamper.md— CRITICAL: stop the affected pipeline, isolate the artifact, page Securityfeed-export-stale.md— HSM availability, MinIO availability, SFTP destination healthfeed-signature-invalid.md— peer-key rotation status, KMS revocation list, contact peer SOCdetection-spike.md— distinguish real campaign from false-positive spike via case samplingcase-backlog.md— analyst capacity check, bulk-action considerationcase-auto-stale.md— analyst headcount review, queue prioritisation