Fraud Intelligence Service — Observability

Version: 1.0 Status: Draft Owner: Trust and Safety + Platform SRE Last Updated: 2026-04-21 Companion: APPLICATION_LOGIC · FAILURE_MODES · SERVICE_READINESS

1. Service Level Objectives (SLOs)

Bound to platform NFR catalog (EP-PLAT-NB-09):

SLI	SLO	Window	Error budget
`Score` gRPC P95 latency	≤ 50 ms	30 d rolling	0.1% (43 min)
`Score` gRPC availability	≥ 99.5%	30 d rolling	3.6 h
Stream ingestion lag (NATS publish → ClickHouse insert) P95	≤ 30 s	30 d rolling	1% > 30s
AIT pipeline freshness (last successful run)	≤ 15 min	30 d rolling	0.5% > 15min
Detection-to-event latency (high-confidence)	≤ 5 min	30 d rolling	1% > 5min
OTP-grinding detection-to-event latency	≤ 5 s	24 h rolling	0.1% > 5s
MISP feed export daily success	100%	quarterly	0 misses
Model precision (rolling 7d on confirmed cases)	≥ 0.92 (AIT), ≥ 0.92 (SIM-box)	7 d	n/a (binary alarm)
HITL case backlog age (oldest PENDING)	≤ 7 d	7 d rolling	5% > 7d

2. Prometheus Metrics

All metrics exposed at GET /metrics on port 3014 (REST) and port 9091 (worker pods). Prometheus scrape interval 15 s; retention 90 d.

2.1 Score gRPC (hot path)

Metric	Type	Labels	Description
`fraud_score_grpc_total`	Counter	`scope`, `tier`, `cache` (hit/miss/cold)	Score gRPC call count
`fraud_score_grpc_duration_seconds`	Histogram	`scope`, `cache`	End-to-end gRPC latency
`fraud_score_grpc_errors_total`	Counter	`code` (UNAVAILABLE/INVALID/PERMISSION)	Error count
`fraud_score_cache_hit_ratio`	Gauge	—	Rolling 5-min Redis L1 hit ratio

Histogram buckets: [0.001, 0.005, 0.01, 0.025, 0.05, 0.075, 0.1, 0.25, 0.5, 1.0].

2.2 Stream ingestion

Metric	Type	Labels	Description
`fraud_ingestion_lag_seconds`	Gauge	`source_stream`	NATS publish → ClickHouse insert lag
`fraud_ingestion_rows_total`	Counter	`source_stream`	Successfully ingested rows
`fraud_ingestion_dlq_total`	Counter	`source_stream`, `reject_reason`	DLQ-routed rows
`fraud_ingestion_wal_buffer_bytes`	Gauge	—	On-disk WAL buffer size during ClickHouse outages

2.3 Pipeline metrics (AIT, SIM-box, OTP-harvest, grey-route, cohort, scoring)

Metric	Type	Labels	Description
`fraud_pipeline_run_duration_seconds`	Histogram	`pipeline`	End-to-end pipeline run duration
`fraud_pipeline_last_success_timestamp`	Gauge	`pipeline`	Unix ts of last successful run (freshness)
`fraud_pipeline_failed_total`	Counter	`pipeline`, `error_type`	Pipeline failures
`fraud_pipeline_predictions_total`	Counter	`pipeline`, `bucket` (high/medium/low)	Predictions emitted by confidence bucket
`fraud_pipeline_lock_contention_total`	Counter	`pipeline`	Distributed-lock skip count

2.4 Detection emission

Metric	Type	Labels	Description
`fraud_detections_emitted_total`	Counter	`category`, `subject_scope`, `confidence_tier`	Detection event emission
`fraud_cases_opened_total`	Counter	`category`, `suggested_action`	Case openings
`fraud_cases_decided_total`	Counter	`decision`, `category`	Case decisions
`fraud_cases_pending_total`	Gauge	`category`	Current PENDING count
`fraud_cases_age_seconds`	Histogram	—	Age at decision time
`fraud_cases_auto_stale_total`	Counter	—	Stale-closed cases

2.5 Model serving (Triton)

Metric	Type	Labels	Description
`fraud_model_inference_duration_seconds`	Histogram	`model_id`, `model_version`	Triton inference latency
`fraud_model_inference_errors_total`	Counter	`model_id`, `error_type`	Inference errors
`fraud_model_load_total`	Counter	`model_id`, `version`	Model load events
`fraud_model_active_version`	Gauge	`model_id`, `version`	Per-model active version (1.0 for active)
`fraud_model_artifact_sha_mismatch_total`	Counter	`model_id`	Tamper events

2.6 Model accuracy & drift

Metric	Type	Labels	Description
`fraud_model_precision_rolling_7d`	Gauge	`model_id`, `category`	Precision on confirmed-vs-emitted last 7 d
`fraud_model_recall_rolling_7d`	Gauge	`model_id`, `category`	Recall on confirmed-vs-missed last 7 d
`fraud_model_brier_rolling_7d`	Gauge	`model_id`	Calibration
`fraud_model_psi_per_feature`	Gauge	`model_id`, `feature`	Population Stability Index per input feature
`fraud_model_prediction_drift_wasserstein`	Gauge	`model_id`	Wasserstein distance to 7-d baseline
`fraud_model_fairness_delta`	Gauge	`model_id`, `cohort`	Per-cohort AUC delta from population AUC

2.7 Feed (MISP / STIX)

Metric	Type	Labels	Description
`fraud_feed_export_runs_total`	Counter	`feed_id`, `status`	Daily export runs
`fraud_feed_export_indicator_count`	Gauge	`feed_id`	Indicators in last export
`fraud_feed_export_signature_duration_seconds`	Histogram	—	HSM signing duration
`fraud_feed_import_runs_total`	Counter	`feed_id`, `signature_valid`	Import runs
`fraud_feed_import_indicator_added_total`	Counter	`feed_id`, `type`	Indicators added per import
`fraud_feed_sync_lag_seconds`	Gauge	`feed_id`, `direction`	Last sync age

2.8 Tenant scoring

Metric	Type	Labels	Description
`fraud_tenant_score`	Gauge	`tenant_id`, `tier`	Per-tenant fraud score
`fraud_tenant_score_tier_distribution`	Gauge	`tier`	Tenants per tier
`fraud_tenant_score_recompute_duration_seconds`	Histogram	—	Hourly recompute duration
`fraud_score_recompute_failed_total`	Counter	—	Recompute failures

2.9 OTP-grinding streaming

Metric	Type	Labels	Description
`fraud_otp_grinding_detections_total`	Counter	—	Detections emitted
`fraud_otp_grinding_throttle_active_total`	Gauge	—	Active throttle handles
`fraud_otp_grinding_window_size_avg`	Gauge	—	Avg messages in 60s window for tracked MSISDNs

3. Structured Log Events

All log output is valid JSON (Pino format, NestJS) or structlog JSON (Python workers). Log level controlled by LOG_LEVEL env var.

3.1 Score evaluation

{
  "level": "info",
  "time": "2026-04-21T10:00:00.123Z",
  "event": "fraud.score.served",
  "scope": "TENANT",
  "subjectId": "tnt_acme",
  "score": 0.21,
  "tier": "WATCH",
  "cache": "hit",
  "latencyMs": 4,
  "callerSpiffe": "spiffe://ghasi/compliance-engine",
  "traceId": "abc123",
  "spanId": "def456"
}

3.2 Detection emission

{
  "level": "warn",
  "event": "fraud.detection.emitted",
  "detectionId": "fd_01H...",
  "category": "AIT",
  "subjectScope": "TENANT",
  "subjectId": "tnt_xyz",
  "score": 0.94,
  "modelId": "ml_ait_xgboost",
  "modelVersion": "2.1.4",
  "shapTop3": [
    { "feature": "dlr_success_rate", "contribution": -0.42 },
    { "feature": "cohort_anomaly_score", "contribution": 0.31 },
    { "feature": "tenant_age_days", "contribution": 0.24 }
  ],
  "windowStart": "2026-04-21T10:00:00Z",
  "windowEnd":   "2026-04-21T10:05:00Z",
  "traceId": "..."
}

3.3 Case decision

{
  "level": "info",
  "event": "fraud.case.decided",
  "caseId": "fc_01H...",
  "decision": "CONFIRM_FRAUD",
  "decidedBy": "user_jane",
  "ageSeconds": 8400,
  "actionExecuted": true,
  "actionDispatched": "sender_id.suspend.v1"
}

3.4 Pipeline run

{
  "level": "info",
  "event": "fraud.pipeline.completed",
  "pipeline": "ait",
  "windowStart": "2026-04-21T10:00:00Z",
  "windowEnd":   "2026-04-21T10:05:00Z",
  "rowsProcessed": 4127,
  "predictionsHigh": 3,
  "predictionsMedium": 12,
  "predictionsLow": 421,
  "durationMs": 71200,
  "modelVersion": "ait-xgboost-2.1.4"
}

3.5 Model promotion

{
  "level": "warn",
  "event": "fraud.model.promoted",
  "modelId": "ml_ait_xgboost",
  "previousVersion": "2.1.3",
  "newVersion": "2.1.4",
  "promotedBy": "user_ds_alice",
  "approvedBy": "user_compl_bob",
  "shadowDurationHours": 26,
  "shadowAuc": 0.937,
  "activeAuc": 0.932
}

3.6 Drift / anomaly

{
  "level": "warn",
  "event": "fraud.model.drift.detected",
  "modelId": "ml_ait_xgboost",
  "feature": "dlr_success_rate",
  "psi": 0.31,
  "thresholdMedium": 0.25,
  "thresholdHigh": 0.50
}

3.7 Feed signature failure

{
  "level": "error",
  "event": "fraud.alert.feed.signature.invalid",
  "feedId": "ff_regulator_atra",
  "expectedKeyId": "atra-prod-2026-q2",
  "observedKeyId": "atra-prod-2025-q4",
  "reason": "KEY_REVOKED"
}

PII rules: never log raw dst_msisdn / src_msisdn. Use +CCNNN*** masked or msisdnHash. Body content is never stored anywhere, including logs.

4. OpenTelemetry Tracing

Parent spans:

fraud-intel-service.Score (gRPC)
fraud-intel-service.IngestSignal (NATS consumer)
fraud-intel-service.AitPipelineRun (5-min cron)
fraud-intel-service.OtpGrindingStream (per-event)

Child spans (Score):

Span	Operation	Attributes
`fraud.score.cache.l1`	Redis GET	`cache.hit`
`fraud.score.cache.l2`	Postgres SELECT	`rows_returned`
`fraud.score.refresh.queue`	Redis LPUSH	`queue_depth`

Child spans (AIT pipeline):

Span	Operation	Attributes
`fraud.ait.feature.engineering`	ClickHouse INSERT…SELECT	`rows_in`, `rows_out`
`fraud.ait.cohort.join`	ClickHouse JOIN	`cohorts_joined`
`fraud.ait.inference`	Triton ModelInfer	`model_id`, `model_version`, `batch_size`
`fraud.ait.shap`	TreeSHAP compute	`n_features`
`fraud.ait.outbox.emit`	Postgres INSERT + NATS publish	`event_count`

Trace context propagated via W3C Trace Context (traceparent header on gRPC + Nats-Trace-Parent on NATS messages).

Exporter: OTLP gRPC to platform OTel collector → SigNoz backend.

5. Alerting Rules

Prometheus alert YAML (excerpt):

groups:
- name: fraud-intel
  interval: 30s
  rules:

  # Score gRPC SLO
  - alert: FraudScoreP95High
    expr: histogram_quantile(0.95, sum(rate(fraud_score_grpc_duration_seconds_bucket[5m])) by (le)) > 0.05
    for: 5m
    labels: { severity: warning, service: fraud-intel-service }
    annotations:
      summary: "Score gRPC P95 > 50 ms"
      runbook: "runbooks/fraud-intel/score-p95-high.md"

  - alert: FraudScoreUnavailable
    expr: rate(fraud_score_grpc_errors_total{code="UNAVAILABLE"}[2m]) > 0.5
    for: 2m
    labels: { severity: high }
    annotations: { runbook: "runbooks/fraud-intel/score-unavailable.md" }

  # Pipeline freshness
  - alert: FraudAitPipelineStale
    expr: time() - fraud_pipeline_last_success_timestamp{pipeline="ait"} > 900
    for: 5m
    labels: { severity: high }
    annotations:
      summary: "AIT pipeline last success > 15 min ago"
      runbook: "runbooks/fraud-intel/pipeline-stale.md"

  - alert: FraudIngestionLagHigh
    expr: fraud_ingestion_lag_seconds > 60
    for: 5m
    labels: { severity: warning }
    annotations: { runbook: "runbooks/fraud-intel/ingestion-lag.md" }

  # Model accuracy / drift
  - alert: FraudModelDriftHigh
    expr: max(fraud_model_psi_per_feature) by (model_id) > 0.50
    for: 1h
    labels: { severity: high }
    annotations:
      summary: "Model {{ $labels.model_id }} feature drift PSI > 0.50"
      runbook: "runbooks/fraud-intel/model-drift.md"

  - alert: FraudModelPrecisionLow
    expr: fraud_model_precision_rolling_7d{category="AIT"} < 0.92
    for: 4h
    labels: { severity: high }
    annotations: { runbook: "runbooks/fraud-intel/precision-degraded.md" }

  - alert: FraudModelArtifactTamper
    expr: increase(fraud_model_artifact_sha_mismatch_total[5m]) > 0
    for: 0m
    labels: { severity: critical, page: pagerduty }
    annotations:
      summary: "Model artifact SHA-256 mismatch — supply-chain attack suspected"
      runbook: "runbooks/fraud-intel/artifact-tamper.md"

  # Feed sync
  - alert: FraudFeedSyncStale
    expr: fraud_feed_sync_lag_seconds{direction="EXPORT"} > 86400 * 1.1
    for: 30m
    labels: { severity: medium }
    annotations: { runbook: "runbooks/fraud-intel/feed-export-stale.md" }

  - alert: FraudFeedSignatureInvalid
    expr: increase(fraud_feed_import_runs_total{signature_valid="false"}[5m]) > 0
    for: 0m
    labels: { severity: critical, page: pagerduty }
    annotations:
      summary: "MISP feed import signature failure — possible compromised peer key"
      runbook: "runbooks/fraud-intel/feed-signature-invalid.md"

  # Detection volume
  - alert: FraudScoreSpike
    expr: rate(fraud_detections_emitted_total{category="AIT"}[5m]) > 5 * avg_over_time(rate(fraud_detections_emitted_total{category="AIT"}[5m])[1d:5m])
    for: 10m
    labels: { severity: warning }
    annotations:
      summary: "AIT detection rate 5× baseline — possible coordinated campaign or model false-positive spike"
      runbook: "runbooks/fraud-intel/detection-spike.md"

  # HITL backlog
  - alert: FraudCaseBacklogHigh
    expr: fraud_cases_pending_total > 200
    for: 30m
    labels: { severity: warning }
    annotations: { runbook: "runbooks/fraud-intel/case-backlog.md" }

  - alert: FraudCaseAutoStaleSpike
    expr: rate(fraud_cases_auto_stale_total[1h]) > 10
    for: 1h
    labels: { severity: warning }
    annotations: { runbook: "runbooks/fraud-intel/case-auto-stale.md" }

6. Grafana Dashboards

Dashboard JSON sources in dashboards/fraud-intel-service.json. Five canonical dashboards:

6.1 Score gRPC (hot path)

Panel	Query	Visualisation
RPS	`rate(fraud_score_grpc_total[5m])` by `tier`	Stacked area
P50/P95/P99 latency	`histogram_quantile(0.5/0.95/0.99, …)`	Multi-line
Cache hit ratio	`fraud_score_cache_hit_ratio`	Gauge + time series
Tier distribution	`fraud_score_grpc_total` by `tier` (last 24h)	Pie
Error breakdown	`fraud_score_grpc_errors_total` by `code`	Bar

6.2 Detection pipelines

Panel	Query	Visualisation
Pipeline freshness	`time() - fraud_pipeline_last_success_timestamp` per pipeline	Multi-stat
Pipeline duration	`fraud_pipeline_run_duration_seconds` per pipeline	Multi-line
Detection rate (24h)	`rate(fraud_detections_emitted_total[5m])` by `category, confidence_tier`	Stacked area
Cases opened vs decided	dual line	Time series
Case backlog by category	`fraud_cases_pending_total` by `category`	Stacked bar

6.3 Model serving

Panel	Query	Visualisation
Triton inference latency P95	per model	Multi-line
Active model versions	`fraud_model_active_version`	Table
Model load events	`fraud_model_load_total`	Time series
Artifact tamper count	`fraud_model_artifact_sha_mismatch_total`	Stat (alarm-coloured)

6.4 Model accuracy & drift

Panel	Query	Visualisation
Precision/Recall rolling 7d per category	dual gauge + line
PSI per feature heatmap	`fraud_model_psi_per_feature`	Heatmap
Wasserstein drift per model	`fraud_model_prediction_drift_wasserstein`	Time series
Per-cohort fairness delta	`fraud_model_fairness_delta` by `cohort`	Bar (with thresholds)

6.5 Feed (MISP / STIX)

Panel	Query	Visualisation
Last export age	`fraud_feed_sync_lag_seconds{direction="EXPORT"}`	Stat
Indicator counts per feed	`fraud_feed_export_indicator_count` by `feed_id`	Bar
Import success rate	success / total	Gauge
Signature failure timeline	`fraud_feed_import_runs_total{signature_valid="false"}`	Annotated time series

7. Runbook References

Every alert in §5 links to a runbook under runbooks/fraud-intel/:

score-p95-high.md — investigate Redis L1 hit ratio, Triton latency, network
score-unavailable.md — pod readiness, Postgres availability, mTLS cert validity
pipeline-stale.md — Airflow DAG status, distributed-lock contention, ClickHouse availability
ingestion-lag.md — NATS consumer lag, ClickHouse insert latency, WAL buffer
model-drift.md — feature population shift, training freshness, retrain trigger
precision-degraded.md — recent confirmed/dismissed analysis, label noise check, model rollback decision
artifact-tamper.md — CRITICAL: stop the affected pipeline, isolate the artifact, page Security
feed-export-stale.md — HSM availability, MinIO availability, SFTP destination health
feed-signature-invalid.md — peer-key rotation status, KMS revocation list, contact peer SOC
detection-spike.md — distinguish real campaign from false-positive spike via case sampling
case-backlog.md — analyst capacity check, bulk-action consideration
case-auto-stale.md — analyst headcount review, queue prioritisation

1. Service Level Objectives (SLOs)​

2. Prometheus Metrics​

2.1 Score gRPC (hot path)​

2.2 Stream ingestion​

2.3 Pipeline metrics (AIT, SIM-box, OTP-harvest, grey-route, cohort, scoring)​

2.4 Detection emission​

2.5 Model serving (Triton)​

2.6 Model accuracy & drift​

2.7 Feed (MISP / STIX)​

2.8 Tenant scoring​

2.9 OTP-grinding streaming​

3. Structured Log Events​

3.1 Score evaluation​

3.2 Detection emission​

3.3 Case decision​

3.4 Pipeline run​

3.5 Model promotion​

3.6 Drift / anomaly​

3.7 Feed signature failure​

4. OpenTelemetry Tracing​

5. Alerting Rules​

6. Grafana Dashboards​

6.1 Score gRPC (hot path)​

6.2 Detection pipelines​

6.3 Model serving​

6.4 Model accuracy & drift​

6.5 Feed (MISP / STIX)​

7. Runbook References​