Skip to main content

fraud-intel-service — Failure Modes

Version: 1.0 Status: Draft Owner: Trust and Safety + ML Ops + SRE Last Updated: 2026-04-21 References: SERVICE_OVERVIEW.md, OBSERVABILITY.md, docs/architecture/ADR-0004-national-backbone-resilience.md

This document catalogs how fraud-intel-service fails and the designed response. Unlike sms-firewall-service, this service is fail-open by design — it is informational, not a blocker. Its failures degrade detection quality but do not stop SMS delivery directly. Downstream consumers (firewall, sender-id-registry, compliance-engine) decide how to treat stale fraud signals.


1. Operating Principle: Fail-Open (informational)

Fraud-intel is a signal producer. If unavailable:

  • Firewall falls back to rule-based patterns.
  • Sender-id-registry reputation scores freeze at last-known.
  • Compliance-engine uses last-known tenant risk-tier.

This is explicit: fraud-intel must not block legitimate traffic when it's down. If the platform loses fraud detection, the right response is heightened vigilance (more human review, stricter rule-based fallback), not outage.


2. Failure Mode Summary

#NameClassDetectionImpactRunbook
FM-01ML model (Triton) unavailableDependency< 10 sFall back to rule-based; reduced recallrunbooks/fraud-ml-out.md
FM-02Score gRPC latency spike (> 100 ms P95)Performance5 minCallers time out; fallback engagesrunbooks/fraud-score-latency.md
FM-03Postgres unavailable (signal write-path)Infra< 30 sSignal ingest halted; NATS queues uprunbooks/fraud-postgres-out.md
FM-04NATS signal-ingest lag > 5 minInfra5 minML features stale; late fraud detectionrunbooks/fraud-nats-lag.md
FM-05MISP feed source unreachableDependency30 minNo external IOC updatesrunbooks/fraud-feed-sync-stale.md
FM-06ML model drift (F1 < baseline - 5%)ML quality1 dSilent detection degradationrunbooks/fraud-model-drift.md
FM-07Feedback-loop poisoning (bad labels)Securityhours-daysModel learns wrong patternsrunbooks/fraud-label-poisoning.md
FM-08Training job failsOps15 minModel not refreshed this cyclerunbooks/fraud-training-fail.md
FM-09Model registry corruptionOps< 5 minDeployment fails; old model continuesrunbooks/fraud-registry-corrupt.md
FM-10Feature store unavailableDependency< 10 sCold-start features recomputed from raw → latency uprunbooks/fraud-feature-store-out.md
FM-11Signal storm (adversarial flood)Adversarial2 minNoise in dataset; possible bias toward storm patternrunbooks/fraud-signal-storm.md
FM-12Redis unavailable (score cache)Infra< 10 sEvery Score call hits Postgres; latency uprunbooks/fraud-redis-out.md

3. Detailed Failure Modes

FM-01 — Triton ML serving unavailable

Scenario. Triton Inference Server pod crash, GPU fault, or OOM.

Impact. Score returns rule-based verdict only. Detection recall drops (AIT ~20–30%, SIM-box ~25%, OTP-harvest ~15% lower).

Detection. fraud_ml_inference_errors_total; circuit breaker opens after 3 consecutive errors. /health/ready on the fraud-intel pod marks ML as degraded.

Mitigation.

  1. Circuit breaker with 30 s half-open; retries on model availability.
  2. Rule-based fallback inline — no service outage.
  3. Triton deployed in HA (3 replicas across AZs).
  4. Alert FraudMlUnavailable fires; ML Ops triages within 30 min.

Recovery. Triton recovers → circuit closes → ML inference resumes.


FM-02 — Score latency spike

Scenario. Model inference, DB query, or feature computation regression pushes P95 from 50 ms to > 200 ms.

Impact. Synchronous callers (firewall, compliance) time out on 100 ms budget → fallback engages.

Detection. Histogram fraud_score_seconds P95 > 100 ms for 5 min.

Mitigation.

  1. Budget enforcement inside Score: if inference exceeds 80 ms, return fast-path rule-based.
  2. Horizontal scale-out on RPS.
  3. Automatic canary rollback on P95 > 150 ms for 10 min.
  4. Alert FraudScoreLatencyHigh.

Recovery. Rollback or scale resolves; post-mortem within 48 h.


FM-03 — Postgres unavailable

Scenario. Primary Postgres unreachable.

Impact. signals and detections writes fail; NATS consumer retries. Score reads from Redis cache where possible.

Detection. Connection error metric; alert FraudDbUnavailable within 30 s.

Mitigation.

  1. Postgres HA with synchronous replica; auto-failover ≤ 30 s.
  2. NATS consumer stops ACKing signals — they stay in stream (up to 7 d retention).
  3. Reads fall back to Redis (feature store cached values).
  4. Score downgrades gracefully to rule-based only (no DB-sourced features).

Recovery. DB recovery → consumer resumes → backlog drains. No data loss.


FM-04 — NATS consumer lag > 5 min

Scenario. Signal ingest (sms.dlr.inbound, sms.mo.inbound, compliance.audit.v1, firewall.audit.v1) lags.

Impact. ML features stale; new fraud patterns detected late.

Detection. Lag metric per consumer; alert at 5 min.

Mitigation.

  1. Multiple consumer replicas (queue-group scaling).
  2. Auto-scale on lag (KEDA).
  3. Downstream notified via fraud.signal.degraded.v1 if lag > 30 min.

Recovery. Auto-scale and upstream ebb resolves; lag clears.


FM-05 — MISP feed source unreachable

Scenario. External MISP server (cross-platform fraud-intel sharing) unreachable.

Impact. No new IOCs imported; last-known IOCs still active.

Detection. Feed-sync metric fraud_feed_last_sync_age_seconds > 30 min; alert FraudFeedSyncStale.

Mitigation.

  1. Last-known-good IOC cache continues to be used.
  2. Exponential-backoff retry (max 1 h between attempts).
  3. Multi-source MISP integration — other sources still update.

Recovery. Automatic.


FM-06 — Model drift

Scenario. Attacker tactics evolve; ML model trained months ago silently loses recall.

Impact. Fraud detection rate degrades without obvious service incident.

Detection. Continuous accuracy monitoring against held-out test + weekly freshly-labelled corpus. Alert if F1 drops > 5% from baseline.

Mitigation.

  1. Weekly drift monitoring job.
  2. Quarterly retraining cadence + on-demand retraining on drift alert.
  3. Parallel-model A/B: new candidate shadows production before switchover.
  4. Per-MNO, per-time-of-day recall tracking for fine-grained alerting.

Recovery. Retrain + redeploy (typically 7–14 d).


FM-07 — Feedback-loop poisoning

Scenario. Attacker uses feedback API (T&S correction endpoint) to label fraudulent traffic as legitimate, poisoning the training dataset.

Impact. Future models trained on corrupt labels.

Detection. Anomaly detection on feedback volume and ratios; divergence from automated-detection baselines.

Mitigation.

  1. Feedback API role-restricted to T&S staff with auditable identity.
  2. Feedback weight in training is lower than automatic-labelling.
  3. Training pipeline rejects a single account / IP contributing > 5% of labels in a week.
  4. Weekly human review of label-distribution trends.

Recovery. Rollback training set; retrain on clean corpus; investigate the insider / compromised account.


FM-08 — Training job fails

Scenario. Airflow DAG failure (data dependency missing, OOM, infrastructure fault).

Impact. No model refresh this cycle.

Detection. Airflow DAG alert; ML Ops paged.

Mitigation.

  1. Previous model continues in production.
  2. Training job is idempotent; re-run within 24 h.
  3. Manual run path documented.

Recovery. Re-run; model published to registry.


FM-09 — Model registry corruption

Scenario. Model artifact in registry corrupt or missing (checksum mismatch).

Impact. Deployment of the new version fails; old model stays.

Detection. Pre-deploy checksum verify; deploy aborts.

Mitigation.

  1. Immutable model registry (S3 versioning + object-lock).
  2. Checksums verified at upload and at deploy.
  3. Cross-region replicated model bucket.
  4. Rollback to prior version always possible.

Recovery. Upload fresh artifact; redeploy.


FM-10 — Feature store unavailable

Scenario. Feature store (Redis + Postgres hybrid) partially unavailable.

Impact. Cold-start feature recomputation from raw signals → Score latency up 2–5×.

Detection. Feature-store connection errors.

Mitigation.

  1. In-process LRU of 10 000 most-frequent feature vectors.
  2. Degraded-mode Score runs with reduced feature set.
  3. Alert fires.

Recovery. Feature store recovery.


FM-11 — Signal storm

Scenario. Adversarial traffic floods the platform with synthetic signals aiming to bias the model.

Impact. Training corpus polluted; model learns the attack pattern as "normal".

Detection. Anomaly on signal-volume metric.

Mitigation.

  1. Per-source / per-tenant rate-limit on signal ingest.
  2. Training pipeline applies outlier removal (> 3σ).
  3. Human review of high-volume sources weekly.

Recovery. Drop poisoned signals from training set; retrain.


FM-12 — Redis unavailable (score cache)

Scenario. Redis unreachable.

Impact. Score calls hit Postgres + feature-store fully; latency up.

Detection. Conn errors.

Mitigation.

  1. Redis HA.
  2. In-process LRU.
  3. Degraded path works; latency is within call budget 2–3× headroom.

Recovery. Automatic.


4. Graceful Degradation Summary

FailureFallbackEffect on callers
ML unavailableRule-based scoringReduced recall; recorded in audit
Postgres outRead-only from cache; writes queue on NATSSignals delayed
Feature store outIn-process LRU + reduced feature setLatency up 2–5×
Redis outPostgres directLatency up
NATS lagStale featuresLate detection
Feed source outLast-known IOCsMissing new IOCs
Model driftRollback to prior version + retrainShort-term recall dip

5. Failure ↔ Consumer Experience Matrix

FMFirewallSender-ID RegistryCompliance-EngineNOC / T&S
FM-01 ML outUses rule-based fallback (flag in audit)Reputation freezesUses last tenant risk-tierAlert; reduced detection
FM-02 LatencyBudget cap + fallbackLatency up brieflyBudget cap + fallbackAlert
FM-03 Postgres outCache serves; some signals delayedStale reputationStale scoringAlert
FM-04 NATS lagLate signals in auditLate reputation updatesLate scoring inputAlert
FM-05 Feed staleMissing IOCs; minorMinorMinorAlert
FM-06 DriftSilent recall lossSilent reputation skewSilent scoring skewDrift alert
FM-07 PoisoningSilent recall loss (worse over time)Silent skewSilent skewManual detection
FM-11 Signal stormAdversarial bias emergingPossiblePossibleAlert

6. Open Points

IDQuestionOwner
FM-OPEN-01Exact SLA on feedback-loop audit (e.g., weekly vs. monthly)T&S
FM-OPEN-02Model-card publication cadence (quarterly?)ML Ops
FM-OPEN-03MISP reciprocal-sharing terms with external partiesRegulator Liaison + Legal