fraud-intel-service — Failure Modes

Version: 1.0 Status: Draft Owner: Trust and Safety + ML Ops + SRE Last Updated: 2026-04-21 References: SERVICE_OVERVIEW.md, OBSERVABILITY.md, docs/architecture/ADR-0004-national-backbone-resilience.md

This document catalogs how fraud-intel-service fails and the designed response. Unlike sms-firewall-service, this service is fail-open by design — it is informational, not a blocker. Its failures degrade detection quality but do not stop SMS delivery directly. Downstream consumers (firewall, sender-id-registry, compliance-engine) decide how to treat stale fraud signals.

1. Operating Principle: Fail-Open (informational)

Fraud-intel is a signal producer. If unavailable:

Firewall falls back to rule-based patterns.
Sender-id-registry reputation scores freeze at last-known.
Compliance-engine uses last-known tenant risk-tier.

This is explicit: fraud-intel must not block legitimate traffic when it's down. If the platform loses fraud detection, the right response is heightened vigilance (more human review, stricter rule-based fallback), not outage.

2. Failure Mode Summary

#	Name	Class	Detection	Impact	Runbook
FM-01	ML model (Triton) unavailable	Dependency	< 10 s	Fall back to rule-based; reduced recall	`runbooks/fraud-ml-out.md`
FM-02	Score gRPC latency spike (> 100 ms P95)	Performance	5 min	Callers time out; fallback engages	`runbooks/fraud-score-latency.md`
FM-03	Postgres unavailable (signal write-path)	Infra	< 30 s	Signal ingest halted; NATS queues up	`runbooks/fraud-postgres-out.md`
FM-04	NATS signal-ingest lag > 5 min	Infra	5 min	ML features stale; late fraud detection	`runbooks/fraud-nats-lag.md`
FM-05	MISP feed source unreachable	Dependency	30 min	No external IOC updates	`runbooks/fraud-feed-sync-stale.md`
FM-06	ML model drift (F1 < baseline - 5%)	ML quality	1 d	Silent detection degradation	`runbooks/fraud-model-drift.md`
FM-07	Feedback-loop poisoning (bad labels)	Security	hours-days	Model learns wrong patterns	`runbooks/fraud-label-poisoning.md`
FM-08	Training job fails	Ops	15 min	Model not refreshed this cycle	`runbooks/fraud-training-fail.md`
FM-09	Model registry corruption	Ops	< 5 min	Deployment fails; old model continues	`runbooks/fraud-registry-corrupt.md`
FM-10	Feature store unavailable	Dependency	< 10 s	Cold-start features recomputed from raw → latency up	`runbooks/fraud-feature-store-out.md`
FM-11	Signal storm (adversarial flood)	Adversarial	2 min	Noise in dataset; possible bias toward storm pattern	`runbooks/fraud-signal-storm.md`
FM-12	Redis unavailable (score cache)	Infra	< 10 s	Every Score call hits Postgres; latency up	`runbooks/fraud-redis-out.md`

3. Detailed Failure Modes

FM-01 — Triton ML serving unavailable

Scenario. Triton Inference Server pod crash, GPU fault, or OOM.

Impact. Score returns rule-based verdict only. Detection recall drops (AIT ~20–30%, SIM-box ~25%, OTP-harvest ~15% lower).

Detection. fraud_ml_inference_errors_total; circuit breaker opens after 3 consecutive errors. /health/ready on the fraud-intel pod marks ML as degraded.

Mitigation.

Circuit breaker with 30 s half-open; retries on model availability.
Rule-based fallback inline — no service outage.
Triton deployed in HA (3 replicas across AZs).
Alert FraudMlUnavailable fires; ML Ops triages within 30 min.

Recovery. Triton recovers → circuit closes → ML inference resumes.

FM-02 — Score latency spike

Scenario. Model inference, DB query, or feature computation regression pushes P95 from 50 ms to > 200 ms.

Impact. Synchronous callers (firewall, compliance) time out on 100 ms budget → fallback engages.

Detection. Histogram fraud_score_seconds P95 > 100 ms for 5 min.

Mitigation.

Budget enforcement inside Score: if inference exceeds 80 ms, return fast-path rule-based.
Horizontal scale-out on RPS.
Automatic canary rollback on P95 > 150 ms for 10 min.
Alert FraudScoreLatencyHigh.

Recovery. Rollback or scale resolves; post-mortem within 48 h.

FM-03 — Postgres unavailable

Scenario. Primary Postgres unreachable.

Impact. signals and detections writes fail; NATS consumer retries. Score reads from Redis cache where possible.

Detection. Connection error metric; alert FraudDbUnavailable within 30 s.

Mitigation.

Postgres HA with synchronous replica; auto-failover ≤ 30 s.
NATS consumer stops ACKing signals — they stay in stream (up to 7 d retention).
Reads fall back to Redis (feature store cached values).
Score downgrades gracefully to rule-based only (no DB-sourced features).

Recovery. DB recovery → consumer resumes → backlog drains. No data loss.

FM-04 — NATS consumer lag > 5 min

Scenario. Signal ingest (sms.dlr.inbound, sms.mo.inbound, compliance.audit.v1, firewall.audit.v1) lags.

Impact. ML features stale; new fraud patterns detected late.

Detection. Lag metric per consumer; alert at 5 min.

Mitigation.

Multiple consumer replicas (queue-group scaling).
Auto-scale on lag (KEDA).
Downstream notified via fraud.signal.degraded.v1 if lag > 30 min.

Recovery. Auto-scale and upstream ebb resolves; lag clears.

FM-05 — MISP feed source unreachable

Scenario. External MISP server (cross-platform fraud-intel sharing) unreachable.

Impact. No new IOCs imported; last-known IOCs still active.

Detection. Feed-sync metric fraud_feed_last_sync_age_seconds > 30 min; alert FraudFeedSyncStale.

Mitigation.

Last-known-good IOC cache continues to be used.
Exponential-backoff retry (max 1 h between attempts).
Multi-source MISP integration — other sources still update.

Recovery. Automatic.

FM-06 — Model drift

Scenario. Attacker tactics evolve; ML model trained months ago silently loses recall.

Impact. Fraud detection rate degrades without obvious service incident.

Detection. Continuous accuracy monitoring against held-out test + weekly freshly-labelled corpus. Alert if F1 drops > 5% from baseline.

Mitigation.

Weekly drift monitoring job.
Quarterly retraining cadence + on-demand retraining on drift alert.
Parallel-model A/B: new candidate shadows production before switchover.
Per-MNO, per-time-of-day recall tracking for fine-grained alerting.

Recovery. Retrain + redeploy (typically 7–14 d).

FM-07 — Feedback-loop poisoning

Scenario. Attacker uses feedback API (T&S correction endpoint) to label fraudulent traffic as legitimate, poisoning the training dataset.

Impact. Future models trained on corrupt labels.

Detection. Anomaly detection on feedback volume and ratios; divergence from automated-detection baselines.

Mitigation.

Feedback API role-restricted to T&S staff with auditable identity.
Feedback weight in training is lower than automatic-labelling.
Training pipeline rejects a single account / IP contributing > 5% of labels in a week.
Weekly human review of label-distribution trends.

Recovery. Rollback training set; retrain on clean corpus; investigate the insider / compromised account.

FM-08 — Training job fails

Scenario. Airflow DAG failure (data dependency missing, OOM, infrastructure fault).

Impact. No model refresh this cycle.

Detection. Airflow DAG alert; ML Ops paged.

Mitigation.

Previous model continues in production.
Training job is idempotent; re-run within 24 h.
Manual run path documented.

Recovery. Re-run; model published to registry.

FM-09 — Model registry corruption

Scenario. Model artifact in registry corrupt or missing (checksum mismatch).

Impact. Deployment of the new version fails; old model stays.

Detection. Pre-deploy checksum verify; deploy aborts.

Mitigation.

Immutable model registry (S3 versioning + object-lock).
Checksums verified at upload and at deploy.
Cross-region replicated model bucket.
Rollback to prior version always possible.

Recovery. Upload fresh artifact; redeploy.

FM-10 — Feature store unavailable

Scenario. Feature store (Redis + Postgres hybrid) partially unavailable.

Impact. Cold-start feature recomputation from raw signals → Score latency up 2–5×.

Detection. Feature-store connection errors.

Mitigation.

In-process LRU of 10 000 most-frequent feature vectors.
Degraded-mode Score runs with reduced feature set.
Alert fires.

Recovery. Feature store recovery.

FM-11 — Signal storm

Scenario. Adversarial traffic floods the platform with synthetic signals aiming to bias the model.

Impact. Training corpus polluted; model learns the attack pattern as "normal".

Detection. Anomaly on signal-volume metric.

Mitigation.

Per-source / per-tenant rate-limit on signal ingest.
Training pipeline applies outlier removal (> 3σ).
Human review of high-volume sources weekly.

Recovery. Drop poisoned signals from training set; retrain.

FM-12 — Redis unavailable (score cache)

Scenario. Redis unreachable.

Impact. Score calls hit Postgres + feature-store fully; latency up.

Detection. Conn errors.

Mitigation.

Redis HA.
In-process LRU.
Degraded path works; latency is within call budget 2–3× headroom.

Recovery. Automatic.

4. Graceful Degradation Summary

Failure	Fallback	Effect on callers
ML unavailable	Rule-based scoring	Reduced recall; recorded in audit
Postgres out	Read-only from cache; writes queue on NATS	Signals delayed
Feature store out	In-process LRU + reduced feature set	Latency up 2–5×
Redis out	Postgres direct	Latency up
NATS lag	Stale features	Late detection
Feed source out	Last-known IOCs	Missing new IOCs
Model drift	Rollback to prior version + retrain	Short-term recall dip

5. Failure ↔ Consumer Experience Matrix

FM	Firewall	Sender-ID Registry	Compliance-Engine	NOC / T&S
FM-01 ML out	Uses rule-based fallback (flag in audit)	Reputation freezes	Uses last tenant risk-tier	Alert; reduced detection
FM-02 Latency	Budget cap + fallback	Latency up briefly	Budget cap + fallback	Alert
FM-03 Postgres out	Cache serves; some signals delayed	Stale reputation	Stale scoring	Alert
FM-04 NATS lag	Late signals in audit	Late reputation updates	Late scoring input	Alert
FM-05 Feed stale	Missing IOCs; minor	Minor	Minor	Alert
FM-06 Drift	Silent recall loss	Silent reputation skew	Silent scoring skew	Drift alert
FM-07 Poisoning	Silent recall loss (worse over time)	Silent skew	Silent skew	Manual detection
FM-11 Signal storm	Adversarial bias emerging	Possible	Possible	Alert

6. Open Points

ID	Question	Owner
FM-OPEN-01	Exact SLA on feedback-loop audit (e.g., weekly vs. monthly)	T&S
FM-OPEN-02	Model-card publication cadence (quarterly?)	ML Ops
FM-OPEN-03	MISP reciprocal-sharing terms with external parties	Regulator Liaison + Legal

1. Operating Principle: Fail-Open (informational)​

2. Failure Mode Summary​

3. Detailed Failure Modes​

FM-01 — Triton ML serving unavailable​

FM-02 — Score latency spike​

FM-03 — Postgres unavailable​

FM-04 — NATS consumer lag > 5 min​

FM-05 — MISP feed source unreachable​

FM-06 — Model drift​

FM-07 — Feedback-loop poisoning​

FM-08 — Training job fails​

FM-09 — Model registry corruption​

FM-10 — Feature store unavailable​

FM-11 — Signal storm​

FM-12 — Redis unavailable (score cache)​

4. Graceful Degradation Summary​

5. Failure ↔ Consumer Experience Matrix​

6. Open Points​

1. Operating Principle: Fail-Open (informational)

2. Failure Mode Summary

3. Detailed Failure Modes

FM-01 — Triton ML serving unavailable

FM-02 — Score latency spike

FM-03 — Postgres unavailable

FM-04 — NATS consumer lag > 5 min

FM-05 — MISP feed source unreachable

FM-06 — Model drift

FM-07 — Feedback-loop poisoning

FM-08 — Training job fails

FM-09 — Model registry corruption

FM-10 — Feature store unavailable

FM-11 — Signal storm

FM-12 — Redis unavailable (score cache)

4. Graceful Degradation Summary

5. Failure ↔ Consumer Experience Matrix

6. Open Points