fraud-intel-service — Failure Modes
Version: 1.0 Status: Draft Owner: Trust and Safety + ML Ops + SRE Last Updated: 2026-04-21 References: SERVICE_OVERVIEW.md, OBSERVABILITY.md, docs/architecture/ADR-0004-national-backbone-resilience.md
This document catalogs how fraud-intel-service fails and the designed response. Unlike sms-firewall-service, this service is fail-open by design — it is informational, not a blocker. Its failures degrade detection quality but do not stop SMS delivery directly. Downstream consumers (firewall, sender-id-registry, compliance-engine) decide how to treat stale fraud signals.
1. Operating Principle: Fail-Open (informational)
Fraud-intel is a signal producer. If unavailable:
- Firewall falls back to rule-based patterns.
- Sender-id-registry reputation scores freeze at last-known.
- Compliance-engine uses last-known tenant risk-tier.
This is explicit: fraud-intel must not block legitimate traffic when it's down. If the platform loses fraud detection, the right response is heightened vigilance (more human review, stricter rule-based fallback), not outage.
2. Failure Mode Summary
| # | Name | Class | Detection | Impact | Runbook |
|---|---|---|---|---|---|
| FM-01 | ML model (Triton) unavailable | Dependency | < 10 s | Fall back to rule-based; reduced recall | runbooks/fraud-ml-out.md |
| FM-02 | Score gRPC latency spike (> 100 ms P95) | Performance | 5 min | Callers time out; fallback engages | runbooks/fraud-score-latency.md |
| FM-03 | Postgres unavailable (signal write-path) | Infra | < 30 s | Signal ingest halted; NATS queues up | runbooks/fraud-postgres-out.md |
| FM-04 | NATS signal-ingest lag > 5 min | Infra | 5 min | ML features stale; late fraud detection | runbooks/fraud-nats-lag.md |
| FM-05 | MISP feed source unreachable | Dependency | 30 min | No external IOC updates | runbooks/fraud-feed-sync-stale.md |
| FM-06 | ML model drift (F1 < baseline - 5%) | ML quality | 1 d | Silent detection degradation | runbooks/fraud-model-drift.md |
| FM-07 | Feedback-loop poisoning (bad labels) | Security | hours-days | Model learns wrong patterns | runbooks/fraud-label-poisoning.md |
| FM-08 | Training job fails | Ops | 15 min | Model not refreshed this cycle | runbooks/fraud-training-fail.md |
| FM-09 | Model registry corruption | Ops | < 5 min | Deployment fails; old model continues | runbooks/fraud-registry-corrupt.md |
| FM-10 | Feature store unavailable | Dependency | < 10 s | Cold-start features recomputed from raw → latency up | runbooks/fraud-feature-store-out.md |
| FM-11 | Signal storm (adversarial flood) | Adversarial | 2 min | Noise in dataset; possible bias toward storm pattern | runbooks/fraud-signal-storm.md |
| FM-12 | Redis unavailable (score cache) | Infra | < 10 s | Every Score call hits Postgres; latency up | runbooks/fraud-redis-out.md |
3. Detailed Failure Modes
FM-01 — Triton ML serving unavailable
Scenario. Triton Inference Server pod crash, GPU fault, or OOM.
Impact. Score returns rule-based verdict only. Detection recall drops (AIT ~20–30%, SIM-box ~25%, OTP-harvest ~15% lower).
Detection. fraud_ml_inference_errors_total; circuit breaker opens after 3 consecutive errors. /health/ready on the fraud-intel pod marks ML as degraded.
Mitigation.
- Circuit breaker with 30 s half-open; retries on model availability.
- Rule-based fallback inline — no service outage.
- Triton deployed in HA (3 replicas across AZs).
- Alert
FraudMlUnavailablefires; ML Ops triages within 30 min.
Recovery. Triton recovers → circuit closes → ML inference resumes.
FM-02 — Score latency spike
Scenario. Model inference, DB query, or feature computation regression pushes P95 from 50 ms to > 200 ms.
Impact. Synchronous callers (firewall, compliance) time out on 100 ms budget → fallback engages.
Detection. Histogram fraud_score_seconds P95 > 100 ms for 5 min.
Mitigation.
- Budget enforcement inside Score: if inference exceeds 80 ms, return fast-path rule-based.
- Horizontal scale-out on RPS.
- Automatic canary rollback on P95 > 150 ms for 10 min.
- Alert
FraudScoreLatencyHigh.
Recovery. Rollback or scale resolves; post-mortem within 48 h.
FM-03 — Postgres unavailable
Scenario. Primary Postgres unreachable.
Impact. signals and detections writes fail; NATS consumer retries. Score reads from Redis cache where possible.
Detection. Connection error metric; alert FraudDbUnavailable within 30 s.
Mitigation.
- Postgres HA with synchronous replica; auto-failover ≤ 30 s.
- NATS consumer stops ACKing signals — they stay in stream (up to 7 d retention).
- Reads fall back to Redis (feature store cached values).
- Score downgrades gracefully to rule-based only (no DB-sourced features).
Recovery. DB recovery → consumer resumes → backlog drains. No data loss.
FM-04 — NATS consumer lag > 5 min
Scenario. Signal ingest (sms.dlr.inbound, sms.mo.inbound, compliance.audit.v1, firewall.audit.v1) lags.
Impact. ML features stale; new fraud patterns detected late.
Detection. Lag metric per consumer; alert at 5 min.
Mitigation.
- Multiple consumer replicas (queue-group scaling).
- Auto-scale on lag (KEDA).
- Downstream notified via
fraud.signal.degraded.v1if lag > 30 min.
Recovery. Auto-scale and upstream ebb resolves; lag clears.
FM-05 — MISP feed source unreachable
Scenario. External MISP server (cross-platform fraud-intel sharing) unreachable.
Impact. No new IOCs imported; last-known IOCs still active.
Detection. Feed-sync metric fraud_feed_last_sync_age_seconds > 30 min; alert FraudFeedSyncStale.
Mitigation.
- Last-known-good IOC cache continues to be used.
- Exponential-backoff retry (max 1 h between attempts).
- Multi-source MISP integration — other sources still update.
Recovery. Automatic.
FM-06 — Model drift
Scenario. Attacker tactics evolve; ML model trained months ago silently loses recall.
Impact. Fraud detection rate degrades without obvious service incident.
Detection. Continuous accuracy monitoring against held-out test + weekly freshly-labelled corpus. Alert if F1 drops > 5% from baseline.
Mitigation.
- Weekly drift monitoring job.
- Quarterly retraining cadence + on-demand retraining on drift alert.
- Parallel-model A/B: new candidate shadows production before switchover.
- Per-MNO, per-time-of-day recall tracking for fine-grained alerting.
Recovery. Retrain + redeploy (typically 7–14 d).
FM-07 — Feedback-loop poisoning
Scenario. Attacker uses feedback API (T&S correction endpoint) to label fraudulent traffic as legitimate, poisoning the training dataset.
Impact. Future models trained on corrupt labels.
Detection. Anomaly detection on feedback volume and ratios; divergence from automated-detection baselines.
Mitigation.
- Feedback API role-restricted to T&S staff with auditable identity.
- Feedback weight in training is lower than automatic-labelling.
- Training pipeline rejects a single account / IP contributing > 5% of labels in a week.
- Weekly human review of label-distribution trends.
Recovery. Rollback training set; retrain on clean corpus; investigate the insider / compromised account.
FM-08 — Training job fails
Scenario. Airflow DAG failure (data dependency missing, OOM, infrastructure fault).
Impact. No model refresh this cycle.
Detection. Airflow DAG alert; ML Ops paged.
Mitigation.
- Previous model continues in production.
- Training job is idempotent; re-run within 24 h.
- Manual run path documented.
Recovery. Re-run; model published to registry.
FM-09 — Model registry corruption
Scenario. Model artifact in registry corrupt or missing (checksum mismatch).
Impact. Deployment of the new version fails; old model stays.
Detection. Pre-deploy checksum verify; deploy aborts.
Mitigation.
- Immutable model registry (S3 versioning + object-lock).
- Checksums verified at upload and at deploy.
- Cross-region replicated model bucket.
- Rollback to prior version always possible.
Recovery. Upload fresh artifact; redeploy.
FM-10 — Feature store unavailable
Scenario. Feature store (Redis + Postgres hybrid) partially unavailable.
Impact. Cold-start feature recomputation from raw signals → Score latency up 2–5×.
Detection. Feature-store connection errors.
Mitigation.
- In-process LRU of 10 000 most-frequent feature vectors.
- Degraded-mode Score runs with reduced feature set.
- Alert fires.
Recovery. Feature store recovery.
FM-11 — Signal storm
Scenario. Adversarial traffic floods the platform with synthetic signals aiming to bias the model.
Impact. Training corpus polluted; model learns the attack pattern as "normal".
Detection. Anomaly on signal-volume metric.
Mitigation.
- Per-source / per-tenant rate-limit on signal ingest.
- Training pipeline applies outlier removal (> 3σ).
- Human review of high-volume sources weekly.
Recovery. Drop poisoned signals from training set; retrain.
FM-12 — Redis unavailable (score cache)
Scenario. Redis unreachable.
Impact. Score calls hit Postgres + feature-store fully; latency up.
Detection. Conn errors.
Mitigation.
- Redis HA.
- In-process LRU.
- Degraded path works; latency is within call budget 2–3× headroom.
Recovery. Automatic.
4. Graceful Degradation Summary
| Failure | Fallback | Effect on callers |
|---|---|---|
| ML unavailable | Rule-based scoring | Reduced recall; recorded in audit |
| Postgres out | Read-only from cache; writes queue on NATS | Signals delayed |
| Feature store out | In-process LRU + reduced feature set | Latency up 2–5× |
| Redis out | Postgres direct | Latency up |
| NATS lag | Stale features | Late detection |
| Feed source out | Last-known IOCs | Missing new IOCs |
| Model drift | Rollback to prior version + retrain | Short-term recall dip |
5. Failure ↔ Consumer Experience Matrix
| FM | Firewall | Sender-ID Registry | Compliance-Engine | NOC / T&S |
|---|---|---|---|---|
| FM-01 ML out | Uses rule-based fallback (flag in audit) | Reputation freezes | Uses last tenant risk-tier | Alert; reduced detection |
| FM-02 Latency | Budget cap + fallback | Latency up briefly | Budget cap + fallback | Alert |
| FM-03 Postgres out | Cache serves; some signals delayed | Stale reputation | Stale scoring | Alert |
| FM-04 NATS lag | Late signals in audit | Late reputation updates | Late scoring input | Alert |
| FM-05 Feed stale | Missing IOCs; minor | Minor | Minor | Alert |
| FM-06 Drift | Silent recall loss | Silent reputation skew | Silent scoring skew | Drift alert |
| FM-07 Poisoning | Silent recall loss (worse over time) | Silent skew | Silent skew | Manual detection |
| FM-11 Signal storm | Adversarial bias emerging | Possible | Possible | Alert |
6. Open Points
| ID | Question | Owner |
|---|---|---|
| FM-OPEN-01 | Exact SLA on feedback-loop audit (e.g., weekly vs. monthly) | T&S |
| FM-OPEN-02 | Model-card publication cadence (quarterly?) | ML Ops |
| FM-OPEN-03 | MISP reciprocal-sharing terms with external parties | Regulator Liaison + Legal |