sms-firewall-service — Failure Modes
Version: 1.0 Status: Draft Owner: Trust and Safety + SRE Last Updated: 2026-04-21 References: SERVICE_OVERVIEW.md, OBSERVABILITY.md, docs/architecture/ADR-0004-national-backbone-resilience.md
This document catalogs how sms-firewall-service fails, what the tenant / citizen / regulator experiences, how the platform detects each failure, and the designed mitigation path. The service is consulted synchronously by routing-engine on every outbound message and by channel-router-service on inbound MO; any fault is therefore national in impact.
1. Operating Principle: Fail-Closed by Default
sms-firewall-service is a national-perimeter firewall. Its fail-closed posture is the inverse of the usual "availability first" service posture: if we cannot evaluate a message, we do not let it through. Specific exceptions are enumerated in §4 (Graceful Degradation).
2. Failure Mode Summary
| # | Name | Class | Detection Time | User-visible effect | Runbook |
|---|---|---|---|---|---|
| FM-01 | Postgres unavailable | Infra | < 30 s | Firewall FilterInbound/EvaluateTransit returns 503; routing-engine fails the submit; tenant sees 503 with code FIREWALL_UNAVAILABLE | runbooks/firewall-postgres-out.md |
| FM-02 | Redis unavailable | Infra | < 10 s | Verdict cache miss → all requests hit Postgres (degraded latency, ~30 ms P95 vs. 5 ms) | runbooks/firewall-redis-out.md |
| FM-03 | NATS consumer lag > 60 s | Infra | 60 s | Fraud-intel and consent-ledger signals delayed; blocklist updates stale | runbooks/firewall-nats-lag.md |
| FM-04 | National-blocklist federation source unreachable | Dependency | 30 min | Blocklist stale; new national-blocklist entries not applied | runbooks/firewall-federation-stale.md |
| FM-05 | Blocklist hash-chain break (tamper or bug) | Correctness | 24 h | Audit defensibility lost for affected period | runbooks/firewall-chain-break.md |
| FM-06 | ML model (AIT / SIM-box) unavailable | Dependency | < 10 s | Fall back to rule-based pattern matcher (degraded detection rate) | runbooks/firewall-ml-out.md |
| FM-07 | Fingerprint storm (DDoS of distinct JA3) | Adversarial | 2 min | Cache thrashing; increased latency; tarpit triggers | runbooks/firewall-fingerprint-storm.md |
| FM-08 | Legitimate-sender false positive spike | Correctness | minutes | Tenant escalations; potential routing failures for genuine traffic | runbooks/firewall-false-positive-spike.md |
| FM-09 | SIM-box detector emits high-volume false positives | ML | 1 h | Real traffic blocked as SIM-box originated | runbooks/firewall-simbox-fp.md |
| FM-10 | EvaluateTransit P95 latency spike | Performance | 5 min | Submit-to-DLR SLA at risk; SLO burn | runbooks/firewall-latency-spike.md |
| FM-11 | Region-partition on blocklist replication | Infra | 1 min | Regions diverge on blocklist state | runbooks/firewall-region-split.md |
| FM-12 | HSM unavailable → signing of federation export fails | Dependency | < 30 s | Outbound federation paused; inbound continues | runbooks/firewall-hsm-out.md |
| FM-13 | Rule-engine recursion (composite-rule cycle) | Code | < 1 s | Single evaluation times out; fail-closed for that message | runbooks/firewall-rule-cycle.md |
3. Detailed Failure Modes
FM-01 — Postgres unavailable
Scenario. Primary Postgres (or its synchronous replica) is unreachable; pooled connections timeout.
Impact. Every outbound SMS is blocked at the firewall (fail-closed). Inbound MO firewall stalls; MO messages queue in NATS.
Detection. /health/ready returns 503 within 2 s; HPA may remove the pod from service endpoints; alert FirewallDbUnavailable fires within 30 s (metric firewall_db_connection_errors_total).
Mitigation.
- Postgres HA with synchronous replica; automatic fail-over within 30 s.
- Multi-region fail-over (manual-gated) ≤ 15 min (quarterly drilled).
- Redis-cached verdicts continue to serve up to TTL (300 s), buying a 5-minute window.
- Emergency bypass via feature flag
FIREWALL_EMERGENCY_BYPASS=true— allows P0/P1 lanes to bypass firewall with conspicuous audit (CISO + CTO dual-approval, time-boxed ≤ 1 h).
Recovery. Once Postgres is reachable, pods auto-recover. Backlog from NATS drains; audit gaps (if bypass engaged) flagged in daily report.
FM-02 — Redis unavailable
Scenario. Redis cluster reachable but degraded (e.g., node failure) or unreachable.
Impact. Hot-path verdict cache misses → all requests hit Postgres. EvaluateTransit P95 drifts from 5 ms → 30–50 ms. SLO burn begins within minutes if load is high.
Detection. Cache-miss rate metric firewall_cache_miss_ratio spikes; Redis conn errors; alert FirewallCachePostgresFallback fires at > 20% miss rate sustained 5 min.
Mitigation.
- Redis HA cluster with 3 primary + 3 replica; automatic fail-over.
- Degraded-mode Postgres fallback continues serving correctly, just slower.
- In-process LRU of 1 000 most-frequent verdicts (seconds-level TTL) masks Redis flaps.
- If sustained > 30 min, HPA scales up to compensate.
Recovery. Redis recovery repopulates cache naturally; latency returns to normal within 5 min.
FM-03 — NATS consumer lag > 60 s
Scenario. Consumer on fraud.detected.*, consent.revoked.v1, or sender.id.suspended.v1 falls behind.
Impact. New fraud signals / consent revocations / sender-ID suspensions delayed — firewall still operates on stale state. Risk: a just-suspended sender continues to pass firewall for up to lag seconds.
Detection. NATS lag metric per consumer; alert at 60 s.
Mitigation.
- Durable consumers with explicit ACK; DLQ for poison messages.
- Multiple consumer instances (queue-group of N pods) — parallel processing.
- Back-pressure signal: if lag > 300 s, emit
firewall.signal.degraded.v1so downstream can compensate.
Recovery. Consumer catches up within minutes once load returns to normal. Lag metric guides HPA.
FM-04 — Federation source unreachable
Scenario. National-blocklist federation partner (MNO or regulator) rejects the sync poll or times out.
Impact. Blocklist stale — new national-blocklist entries not applied to local firewall.
Detection. Federation sync metric firewall_federation_last_sync_age_seconds > 30 min; alert FirewallFederationStale.
Mitigation.
- Previous blocklist still enforced (last-known-good).
- Circuit breaker on federation client: 3 consecutive failures → back off exponentially to 1 h max.
- Manual upload fallback via admin-dashboard for urgent entries.
- Federation is multi-source (if one MNO is down, others still update).
Recovery. Automatic retry; next successful sync merges the diff.
FM-05 — Blocklist hash-chain break
Scenario. Daily hash-chain verifier detects that a row's record_hash does not match sha256(payload || prev_hash).
Impact. Audit log regulator-defensibility lost for affected period.
Detection. Daily verifier cron; alert FirewallChainBroken (Critical).
Mitigation.
- Immediate investigation — log read-write anomaly, schema tampering, or verifier bug.
- Quarantine affected partition; subsequent rows continue from a new chain origin.
- If root cause is tamper, escalate to CISO + Legal.
- If root cause is verifier bug, recompute chain after bugfix; regulator notified if audit was already submitted with corrupt chain.
Recovery. Depends on root cause. Audit row volume typically ~10 k/h, so partition quarantine has minor operational cost but major audit implication.
FM-06 — ML model unavailable
Scenario. AIT-detection or SIM-box-detection ML model serving (Triton / TorchServe) returns 503.
Impact. ML-assisted detection offline; rule-based matchers (which are lower-recall) continue.
Detection. firewall_ml_inference_errors_total; circuit breaker opens after 3 consecutive errors; fallback engages immediately.
Mitigation.
- Fallback to rule-based patterns (still detects common AIT/SIM-box signatures).
- Rule-based verdict marked with
detectionMode: "FALLBACK"in audit. - Alert
FirewallMlUnavailablefires. - Model-serving is HA (3 replicas); single-pod failure auto-recovers.
Recovery. Model back online → circuit closes after 30 s of healthy responses.
FM-07 — Fingerprint storm (adversarial)
Scenario. Attacker rotates JA3 fingerprints at high rate to exhaust cache / rate-limit buckets.
Impact. Cache thrashing; legitimate traffic evicted; possible latency spike.
Detection. firewall_distinct_ja3_per_minute > 10 000; alert FirewallFingerprintStorm.
Mitigation.
- Cloudflare + Kong edge defence (per
EP-KONG-06) absorbs most before reaching firewall. - In-service: cache eviction by LFU; top-100 fingerprints pinned.
- Automatic tarpit deployment: new fingerprints with elevated error rate get slow-response.
- Escalation: if unresolved in 15 min, enable tighter Kong filter.
Recovery. Storm ebbs naturally or is blocked at edge; service recovers in minutes.
FM-08 — Legitimate-sender false-positive spike
Scenario. A rule update or AIT model update causes > 1% of legitimate traffic to be BLOCKED.
Impact. Tenant escalations; OTP/transactional delivery failures; regulator exposure.
Detection. Tenant-reported complaints; metric firewall_block_rate anomaly; SLO breach.
Mitigation.
- Shadow-mode rule deployment (per
EP-CE-19pattern) is required for all rule updates. - Automatic rollback if BLOCK rate > baseline + 50% sustained 10 min.
- Per-tenant whitelist for known design-partner tenants during sensitive windows.
Recovery. Rollback the offending rule; incident report within 24 h.
FM-09 — SIM-box detector false positives
Scenario. ML model flags legitimate high-volume MNO gateway traffic as SIM-box.
Impact. Real traffic blocked.
Detection. Tenant escalation; fraud.detected.simbox volume anomaly; manual review shows FP.
Mitigation.
- Human-in-the-loop triage for SIM-box detections before auto-block (except highest-confidence tier).
- Per-tenant exemption workflow for verified bulk senders.
- Model re-training with updated negative examples.
Recovery. Exemption applied immediately; model retrained within 7 d.
FM-10 — EvaluateTransit P95 latency spike
Scenario. Code path regression or DB slow query pushes P95 from 5 ms to > 30 ms.
Impact. Submit-to-DLR SLA (OTP 3 s target) at risk.
Detection. Prometheus histogram firewall_evaluate_transit_seconds P95 > 20 ms for 5 min.
Mitigation.
- Automatic rollback on canary P95 > 15 ms for 5 min.
- Horizontal scale-out via HPA on RPS.
- Query-plan regression detection in staging under load.
Recovery. Rollback; post-mortem within 48 h.
FM-11 — Region partition on blocklist replication
Scenario. Kabul↔Mazar logical replication partition prevents blocklist updates flowing cross-region.
Impact. Regions diverge; an entry added in Kabul not seen in Mazar for up to partition duration.
Detection. firewall_region_divergence_row_count metric > 0 for > 5 min.
Mitigation.
- Replication monitoring + automatic re-sync.
- Regions operate independently (each enforces own local blocklist).
- Drift reconciliation hourly; alert if > 100 rows divergent for 1 h.
Recovery. Partition heals → replication catches up → reconciliation cron merges.
FM-12 — HSM unavailable
Scenario. HSM outage blocks signing of daily federation export.
Impact. Outbound federation paused; inbound firewall operations continue unaffected (audit hash-chain uses software SHA-256, not HSM).
Detection. Sign operation fails; export job queues; alert FirewallHsmUnavailable.
Mitigation.
- HSM HA (ADR-0004 §11) with regional quorum.
- Export job retries on HSM recovery.
- Manual export signing via Security-team-held backup key (dual-control).
Recovery. HSM recovery → export job drains queue automatically.
FM-13 — Rule-engine recursion
Scenario. A composite-rule reference loop is created via manual rule authoring.
Impact. Single evaluation hits recursion limit and times out; fail-closed for that message.
Detection. firewall_rule_recursion_errors_total counter increment; alert at > 10/min.
Mitigation.
- Save-time cycle detection (inherited from
compliance-enginecomposite-rule pattern). - Runtime recursion depth cap (5) with visited-set tracking.
- Per-message evaluation budget (50 ms P99); exceeded → fail-closed + alert.
Recovery. Admin removes the offending rule; affected period has a reviewable audit log.
4. Graceful Degradation Summary
| Failure domain | Fail-closed action | Optional bypass |
|---|---|---|
| Postgres | Block all (default) | FIREWALL_EMERGENCY_BYPASS=true with CISO+CTO dual-approval, P0/P1 only, time-boxed |
| Redis | Degrade to Postgres | N/A (automatic) |
| NATS | Operate on stale signals | N/A |
| Federation source | Use last-known-good | Manual upload for urgent entries |
| ML model | Rule-based fallback | N/A |
| HSM | Inbound continues; outbound paused | Manual signing via backup key |
5. Failure ↔ Tenant / Regulator Experience Matrix
| FM | Tenant sees | Regulator sees | Citizen sees |
|---|---|---|---|
| FM-01 Postgres out | 503 FIREWALL_UNAVAILABLE | Audit gap if emergency bypass engaged | Delayed SMS |
| FM-02 Redis out | None (latency slightly up) | None | None |
| FM-03 NATS lag | None (possibly stale verdicts) | None | None |
| FM-04 Federation stale | None | Stale national blocklist (regulator may escalate if > 6 h) | None |
| FM-05 Chain break | None | Audit integrity claim compromised — major | None |
| FM-06 ML out | Possibly reduced detection rate | None (logged) | None |
| FM-07 Fingerprint storm | Possible latency blip | None | None |
| FM-08 FP spike | OTP/transactional failures; tenant escalation | Complaint may arrive | Reduced SMS delivery |
| FM-09 SIM-box FP | Bulk-sender escalation | None | None |
| FM-10 Latency spike | Submit-to-DLR SLA breach risk | None | Possible delay |
| FM-11 Region split | Inconsistent behaviour between regions | None short-term; audit export might differ long-term | None |
| FM-12 HSM out | None | Missed daily federation export (regulator notified) | None |
| FM-13 Rule cycle | Single-message fail-closed | None | None |
6. Open Points
| ID | Question | Owner |
|---|---|---|
| FM-OPEN-01 | Exact SLO target for FirewallEvaluateLatencyHigh (e.g., P95 ≤ 5 ms or ≤ 10 ms) | SRE |
| FM-OPEN-02 | Emergency bypass governance — CISO+CTO dual-approval or CEO sign-off for extended bypass | Legal + Leadership |
| FM-OPEN-03 | Regional independence on blocklist state — documented as acceptable or eventually-consistent-only | Platform Arch |