sms-firewall-service — Failure Modes

Version: 1.0 Status: Draft Owner: Trust and Safety + SRE Last Updated: 2026-04-21 References: SERVICE_OVERVIEW.md, OBSERVABILITY.md, docs/architecture/ADR-0004-national-backbone-resilience.md

This document catalogs how sms-firewall-service fails, what the tenant / citizen / regulator experiences, how the platform detects each failure, and the designed mitigation path. The service is consulted synchronously by routing-engine on every outbound message and by channel-router-service on inbound MO; any fault is therefore national in impact.

1. Operating Principle: Fail-Closed by Default

sms-firewall-service is a national-perimeter firewall. Its fail-closed posture is the inverse of the usual "availability first" service posture: if we cannot evaluate a message, we do not let it through. Specific exceptions are enumerated in §4 (Graceful Degradation).

2. Failure Mode Summary

#	Name	Class	Detection Time	User-visible effect	Runbook
FM-01	Postgres unavailable	Infra	< 30 s	Firewall `FilterInbound`/`EvaluateTransit` returns 503; routing-engine fails the submit; tenant sees 503 with code `FIREWALL_UNAVAILABLE`	`runbooks/firewall-postgres-out.md`
FM-02	Redis unavailable	Infra	< 10 s	Verdict cache miss → all requests hit Postgres (degraded latency, ~30 ms P95 vs. 5 ms)	`runbooks/firewall-redis-out.md`
FM-03	NATS consumer lag > 60 s	Infra	60 s	Fraud-intel and consent-ledger signals delayed; blocklist updates stale	`runbooks/firewall-nats-lag.md`
FM-04	National-blocklist federation source unreachable	Dependency	30 min	Blocklist stale; new national-blocklist entries not applied	`runbooks/firewall-federation-stale.md`
FM-05	Blocklist hash-chain break (tamper or bug)	Correctness	24 h	Audit defensibility lost for affected period	`runbooks/firewall-chain-break.md`
FM-06	ML model (AIT / SIM-box) unavailable	Dependency	< 10 s	Fall back to rule-based pattern matcher (degraded detection rate)	`runbooks/firewall-ml-out.md`
FM-07	Fingerprint storm (DDoS of distinct JA3)	Adversarial	2 min	Cache thrashing; increased latency; tarpit triggers	`runbooks/firewall-fingerprint-storm.md`
FM-08	Legitimate-sender false positive spike	Correctness	minutes	Tenant escalations; potential routing failures for genuine traffic	`runbooks/firewall-false-positive-spike.md`
FM-09	SIM-box detector emits high-volume false positives	ML	1 h	Real traffic blocked as SIM-box originated	`runbooks/firewall-simbox-fp.md`
FM-10	`EvaluateTransit` P95 latency spike	Performance	5 min	Submit-to-DLR SLA at risk; SLO burn	`runbooks/firewall-latency-spike.md`
FM-11	Region-partition on blocklist replication	Infra	1 min	Regions diverge on blocklist state	`runbooks/firewall-region-split.md`
FM-12	HSM unavailable → signing of federation export fails	Dependency	< 30 s	Outbound federation paused; inbound continues	`runbooks/firewall-hsm-out.md`
FM-13	Rule-engine recursion (composite-rule cycle)	Code	< 1 s	Single evaluation times out; fail-closed for that message	`runbooks/firewall-rule-cycle.md`

3. Detailed Failure Modes

FM-01 — Postgres unavailable

Scenario. Primary Postgres (or its synchronous replica) is unreachable; pooled connections timeout.

Impact. Every outbound SMS is blocked at the firewall (fail-closed). Inbound MO firewall stalls; MO messages queue in NATS.

Detection. /health/ready returns 503 within 2 s; HPA may remove the pod from service endpoints; alert FirewallDbUnavailable fires within 30 s (metric firewall_db_connection_errors_total).

Mitigation.

Postgres HA with synchronous replica; automatic fail-over within 30 s.
Multi-region fail-over (manual-gated) ≤ 15 min (quarterly drilled).
Redis-cached verdicts continue to serve up to TTL (300 s), buying a 5-minute window.
Emergency bypass via feature flag FIREWALL_EMERGENCY_BYPASS=true — allows P0/P1 lanes to bypass firewall with conspicuous audit (CISO + CTO dual-approval, time-boxed ≤ 1 h).

Recovery. Once Postgres is reachable, pods auto-recover. Backlog from NATS drains; audit gaps (if bypass engaged) flagged in daily report.

FM-02 — Redis unavailable

Scenario. Redis cluster reachable but degraded (e.g., node failure) or unreachable.

Impact. Hot-path verdict cache misses → all requests hit Postgres. EvaluateTransit P95 drifts from 5 ms → 30–50 ms. SLO burn begins within minutes if load is high.

Detection. Cache-miss rate metric firewall_cache_miss_ratio spikes; Redis conn errors; alert FirewallCachePostgresFallback fires at > 20% miss rate sustained 5 min.

Mitigation.

Redis HA cluster with 3 primary + 3 replica; automatic fail-over.
Degraded-mode Postgres fallback continues serving correctly, just slower.
In-process LRU of 1 000 most-frequent verdicts (seconds-level TTL) masks Redis flaps.
If sustained > 30 min, HPA scales up to compensate.

Recovery. Redis recovery repopulates cache naturally; latency returns to normal within 5 min.

FM-03 — NATS consumer lag > 60 s

Scenario. Consumer on fraud.detected.*, consent.revoked.v1, or sender.id.suspended.v1 falls behind.

Impact. New fraud signals / consent revocations / sender-ID suspensions delayed — firewall still operates on stale state. Risk: a just-suspended sender continues to pass firewall for up to lag seconds.

Detection. NATS lag metric per consumer; alert at 60 s.

Mitigation.

Durable consumers with explicit ACK; DLQ for poison messages.
Multiple consumer instances (queue-group of N pods) — parallel processing.
Back-pressure signal: if lag > 300 s, emit firewall.signal.degraded.v1 so downstream can compensate.

Recovery. Consumer catches up within minutes once load returns to normal. Lag metric guides HPA.

FM-04 — Federation source unreachable

Scenario. National-blocklist federation partner (MNO or regulator) rejects the sync poll or times out.

Impact. Blocklist stale — new national-blocklist entries not applied to local firewall.

Detection. Federation sync metric firewall_federation_last_sync_age_seconds > 30 min; alert FirewallFederationStale.

Mitigation.

Previous blocklist still enforced (last-known-good).
Circuit breaker on federation client: 3 consecutive failures → back off exponentially to 1 h max.
Manual upload fallback via admin-dashboard for urgent entries.
Federation is multi-source (if one MNO is down, others still update).

Recovery. Automatic retry; next successful sync merges the diff.

FM-05 — Blocklist hash-chain break

Scenario. Daily hash-chain verifier detects that a row's record_hash does not match sha256(payload || prev_hash).

Impact. Audit log regulator-defensibility lost for affected period.

Detection. Daily verifier cron; alert FirewallChainBroken (Critical).

Mitigation.

Immediate investigation — log read-write anomaly, schema tampering, or verifier bug.
Quarantine affected partition; subsequent rows continue from a new chain origin.
If root cause is tamper, escalate to CISO + Legal.
If root cause is verifier bug, recompute chain after bugfix; regulator notified if audit was already submitted with corrupt chain.

Recovery. Depends on root cause. Audit row volume typically ~10 k/h, so partition quarantine has minor operational cost but major audit implication.

FM-06 — ML model unavailable

Scenario. AIT-detection or SIM-box-detection ML model serving (Triton / TorchServe) returns 503.

Impact. ML-assisted detection offline; rule-based matchers (which are lower-recall) continue.

Detection. firewall_ml_inference_errors_total; circuit breaker opens after 3 consecutive errors; fallback engages immediately.

Mitigation.

Fallback to rule-based patterns (still detects common AIT/SIM-box signatures).
Rule-based verdict marked with detectionMode: "FALLBACK" in audit.
Alert FirewallMlUnavailable fires.
Model-serving is HA (3 replicas); single-pod failure auto-recovers.

Recovery. Model back online → circuit closes after 30 s of healthy responses.

FM-07 — Fingerprint storm (adversarial)

Scenario. Attacker rotates JA3 fingerprints at high rate to exhaust cache / rate-limit buckets.

Impact. Cache thrashing; legitimate traffic evicted; possible latency spike.

Detection. firewall_distinct_ja3_per_minute > 10 000; alert FirewallFingerprintStorm.

Mitigation.

Cloudflare + Kong edge defence (per EP-KONG-06) absorbs most before reaching firewall.
In-service: cache eviction by LFU; top-100 fingerprints pinned.
Automatic tarpit deployment: new fingerprints with elevated error rate get slow-response.
Escalation: if unresolved in 15 min, enable tighter Kong filter.

Recovery. Storm ebbs naturally or is blocked at edge; service recovers in minutes.

FM-08 — Legitimate-sender false-positive spike

Scenario. A rule update or AIT model update causes > 1% of legitimate traffic to be BLOCKED.

Impact. Tenant escalations; OTP/transactional delivery failures; regulator exposure.

Detection. Tenant-reported complaints; metric firewall_block_rate anomaly; SLO breach.

Mitigation.

Shadow-mode rule deployment (per EP-CE-19 pattern) is required for all rule updates.
Automatic rollback if BLOCK rate > baseline + 50% sustained 10 min.
Per-tenant whitelist for known design-partner tenants during sensitive windows.

Recovery. Rollback the offending rule; incident report within 24 h.

FM-09 — SIM-box detector false positives

Scenario. ML model flags legitimate high-volume MNO gateway traffic as SIM-box.

Impact. Real traffic blocked.

Detection. Tenant escalation; fraud.detected.simbox volume anomaly; manual review shows FP.

Mitigation.

Human-in-the-loop triage for SIM-box detections before auto-block (except highest-confidence tier).
Per-tenant exemption workflow for verified bulk senders.
Model re-training with updated negative examples.

Recovery. Exemption applied immediately; model retrained within 7 d.

FM-10 — `EvaluateTransit` P95 latency spike

Scenario. Code path regression or DB slow query pushes P95 from 5 ms to > 30 ms.

Impact. Submit-to-DLR SLA (OTP 3 s target) at risk.

Detection. Prometheus histogram firewall_evaluate_transit_seconds P95 > 20 ms for 5 min.

Mitigation.

Automatic rollback on canary P95 > 15 ms for 5 min.
Horizontal scale-out via HPA on RPS.
Query-plan regression detection in staging under load.

Recovery. Rollback; post-mortem within 48 h.

FM-11 — Region partition on blocklist replication

Scenario. Kabul↔Mazar logical replication partition prevents blocklist updates flowing cross-region.

Impact. Regions diverge; an entry added in Kabul not seen in Mazar for up to partition duration.

Detection. firewall_region_divergence_row_count metric > 0 for > 5 min.

Mitigation.

Replication monitoring + automatic re-sync.
Regions operate independently (each enforces own local blocklist).
Drift reconciliation hourly; alert if > 100 rows divergent for 1 h.

Recovery. Partition heals → replication catches up → reconciliation cron merges.

FM-12 — HSM unavailable

Scenario. HSM outage blocks signing of daily federation export.

Impact. Outbound federation paused; inbound firewall operations continue unaffected (audit hash-chain uses software SHA-256, not HSM).

Detection. Sign operation fails; export job queues; alert FirewallHsmUnavailable.

Mitigation.

HSM HA (ADR-0004 §11) with regional quorum.
Export job retries on HSM recovery.
Manual export signing via Security-team-held backup key (dual-control).

Recovery. HSM recovery → export job drains queue automatically.

FM-13 — Rule-engine recursion

Scenario. A composite-rule reference loop is created via manual rule authoring.

Impact. Single evaluation hits recursion limit and times out; fail-closed for that message.

Detection. firewall_rule_recursion_errors_total counter increment; alert at > 10/min.

Mitigation.

Save-time cycle detection (inherited from compliance-engine composite-rule pattern).
Runtime recursion depth cap (5) with visited-set tracking.
Per-message evaluation budget (50 ms P99); exceeded → fail-closed + alert.

Recovery. Admin removes the offending rule; affected period has a reviewable audit log.

4. Graceful Degradation Summary

Failure domain	Fail-closed action	Optional bypass
Postgres	Block all (default)	`FIREWALL_EMERGENCY_BYPASS=true` with CISO+CTO dual-approval, P0/P1 only, time-boxed
Redis	Degrade to Postgres	N/A (automatic)
NATS	Operate on stale signals	N/A
Federation source	Use last-known-good	Manual upload for urgent entries
ML model	Rule-based fallback	N/A
HSM	Inbound continues; outbound paused	Manual signing via backup key

5. Failure ↔ Tenant / Regulator Experience Matrix

FM	Tenant sees	Regulator sees	Citizen sees
FM-01 Postgres out	503 `FIREWALL_UNAVAILABLE`	Audit gap if emergency bypass engaged	Delayed SMS
FM-02 Redis out	None (latency slightly up)	None	None
FM-03 NATS lag	None (possibly stale verdicts)	None	None
FM-04 Federation stale	None	Stale national blocklist (regulator may escalate if > 6 h)	None
FM-05 Chain break	None	Audit integrity claim compromised — major	None
FM-06 ML out	Possibly reduced detection rate	None (logged)	None
FM-07 Fingerprint storm	Possible latency blip	None	None
FM-08 FP spike	OTP/transactional failures; tenant escalation	Complaint may arrive	Reduced SMS delivery
FM-09 SIM-box FP	Bulk-sender escalation	None	None
FM-10 Latency spike	Submit-to-DLR SLA breach risk	None	Possible delay
FM-11 Region split	Inconsistent behaviour between regions	None short-term; audit export might differ long-term	None
FM-12 HSM out	None	Missed daily federation export (regulator notified)	None
FM-13 Rule cycle	Single-message fail-closed	None	None

6. Open Points

ID	Question	Owner
FM-OPEN-01	Exact SLO target for FirewallEvaluateLatencyHigh (e.g., P95 ≤ 5 ms or ≤ 10 ms)	SRE
FM-OPEN-02	Emergency bypass governance — CISO+CTO dual-approval or CEO sign-off for extended bypass	Legal + Leadership
FM-OPEN-03	Regional independence on blocklist state — documented as acceptable or eventually-consistent-only	Platform Arch

1. Operating Principle: Fail-Closed by Default​

2. Failure Mode Summary​

3. Detailed Failure Modes​

FM-01 — Postgres unavailable​

FM-02 — Redis unavailable​

FM-03 — NATS consumer lag > 60 s​

FM-04 — Federation source unreachable​

FM-05 — Blocklist hash-chain break​

FM-06 — ML model unavailable​

FM-07 — Fingerprint storm (adversarial)​

FM-08 — Legitimate-sender false-positive spike​

FM-09 — SIM-box detector false positives​

FM-10 — EvaluateTransit P95 latency spike​

FM-11 — Region partition on blocklist replication​

FM-12 — HSM unavailable​

FM-13 — Rule-engine recursion​

4. Graceful Degradation Summary​

5. Failure ↔ Tenant / Regulator Experience Matrix​

6. Open Points​

1. Operating Principle: Fail-Closed by Default

2. Failure Mode Summary

3. Detailed Failure Modes

FM-01 — Postgres unavailable

FM-02 — Redis unavailable

FM-03 — NATS consumer lag > 60 s

FM-04 — Federation source unreachable

FM-05 — Blocklist hash-chain break

FM-06 — ML model unavailable

FM-07 — Fingerprint storm (adversarial)

FM-08 — Legitimate-sender false-positive spike

FM-09 — SIM-box detector false positives

FM-10 — `EvaluateTransit` P95 latency spike

FM-11 — Region partition on blocklist replication

FM-12 — HSM unavailable

FM-13 — Rule-engine recursion

4. Graceful Degradation Summary

5. Failure ↔ Tenant / Regulator Experience Matrix

6. Open Points