Skip to main content

sms-firewall-service — Failure Modes

Version: 1.0 Status: Draft Owner: Trust and Safety + SRE Last Updated: 2026-04-21 References: SERVICE_OVERVIEW.md, OBSERVABILITY.md, docs/architecture/ADR-0004-national-backbone-resilience.md

This document catalogs how sms-firewall-service fails, what the tenant / citizen / regulator experiences, how the platform detects each failure, and the designed mitigation path. The service is consulted synchronously by routing-engine on every outbound message and by channel-router-service on inbound MO; any fault is therefore national in impact.


1. Operating Principle: Fail-Closed by Default

sms-firewall-service is a national-perimeter firewall. Its fail-closed posture is the inverse of the usual "availability first" service posture: if we cannot evaluate a message, we do not let it through. Specific exceptions are enumerated in §4 (Graceful Degradation).


2. Failure Mode Summary

#NameClassDetection TimeUser-visible effectRunbook
FM-01Postgres unavailableInfra< 30 sFirewall FilterInbound/EvaluateTransit returns 503; routing-engine fails the submit; tenant sees 503 with code FIREWALL_UNAVAILABLErunbooks/firewall-postgres-out.md
FM-02Redis unavailableInfra< 10 sVerdict cache miss → all requests hit Postgres (degraded latency, ~30 ms P95 vs. 5 ms)runbooks/firewall-redis-out.md
FM-03NATS consumer lag > 60 sInfra60 sFraud-intel and consent-ledger signals delayed; blocklist updates stalerunbooks/firewall-nats-lag.md
FM-04National-blocklist federation source unreachableDependency30 minBlocklist stale; new national-blocklist entries not appliedrunbooks/firewall-federation-stale.md
FM-05Blocklist hash-chain break (tamper or bug)Correctness24 hAudit defensibility lost for affected periodrunbooks/firewall-chain-break.md
FM-06ML model (AIT / SIM-box) unavailableDependency< 10 sFall back to rule-based pattern matcher (degraded detection rate)runbooks/firewall-ml-out.md
FM-07Fingerprint storm (DDoS of distinct JA3)Adversarial2 minCache thrashing; increased latency; tarpit triggersrunbooks/firewall-fingerprint-storm.md
FM-08Legitimate-sender false positive spikeCorrectnessminutesTenant escalations; potential routing failures for genuine trafficrunbooks/firewall-false-positive-spike.md
FM-09SIM-box detector emits high-volume false positivesML1 hReal traffic blocked as SIM-box originatedrunbooks/firewall-simbox-fp.md
FM-10EvaluateTransit P95 latency spikePerformance5 minSubmit-to-DLR SLA at risk; SLO burnrunbooks/firewall-latency-spike.md
FM-11Region-partition on blocklist replicationInfra1 minRegions diverge on blocklist staterunbooks/firewall-region-split.md
FM-12HSM unavailable → signing of federation export failsDependency< 30 sOutbound federation paused; inbound continuesrunbooks/firewall-hsm-out.md
FM-13Rule-engine recursion (composite-rule cycle)Code< 1 sSingle evaluation times out; fail-closed for that messagerunbooks/firewall-rule-cycle.md

3. Detailed Failure Modes

FM-01 — Postgres unavailable

Scenario. Primary Postgres (or its synchronous replica) is unreachable; pooled connections timeout.

Impact. Every outbound SMS is blocked at the firewall (fail-closed). Inbound MO firewall stalls; MO messages queue in NATS.

Detection. /health/ready returns 503 within 2 s; HPA may remove the pod from service endpoints; alert FirewallDbUnavailable fires within 30 s (metric firewall_db_connection_errors_total).

Mitigation.

  1. Postgres HA with synchronous replica; automatic fail-over within 30 s.
  2. Multi-region fail-over (manual-gated) ≤ 15 min (quarterly drilled).
  3. Redis-cached verdicts continue to serve up to TTL (300 s), buying a 5-minute window.
  4. Emergency bypass via feature flag FIREWALL_EMERGENCY_BYPASS=true — allows P0/P1 lanes to bypass firewall with conspicuous audit (CISO + CTO dual-approval, time-boxed ≤ 1 h).

Recovery. Once Postgres is reachable, pods auto-recover. Backlog from NATS drains; audit gaps (if bypass engaged) flagged in daily report.


FM-02 — Redis unavailable

Scenario. Redis cluster reachable but degraded (e.g., node failure) or unreachable.

Impact. Hot-path verdict cache misses → all requests hit Postgres. EvaluateTransit P95 drifts from 5 ms → 30–50 ms. SLO burn begins within minutes if load is high.

Detection. Cache-miss rate metric firewall_cache_miss_ratio spikes; Redis conn errors; alert FirewallCachePostgresFallback fires at > 20% miss rate sustained 5 min.

Mitigation.

  1. Redis HA cluster with 3 primary + 3 replica; automatic fail-over.
  2. Degraded-mode Postgres fallback continues serving correctly, just slower.
  3. In-process LRU of 1 000 most-frequent verdicts (seconds-level TTL) masks Redis flaps.
  4. If sustained > 30 min, HPA scales up to compensate.

Recovery. Redis recovery repopulates cache naturally; latency returns to normal within 5 min.


FM-03 — NATS consumer lag > 60 s

Scenario. Consumer on fraud.detected.*, consent.revoked.v1, or sender.id.suspended.v1 falls behind.

Impact. New fraud signals / consent revocations / sender-ID suspensions delayed — firewall still operates on stale state. Risk: a just-suspended sender continues to pass firewall for up to lag seconds.

Detection. NATS lag metric per consumer; alert at 60 s.

Mitigation.

  1. Durable consumers with explicit ACK; DLQ for poison messages.
  2. Multiple consumer instances (queue-group of N pods) — parallel processing.
  3. Back-pressure signal: if lag > 300 s, emit firewall.signal.degraded.v1 so downstream can compensate.

Recovery. Consumer catches up within minutes once load returns to normal. Lag metric guides HPA.


FM-04 — Federation source unreachable

Scenario. National-blocklist federation partner (MNO or regulator) rejects the sync poll or times out.

Impact. Blocklist stale — new national-blocklist entries not applied to local firewall.

Detection. Federation sync metric firewall_federation_last_sync_age_seconds > 30 min; alert FirewallFederationStale.

Mitigation.

  1. Previous blocklist still enforced (last-known-good).
  2. Circuit breaker on federation client: 3 consecutive failures → back off exponentially to 1 h max.
  3. Manual upload fallback via admin-dashboard for urgent entries.
  4. Federation is multi-source (if one MNO is down, others still update).

Recovery. Automatic retry; next successful sync merges the diff.


FM-05 — Blocklist hash-chain break

Scenario. Daily hash-chain verifier detects that a row's record_hash does not match sha256(payload || prev_hash).

Impact. Audit log regulator-defensibility lost for affected period.

Detection. Daily verifier cron; alert FirewallChainBroken (Critical).

Mitigation.

  1. Immediate investigation — log read-write anomaly, schema tampering, or verifier bug.
  2. Quarantine affected partition; subsequent rows continue from a new chain origin.
  3. If root cause is tamper, escalate to CISO + Legal.
  4. If root cause is verifier bug, recompute chain after bugfix; regulator notified if audit was already submitted with corrupt chain.

Recovery. Depends on root cause. Audit row volume typically ~10 k/h, so partition quarantine has minor operational cost but major audit implication.


FM-06 — ML model unavailable

Scenario. AIT-detection or SIM-box-detection ML model serving (Triton / TorchServe) returns 503.

Impact. ML-assisted detection offline; rule-based matchers (which are lower-recall) continue.

Detection. firewall_ml_inference_errors_total; circuit breaker opens after 3 consecutive errors; fallback engages immediately.

Mitigation.

  1. Fallback to rule-based patterns (still detects common AIT/SIM-box signatures).
  2. Rule-based verdict marked with detectionMode: "FALLBACK" in audit.
  3. Alert FirewallMlUnavailable fires.
  4. Model-serving is HA (3 replicas); single-pod failure auto-recovers.

Recovery. Model back online → circuit closes after 30 s of healthy responses.


FM-07 — Fingerprint storm (adversarial)

Scenario. Attacker rotates JA3 fingerprints at high rate to exhaust cache / rate-limit buckets.

Impact. Cache thrashing; legitimate traffic evicted; possible latency spike.

Detection. firewall_distinct_ja3_per_minute > 10 000; alert FirewallFingerprintStorm.

Mitigation.

  1. Cloudflare + Kong edge defence (per EP-KONG-06) absorbs most before reaching firewall.
  2. In-service: cache eviction by LFU; top-100 fingerprints pinned.
  3. Automatic tarpit deployment: new fingerprints with elevated error rate get slow-response.
  4. Escalation: if unresolved in 15 min, enable tighter Kong filter.

Recovery. Storm ebbs naturally or is blocked at edge; service recovers in minutes.


FM-08 — Legitimate-sender false-positive spike

Scenario. A rule update or AIT model update causes > 1% of legitimate traffic to be BLOCKED.

Impact. Tenant escalations; OTP/transactional delivery failures; regulator exposure.

Detection. Tenant-reported complaints; metric firewall_block_rate anomaly; SLO breach.

Mitigation.

  1. Shadow-mode rule deployment (per EP-CE-19 pattern) is required for all rule updates.
  2. Automatic rollback if BLOCK rate > baseline + 50% sustained 10 min.
  3. Per-tenant whitelist for known design-partner tenants during sensitive windows.

Recovery. Rollback the offending rule; incident report within 24 h.


FM-09 — SIM-box detector false positives

Scenario. ML model flags legitimate high-volume MNO gateway traffic as SIM-box.

Impact. Real traffic blocked.

Detection. Tenant escalation; fraud.detected.simbox volume anomaly; manual review shows FP.

Mitigation.

  1. Human-in-the-loop triage for SIM-box detections before auto-block (except highest-confidence tier).
  2. Per-tenant exemption workflow for verified bulk senders.
  3. Model re-training with updated negative examples.

Recovery. Exemption applied immediately; model retrained within 7 d.


FM-10 — EvaluateTransit P95 latency spike

Scenario. Code path regression or DB slow query pushes P95 from 5 ms to > 30 ms.

Impact. Submit-to-DLR SLA (OTP 3 s target) at risk.

Detection. Prometheus histogram firewall_evaluate_transit_seconds P95 > 20 ms for 5 min.

Mitigation.

  1. Automatic rollback on canary P95 > 15 ms for 5 min.
  2. Horizontal scale-out via HPA on RPS.
  3. Query-plan regression detection in staging under load.

Recovery. Rollback; post-mortem within 48 h.


FM-11 — Region partition on blocklist replication

Scenario. Kabul↔Mazar logical replication partition prevents blocklist updates flowing cross-region.

Impact. Regions diverge; an entry added in Kabul not seen in Mazar for up to partition duration.

Detection. firewall_region_divergence_row_count metric > 0 for > 5 min.

Mitigation.

  1. Replication monitoring + automatic re-sync.
  2. Regions operate independently (each enforces own local blocklist).
  3. Drift reconciliation hourly; alert if > 100 rows divergent for 1 h.

Recovery. Partition heals → replication catches up → reconciliation cron merges.


FM-12 — HSM unavailable

Scenario. HSM outage blocks signing of daily federation export.

Impact. Outbound federation paused; inbound firewall operations continue unaffected (audit hash-chain uses software SHA-256, not HSM).

Detection. Sign operation fails; export job queues; alert FirewallHsmUnavailable.

Mitigation.

  1. HSM HA (ADR-0004 §11) with regional quorum.
  2. Export job retries on HSM recovery.
  3. Manual export signing via Security-team-held backup key (dual-control).

Recovery. HSM recovery → export job drains queue automatically.


FM-13 — Rule-engine recursion

Scenario. A composite-rule reference loop is created via manual rule authoring.

Impact. Single evaluation hits recursion limit and times out; fail-closed for that message.

Detection. firewall_rule_recursion_errors_total counter increment; alert at > 10/min.

Mitigation.

  1. Save-time cycle detection (inherited from compliance-engine composite-rule pattern).
  2. Runtime recursion depth cap (5) with visited-set tracking.
  3. Per-message evaluation budget (50 ms P99); exceeded → fail-closed + alert.

Recovery. Admin removes the offending rule; affected period has a reviewable audit log.


4. Graceful Degradation Summary

Failure domainFail-closed actionOptional bypass
PostgresBlock all (default)FIREWALL_EMERGENCY_BYPASS=true with CISO+CTO dual-approval, P0/P1 only, time-boxed
RedisDegrade to PostgresN/A (automatic)
NATSOperate on stale signalsN/A
Federation sourceUse last-known-goodManual upload for urgent entries
ML modelRule-based fallbackN/A
HSMInbound continues; outbound pausedManual signing via backup key

5. Failure ↔ Tenant / Regulator Experience Matrix

FMTenant seesRegulator seesCitizen sees
FM-01 Postgres out503 FIREWALL_UNAVAILABLEAudit gap if emergency bypass engagedDelayed SMS
FM-02 Redis outNone (latency slightly up)NoneNone
FM-03 NATS lagNone (possibly stale verdicts)NoneNone
FM-04 Federation staleNoneStale national blocklist (regulator may escalate if > 6 h)None
FM-05 Chain breakNoneAudit integrity claim compromised — majorNone
FM-06 ML outPossibly reduced detection rateNone (logged)None
FM-07 Fingerprint stormPossible latency blipNoneNone
FM-08 FP spikeOTP/transactional failures; tenant escalationComplaint may arriveReduced SMS delivery
FM-09 SIM-box FPBulk-sender escalationNoneNone
FM-10 Latency spikeSubmit-to-DLR SLA breach riskNonePossible delay
FM-11 Region splitInconsistent behaviour between regionsNone short-term; audit export might differ long-termNone
FM-12 HSM outNoneMissed daily federation export (regulator notified)None
FM-13 Rule cycleSingle-message fail-closedNoneNone

6. Open Points

IDQuestionOwner
FM-OPEN-01Exact SLO target for FirewallEvaluateLatencyHigh (e.g., P95 ≤ 5 ms or ≤ 10 ms)SRE
FM-OPEN-02Emergency bypass governance — CISO+CTO dual-approval or CEO sign-off for extended bypassLegal + Leadership
FM-OPEN-03Regional independence on blocklist state — documented as acceptable or eventually-consistent-onlyPlatform Arch