Skip to main content

sms-firewall-service — Service Readiness

Version: 1.0 Status: Draft Owner: Trust and Safety + SRE Last Updated: 2026-04-21 References: SERVICE_OVERVIEW.md, _report.md, FAILURE_MODES.md

Readiness criteria for taking sms-firewall-service from development to production. Given the service is the national perimeter firewall consulted on every outbound and inbound message, the bar is elevated: fail-closed correctness, sub-5 ms EvaluateTransit P95, hash-chained audit, adversarial-input resistance.


1. Code Readiness

CriterionStatusNotes
gRPC SmsFirewallService.v1 (FilterInbound, EvaluateTransit, GetVerdict)
REST admin: rules CRUD, blocklist CRUD, federation trigger, audit query, public stats
Rule engine: MSISDN/sender-ID/content/rate/geo/ML-based evaluators
ML model integration: AIT, SIM-box, OTP-harvest detectors via Triton / TorchServe
Rule-based fallback when ML unavailable
Blocklist federation sync (MNO + regulator sources) with last-known-good fallback
Audit hash-chain (prev_hash, record_hash) with daily verifier
Redis verdict cache with fingerprint key; TTL 300 s
NATS consumers: fraud.detected.*, consent.revoked.v1, sender.id.suspended.v1
Fail-closed on Postgres unavailable
Emergency-bypass feature flag with dual-approval audit
Fingerprint-storm detection + tarpit engagement
Idempotency-Key on admin writes
mTLS gRPC + SPIRE SVID enforcement

2. Testing Readiness

CriterionTargetStatus
Unit coverage≥ 90% line (domain), ≥ 80% branch
Unit tests per rule evaluator≥ 15 per type
Unit tests for MSISDN normalisation + homoglyph defence≥ 50
Unit tests for audit hash-chain≥ 15
Property-based tests (fast-check): chain monotonicity, verdict determinism≥ 10 properties
Integration: gRPC EvaluateTransit P95 ≤ 5 ms @ 5 000 RPSPassed
Integration: federation sync with mock MNO/regulator sourcePassed
Integration: ML model fallback to rule-based on serving outagePassed
Integration: RLS cross-tenant blockedPassed
Contract: routing-engine EvaluateTransit integrationPassed
Contract: channel-router FilterInbound integrationPassed
Contract: compliance-engine policy input from firewall verdictPassed
Chaos: Postgres unavailable → fail-closed verifiedPassed
Chaos: Redis unavailable → PG fallback, latency degrades gracefullyPassed
Chaos: ML model unreachable → rule-based fallback engagesPassed
Chaos: NATS lag > 60 s → alert + stale-signal handlingPassed
Security: adversarial-input corpus test (homoglyphs, encoded payloads, recursion)Passed
Security: audit UPDATE/DELETE rejectedPassed
Security: emergency-bypass dual-approval enforcedPassed
Load test: 10 000 RPS sustained for 1 h, P99 ≤ 20 msPassed

3. Observability Readiness

CriterionStatus
All Prometheus metrics emitting (OBSERVABILITY.md §1)
Grafana dashboard sms-firewall-service.json deployed
All alerts configured with runbook links
Structured JSON logs (Pino) with MSISDN hash-masking
OTel trace propagation end-to-end verified
Loki parsing rules validated
SIEM forwarding of firewall.audit.v1 verified

Alerts Configured

  • FirewallEvaluateLatencyHigh (P95 > 15 ms for 5 min)
  • FirewallEvaluateErrorRateHigh (> 0.1% 5xx)
  • FirewallDbUnavailable
  • FirewallCachePostgresFallback (> 20% miss rate sustained 5 min)
  • FirewallFederationStale (> 30 min since last successful sync)
  • FirewallChainBroken (verifier detected break — Critical)
  • FirewallMlUnavailable
  • FirewallFingerprintStorm
  • FirewallBlockRateAnomaly (block rate > baseline + 50% sustained 10 min)
  • FirewallRegionDivergence (> 100 divergent rows for 1 h)
  • FirewallEmergencyBypassEngaged (bypass active — Critical audit)

4. Security Readiness

CriterionStatus
mTLS on gRPC; SPIRE SVIDs rotated hourly (ADR-0004 §12)
NetworkPolicy restricting ingress to routing-engine, channel-router, compliance-engine
Kong JWT on REST admin endpoints
Blocklist federation auth via HSM-signed mutual certs
Audit UPDATE/DELETE rejected at Postgres trigger
Emergency-bypass requires CISO+CTO approval + 1 h time-box
Adversarial-input test corpus run in CI
Pen test against admin API + federation endpoints
Security team sign-off

5. Operational Readiness

CriterionStatus
K8s Deployment (5–15 replicas) reviewed; HPA on gRPC RPS
PDB minAvailable: 3 per region (perimeter criticality)
Rolling update tested: zero dropped gRPC calls under 5 000 RPS
Graceful shutdown: 15 s SIGTERM drain
Postgres conn pool (pgbouncer transaction mode)
Redis conn pool sized (100 min / 300 max per pod)
ML model-serving (Triton) pool sized for peak RPS
Blocklist federation runbook drafted
Emergency-bypass runbook drafted
Chain-break incident runbook drafted
False-positive-spike runbook drafted
On-call rotation assigned (T&S primary; SRE secondary)

6. Documentation Readiness

All 16 SERVICE_TEMPLATE docs at "Complete" status. Plus runbooks per alert (§3 list). Plus Admin handbook for rule authoring and reviewer workflow.


7. Compliance / Regulatory Readiness

CriterionStatus
ATRA engagement on national-blocklist federation schema
Initial rule set reviewed by Legal
AIT / SIM-box model fairness review (no MNO bias)
Cross-MNO blocklist-sharing agreement drafted
Citizen complaint surface (via regulator-portal) integrated
SIEM forwarding of firewall audit events approved

8. Go/No-Go Criteria Summary

Production deployment is GO when:

  • All §1 Code Readiness complete.
  • Coverage targets met.
  • Load test at 1.5× expected peak RPS (target 7 500) sustains P99 ≤ 25 ms.
  • Adversarial-input corpus passes (0 bypasses).
  • 14-day shadow mode completed with false-positive projection < 1%.
  • Chaos drill (Postgres, Redis, ML, NATS, federation) all degrade correctly.
  • Legal + Security + Regulator Liaison sign-offs.
  • Rollback plan validated in staging.
  • SIEM forwarding verified in staging.
  • CISO sign-off on emergency-bypass governance.

9. Post-Launch Review

Within 30 days:

  • Block rate vs. forecast; tune rules where rate is outlier.
  • False-positive rate audit (target < 1% of BLOCKs).
  • False-negative review via fraud-intel feedback.
  • Latency SLO attainment (target 99.9% of 5-min windows ≤ 5 ms P95).
  • Federation-sync reliability (target 99.5% successful daily syncs).
  • Audit-chain integrity (0 breaks).
  • Emergency-bypass usage review (target 0 engagements).
  • Cache hit ratio (target ≥ 97%).
  • Cost analysis: NATS bandwidth, ML inference costs, federation bandwidth.

10. Phased Rollout

PhaseDurationBehaviourExit criteria
P1 — Observation14 dReturns ALLOW for all; verdicts logged onlyFP projection < 1%
P2 — Enforcement: transit-MT only7 dEvaluateTransit enforces; inbound-MO still observe< 10 escalations/day
P3 — Full EnforcementOngoingBoth directions enforce; ML models activeSteady state

Rollback flags: FIREWALL_ENFORCEMENT_TRANSIT, FIREWALL_ENFORCEMENT_INBOUND.