sms-firewall-service — Service Readiness

Version: 1.0 Status: Draft Owner: Trust and Safety + SRE Last Updated: 2026-04-21 References: SERVICE_OVERVIEW.md, _report.md, FAILURE_MODES.md

Readiness criteria for taking sms-firewall-service from development to production. Given the service is the national perimeter firewall consulted on every outbound and inbound message, the bar is elevated: fail-closed correctness, sub-5 ms EvaluateTransit P95, hash-chained audit, adversarial-input resistance.

1. Code Readiness

Criterion	Status	Notes
gRPC `SmsFirewallService.v1` (FilterInbound, EvaluateTransit, GetVerdict)	☐
REST admin: rules CRUD, blocklist CRUD, federation trigger, audit query, public stats	☐
Rule engine: MSISDN/sender-ID/content/rate/geo/ML-based evaluators	☐
ML model integration: AIT, SIM-box, OTP-harvest detectors via Triton / TorchServe	☐
Rule-based fallback when ML unavailable	☐
Blocklist federation sync (MNO + regulator sources) with last-known-good fallback	☐
Audit hash-chain (`prev_hash`, `record_hash`) with daily verifier	☐
Redis verdict cache with fingerprint key; TTL 300 s	☐
NATS consumers: `fraud.detected.*`, `consent.revoked.v1`, `sender.id.suspended.v1`	☐
Fail-closed on Postgres unavailable	☐
Emergency-bypass feature flag with dual-approval audit	☐
Fingerprint-storm detection + tarpit engagement	☐
Idempotency-Key on admin writes	☐
mTLS gRPC + SPIRE SVID enforcement	☐

2. Testing Readiness

Criterion	Target	Status
Unit coverage	≥ 90% line (domain), ≥ 80% branch	☐
Unit tests per rule evaluator	≥ 15 per type	☐
Unit tests for MSISDN normalisation + homoglyph defence	≥ 50	☐
Unit tests for audit hash-chain	≥ 15	☐
Property-based tests (fast-check): chain monotonicity, verdict determinism	≥ 10 properties	☐
Integration: gRPC `EvaluateTransit` P95 ≤ 5 ms @ 5 000 RPS	Passed	☐
Integration: federation sync with mock MNO/regulator source	Passed	☐
Integration: ML model fallback to rule-based on serving outage	Passed	☐
Integration: RLS cross-tenant blocked	Passed	☐
Contract: routing-engine EvaluateTransit integration	Passed	☐
Contract: channel-router FilterInbound integration	Passed	☐
Contract: compliance-engine policy input from firewall verdict	Passed	☐
Chaos: Postgres unavailable → fail-closed verified	Passed	☐
Chaos: Redis unavailable → PG fallback, latency degrades gracefully	Passed	☐
Chaos: ML model unreachable → rule-based fallback engages	Passed	☐
Chaos: NATS lag > 60 s → alert + stale-signal handling	Passed	☐
Security: adversarial-input corpus test (homoglyphs, encoded payloads, recursion)	Passed	☐
Security: audit UPDATE/DELETE rejected	Passed	☐
Security: emergency-bypass dual-approval enforced	Passed	☐
Load test: 10 000 RPS sustained for 1 h, P99 ≤ 20 ms	Passed	☐

3. Observability Readiness

Criterion	Status
All Prometheus metrics emitting (OBSERVABILITY.md §1)	☐
Grafana dashboard `sms-firewall-service.json` deployed	☐
All alerts configured with runbook links	☐
Structured JSON logs (Pino) with MSISDN hash-masking	☐
OTel trace propagation end-to-end verified	☐
Loki parsing rules validated	☐
SIEM forwarding of `firewall.audit.v1` verified	☐

Alerts Configured

4. Security Readiness

Criterion	Status
mTLS on gRPC; SPIRE SVIDs rotated hourly (ADR-0004 §12)	☐
NetworkPolicy restricting ingress to routing-engine, channel-router, compliance-engine	☐
Kong JWT on REST admin endpoints	☐
Blocklist federation auth via HSM-signed mutual certs	☐
Audit UPDATE/DELETE rejected at Postgres trigger	☐
Emergency-bypass requires CISO+CTO approval + 1 h time-box	☐
Adversarial-input test corpus run in CI	☐
Pen test against admin API + federation endpoints	☐
Security team sign-off	☐

5. Operational Readiness

Criterion	Status
K8s Deployment (5–15 replicas) reviewed; HPA on gRPC RPS	☐
PDB `minAvailable: 3` per region (perimeter criticality)	☐
Rolling update tested: zero dropped gRPC calls under 5 000 RPS	☐
Graceful shutdown: 15 s SIGTERM drain	☐
Postgres conn pool (pgbouncer transaction mode)	☐
Redis conn pool sized (100 min / 300 max per pod)	☐
ML model-serving (Triton) pool sized for peak RPS	☐
Blocklist federation runbook drafted	☐
Emergency-bypass runbook drafted	☐
Chain-break incident runbook drafted	☐
False-positive-spike runbook drafted	☐
On-call rotation assigned (T&S primary; SRE secondary)	☐

6. Documentation Readiness

All 16 SERVICE_TEMPLATE docs at "Complete" status. Plus runbooks per alert (§3 list). Plus Admin handbook for rule authoring and reviewer workflow.

7. Compliance / Regulatory Readiness

Criterion	Status
ATRA engagement on national-blocklist federation schema	☐
Initial rule set reviewed by Legal	☐
AIT / SIM-box model fairness review (no MNO bias)	☐
Cross-MNO blocklist-sharing agreement drafted	☐
Citizen complaint surface (via regulator-portal) integrated	☐
SIEM forwarding of firewall audit events approved	☐

8. Go/No-Go Criteria Summary

Production deployment is GO when:

9. Post-Launch Review

Within 30 days:

Block rate vs. forecast; tune rules where rate is outlier.
False-positive rate audit (target < 1% of BLOCKs).
False-negative review via fraud-intel feedback.
Latency SLO attainment (target 99.9% of 5-min windows ≤ 5 ms P95).
Federation-sync reliability (target 99.5% successful daily syncs).
Audit-chain integrity (0 breaks).
Emergency-bypass usage review (target 0 engagements).
Cache hit ratio (target ≥ 97%).
Cost analysis: NATS bandwidth, ML inference costs, federation bandwidth.

10. Phased Rollout

Phase	Duration	Behaviour	Exit criteria
P1 — Observation	14 d	Returns `ALLOW` for all; verdicts logged only	FP projection < 1%
P2 — Enforcement: transit-MT only	7 d	`EvaluateTransit` enforces; inbound-MO still observe	< 10 escalations/day
P3 — Full Enforcement	Ongoing	Both directions enforce; ML models active	Steady state

Rollback flags: FIREWALL_ENFORCEMENT_TRANSIT, FIREWALL_ENFORCEMENT_INBOUND.

1. Code Readiness​

2. Testing Readiness​

3. Observability Readiness​

Alerts Configured​

4. Security Readiness​

5. Operational Readiness​

6. Documentation Readiness​

7. Compliance / Regulatory Readiness​

8. Go/No-Go Criteria Summary​

9. Post-Launch Review​

10. Phased Rollout​