sms-firewall-service — Service Readiness
Version: 1.0
Status: Draft
Owner: Trust and Safety + SRE
Last Updated: 2026-04-21
References: SERVICE_OVERVIEW.md, _report.md, FAILURE_MODES.md
Readiness criteria for taking sms-firewall-service from development to production. Given the service is the national perimeter firewall consulted on every outbound and inbound message, the bar is elevated: fail-closed correctness, sub-5 ms EvaluateTransit P95, hash-chained audit, adversarial-input resistance.
1. Code Readiness
| Criterion | Status | Notes |
|---|
gRPC SmsFirewallService.v1 (FilterInbound, EvaluateTransit, GetVerdict) | ☐ | |
| REST admin: rules CRUD, blocklist CRUD, federation trigger, audit query, public stats | ☐ | |
| Rule engine: MSISDN/sender-ID/content/rate/geo/ML-based evaluators | ☐ | |
| ML model integration: AIT, SIM-box, OTP-harvest detectors via Triton / TorchServe | ☐ | |
| Rule-based fallback when ML unavailable | ☐ | |
| Blocklist federation sync (MNO + regulator sources) with last-known-good fallback | ☐ | |
Audit hash-chain (prev_hash, record_hash) with daily verifier | ☐ | |
| Redis verdict cache with fingerprint key; TTL 300 s | ☐ | |
NATS consumers: fraud.detected.*, consent.revoked.v1, sender.id.suspended.v1 | ☐ | |
| Fail-closed on Postgres unavailable | ☐ | |
| Emergency-bypass feature flag with dual-approval audit | ☐ | |
| Fingerprint-storm detection + tarpit engagement | ☐ | |
| Idempotency-Key on admin writes | ☐ | |
| mTLS gRPC + SPIRE SVID enforcement | ☐ | |
2. Testing Readiness
| Criterion | Target | Status |
|---|
| Unit coverage | ≥ 90% line (domain), ≥ 80% branch | ☐ |
| Unit tests per rule evaluator | ≥ 15 per type | ☐ |
| Unit tests for MSISDN normalisation + homoglyph defence | ≥ 50 | ☐ |
| Unit tests for audit hash-chain | ≥ 15 | ☐ |
| Property-based tests (fast-check): chain monotonicity, verdict determinism | ≥ 10 properties | ☐ |
Integration: gRPC EvaluateTransit P95 ≤ 5 ms @ 5 000 RPS | Passed | ☐ |
| Integration: federation sync with mock MNO/regulator source | Passed | ☐ |
| Integration: ML model fallback to rule-based on serving outage | Passed | ☐ |
| Integration: RLS cross-tenant blocked | Passed | ☐ |
| Contract: routing-engine EvaluateTransit integration | Passed | ☐ |
| Contract: channel-router FilterInbound integration | Passed | ☐ |
| Contract: compliance-engine policy input from firewall verdict | Passed | ☐ |
| Chaos: Postgres unavailable → fail-closed verified | Passed | ☐ |
| Chaos: Redis unavailable → PG fallback, latency degrades gracefully | Passed | ☐ |
| Chaos: ML model unreachable → rule-based fallback engages | Passed | ☐ |
| Chaos: NATS lag > 60 s → alert + stale-signal handling | Passed | ☐ |
| Security: adversarial-input corpus test (homoglyphs, encoded payloads, recursion) | Passed | ☐ |
| Security: audit UPDATE/DELETE rejected | Passed | ☐ |
| Security: emergency-bypass dual-approval enforced | Passed | ☐ |
| Load test: 10 000 RPS sustained for 1 h, P99 ≤ 20 ms | Passed | ☐ |
3. Observability Readiness
| Criterion | Status |
|---|
| All Prometheus metrics emitting (OBSERVABILITY.md §1) | ☐ |
Grafana dashboard sms-firewall-service.json deployed | ☐ |
| All alerts configured with runbook links | ☐ |
| Structured JSON logs (Pino) with MSISDN hash-masking | ☐ |
| OTel trace propagation end-to-end verified | ☐ |
| Loki parsing rules validated | ☐ |
SIEM forwarding of firewall.audit.v1 verified | ☐ |
4. Security Readiness
| Criterion | Status |
|---|
| mTLS on gRPC; SPIRE SVIDs rotated hourly (ADR-0004 §12) | ☐ |
| NetworkPolicy restricting ingress to routing-engine, channel-router, compliance-engine | ☐ |
| Kong JWT on REST admin endpoints | ☐ |
| Blocklist federation auth via HSM-signed mutual certs | ☐ |
| Audit UPDATE/DELETE rejected at Postgres trigger | ☐ |
| Emergency-bypass requires CISO+CTO approval + 1 h time-box | ☐ |
| Adversarial-input test corpus run in CI | ☐ |
| Pen test against admin API + federation endpoints | ☐ |
| Security team sign-off | ☐ |
5. Operational Readiness
| Criterion | Status |
|---|
| K8s Deployment (5–15 replicas) reviewed; HPA on gRPC RPS | ☐ |
PDB minAvailable: 3 per region (perimeter criticality) | ☐ |
| Rolling update tested: zero dropped gRPC calls under 5 000 RPS | ☐ |
| Graceful shutdown: 15 s SIGTERM drain | ☐ |
| Postgres conn pool (pgbouncer transaction mode) | ☐ |
| Redis conn pool sized (100 min / 300 max per pod) | ☐ |
| ML model-serving (Triton) pool sized for peak RPS | ☐ |
| Blocklist federation runbook drafted | ☐ |
| Emergency-bypass runbook drafted | ☐ |
| Chain-break incident runbook drafted | ☐ |
| False-positive-spike runbook drafted | ☐ |
| On-call rotation assigned (T&S primary; SRE secondary) | ☐ |
6. Documentation Readiness
All 16 SERVICE_TEMPLATE docs at "Complete" status. Plus runbooks per alert (§3 list). Plus Admin handbook for rule authoring and reviewer workflow.
7. Compliance / Regulatory Readiness
| Criterion | Status |
|---|
| ATRA engagement on national-blocklist federation schema | ☐ |
| Initial rule set reviewed by Legal | ☐ |
| AIT / SIM-box model fairness review (no MNO bias) | ☐ |
| Cross-MNO blocklist-sharing agreement drafted | ☐ |
| Citizen complaint surface (via regulator-portal) integrated | ☐ |
| SIEM forwarding of firewall audit events approved | ☐ |
8. Go/No-Go Criteria Summary
Production deployment is GO when:
9. Post-Launch Review
Within 30 days:
10. Phased Rollout
| Phase | Duration | Behaviour | Exit criteria |
|---|
| P1 — Observation | 14 d | Returns ALLOW for all; verdicts logged only | FP projection < 1% |
| P2 — Enforcement: transit-MT only | 7 d | EvaluateTransit enforces; inbound-MO still observe | < 10 escalations/day |
| P3 — Full Enforcement | Ongoing | Both directions enforce; ML models active | Steady state |
Rollback flags: FIREWALL_ENFORCEMENT_TRANSIT, FIREWALL_ENFORCEMENT_INBOUND.