cbc-bridge-service — Service Readiness
Version: 1.0 Status: Draft Owner: Government / Emergency + SRE Last Updated: 2026-04-21 References: SERVICE_OVERVIEW.md, _report.md, FAILURE_MODES.md, TESTING_STRATEGY.md
Readiness criteria for production deployment. Because this service connects to MNO RAN cell-broadcast interfaces for civil emergency alerts, go-live requires government, regulator, and MNO partner sign-off in addition to engineering readiness.
1. Code Readiness
| Criterion | Status | Notes |
|---|---|---|
gRPC CbcBridgeService.v1 (BroadcastEmergency, GetBroadcastStatus, CancelBroadcast, ScheduleDrill, VerifyAuthorisedCaller) | ☐ | |
| REST admin (broadcast list, audit query, caller registry CRUD, cell-DB refresh, drill schedule) | ☐ | |
| CBS PDU encoder (severity → Message Identifier; per-language DCS; multi-page) | ☐ | Conformance per 3GPP TS 23.041 |
Per-MNO adapter implementations: Standard3gppCbeAdapter, EricssonProprietaryCbeAdapter, HuaweiProprietaryCbeAdapter | ☐ | One per MNO vendor stack |
| PKI signature verification via HSM/PKCS#11 (no in-process key fallback) | ☐ | Fail-loud on HSM outage |
| Authorised-caller registry + cert-subject mapping | ☐ | Dual-control for registry changes |
Hash-chained audit (prev_hash, record_hash) with daily verifier | ☐ | RFC 8785 canonicalisation |
| Cancellation dual-control within 60 s grace window | ☐ | Atomic state transition |
| Monthly drill scheduler with test-range Message Identifier | ☐ | 4370..4379 test slot |
| Cell-database refresh cron (weekly, per MNO) | ☐ | |
| Replay-attack defence: nonce + timestamp window | ☐ | 5-min window |
| mTLS gRPC + SPIRE SVID | ☐ | |
| Idempotency on admin writes | ☐ | |
| NATS event emission (outbox + DB atomic) | ☐ | |
| Per-MNO egress IP pool (NetworkPolicy) | ☐ | MNO-whitelisted IPs |
2. Testing Readiness
| Criterion | Target | Status |
|---|---|---|
| Unit coverage | ≥ 90% line (domain), ≥ 80% branch | ☐ |
| CBS PDU encoder tests (all severity × language combinations) | ≥ 40 | ☐ |
| Language conformance tests (en/fa/ps/ar round-trip) | 100 samples per language | ☐ |
| Hash-chain integrity tests (append, verify, tamper-detect) | ≥ 15 | ☐ |
| State-machine transition tests | ≥ 30 | ☐ |
| Authorisation-gate tests (allowedSeverities × allowedRegions × expiry) | ≥ 40 | ☐ |
| Property-based tests (fast-check) | ≥ 10 properties × 500 runs | ☐ |
| Integration: PKI happy + CRL + OCSP + expired + tampered + replay | Passed | ☐ |
| Integration: All 3 adapter types against mock CBEs | Passed | ☐ |
| Integration: drill fires on schedule + test Message Identifier | Passed | ☐ |
| Integration: cancel dual-control within window | Passed | ☐ |
| Integration: hash-chain verifier runs + detects tamper | Passed | ☐ |
| Contract test with regulator-portal-service (BroadcastEmergency provider) | Passed | ☐ |
| E2E: happy P0 + partial success + all-MNO fail + drill + cancel + audit verify | Passed | ☐ |
| Adversarial-input corpus test (PKI bypass) | 0 bypasses | ☐ |
| Chaos: HSM out → fail-closed verified | Passed | ☐ |
| Chaos: 1 MNO CBE out → PARTIAL verdict | Passed | ☐ |
| Chaos: all MNO CBE out → FAILED + runbook executed | Passed | ☐ |
| Chaos: Postgres out → in-flight complete via Redis | Passed | ☐ |
| Chaos: region partition → region-local operation | Passed | ☐ |
| Security: audit UPDATE/DELETE rejected | Passed | ☐ |
| Security: authorised-caller escalation blocked | Passed | ☐ |
| Security: replay-attack rejected | Passed | ☐ |
| Load: 100 concurrent cancellations without state-race | Passed | ☐ |
| Load: 500 bad-cert/s for 5 min does not block legitimate | Passed | ☐ |
3. Observability Readiness
| Criterion | Status |
|---|---|
| All Prometheus metrics emitting (OBSERVABILITY.md §1) | ☐ |
Grafana dashboard cbc-bridge-service.json deployed (NOC + Regulator + Engineering rows) | ☐ |
| All alerts configured in Alertmanager with runbook links | ☐ |
| Structured JSON logs (Pino); caller-org logged; no subscriber PII | ☐ |
| OTel trace propagation end-to-end verified | ☐ |
| Loki parsing rules validated | ☐ |
SIEM forwarding of cbc.audit.v1 via regulator-portal verified | ☐ |
Alerts Configured
-
CbcBroadcastDispatchFailureCritical(≥ 25% dispatch failure per MNO 2 min) -
CbcBroadcastAllMnoFailed(FAILED verdict) — CEO-paging -
CbcPkiVerifyFailureSpike(> 5 failures/min 5 min) -
CbcHsmUnavailable(circuit open or up=0) -
CbcAuditChainBroken— Critical, CISO-paging -
CbcDrillOverdue(> 7 d past cadence) -
CbcCellDatabaseStale(> 14 d) -
CbcAuthorisedCallerCertExpiringSoon(< 14 d) -
CbcBroadcastAcceptLatencyHigh(P99 > 1 s) -
CbcPartialDispatchRateHigh(PARTIAL > 10% for 30 min)
4. Security Readiness
| Criterion | Status |
|---|---|
| mTLS on gRPC + SPIRE SVIDs hourly rotation | ☐ |
| National-PKI trust chain configured in HSM trust anchors | ☐ |
| NetworkPolicy restricts ingress to regulator-portal + government-PKI callers | ☐ |
| NetworkPolicy egress limited to per-MNO CBE CIDRs (deny everything else) | ☐ |
| Audit UPDATE/DELETE rejected by Postgres trigger | ☐ |
| PKI-bypass adversarial corpus passes (0 bypasses) | ☐ |
| Replay-attack corpus passes | ☐ |
| Authorised-caller registry immutable history; dual-control on edits | ☐ |
| HSM HA + regional quorum (ADR-0004 §11) | ☐ |
| Pen test against gRPC + REST admin | ☐ |
| CISO sign-off | ☐ |
5. Operational Readiness
| Criterion | Status |
|---|---|
| K8s Deployment (3 replicas min across 3 AZs) reviewed | ☐ |
| HPA scaled on broadcast-rate metric with conservative scale-down | ☐ |
PDB minAvailable: 2 | ☐ |
| Rolling update: zero-drop verified under steady 100 broadcast/h | ☐ |
| Graceful shutdown: 20 s SIGTERM drain for in-flight dispatches | ☐ |
| Postgres conn pool (pgbouncer) | ☐ |
| Redis conn pool sized | ☐ |
| HSM pool (4 sessions/pod) | ☐ |
| Multi-region topology (kbl primary + mzr standby + dxb DR mirror) verified | ☐ |
| Per-MNO egress IP pool whitelisted with MNOs (exchanged during Phase 0) | ☐ |
| Runbooks drafted for every alert | ☐ |
| On-call rotation: Government / Emergency primary; SRE secondary; CISO executive | ☐ |
| CEO + Board Secretary contact channel established | ☐ |
| Out-of-band communication bridge (phone + Slack) documented | ☐ |
6. Documentation Readiness
All 16 SERVICE_TEMPLATE docs at "Complete". Plus:
- Runbooks per alert (§3 list)
- Government-caller integration guide (gRPC client + SDK)
- MNO onboarding playbook (per MNO: adapter, egress, test-broadcast SLA)
- Drill playbook (monthly cadence + after-action report template)
- Regulator engagement document (audit-log format, SIEM stream contents)
7. Compliance / Regulatory Readiness
| Criterion | Status |
|---|---|
| ATRA MoU for CBC service | ☐ |
| National-PKI CA trust chain ratified by Legal + CISO | ☐ |
| Government MoU with NDMA / Police / Civil Defence (authorised callers) | ☐ |
| MNO MoU per operator (CBE protocol, staging endpoints, production endpoints, incident contacts) | ☐ |
| 3GPP TS 23.041 conformance declaration | ☐ |
| Audit retention policy (13 m hot + 7 y cold) configured | ☐ |
| Drill cadence agreed with NDMA + ATRA (monthly) | ☐ |
| Emergency-broadcast legal authority documented (Government + Legal) | ☐ |
| SIEM forwarding approved by regulator-portal team | ☐ |
8. Go/No-Go Criteria Summary
Production deployment is GO when:
- All §1 Code Readiness complete.
- Coverage targets met.
- Load tests pass (at expected emergency-burst scale).
- PKI-bypass corpus: 0 bypasses.
- 14-day shadow mode in staging completed — drill-only with ATRA observation.
- Chaos drill (HSM, 1 MNO, all MNO, Postgres, region partition) all recover per runbook.
- Government + Legal + Security + Regulator Liaison + MNO partners sign-offs.
- On-call rotation assigned and drilled once.
- Out-of-band communication bridge tested.
- Rollback procedure validated in staging.
9. Post-Launch Review
Within 30 days:
- Monthly drill cadence held (100% of scheduled drills executed).
- PKI verification success rate for legitimate callers (target: 100%).
- Per-MNO CBE availability (target: 99.5% per MNO).
- Audit chain integrity (0 breaks).
- Partial-dispatch rate (target: < 1% of broadcasts go PARTIAL).
- No unauthorised broadcasts (target: 0).
- Government-client feedback gathered.
- ATRA acknowledgement of audit visibility.
Within 90 days:
- Real emergency broadcast exercise (tabletop or live drill with NDMA).
- Cost analysis: HSM usage, MNO CBE link costs, on-call hours.
- Cell-database accuracy (target: > 95% of national area covered per MNO).
10. Phased Rollout
| Phase | Duration | Behaviour | Exit criteria |
|---|---|---|---|
| P0 — Pre-engagement | 3 months | MNO / ATRA / NDMA engagement; adapter development; test-PKI obtained | MoUs signed; adapters pass against MNO staging |
| P1 — Drill-only | 60 d | Service live; monthly drills only; no real emergency broadcasts | 2 consecutive drills 100% delivered |
| P2 — Non-life-critical emergencies | 30 d | Enable P1/P2 severity for public-health / security advisories | ATRA satisfaction; no unauthorised broadcasts |
| P3 — Full emergency | Ongoing | Enable P0 for life-critical (earthquake, civil defence) | N/A — steady state |
Rollback at any phase via feature flag CBC_ACCEPT_SEVERITY_MAX = NONE|P2|P1|P0.