cbc-bridge-service — Failure Modes
Version: 1.0 Status: Draft Owner: Government / Emergency + SRE Last Updated: 2026-04-21 References: SERVICE_OVERVIEW.md, OBSERVABILITY.md, SECURITY_MODEL.md
Catalog of how cbc-bridge-service fails, how the failure surfaces to government callers / regulators / citizens, and the designed mitigation path. Because broadcasts are rare and critical, failure-mode planning is biased toward loud-and-visible failure rather than silent degradation.
1. Operating Principle: Loud Failure, Fail-Closed on Authentication, Fail-Open on Downstream
- Authentication/authorisation failure → fail-closed, audit-logged, alert fires. Under no circumstances does a broadcast bypass PKI verification.
- MNO CBE outage → fail-open at aggregate level: PARTIAL verdict is acceptable; the broadcast reaches as many MNOs as possible. A single MNO outage does NOT prevent the broadcast.
- HSM outage → fail-closed on new submissions; existing in-flight broadcasts continue to completion via cached state.
- Audit / chain integrity failure → continue operating but immediate Critical alert to Government/Emergency team, CISO, and Board Secretary.
2. Failure Mode Summary
| # | Name | Class | Detection | Impact | Runbook |
|---|---|---|---|---|---|
| FM-01 | HSM unavailable | Dependency | < 30 s | New broadcasts rejected (fail-closed); existing complete | runbooks/cbc/hsm-unavailable.md |
| FM-02 | All MNO CBE endpoints unreachable | Infra | 30 s | Broadcast FAILED; manual fallback via MNO NOCs | runbooks/cbc/all-mno-failed.md |
| FM-03 | Partial MNO CBE success | Infra | Per-dispatch | PARTIAL status; per-MNO breakdown in response | runbooks/cbc/partial-dispatch.md |
| FM-04 | PKI signature verification fails | Security | < 1 s | UNAUTHENTICATED; rate-limit triggers on spike | runbooks/cbc/pki-verify-spike.md |
| FM-05 | CBS PDU encoding error per language | Code | < 1 s | That language skipped; others dispatched | runbooks/cbc/encoding-error.md |
| FM-06 | Cell-ID resolution fails | Dependency | < 5 s | Request rejected with detail (INVALID_ARGUMENT) | runbooks/cbc/cell-resolution.md |
| FM-07 | Monthly drill cadence missed | Process | 1 h past due | Regulator escalation; scheduler bug investigation | runbooks/cbc/drill-overdue.md |
| FM-08 | Audit chain break detected | Correctness | 24 h (daily verifier) | Regulator-defensibility lost for affected period | runbooks/cbc/chain-broken.md |
| FM-09 | MNO CBE adapter protocol change | Dependency | Varies | Adapter circuit-breaks; fallback to alternate adapter if configured | runbooks/cbc/adapter-proto-change.md |
| FM-10 | Replay attack (signature reuse with stale timestamp) | Adversarial | < 1 s | Rejected REPLAY_NONCE; rate-limit engages | runbooks/cbc/replay-attack.md |
| FM-11 | Concurrent cancellation race | Code | < 5 s | Atomic state transition ensures only one CANCELLED | runbooks/cbc/cancel-race.md |
| FM-12 | Cell-database refresh from MNO fails | Dependency | 1 w past due | Cell-DB stale; polygon targeting degraded | runbooks/cbc/cell-db-stale.md |
| FM-13 | Postgres unavailable | Infra | < 30 s | Fail-closed; new requests rejected; existing state held in Redis cache | runbooks/cbc/pg-out.md |
| FM-14 | Region partition (kbl ↔ mzr) | Infra | < 1 min | Region-local operation continues; audit mirror delayed | runbooks/cbc/region-split.md |
| FM-15 | Government-client clock drift causes signature timestamp-window miss | Environmental | < 1 s | Rejected; caller support ticket | runbooks/cbc/timestamp-drift.md |
3. Detailed Failure Modes
FM-01 — HSM unavailable
Scenario. HSM cluster unreachable; PKCS#11 operations fail.
Impact. Every new BroadcastEmergency rejected with 503 HSM_UNAVAILABLE (fail-closed — PKI verification impossible without HSM). Existing in-flight broadcasts continue through the dispatch / ack phase using cached state (verification already happened before they entered the queue).
Detection. cbc_hsm_operation_total{result="FAILURE"} spike; circuit breaker opens after 3 consecutive errors; CbcHsmUnavailable alert fires within 2 min.
Mitigation.
- HSM HA (ADR-0004 §11) with regional quorum; automatic fail-over ≤ 30 s.
- During fail-over window, in-flight broadcasts unaffected.
- Manual fallback: Security team can engage backup HSM key with dual-control within 5 min.
- Government callers notified via out-of-band channel if outage > 5 min (phone + Slack bridge).
- Any out-of-band / emergency manual dispatch (rare) is audit-logged separately with CISO + CTO + Government Liaison sign-off.
Recovery. HSM recovery → new submissions accepted. Audit row for the outage window captures scope.
FM-02 — All MNO CBE endpoints unreachable
Scenario. Connectivity to all MNO CBE endpoints down (platform-side network issue or national infrastructure event).
Impact. Broadcasts CANNOT reach any MNO. Returns FAILED verdict to caller within dispatch-timeout window (30 s per-MNO; parallel).
Detection. cbc_mno_dispatch_failed_total sum across MNOs == dispatch count; adapter circuit breakers all open; CbcBroadcastAllMnoFailed Critical alert (CEO-paging).
Mitigation.
- Dedicated per-MNO egress IP pool (DEPLOYMENT_TOPOLOGY §2.5) — narrowly scoped network policy catches platform-side network issues vs. MNO-side.
- Manual fallback runbook: phone-bridge to each MNO NOC to perform manual cell-broadcast from MNO side using pre-agreed authenticated backup procedure.
- Escalation: CEO + Board Secretary + Government Liaison bridge within 5 min.
- Rollback: no data corruption — request reusable on recovery.
Recovery. Network restoration → adapter circuits close → new broadcasts dispatch normally. Government client may resubmit.
FM-03 — Partial MNO CBE success
Scenario. Some MNOs ACK, others fail or time out.
Impact. Final status PARTIAL with per-MNO breakdown. Per ADR-0004 this is an expected operational mode — broadcast reaches as many MNOs as possible.
Detection. Per-dispatch observation; cbc_broadcast_final_status_total{status="PARTIAL"} increments.
Mitigation.
- Caller receives per-MNO detail — they can decide whether to retry failed MNOs manually.
- Human-in-the-loop: if PARTIAL rate exceeds 10% over an hour →
CbcPartialDispatchRateHighalert. - Per-MNO failure reason captured in audit row.
Recovery. Operator-initiated resubmit targets only the failed MNOs (preserves idempotency via correlation ID).
FM-04 — PKI signature verification fails
Scenario. Request arrives with invalid / revoked / tampered signature.
Impact. Rejected UNAUTHENTICATED; audit row written; caller receives specific error reason.
Detection. cbc_pki_signature_verified_total{result="FAILURE"} per reason; CbcPkiVerifyFailureSpike alert at > 5 failures/min.
Mitigation.
- First failure: log + audit; return to caller.
- Rate-limit: if same cert-subject fails 3× in 5 min, tarpit for 1 h + alert security team (probing detection).
- On CRL / OCSP failure, caller receives
CERT_REVOKEDand is expected to renew out-of-band.
Recovery. Caller resubmits with fixed credentials.
FM-05 — CBS encoding error per language
Scenario. Body for one language (e.g., ps) contains a char the encoder cannot handle (e.g., invalid Unicode combining).
Impact. That language variant is skipped; other languages dispatched. Caller notified via perLanguageStatus in response.
Mitigation.
- Input validation at accept-time catches most cases before acceptance (strict validation per SERVICE_OVERVIEW).
- Fallback: if P0 severity and only
ensucceeds encoding, broadcast proceeds with warning log + alert. - Per-language status in final
cbc.broadcast.acked.v1event.
FM-06 — Cell-ID resolution fails
Scenario. Polygon target resolves to cell IDs that aren't in cbc.mno_cell_database (cell DB stale or polygon out of coverage).
Impact. Request rejected at accept-time with INVALID_ARGUMENT + detail (missing cells count).
Mitigation.
- Cell-DB refresh weekly per MNO (DEPLOYMENT_TOPOLOGY §4).
- Fallback: caller can re-target using named region rather than polygon.
CbcCellDatabaseStalealert at 14 d.
FM-07 — Monthly drill cadence missed
Scenario. Drill scheduler pod fails; monthly drill not fired.
Impact. Regulator escalation; service-readiness gate breach.
Detection. cbc_drill_overdue_seconds > 604800 (1 week past due); CbcDrillOverdue alert.
Mitigation.
- Scheduler is its own Deployment with health probes; crash triggers immediate PagerDuty.
- Manual drill trigger via
POST /v1/admin/cbc/drill/now(admin-role only) runs drill catch-up. - After-action report notes the missed cadence + recovery.
FM-08 — Audit chain break detected
Scenario. Daily verifier finds a row where record_hash ≠ sha256(canonical(payload) || prev_hash).
Impact. Regulator-defensibility of that specific period compromised.
Detection. Verifier run; cbc_audit_chain_verifier_status == 1; Critical alert.
Mitigation.
- Immediate investigation: possible causes are (a) canonicalisation bug, (b) concurrent-write race, (c) malicious tampering.
- Partition the affected chain region; subsequent rows start a new chain from a marked genesis.
- Regulator notified within 24 h if audit was already submitted with corrupt chain.
- Post-mortem + code / DB root-cause analysis within 72 h.
FM-09 — MNO CBE adapter protocol change
Scenario. MNO upgrades their CBE (e.g., Ericsson protocol version bump); existing adapter rejects new response format.
Impact. Dispatches to that MNO start failing with CBE_REJECT; circuit eventually opens; broadcasts become PARTIAL until adapter updated.
Mitigation.
- Adapter abstraction (per SERVICE_OVERVIEW §6.2) — new adapter can be deployed without redeploying main service.
- Alternate-adapter fallback: if
EricssonProprietaryCbeAdapterfails, tryStandard3gppCbeAdapter(some MNO CBEs support both). - Vendor-contact runbook: call MNO NOC within 1 h to confirm protocol change + request documentation.
- Adapter change deployed within 48 h SLA.
FM-10 — Replay attack
Scenario. Attacker captures a valid signed broadcast request and replays it.
Impact. Rejected REPLAY_NONCE; audit row written; rate-limit engages.
Mitigation.
- Signature window: request timestamp must be within 5 min of server time; older → rejected.
- Nonce deduplication: per-cert nonce cache in Redis (TTL 10 min); re-seen nonce → rejected.
- Multiple replay attempts → automated security incident (CISO paged).
FM-11 — Concurrent cancellation race
Scenario. Two approvers click cancel near-simultaneously; or the original initiator and a second approver both attempt cancel at the same instant.
Impact. Potential duplicate CANCELLED transition.
Mitigation.
- State transition uses
UPDATE ... WHERE state = 'ACCEPTED'(Postgres row-level lock viaFOR UPDATE). - Single winner; loser receives
INVALID_STATE. - Audit records only the winning transition.
FM-12 — Cell-DB refresh fails
Scenario. MNO cell-database export endpoint unreachable or malformed on weekly refresh.
Impact. Polygon targeting degraded for that MNO; named-region targeting unaffected.
Mitigation.
- Retry 3× with exponential backoff.
- Manual-upload fallback: platform admin can upload cell-DB CSV via admin-dashboard.
CbcCellDatabaseStalealert at 14 d.
FM-13 — Postgres unavailable
Scenario. cbc schema unavailable.
Impact. New broadcasts rejected with 503; existing in-flight broadcasts complete via Redis-cached state (dispatch records in Redis until ACK).
Mitigation.
- Postgres HA with synchronous replica; automatic fail-over ≤ 30 s.
- Multi-region fail-over (manual-gated) ≤ 15 min.
- In-flight broadcasts use Redis cache during outage (best-effort audit to be reconciled on recovery).
Recovery. DB recovery → new submissions accepted; any Redis-cached dispatch records reconciled into Postgres.
FM-14 — Region partition
Scenario. Kabul ↔ Mazar network partition.
Impact. Each region operates independently using region-local Postgres; cbc.audit.v1 stream replication delayed. Government callers geo-routed continue to work.
Mitigation.
- Region-local operation is the design (ADR-0004 §5).
- No cross-region in-flight state to split-brain.
- Audit reconciliation on partition heal.
- Alert
CbcRegionPartition.
FM-15 — Government-client clock drift
Scenario. Caller machine has NTP drift > 5 min; signature timestamp falls outside acceptance window.
Impact. Rejected REPLAY_NONCE even though caller intent was legitimate.
Mitigation.
- Caller SDK should use server-time reflection endpoint (
GET /v1/cbc/time) to anchor timestamps. - Error message includes server time + window bounds so caller can self-diagnose.
- Caller-support runbook: instruct client to fix NTP.
4. Graceful Degradation Summary
| Failure domain | Behaviour | Caller impact |
|---|---|---|
| Authentication | Fail-closed | Clear error reason; must fix + resubmit |
| HSM | Fail-closed (new); fail-open (in-flight) | Temporary outage → resubmit on recovery |
| Single MNO CBE | Continue; PARTIAL verdict | Detailed per-MNO breakdown |
| All MNO CBE | Fail (FAILED verdict) | Manual fallback runbook |
| Postgres | Fail-closed (new); fail-open (in-flight via Redis) | Temporary outage |
| NATS | Queue; no data loss | Slight audit delay |
| HSM + Postgres both out | Complete fail | Government emergency manual procedure |
5. Failure ↔ Experience Matrix
| FM | Government caller | Regulator (ATRA) | Citizen | NOC |
|---|---|---|---|---|
| FM-01 HSM out | Service temporarily unavailable | Missing broadcasts if prolonged | No broadcast received | Critical alert |
| FM-02 All MNO failed | FAILED; manual fallback expected | Aware via regulator-portal-service | No broadcast | CEO paged |
| FM-03 Partial | PARTIAL with breakdown | Aware | Some regions receive, others not | Alert |
| FM-04 PKI fail | Specific error; must fix cert | Audit visible | No impact | Probing-detection alert |
| FM-05 Encoding | Per-language status | Aware | Missing one language | Alert |
| FM-06 Cell resolution | Re-target required | — | — | — |
| FM-07 Drill overdue | — | ATRA notified (failure to meet cadence) | — | Alert |
| FM-08 Chain break | — | Audit integrity claim reduced | — | Critical |
| FM-09 Adapter protocol change | PARTIAL for affected MNO | Aware | Missing that MNO | Alert |
| FM-10 Replay | Rejected; security incident if repeated | — | — | Probing-detection alert |
| FM-11 Cancel race | One succeeds; other gets INVALID_STATE | — | — | — |
| FM-12 Cell-DB stale | Polygon targeting may fail | — | — | Warning alert |
| FM-13 Postgres out | Temporary; in-flight completes | — | No new broadcasts | Critical |
| FM-14 Region partition | Region-local still works | — | Reaches own region | Alert |
| FM-15 Clock drift | Rejected; caller-support ticket | — | — | Caller metric |
6. Open Points
| ID | Question | Owner |
|---|---|---|
| FM-OPEN-01 | Exact SLA for manual-dispatch fallback (FM-02) agreed with MNO NOCs | Regulator Liaison + MNO Partnerships |
| FM-OPEN-02 | Government-client SDK distribution channel (for NTP + timestamp helpers) | DevRel |
| FM-OPEN-03 | Out-of-band communication bridge (phone + Slack + radio?) for nationwide HSM outage | Government / Emergency + CISO |
| FM-OPEN-04 | Signature-window tolerance — 5 min is adequate? Or tighter? | Security |