cbc-bridge-service — Failure Modes

Version: 1.0 Status: Draft Owner: Government / Emergency + SRE Last Updated: 2026-04-21 References: SERVICE_OVERVIEW.md, OBSERVABILITY.md, SECURITY_MODEL.md

Catalog of how cbc-bridge-service fails, how the failure surfaces to government callers / regulators / citizens, and the designed mitigation path. Because broadcasts are rare and critical, failure-mode planning is biased toward loud-and-visible failure rather than silent degradation.

1. Operating Principle: Loud Failure, Fail-Closed on Authentication, Fail-Open on Downstream

Authentication/authorisation failure → fail-closed, audit-logged, alert fires. Under no circumstances does a broadcast bypass PKI verification.
MNO CBE outage → fail-open at aggregate level: PARTIAL verdict is acceptable; the broadcast reaches as many MNOs as possible. A single MNO outage does NOT prevent the broadcast.
HSM outage → fail-closed on new submissions; existing in-flight broadcasts continue to completion via cached state.
Audit / chain integrity failure → continue operating but immediate Critical alert to Government/Emergency team, CISO, and Board Secretary.

2. Failure Mode Summary

#	Name	Class	Detection	Impact	Runbook
FM-01	HSM unavailable	Dependency	< 30 s	New broadcasts rejected (fail-closed); existing complete	`runbooks/cbc/hsm-unavailable.md`
FM-02	All MNO CBE endpoints unreachable	Infra	30 s	Broadcast FAILED; manual fallback via MNO NOCs	`runbooks/cbc/all-mno-failed.md`
FM-03	Partial MNO CBE success	Infra	Per-dispatch	PARTIAL status; per-MNO breakdown in response	`runbooks/cbc/partial-dispatch.md`
FM-04	PKI signature verification fails	Security	< 1 s	UNAUTHENTICATED; rate-limit triggers on spike	`runbooks/cbc/pki-verify-spike.md`
FM-05	CBS PDU encoding error per language	Code	< 1 s	That language skipped; others dispatched	`runbooks/cbc/encoding-error.md`
FM-06	Cell-ID resolution fails	Dependency	< 5 s	Request rejected with detail (INVALID_ARGUMENT)	`runbooks/cbc/cell-resolution.md`
FM-07	Monthly drill cadence missed	Process	1 h past due	Regulator escalation; scheduler bug investigation	`runbooks/cbc/drill-overdue.md`
FM-08	Audit chain break detected	Correctness	24 h (daily verifier)	Regulator-defensibility lost for affected period	`runbooks/cbc/chain-broken.md`
FM-09	MNO CBE adapter protocol change	Dependency	Varies	Adapter circuit-breaks; fallback to alternate adapter if configured	`runbooks/cbc/adapter-proto-change.md`
FM-10	Replay attack (signature reuse with stale timestamp)	Adversarial	< 1 s	Rejected `REPLAY_NONCE`; rate-limit engages	`runbooks/cbc/replay-attack.md`
FM-11	Concurrent cancellation race	Code	< 5 s	Atomic state transition ensures only one CANCELLED	`runbooks/cbc/cancel-race.md`
FM-12	Cell-database refresh from MNO fails	Dependency	1 w past due	Cell-DB stale; polygon targeting degraded	`runbooks/cbc/cell-db-stale.md`
FM-13	Postgres unavailable	Infra	< 30 s	Fail-closed; new requests rejected; existing state held in Redis cache	`runbooks/cbc/pg-out.md`
FM-14	Region partition (kbl ↔ mzr)	Infra	< 1 min	Region-local operation continues; audit mirror delayed	`runbooks/cbc/region-split.md`
FM-15	Government-client clock drift causes signature timestamp-window miss	Environmental	< 1 s	Rejected; caller support ticket	`runbooks/cbc/timestamp-drift.md`

3. Detailed Failure Modes

FM-01 — HSM unavailable

Scenario. HSM cluster unreachable; PKCS#11 operations fail.

Impact. Every new BroadcastEmergency rejected with 503 HSM_UNAVAILABLE (fail-closed — PKI verification impossible without HSM). Existing in-flight broadcasts continue through the dispatch / ack phase using cached state (verification already happened before they entered the queue).

Detection. cbc_hsm_operation_total{result="FAILURE"} spike; circuit breaker opens after 3 consecutive errors; CbcHsmUnavailable alert fires within 2 min.

Mitigation.

HSM HA (ADR-0004 §11) with regional quorum; automatic fail-over ≤ 30 s.
During fail-over window, in-flight broadcasts unaffected.
Manual fallback: Security team can engage backup HSM key with dual-control within 5 min.
Government callers notified via out-of-band channel if outage > 5 min (phone + Slack bridge).
Any out-of-band / emergency manual dispatch (rare) is audit-logged separately with CISO + CTO + Government Liaison sign-off.

Recovery. HSM recovery → new submissions accepted. Audit row for the outage window captures scope.

FM-02 — All MNO CBE endpoints unreachable

Scenario. Connectivity to all MNO CBE endpoints down (platform-side network issue or national infrastructure event).

Impact. Broadcasts CANNOT reach any MNO. Returns FAILED verdict to caller within dispatch-timeout window (30 s per-MNO; parallel).

Detection. cbc_mno_dispatch_failed_total sum across MNOs == dispatch count; adapter circuit breakers all open; CbcBroadcastAllMnoFailed Critical alert (CEO-paging).

Mitigation.

Dedicated per-MNO egress IP pool (DEPLOYMENT_TOPOLOGY §2.5) — narrowly scoped network policy catches platform-side network issues vs. MNO-side.
Manual fallback runbook: phone-bridge to each MNO NOC to perform manual cell-broadcast from MNO side using pre-agreed authenticated backup procedure.
Escalation: CEO + Board Secretary + Government Liaison bridge within 5 min.
Rollback: no data corruption — request reusable on recovery.

Recovery. Network restoration → adapter circuits close → new broadcasts dispatch normally. Government client may resubmit.

FM-03 — Partial MNO CBE success

Scenario. Some MNOs ACK, others fail or time out.

Impact. Final status PARTIAL with per-MNO breakdown. Per ADR-0004 this is an expected operational mode — broadcast reaches as many MNOs as possible.

Detection. Per-dispatch observation; cbc_broadcast_final_status_total{status="PARTIAL"} increments.

Mitigation.

Caller receives per-MNO detail — they can decide whether to retry failed MNOs manually.
Human-in-the-loop: if PARTIAL rate exceeds 10% over an hour → CbcPartialDispatchRateHigh alert.
Per-MNO failure reason captured in audit row.

Recovery. Operator-initiated resubmit targets only the failed MNOs (preserves idempotency via correlation ID).

FM-04 — PKI signature verification fails

Scenario. Request arrives with invalid / revoked / tampered signature.

Impact. Rejected UNAUTHENTICATED; audit row written; caller receives specific error reason.

Detection. cbc_pki_signature_verified_total{result="FAILURE"} per reason; CbcPkiVerifyFailureSpike alert at > 5 failures/min.

Mitigation.

First failure: log + audit; return to caller.
Rate-limit: if same cert-subject fails 3× in 5 min, tarpit for 1 h + alert security team (probing detection).
On CRL / OCSP failure, caller receives CERT_REVOKED and is expected to renew out-of-band.

Recovery. Caller resubmits with fixed credentials.

FM-05 — CBS encoding error per language

Scenario. Body for one language (e.g., ps) contains a char the encoder cannot handle (e.g., invalid Unicode combining).

Impact. That language variant is skipped; other languages dispatched. Caller notified via perLanguageStatus in response.

Mitigation.

Input validation at accept-time catches most cases before acceptance (strict validation per SERVICE_OVERVIEW).
Fallback: if P0 severity and only en succeeds encoding, broadcast proceeds with warning log + alert.
Per-language status in final cbc.broadcast.acked.v1 event.

FM-06 — Cell-ID resolution fails

Scenario. Polygon target resolves to cell IDs that aren't in cbc.mno_cell_database (cell DB stale or polygon out of coverage).

Impact. Request rejected at accept-time with INVALID_ARGUMENT + detail (missing cells count).

Mitigation.

Cell-DB refresh weekly per MNO (DEPLOYMENT_TOPOLOGY §4).
Fallback: caller can re-target using named region rather than polygon.
CbcCellDatabaseStale alert at 14 d.

FM-07 — Monthly drill cadence missed

Scenario. Drill scheduler pod fails; monthly drill not fired.

Impact. Regulator escalation; service-readiness gate breach.

Detection. cbc_drill_overdue_seconds > 604800 (1 week past due); CbcDrillOverdue alert.

Mitigation.

Scheduler is its own Deployment with health probes; crash triggers immediate PagerDuty.
Manual drill trigger via POST /v1/admin/cbc/drill/now (admin-role only) runs drill catch-up.
After-action report notes the missed cadence + recovery.

FM-08 — Audit chain break detected

Scenario. Daily verifier finds a row where record_hash ≠ sha256(canonical(payload) || prev_hash).

Impact. Regulator-defensibility of that specific period compromised.

Detection. Verifier run; cbc_audit_chain_verifier_status == 1; Critical alert.

Mitigation.

Immediate investigation: possible causes are (a) canonicalisation bug, (b) concurrent-write race, (c) malicious tampering.
Partition the affected chain region; subsequent rows start a new chain from a marked genesis.
Regulator notified within 24 h if audit was already submitted with corrupt chain.
Post-mortem + code / DB root-cause analysis within 72 h.

FM-09 — MNO CBE adapter protocol change

Scenario. MNO upgrades their CBE (e.g., Ericsson protocol version bump); existing adapter rejects new response format.

Impact. Dispatches to that MNO start failing with CBE_REJECT; circuit eventually opens; broadcasts become PARTIAL until adapter updated.

Mitigation.

Adapter abstraction (per SERVICE_OVERVIEW §6.2) — new adapter can be deployed without redeploying main service.
Alternate-adapter fallback: if EricssonProprietaryCbeAdapter fails, try Standard3gppCbeAdapter (some MNO CBEs support both).
Vendor-contact runbook: call MNO NOC within 1 h to confirm protocol change + request documentation.
Adapter change deployed within 48 h SLA.

FM-10 — Replay attack

Scenario. Attacker captures a valid signed broadcast request and replays it.

Impact. Rejected REPLAY_NONCE; audit row written; rate-limit engages.

Mitigation.

Signature window: request timestamp must be within 5 min of server time; older → rejected.
Nonce deduplication: per-cert nonce cache in Redis (TTL 10 min); re-seen nonce → rejected.
Multiple replay attempts → automated security incident (CISO paged).

FM-11 — Concurrent cancellation race

Scenario. Two approvers click cancel near-simultaneously; or the original initiator and a second approver both attempt cancel at the same instant.

Impact. Potential duplicate CANCELLED transition.

Mitigation.

State transition uses UPDATE ... WHERE state = 'ACCEPTED' (Postgres row-level lock via FOR UPDATE).
Single winner; loser receives INVALID_STATE.
Audit records only the winning transition.

FM-12 — Cell-DB refresh fails

Scenario. MNO cell-database export endpoint unreachable or malformed on weekly refresh.

Impact. Polygon targeting degraded for that MNO; named-region targeting unaffected.

Mitigation.

Retry 3× with exponential backoff.
Manual-upload fallback: platform admin can upload cell-DB CSV via admin-dashboard.
CbcCellDatabaseStale alert at 14 d.

FM-13 — Postgres unavailable

Scenario. cbc schema unavailable.

Impact. New broadcasts rejected with 503; existing in-flight broadcasts complete via Redis-cached state (dispatch records in Redis until ACK).

Mitigation.

Postgres HA with synchronous replica; automatic fail-over ≤ 30 s.
Multi-region fail-over (manual-gated) ≤ 15 min.
In-flight broadcasts use Redis cache during outage (best-effort audit to be reconciled on recovery).

Recovery. DB recovery → new submissions accepted; any Redis-cached dispatch records reconciled into Postgres.

FM-14 — Region partition

Scenario. Kabul ↔ Mazar network partition.

Impact. Each region operates independently using region-local Postgres; cbc.audit.v1 stream replication delayed. Government callers geo-routed continue to work.

Mitigation.

Region-local operation is the design (ADR-0004 §5).
No cross-region in-flight state to split-brain.
Audit reconciliation on partition heal.
Alert CbcRegionPartition.

FM-15 — Government-client clock drift

Scenario. Caller machine has NTP drift > 5 min; signature timestamp falls outside acceptance window.

Impact. Rejected REPLAY_NONCE even though caller intent was legitimate.

Mitigation.

Caller SDK should use server-time reflection endpoint (GET /v1/cbc/time) to anchor timestamps.
Error message includes server time + window bounds so caller can self-diagnose.
Caller-support runbook: instruct client to fix NTP.

4. Graceful Degradation Summary

Failure domain	Behaviour	Caller impact
Authentication	Fail-closed	Clear error reason; must fix + resubmit
HSM	Fail-closed (new); fail-open (in-flight)	Temporary outage → resubmit on recovery
Single MNO CBE	Continue; PARTIAL verdict	Detailed per-MNO breakdown
All MNO CBE	Fail (FAILED verdict)	Manual fallback runbook
Postgres	Fail-closed (new); fail-open (in-flight via Redis)	Temporary outage
NATS	Queue; no data loss	Slight audit delay
HSM + Postgres both out	Complete fail	Government emergency manual procedure

5. Failure ↔ Experience Matrix

FM	Government caller	Regulator (ATRA)	Citizen	NOC
FM-01 HSM out	Service temporarily unavailable	Missing broadcasts if prolonged	No broadcast received	Critical alert
FM-02 All MNO failed	FAILED; manual fallback expected	Aware via `regulator-portal-service`	No broadcast	CEO paged
FM-03 Partial	PARTIAL with breakdown	Aware	Some regions receive, others not	Alert
FM-04 PKI fail	Specific error; must fix cert	Audit visible	No impact	Probing-detection alert
FM-05 Encoding	Per-language status	Aware	Missing one language	Alert
FM-06 Cell resolution	Re-target required	—	—	—
FM-07 Drill overdue	—	ATRA notified (failure to meet cadence)	—	Alert
FM-08 Chain break	—	Audit integrity claim reduced	—	Critical
FM-09 Adapter protocol change	PARTIAL for affected MNO	Aware	Missing that MNO	Alert
FM-10 Replay	Rejected; security incident if repeated	—	—	Probing-detection alert
FM-11 Cancel race	One succeeds; other gets INVALID_STATE	—	—	—
FM-12 Cell-DB stale	Polygon targeting may fail	—	—	Warning alert
FM-13 Postgres out	Temporary; in-flight completes	—	No new broadcasts	Critical
FM-14 Region partition	Region-local still works	—	Reaches own region	Alert
FM-15 Clock drift	Rejected; caller-support ticket	—	—	Caller metric

6. Open Points

ID	Question	Owner
FM-OPEN-01	Exact SLA for manual-dispatch fallback (FM-02) agreed with MNO NOCs	Regulator Liaison + MNO Partnerships
FM-OPEN-02	Government-client SDK distribution channel (for NTP + timestamp helpers)	DevRel
FM-OPEN-03	Out-of-band communication bridge (phone + Slack + radio?) for nationwide HSM outage	Government / Emergency + CISO
FM-OPEN-04	Signature-window tolerance — 5 min is adequate? Or tighter?	Security

1. Operating Principle: Loud Failure, Fail-Closed on Authentication, Fail-Open on Downstream​

2. Failure Mode Summary​

3. Detailed Failure Modes​

FM-01 — HSM unavailable​

FM-02 — All MNO CBE endpoints unreachable​

FM-03 — Partial MNO CBE success​

FM-04 — PKI signature verification fails​

FM-05 — CBS encoding error per language​

FM-06 — Cell-ID resolution fails​

FM-07 — Monthly drill cadence missed​

FM-08 — Audit chain break detected​

FM-09 — MNO CBE adapter protocol change​

FM-10 — Replay attack​

FM-11 — Concurrent cancellation race​

FM-12 — Cell-DB refresh fails​

FM-13 — Postgres unavailable​

FM-14 — Region partition​

FM-15 — Government-client clock drift​

4. Graceful Degradation Summary​

5. Failure ↔ Experience Matrix​

6. Open Points​

1. Operating Principle: Loud Failure, Fail-Closed on Authentication, Fail-Open on Downstream

2. Failure Mode Summary

3. Detailed Failure Modes

FM-01 — HSM unavailable

FM-02 — All MNO CBE endpoints unreachable

FM-03 — Partial MNO CBE success

FM-04 — PKI signature verification fails

FM-05 — CBS encoding error per language

FM-06 — Cell-ID resolution fails

FM-07 — Monthly drill cadence missed

FM-08 — Audit chain break detected

FM-09 — MNO CBE adapter protocol change

FM-10 — Replay attack

FM-11 — Concurrent cancellation race

FM-12 — Cell-DB refresh fails

FM-13 — Postgres unavailable

FM-14 — Region partition

FM-15 — Government-client clock drift

4. Graceful Degradation Summary

5. Failure ↔ Experience Matrix

6. Open Points