Skip to main content

cbc-bridge-service — Failure Modes

Version: 1.0 Status: Draft Owner: Government / Emergency + SRE Last Updated: 2026-04-21 References: SERVICE_OVERVIEW.md, OBSERVABILITY.md, SECURITY_MODEL.md

Catalog of how cbc-bridge-service fails, how the failure surfaces to government callers / regulators / citizens, and the designed mitigation path. Because broadcasts are rare and critical, failure-mode planning is biased toward loud-and-visible failure rather than silent degradation.


1. Operating Principle: Loud Failure, Fail-Closed on Authentication, Fail-Open on Downstream

  • Authentication/authorisation failure → fail-closed, audit-logged, alert fires. Under no circumstances does a broadcast bypass PKI verification.
  • MNO CBE outage → fail-open at aggregate level: PARTIAL verdict is acceptable; the broadcast reaches as many MNOs as possible. A single MNO outage does NOT prevent the broadcast.
  • HSM outage → fail-closed on new submissions; existing in-flight broadcasts continue to completion via cached state.
  • Audit / chain integrity failure → continue operating but immediate Critical alert to Government/Emergency team, CISO, and Board Secretary.

2. Failure Mode Summary

#NameClassDetectionImpactRunbook
FM-01HSM unavailableDependency< 30 sNew broadcasts rejected (fail-closed); existing completerunbooks/cbc/hsm-unavailable.md
FM-02All MNO CBE endpoints unreachableInfra30 sBroadcast FAILED; manual fallback via MNO NOCsrunbooks/cbc/all-mno-failed.md
FM-03Partial MNO CBE successInfraPer-dispatchPARTIAL status; per-MNO breakdown in responserunbooks/cbc/partial-dispatch.md
FM-04PKI signature verification failsSecurity< 1 sUNAUTHENTICATED; rate-limit triggers on spikerunbooks/cbc/pki-verify-spike.md
FM-05CBS PDU encoding error per languageCode< 1 sThat language skipped; others dispatchedrunbooks/cbc/encoding-error.md
FM-06Cell-ID resolution failsDependency< 5 sRequest rejected with detail (INVALID_ARGUMENT)runbooks/cbc/cell-resolution.md
FM-07Monthly drill cadence missedProcess1 h past dueRegulator escalation; scheduler bug investigationrunbooks/cbc/drill-overdue.md
FM-08Audit chain break detectedCorrectness24 h (daily verifier)Regulator-defensibility lost for affected periodrunbooks/cbc/chain-broken.md
FM-09MNO CBE adapter protocol changeDependencyVariesAdapter circuit-breaks; fallback to alternate adapter if configuredrunbooks/cbc/adapter-proto-change.md
FM-10Replay attack (signature reuse with stale timestamp)Adversarial< 1 sRejected REPLAY_NONCE; rate-limit engagesrunbooks/cbc/replay-attack.md
FM-11Concurrent cancellation raceCode< 5 sAtomic state transition ensures only one CANCELLEDrunbooks/cbc/cancel-race.md
FM-12Cell-database refresh from MNO failsDependency1 w past dueCell-DB stale; polygon targeting degradedrunbooks/cbc/cell-db-stale.md
FM-13Postgres unavailableInfra< 30 sFail-closed; new requests rejected; existing state held in Redis cacherunbooks/cbc/pg-out.md
FM-14Region partition (kbl ↔ mzr)Infra< 1 minRegion-local operation continues; audit mirror delayedrunbooks/cbc/region-split.md
FM-15Government-client clock drift causes signature timestamp-window missEnvironmental< 1 sRejected; caller support ticketrunbooks/cbc/timestamp-drift.md

3. Detailed Failure Modes

FM-01 — HSM unavailable

Scenario. HSM cluster unreachable; PKCS#11 operations fail.

Impact. Every new BroadcastEmergency rejected with 503 HSM_UNAVAILABLE (fail-closed — PKI verification impossible without HSM). Existing in-flight broadcasts continue through the dispatch / ack phase using cached state (verification already happened before they entered the queue).

Detection. cbc_hsm_operation_total{result="FAILURE"} spike; circuit breaker opens after 3 consecutive errors; CbcHsmUnavailable alert fires within 2 min.

Mitigation.

  1. HSM HA (ADR-0004 §11) with regional quorum; automatic fail-over ≤ 30 s.
  2. During fail-over window, in-flight broadcasts unaffected.
  3. Manual fallback: Security team can engage backup HSM key with dual-control within 5 min.
  4. Government callers notified via out-of-band channel if outage > 5 min (phone + Slack bridge).
  5. Any out-of-band / emergency manual dispatch (rare) is audit-logged separately with CISO + CTO + Government Liaison sign-off.

Recovery. HSM recovery → new submissions accepted. Audit row for the outage window captures scope.


FM-02 — All MNO CBE endpoints unreachable

Scenario. Connectivity to all MNO CBE endpoints down (platform-side network issue or national infrastructure event).

Impact. Broadcasts CANNOT reach any MNO. Returns FAILED verdict to caller within dispatch-timeout window (30 s per-MNO; parallel).

Detection. cbc_mno_dispatch_failed_total sum across MNOs == dispatch count; adapter circuit breakers all open; CbcBroadcastAllMnoFailed Critical alert (CEO-paging).

Mitigation.

  1. Dedicated per-MNO egress IP pool (DEPLOYMENT_TOPOLOGY §2.5) — narrowly scoped network policy catches platform-side network issues vs. MNO-side.
  2. Manual fallback runbook: phone-bridge to each MNO NOC to perform manual cell-broadcast from MNO side using pre-agreed authenticated backup procedure.
  3. Escalation: CEO + Board Secretary + Government Liaison bridge within 5 min.
  4. Rollback: no data corruption — request reusable on recovery.

Recovery. Network restoration → adapter circuits close → new broadcasts dispatch normally. Government client may resubmit.


FM-03 — Partial MNO CBE success

Scenario. Some MNOs ACK, others fail or time out.

Impact. Final status PARTIAL with per-MNO breakdown. Per ADR-0004 this is an expected operational mode — broadcast reaches as many MNOs as possible.

Detection. Per-dispatch observation; cbc_broadcast_final_status_total{status="PARTIAL"} increments.

Mitigation.

  1. Caller receives per-MNO detail — they can decide whether to retry failed MNOs manually.
  2. Human-in-the-loop: if PARTIAL rate exceeds 10% over an hour → CbcPartialDispatchRateHigh alert.
  3. Per-MNO failure reason captured in audit row.

Recovery. Operator-initiated resubmit targets only the failed MNOs (preserves idempotency via correlation ID).


FM-04 — PKI signature verification fails

Scenario. Request arrives with invalid / revoked / tampered signature.

Impact. Rejected UNAUTHENTICATED; audit row written; caller receives specific error reason.

Detection. cbc_pki_signature_verified_total{result="FAILURE"} per reason; CbcPkiVerifyFailureSpike alert at > 5 failures/min.

Mitigation.

  1. First failure: log + audit; return to caller.
  2. Rate-limit: if same cert-subject fails 3× in 5 min, tarpit for 1 h + alert security team (probing detection).
  3. On CRL / OCSP failure, caller receives CERT_REVOKED and is expected to renew out-of-band.

Recovery. Caller resubmits with fixed credentials.


FM-05 — CBS encoding error per language

Scenario. Body for one language (e.g., ps) contains a char the encoder cannot handle (e.g., invalid Unicode combining).

Impact. That language variant is skipped; other languages dispatched. Caller notified via perLanguageStatus in response.

Mitigation.

  1. Input validation at accept-time catches most cases before acceptance (strict validation per SERVICE_OVERVIEW).
  2. Fallback: if P0 severity and only en succeeds encoding, broadcast proceeds with warning log + alert.
  3. Per-language status in final cbc.broadcast.acked.v1 event.

FM-06 — Cell-ID resolution fails

Scenario. Polygon target resolves to cell IDs that aren't in cbc.mno_cell_database (cell DB stale or polygon out of coverage).

Impact. Request rejected at accept-time with INVALID_ARGUMENT + detail (missing cells count).

Mitigation.

  1. Cell-DB refresh weekly per MNO (DEPLOYMENT_TOPOLOGY §4).
  2. Fallback: caller can re-target using named region rather than polygon.
  3. CbcCellDatabaseStale alert at 14 d.

FM-07 — Monthly drill cadence missed

Scenario. Drill scheduler pod fails; monthly drill not fired.

Impact. Regulator escalation; service-readiness gate breach.

Detection. cbc_drill_overdue_seconds > 604800 (1 week past due); CbcDrillOverdue alert.

Mitigation.

  1. Scheduler is its own Deployment with health probes; crash triggers immediate PagerDuty.
  2. Manual drill trigger via POST /v1/admin/cbc/drill/now (admin-role only) runs drill catch-up.
  3. After-action report notes the missed cadence + recovery.

FM-08 — Audit chain break detected

Scenario. Daily verifier finds a row where record_hash ≠ sha256(canonical(payload) || prev_hash).

Impact. Regulator-defensibility of that specific period compromised.

Detection. Verifier run; cbc_audit_chain_verifier_status == 1; Critical alert.

Mitigation.

  1. Immediate investigation: possible causes are (a) canonicalisation bug, (b) concurrent-write race, (c) malicious tampering.
  2. Partition the affected chain region; subsequent rows start a new chain from a marked genesis.
  3. Regulator notified within 24 h if audit was already submitted with corrupt chain.
  4. Post-mortem + code / DB root-cause analysis within 72 h.

FM-09 — MNO CBE adapter protocol change

Scenario. MNO upgrades their CBE (e.g., Ericsson protocol version bump); existing adapter rejects new response format.

Impact. Dispatches to that MNO start failing with CBE_REJECT; circuit eventually opens; broadcasts become PARTIAL until adapter updated.

Mitigation.

  1. Adapter abstraction (per SERVICE_OVERVIEW §6.2) — new adapter can be deployed without redeploying main service.
  2. Alternate-adapter fallback: if EricssonProprietaryCbeAdapter fails, try Standard3gppCbeAdapter (some MNO CBEs support both).
  3. Vendor-contact runbook: call MNO NOC within 1 h to confirm protocol change + request documentation.
  4. Adapter change deployed within 48 h SLA.

FM-10 — Replay attack

Scenario. Attacker captures a valid signed broadcast request and replays it.

Impact. Rejected REPLAY_NONCE; audit row written; rate-limit engages.

Mitigation.

  1. Signature window: request timestamp must be within 5 min of server time; older → rejected.
  2. Nonce deduplication: per-cert nonce cache in Redis (TTL 10 min); re-seen nonce → rejected.
  3. Multiple replay attempts → automated security incident (CISO paged).

FM-11 — Concurrent cancellation race

Scenario. Two approvers click cancel near-simultaneously; or the original initiator and a second approver both attempt cancel at the same instant.

Impact. Potential duplicate CANCELLED transition.

Mitigation.

  1. State transition uses UPDATE ... WHERE state = 'ACCEPTED' (Postgres row-level lock via FOR UPDATE).
  2. Single winner; loser receives INVALID_STATE.
  3. Audit records only the winning transition.

FM-12 — Cell-DB refresh fails

Scenario. MNO cell-database export endpoint unreachable or malformed on weekly refresh.

Impact. Polygon targeting degraded for that MNO; named-region targeting unaffected.

Mitigation.

  1. Retry 3× with exponential backoff.
  2. Manual-upload fallback: platform admin can upload cell-DB CSV via admin-dashboard.
  3. CbcCellDatabaseStale alert at 14 d.

FM-13 — Postgres unavailable

Scenario. cbc schema unavailable.

Impact. New broadcasts rejected with 503; existing in-flight broadcasts complete via Redis-cached state (dispatch records in Redis until ACK).

Mitigation.

  1. Postgres HA with synchronous replica; automatic fail-over ≤ 30 s.
  2. Multi-region fail-over (manual-gated) ≤ 15 min.
  3. In-flight broadcasts use Redis cache during outage (best-effort audit to be reconciled on recovery).

Recovery. DB recovery → new submissions accepted; any Redis-cached dispatch records reconciled into Postgres.


FM-14 — Region partition

Scenario. Kabul ↔ Mazar network partition.

Impact. Each region operates independently using region-local Postgres; cbc.audit.v1 stream replication delayed. Government callers geo-routed continue to work.

Mitigation.

  1. Region-local operation is the design (ADR-0004 §5).
  2. No cross-region in-flight state to split-brain.
  3. Audit reconciliation on partition heal.
  4. Alert CbcRegionPartition.

FM-15 — Government-client clock drift

Scenario. Caller machine has NTP drift > 5 min; signature timestamp falls outside acceptance window.

Impact. Rejected REPLAY_NONCE even though caller intent was legitimate.

Mitigation.

  1. Caller SDK should use server-time reflection endpoint (GET /v1/cbc/time) to anchor timestamps.
  2. Error message includes server time + window bounds so caller can self-diagnose.
  3. Caller-support runbook: instruct client to fix NTP.

4. Graceful Degradation Summary

Failure domainBehaviourCaller impact
AuthenticationFail-closedClear error reason; must fix + resubmit
HSMFail-closed (new); fail-open (in-flight)Temporary outage → resubmit on recovery
Single MNO CBEContinue; PARTIAL verdictDetailed per-MNO breakdown
All MNO CBEFail (FAILED verdict)Manual fallback runbook
PostgresFail-closed (new); fail-open (in-flight via Redis)Temporary outage
NATSQueue; no data lossSlight audit delay
HSM + Postgres both outComplete failGovernment emergency manual procedure

5. Failure ↔ Experience Matrix

FMGovernment callerRegulator (ATRA)CitizenNOC
FM-01 HSM outService temporarily unavailableMissing broadcasts if prolongedNo broadcast receivedCritical alert
FM-02 All MNO failedFAILED; manual fallback expectedAware via regulator-portal-serviceNo broadcastCEO paged
FM-03 PartialPARTIAL with breakdownAwareSome regions receive, others notAlert
FM-04 PKI failSpecific error; must fix certAudit visibleNo impactProbing-detection alert
FM-05 EncodingPer-language statusAwareMissing one languageAlert
FM-06 Cell resolutionRe-target required
FM-07 Drill overdueATRA notified (failure to meet cadence)Alert
FM-08 Chain breakAudit integrity claim reducedCritical
FM-09 Adapter protocol changePARTIAL for affected MNOAwareMissing that MNOAlert
FM-10 ReplayRejected; security incident if repeatedProbing-detection alert
FM-11 Cancel raceOne succeeds; other gets INVALID_STATE
FM-12 Cell-DB stalePolygon targeting may failWarning alert
FM-13 Postgres outTemporary; in-flight completesNo new broadcastsCritical
FM-14 Region partitionRegion-local still worksReaches own regionAlert
FM-15 Clock driftRejected; caller-support ticketCaller metric

6. Open Points

IDQuestionOwner
FM-OPEN-01Exact SLA for manual-dispatch fallback (FM-02) agreed with MNO NOCsRegulator Liaison + MNO Partnerships
FM-OPEN-02Government-client SDK distribution channel (for NTP + timestamp helpers)DevRel
FM-OPEN-03Out-of-band communication bridge (phone + Slack + radio?) for nationwide HSM outageGovernment / Emergency + CISO
FM-OPEN-04Signature-window tolerance — 5 min is adequate? Or tighter?Security