cbc-bridge-service — Observability
Version: 1.0 Status: Draft Owner: Government / Emergency + SRE Last Updated: 2026-04-21 References: SERVICE_OVERVIEW.md, FAILURE_MODES.md, docs/architecture/ADR-0004-national-backbone-resilience.md, services/compliance-engine/OBSERVABILITY.md
Because cbc-bridge-service handles civil emergency alerts with a government-PKI-authenticated request boundary and per-MNO dispatch fan-out, its observability posture is designed for two distinct audiences: (1) the 24×7 NOC that must confirm a broadcast landed, and (2) the regulator/Legal stakeholders who need immutable audit of who initiated what.
1. Prometheus Metrics
All metrics carry the standard labels service="cbc-bridge-service", region, namespace, pod. Additional labels are listed per metric.
1.1 Broadcast-pipeline metrics
| Metric | Type | Labels | Description |
|---|---|---|---|
cbc_broadcast_requested_total | Counter | severity (P0/P1/P2), caller_org, region | Incoming broadcast submissions |
cbc_broadcast_accepted_total | Counter | severity, caller_org | After PKI + authorised-caller check |
cbc_broadcast_rejected_total | Counter | reason (UNAUTHENTICATED, CALLER_NOT_REGISTERED, INVALID_ARGUMENT, UNAUTHORIZED_SEVERITY, UNAUTHORIZED_REGION) | Broadcast rejected |
cbc_broadcast_accept_seconds | Histogram | severity | Time from request receipt to 202 response; buckets [0.01, 0.05, 0.1, 0.25, 0.5, 1, 2, 5] |
cbc_broadcast_dispatch_seconds | Histogram | mno, severity | Time from accept → MNO CBE dispatch; buckets [0.1, 0.5, 1, 2, 5, 10, 30, 60] |
cbc_broadcast_ack_seconds | Histogram | mno, severity | Time from dispatch → MNO ACK; buckets [0.5, 1, 2, 5, 10, 30, 60, 120] |
cbc_broadcast_final_status_total | Counter | status (DELIVERED/PARTIAL/FAILED/CANCELLED), severity | Final verdict distribution |
cbc_broadcast_cancelled_total | Counter | reason (DUAL_CONTROL, RECALLED, EXPIRED) | Cancellations |
1.2 Per-MNO dispatch metrics
| Metric | Type | Labels | Description |
|---|---|---|---|
cbc_mno_dispatch_total | Counter | mno, adapter (standard3gpp/ericsson/huawei) | Dispatches attempted |
cbc_mno_dispatch_success_total | Counter | mno, adapter | ACKED |
cbc_mno_dispatch_failed_total | Counter | mno, adapter, reason (TIMEOUT, CBE_REJECT, CIRCUIT_OPEN, NETWORK_ERR) | Failures |
cbc_mno_circuit_state | Gauge | mno, adapter | 0=closed, 1=half-open, 2=open |
cbc_mno_adapter_available | Gauge | mno, adapter | 0 or 1 |
1.3 PKI + authorised-caller metrics
| Metric | Type | Labels | Description |
|---|---|---|---|
cbc_pki_signature_verified_total | Counter | result (SUCCESS/FAILURE), failure_reason | Signature outcomes |
cbc_pki_crl_check_total | Counter | result, ca_subject | CRL/OCSP |
cbc_pki_cert_expiring_days | Gauge | caller_org, cert_subject | Days until caller cert expires |
cbc_authorised_callers_total | Gauge | status (ACTIVE/REVOKED) | Registry size |
cbc_authorised_caller_mismatch_total | Counter | reason (SUBJECT_UNKNOWN, SEVERITY_DENIED, REGION_DENIED) | Caller authorised-but-out-of-scope |
1.4 Audit + drill metrics
| Metric | Type | Labels | Description |
|---|---|---|---|
cbc_audit_chain_verifier_status | Gauge | verifier_run_id | 0=OK, 1=break detected |
cbc_audit_chain_last_verified_seconds | Gauge | — | Age since last successful verification |
cbc_audit_rows_written_total | Counter | event_type | Append-only writes |
cbc_drill_scheduled_total | Counter | kind (monthly/ad-hoc) | Drills scheduled |
cbc_drill_completed_total | Counter | status (SUCCESS/PARTIAL/FAILED) | Drills completed |
cbc_drill_overdue_seconds | Gauge | — | Seconds past next-expected drill window |
cbc_cell_database_last_refresh_seconds | Gauge | mno | Age of cell-tower database per MNO |
1.5 HSM + downstream-service metrics
| Metric | Type | Labels | Description |
|---|---|---|---|
cbc_hsm_operation_total | Counter | op (verify/sign), result | HSM calls |
cbc_hsm_operation_seconds | Histogram | op | HSM latency; buckets [0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1] |
cbc_nats_publish_total | Counter | subject, result | NATS publish outcomes |
cbc_nats_publish_lag_seconds | Histogram | subject | NATS ack latency |
Histogram buckets are chosen deliberately to match expected latency profiles at national-backbone scale.
2. Structured Log Events
All logs emitted as Pino JSON with standard fields (ts, traceId, spanId, service, region, level). Service-specific fields per event-class:
2.1 Broadcast lifecycle events
cbc.broadcast.requested:{ broadcastId, callerOrg, certSubject, severity, geoTarget.kind, languages[], pkiSigHash }cbc.broadcast.accepted: as above +acceptedAt,expectedDispatchBycbc.broadcast.dispatched:{ broadcastId, mno, adapter, cbsMessageIdentifier, pduCount, latencyMs }cbc.broadcast.acked:{ broadcastId, mno, perMno: { status, latencyMs, error? }[] }cbc.broadcast.final:{ broadcastId, finalStatus, perMnoBreakdown }
2.2 Security events
cbc.pki.verified:{ callerOrg, certSubject, certFingerprintSha256, verifyDurationMs }cbc.pki.failed:{ callerOrg?, certSubject?, reason (BAD_SIG/CERT_EXPIRED/CRL_REVOKED/OCSP_REJECT/REPLAY_NONCE), sourceIp }cbc.authorisation.denied:{ callerOrg, requested (severity, region), granted: false, reason }cbc.cancel.initiated:{ broadcastId, cancelInitiator, cancelApprover, timeDelta }
2.3 Operational events
cbc.mno.adapter.health:{ mno, adapter, state (HEALTHY/DEGRADED/UNAVAILABLE), lastOkAt }cbc.drill.scheduled:{ drillId, scheduledFor }cbc.drill.completed:{ drillId, status, reachedMnos, unreachedMnos, reportS3Uri }cbc.cellDb.refreshed:{ mno, rowCount, newRowsDelta, refreshDurationMs }
2.4 Audit events
cbc.audit.appended:{ recordId, prevHash, recordHash, eventType, payloadSha256 }cbc.audit.chain_verified:{ from, to, ok, firstBreakAt? }
PII policy
The caller-organisation identity is logged (government accountability is the point of this service). No subscriber MSISDN appears in any log — cell broadcast is targeted by area, not by individual. Cell IDs may appear.
3. OpenTelemetry Tracing
OpenTelemetry SDK (Node) initialised before NestFactory. Auto-instrumentation for gRPC, HTTP, Postgres, NATS. Explicit manual spans for:
cbc.broadcast.accept— incoming request → persisted rowcbc.pki.verify— HSM round-tripcbc.cbs.encode— language × severity → CBS PDUcbc.mno.dispatch— per-MNO fan-out (span per MNO)cbc.mno.ack.wait— per-MNO ack-waitcbc.audit.append— chain appendcbc.cbs.cancel— cancellation path
W3C TraceContext propagation from regulator-portal-service and government clients. Span attributes include broadcast.id, caller.org (but not cert secret material), mno, severity.
Sampling: head-based 100% for errors; 100% for severity=P0; 10% for severity=P2; 1% for non-broadcast background jobs.
4. Alerting Rules
Alertmanager PagerDuty routing with team=government-emergency primary, team=sre secondary.
groups:
- name: cbc-bridge.rules
rules:
- alert: CbcBroadcastDispatchFailureCritical
expr: sum(rate(cbc_mno_dispatch_failed_total{reason!="CIRCUIT_OPEN"}[5m])) by (mno) / sum(rate(cbc_mno_dispatch_total[5m])) by (mno) > 0.25
for: 2m
labels: { severity: critical, team: government-emergency }
annotations:
summary: "CBC dispatch failing for {{ $labels.mno }}"
runbook: https://runbooks.ghasi.io/cbc/mno-dispatch-failure
- alert: CbcBroadcastAllMnoFailed
expr: sum(cbc_broadcast_final_status_total{status="FAILED"}[10m]) > 0
for: 1m
labels: { severity: critical, team: government-emergency, page: ceo }
- alert: CbcPkiVerifyFailureSpike
expr: sum(rate(cbc_pki_signature_verified_total{result="FAILURE"}[5m])) > 0.05
for: 5m
labels: { severity: high, team: security }
- alert: CbcHsmUnavailable
expr: up{job="cbc-bridge",subsystem="hsm"} == 0 or rate(cbc_hsm_operation_total{result="FAILURE"}[5m]) > 0.1
for: 2m
labels: { severity: critical, team: sre }
- alert: CbcAuditChainBroken
expr: cbc_audit_chain_verifier_status == 1
for: 0m
labels: { severity: critical, team: government-emergency, page: ciso }
- alert: CbcDrillOverdue
expr: cbc_drill_overdue_seconds > 604800
for: 1h
labels: { severity: high, team: government-emergency }
- alert: CbcCellDatabaseStale
expr: cbc_cell_database_last_refresh_seconds > 1209600
for: 1h
labels: { severity: medium, team: government-emergency }
- alert: CbcAuthorisedCallerCertExpiringSoon
expr: cbc_pki_cert_expiring_days < 14
for: 1h
labels: { severity: medium, team: government-emergency }
- alert: CbcBroadcastAcceptLatencyHigh
expr: histogram_quantile(0.99, sum(rate(cbc_broadcast_accept_seconds_bucket[5m])) by (le)) > 1
for: 5m
labels: { severity: medium, team: sre }
- alert: CbcPartialDispatchRateHigh
expr: sum(rate(cbc_broadcast_final_status_total{status="PARTIAL"}[1h])) / sum(rate(cbc_broadcast_final_status_total[1h])) > 0.1
for: 30m
labels: { severity: high, team: government-emergency }
Every alert has a linked runbook in runbooks/cbc/.
5. Grafana Dashboard Panels
Dashboard cbc-bridge-service.json provides three rows targeting different audiences:
5.1 NOC row (always-on SPoG)
- Tile: current broadcasts in-flight + their state
- Tile: per-MNO circuit state (red/amber/green)
- Tile: last drill status + next drill ETA
- Panel: broadcast final-status distribution (24h)
- Panel: per-MNO dispatch latency P50/P95/P99 heatmap
- Panel: HSM latency + error rate
5.2 Regulator / Legal row
- Panel: audit-chain verifier status + last-verified-age
- Panel: authorised-caller cert expiry (top-10 nearest)
- Panel: PKI verification success/fail per caller-org (24h)
- Panel: drill cadence vs. schedule
- Panel: signed-file generation pipeline status (PDF + package delivery)
- Panel: per-region broadcast distribution (regulatory geo audit)
5.3 Engineering row
- Panel: NATS publish lag by subject
- Panel: adapter availability per MNO/adapter
- Panel: cell-database age per MNO
- Panel: Postgres/Redis/HSM pool utilisation
- Panel: K8s HPA activity (replicas over time)
Dashboard links to the NOC dashboard (EP-ADMDASH-09) and regulator workbench (EP-ADMDASH-10).
6. Runbook Index
| Alert | Runbook |
|---|---|
| CbcBroadcastDispatchFailureCritical | runbooks/cbc/mno-dispatch-failure.md |
| CbcBroadcastAllMnoFailed | runbooks/cbc/all-mno-failed.md (CEO-paging incident) |
| CbcPkiVerifyFailureSpike | runbooks/cbc/pki-verify-spike.md (probing detection) |
| CbcHsmUnavailable | runbooks/cbc/hsm-unavailable.md |
| CbcAuditChainBroken | runbooks/cbc/chain-broken.md |
| CbcDrillOverdue | runbooks/cbc/drill-overdue.md |
| CbcCellDatabaseStale | runbooks/cbc/cell-db-stale.md |
| CbcAuthorisedCallerCertExpiringSoon | runbooks/cbc/caller-cert-renewal.md |
| CbcBroadcastAcceptLatencyHigh | runbooks/cbc/accept-latency.md |
| CbcPartialDispatchRateHigh | runbooks/cbc/partial-rate.md |
Each runbook has: detection signal, hypotheses, immediate mitigations, escalation tree (incl. CEO + Board Secretary if emergency-broadcast-level impact), post-incident review template.
7. SLIs / SLOs
Bound to the platform NFR catalog (EP-PLAT-NB-09). Concrete SLOs:
| SLI | SLO target | Error-budget window |
|---|---|---|
BroadcastEmergency accept latency (P99) | ≤ 500 ms | 30 d |
Broadcast.dispatched.v1 emit latency (P95) from accept | ≤ 15 s | 30 d |
| Any-MNO ACK latency (P95) from dispatch | ≤ 30 s | 30 d |
Broadcast final-status DELIVERED or PARTIAL rate | ≥ 99.9% | 90 d |
| PKI verification success rate (genuine callers) | ≥ 99.99% | 30 d |
| HSM availability | ≥ 99.95% | 30 d |
| Audit chain integrity | 100% (no breaks) | continuous |
| Monthly drill completion | 100% | annual |
Error-budget burn alerts fire at 5% and 25% of monthly budget consumed.
8. Log Retention
| Stream | Hot (Loki) | Cold (S3) |
|---|---|---|
| Broadcast + PKI + audit events | 14 d | 13 m + 7 y object-lock |
| Operational + debug | 14 d | 30 d |
Cold-tier queries run against ClickHouse (per analytics-service EP-ANLYT-02).