Skip to main content

cbc-bridge-service — Observability

Version: 1.0 Status: Draft Owner: Government / Emergency + SRE Last Updated: 2026-04-21 References: SERVICE_OVERVIEW.md, FAILURE_MODES.md, docs/architecture/ADR-0004-national-backbone-resilience.md, services/compliance-engine/OBSERVABILITY.md

Because cbc-bridge-service handles civil emergency alerts with a government-PKI-authenticated request boundary and per-MNO dispatch fan-out, its observability posture is designed for two distinct audiences: (1) the 24×7 NOC that must confirm a broadcast landed, and (2) the regulator/Legal stakeholders who need immutable audit of who initiated what.


1. Prometheus Metrics

All metrics carry the standard labels service="cbc-bridge-service", region, namespace, pod. Additional labels are listed per metric.

1.1 Broadcast-pipeline metrics

MetricTypeLabelsDescription
cbc_broadcast_requested_totalCounterseverity (P0/P1/P2), caller_org, regionIncoming broadcast submissions
cbc_broadcast_accepted_totalCounterseverity, caller_orgAfter PKI + authorised-caller check
cbc_broadcast_rejected_totalCounterreason (UNAUTHENTICATED, CALLER_NOT_REGISTERED, INVALID_ARGUMENT, UNAUTHORIZED_SEVERITY, UNAUTHORIZED_REGION)Broadcast rejected
cbc_broadcast_accept_secondsHistogramseverityTime from request receipt to 202 response; buckets [0.01, 0.05, 0.1, 0.25, 0.5, 1, 2, 5]
cbc_broadcast_dispatch_secondsHistogrammno, severityTime from accept → MNO CBE dispatch; buckets [0.1, 0.5, 1, 2, 5, 10, 30, 60]
cbc_broadcast_ack_secondsHistogrammno, severityTime from dispatch → MNO ACK; buckets [0.5, 1, 2, 5, 10, 30, 60, 120]
cbc_broadcast_final_status_totalCounterstatus (DELIVERED/PARTIAL/FAILED/CANCELLED), severityFinal verdict distribution
cbc_broadcast_cancelled_totalCounterreason (DUAL_CONTROL, RECALLED, EXPIRED)Cancellations

1.2 Per-MNO dispatch metrics

MetricTypeLabelsDescription
cbc_mno_dispatch_totalCountermno, adapter (standard3gpp/ericsson/huawei)Dispatches attempted
cbc_mno_dispatch_success_totalCountermno, adapterACKED
cbc_mno_dispatch_failed_totalCountermno, adapter, reason (TIMEOUT, CBE_REJECT, CIRCUIT_OPEN, NETWORK_ERR)Failures
cbc_mno_circuit_stateGaugemno, adapter0=closed, 1=half-open, 2=open
cbc_mno_adapter_availableGaugemno, adapter0 or 1

1.3 PKI + authorised-caller metrics

MetricTypeLabelsDescription
cbc_pki_signature_verified_totalCounterresult (SUCCESS/FAILURE), failure_reasonSignature outcomes
cbc_pki_crl_check_totalCounterresult, ca_subjectCRL/OCSP
cbc_pki_cert_expiring_daysGaugecaller_org, cert_subjectDays until caller cert expires
cbc_authorised_callers_totalGaugestatus (ACTIVE/REVOKED)Registry size
cbc_authorised_caller_mismatch_totalCounterreason (SUBJECT_UNKNOWN, SEVERITY_DENIED, REGION_DENIED)Caller authorised-but-out-of-scope

1.4 Audit + drill metrics

MetricTypeLabelsDescription
cbc_audit_chain_verifier_statusGaugeverifier_run_id0=OK, 1=break detected
cbc_audit_chain_last_verified_secondsGaugeAge since last successful verification
cbc_audit_rows_written_totalCounterevent_typeAppend-only writes
cbc_drill_scheduled_totalCounterkind (monthly/ad-hoc)Drills scheduled
cbc_drill_completed_totalCounterstatus (SUCCESS/PARTIAL/FAILED)Drills completed
cbc_drill_overdue_secondsGaugeSeconds past next-expected drill window
cbc_cell_database_last_refresh_secondsGaugemnoAge of cell-tower database per MNO

1.5 HSM + downstream-service metrics

MetricTypeLabelsDescription
cbc_hsm_operation_totalCounterop (verify/sign), resultHSM calls
cbc_hsm_operation_secondsHistogramopHSM latency; buckets [0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1]
cbc_nats_publish_totalCountersubject, resultNATS publish outcomes
cbc_nats_publish_lag_secondsHistogramsubjectNATS ack latency

Histogram buckets are chosen deliberately to match expected latency profiles at national-backbone scale.


2. Structured Log Events

All logs emitted as Pino JSON with standard fields (ts, traceId, spanId, service, region, level). Service-specific fields per event-class:

2.1 Broadcast lifecycle events

  • cbc.broadcast.requested: { broadcastId, callerOrg, certSubject, severity, geoTarget.kind, languages[], pkiSigHash }
  • cbc.broadcast.accepted: as above + acceptedAt, expectedDispatchBy
  • cbc.broadcast.dispatched: { broadcastId, mno, adapter, cbsMessageIdentifier, pduCount, latencyMs }
  • cbc.broadcast.acked: { broadcastId, mno, perMno: { status, latencyMs, error? }[] }
  • cbc.broadcast.final: { broadcastId, finalStatus, perMnoBreakdown }

2.2 Security events

  • cbc.pki.verified: { callerOrg, certSubject, certFingerprintSha256, verifyDurationMs }
  • cbc.pki.failed: { callerOrg?, certSubject?, reason (BAD_SIG/CERT_EXPIRED/CRL_REVOKED/OCSP_REJECT/REPLAY_NONCE), sourceIp }
  • cbc.authorisation.denied: { callerOrg, requested (severity, region), granted: false, reason }
  • cbc.cancel.initiated: { broadcastId, cancelInitiator, cancelApprover, timeDelta }

2.3 Operational events

  • cbc.mno.adapter.health: { mno, adapter, state (HEALTHY/DEGRADED/UNAVAILABLE), lastOkAt }
  • cbc.drill.scheduled: { drillId, scheduledFor }
  • cbc.drill.completed: { drillId, status, reachedMnos, unreachedMnos, reportS3Uri }
  • cbc.cellDb.refreshed: { mno, rowCount, newRowsDelta, refreshDurationMs }

2.4 Audit events

  • cbc.audit.appended: { recordId, prevHash, recordHash, eventType, payloadSha256 }
  • cbc.audit.chain_verified: { from, to, ok, firstBreakAt? }

PII policy

The caller-organisation identity is logged (government accountability is the point of this service). No subscriber MSISDN appears in any log — cell broadcast is targeted by area, not by individual. Cell IDs may appear.


3. OpenTelemetry Tracing

OpenTelemetry SDK (Node) initialised before NestFactory. Auto-instrumentation for gRPC, HTTP, Postgres, NATS. Explicit manual spans for:

  • cbc.broadcast.accept — incoming request → persisted row
  • cbc.pki.verify — HSM round-trip
  • cbc.cbs.encode — language × severity → CBS PDU
  • cbc.mno.dispatch — per-MNO fan-out (span per MNO)
  • cbc.mno.ack.wait — per-MNO ack-wait
  • cbc.audit.append — chain append
  • cbc.cbs.cancel — cancellation path

W3C TraceContext propagation from regulator-portal-service and government clients. Span attributes include broadcast.id, caller.org (but not cert secret material), mno, severity.

Sampling: head-based 100% for errors; 100% for severity=P0; 10% for severity=P2; 1% for non-broadcast background jobs.


4. Alerting Rules

Alertmanager PagerDuty routing with team=government-emergency primary, team=sre secondary.

groups:
- name: cbc-bridge.rules
rules:
- alert: CbcBroadcastDispatchFailureCritical
expr: sum(rate(cbc_mno_dispatch_failed_total{reason!="CIRCUIT_OPEN"}[5m])) by (mno) / sum(rate(cbc_mno_dispatch_total[5m])) by (mno) > 0.25
for: 2m
labels: { severity: critical, team: government-emergency }
annotations:
summary: "CBC dispatch failing for {{ $labels.mno }}"
runbook: https://runbooks.ghasi.io/cbc/mno-dispatch-failure

- alert: CbcBroadcastAllMnoFailed
expr: sum(cbc_broadcast_final_status_total{status="FAILED"}[10m]) > 0
for: 1m
labels: { severity: critical, team: government-emergency, page: ceo }

- alert: CbcPkiVerifyFailureSpike
expr: sum(rate(cbc_pki_signature_verified_total{result="FAILURE"}[5m])) > 0.05
for: 5m
labels: { severity: high, team: security }

- alert: CbcHsmUnavailable
expr: up{job="cbc-bridge",subsystem="hsm"} == 0 or rate(cbc_hsm_operation_total{result="FAILURE"}[5m]) > 0.1
for: 2m
labels: { severity: critical, team: sre }

- alert: CbcAuditChainBroken
expr: cbc_audit_chain_verifier_status == 1
for: 0m
labels: { severity: critical, team: government-emergency, page: ciso }

- alert: CbcDrillOverdue
expr: cbc_drill_overdue_seconds > 604800
for: 1h
labels: { severity: high, team: government-emergency }

- alert: CbcCellDatabaseStale
expr: cbc_cell_database_last_refresh_seconds > 1209600
for: 1h
labels: { severity: medium, team: government-emergency }

- alert: CbcAuthorisedCallerCertExpiringSoon
expr: cbc_pki_cert_expiring_days < 14
for: 1h
labels: { severity: medium, team: government-emergency }

- alert: CbcBroadcastAcceptLatencyHigh
expr: histogram_quantile(0.99, sum(rate(cbc_broadcast_accept_seconds_bucket[5m])) by (le)) > 1
for: 5m
labels: { severity: medium, team: sre }

- alert: CbcPartialDispatchRateHigh
expr: sum(rate(cbc_broadcast_final_status_total{status="PARTIAL"}[1h])) / sum(rate(cbc_broadcast_final_status_total[1h])) > 0.1
for: 30m
labels: { severity: high, team: government-emergency }

Every alert has a linked runbook in runbooks/cbc/.


5. Grafana Dashboard Panels

Dashboard cbc-bridge-service.json provides three rows targeting different audiences:

5.1 NOC row (always-on SPoG)

  • Tile: current broadcasts in-flight + their state
  • Tile: per-MNO circuit state (red/amber/green)
  • Tile: last drill status + next drill ETA
  • Panel: broadcast final-status distribution (24h)
  • Panel: per-MNO dispatch latency P50/P95/P99 heatmap
  • Panel: HSM latency + error rate
  • Panel: audit-chain verifier status + last-verified-age
  • Panel: authorised-caller cert expiry (top-10 nearest)
  • Panel: PKI verification success/fail per caller-org (24h)
  • Panel: drill cadence vs. schedule
  • Panel: signed-file generation pipeline status (PDF + package delivery)
  • Panel: per-region broadcast distribution (regulatory geo audit)

5.3 Engineering row

  • Panel: NATS publish lag by subject
  • Panel: adapter availability per MNO/adapter
  • Panel: cell-database age per MNO
  • Panel: Postgres/Redis/HSM pool utilisation
  • Panel: K8s HPA activity (replicas over time)

Dashboard links to the NOC dashboard (EP-ADMDASH-09) and regulator workbench (EP-ADMDASH-10).


6. Runbook Index

AlertRunbook
CbcBroadcastDispatchFailureCriticalrunbooks/cbc/mno-dispatch-failure.md
CbcBroadcastAllMnoFailedrunbooks/cbc/all-mno-failed.md (CEO-paging incident)
CbcPkiVerifyFailureSpikerunbooks/cbc/pki-verify-spike.md (probing detection)
CbcHsmUnavailablerunbooks/cbc/hsm-unavailable.md
CbcAuditChainBrokenrunbooks/cbc/chain-broken.md
CbcDrillOverduerunbooks/cbc/drill-overdue.md
CbcCellDatabaseStalerunbooks/cbc/cell-db-stale.md
CbcAuthorisedCallerCertExpiringSoonrunbooks/cbc/caller-cert-renewal.md
CbcBroadcastAcceptLatencyHighrunbooks/cbc/accept-latency.md
CbcPartialDispatchRateHighrunbooks/cbc/partial-rate.md

Each runbook has: detection signal, hypotheses, immediate mitigations, escalation tree (incl. CEO + Board Secretary if emergency-broadcast-level impact), post-incident review template.


7. SLIs / SLOs

Bound to the platform NFR catalog (EP-PLAT-NB-09). Concrete SLOs:

SLISLO targetError-budget window
BroadcastEmergency accept latency (P99)≤ 500 ms30 d
Broadcast.dispatched.v1 emit latency (P95) from accept≤ 15 s30 d
Any-MNO ACK latency (P95) from dispatch≤ 30 s30 d
Broadcast final-status DELIVERED or PARTIAL rate≥ 99.9%90 d
PKI verification success rate (genuine callers)≥ 99.99%30 d
HSM availability≥ 99.95%30 d
Audit chain integrity100% (no breaks)continuous
Monthly drill completion100%annual

Error-budget burn alerts fire at 5% and 25% of monthly budget consumed.


8. Log Retention

StreamHot (Loki)Cold (S3)
Broadcast + PKI + audit events14 d13 m + 7 y object-lock
Operational + debug14 d30 d

Cold-tier queries run against ClickHouse (per analytics-service EP-ANLYT-02).