cbc-bridge-service — Observability

Version: 1.0 Status: Draft Owner: Government / Emergency + SRE Last Updated: 2026-04-21 References: SERVICE_OVERVIEW.md, FAILURE_MODES.md, docs/architecture/ADR-0004-national-backbone-resilience.md, services/compliance-engine/OBSERVABILITY.md

Because cbc-bridge-service handles civil emergency alerts with a government-PKI-authenticated request boundary and per-MNO dispatch fan-out, its observability posture is designed for two distinct audiences: (1) the 24×7 NOC that must confirm a broadcast landed, and (2) the regulator/Legal stakeholders who need immutable audit of who initiated what.

1. Prometheus Metrics

All metrics carry the standard labels service="cbc-bridge-service", region, namespace, pod. Additional labels are listed per metric.

1.1 Broadcast-pipeline metrics

Metric	Type	Labels	Description
`cbc_broadcast_requested_total`	Counter	`severity` (P0/P1/P2), `caller_org`, `region`	Incoming broadcast submissions
`cbc_broadcast_accepted_total`	Counter	`severity`, `caller_org`	After PKI + authorised-caller check
`cbc_broadcast_rejected_total`	Counter	`reason` (UNAUTHENTICATED, CALLER_NOT_REGISTERED, INVALID_ARGUMENT, UNAUTHORIZED_SEVERITY, UNAUTHORIZED_REGION)	Broadcast rejected
`cbc_broadcast_accept_seconds`	Histogram	`severity`	Time from request receipt to 202 response; buckets `[0.01, 0.05, 0.1, 0.25, 0.5, 1, 2, 5]`
`cbc_broadcast_dispatch_seconds`	Histogram	`mno`, `severity`	Time from accept → MNO CBE dispatch; buckets `[0.1, 0.5, 1, 2, 5, 10, 30, 60]`
`cbc_broadcast_ack_seconds`	Histogram	`mno`, `severity`	Time from dispatch → MNO ACK; buckets `[0.5, 1, 2, 5, 10, 30, 60, 120]`
`cbc_broadcast_final_status_total`	Counter	`status` (DELIVERED/PARTIAL/FAILED/CANCELLED), `severity`	Final verdict distribution
`cbc_broadcast_cancelled_total`	Counter	`reason` (DUAL_CONTROL, RECALLED, EXPIRED)	Cancellations

1.2 Per-MNO dispatch metrics

Metric	Type	Labels	Description
`cbc_mno_dispatch_total`	Counter	`mno`, `adapter` (standard3gpp/ericsson/huawei)	Dispatches attempted
`cbc_mno_dispatch_success_total`	Counter	`mno`, `adapter`	`ACKED`
`cbc_mno_dispatch_failed_total`	Counter	`mno`, `adapter`, `reason` (TIMEOUT, CBE_REJECT, CIRCUIT_OPEN, NETWORK_ERR)	Failures
`cbc_mno_circuit_state`	Gauge	`mno`, `adapter`	0=closed, 1=half-open, 2=open
`cbc_mno_adapter_available`	Gauge	`mno`, `adapter`	0 or 1

1.3 PKI + authorised-caller metrics

Metric	Type	Labels	Description
`cbc_pki_signature_verified_total`	Counter	`result` (SUCCESS/FAILURE), `failure_reason`	Signature outcomes
`cbc_pki_crl_check_total`	Counter	`result`, `ca_subject`	CRL/OCSP
`cbc_pki_cert_expiring_days`	Gauge	`caller_org`, `cert_subject`	Days until caller cert expires
`cbc_authorised_callers_total`	Gauge	`status` (ACTIVE/REVOKED)	Registry size
`cbc_authorised_caller_mismatch_total`	Counter	`reason` (SUBJECT_UNKNOWN, SEVERITY_DENIED, REGION_DENIED)	Caller authorised-but-out-of-scope

1.4 Audit + drill metrics

Metric	Type	Labels	Description
`cbc_audit_chain_verifier_status`	Gauge	`verifier_run_id`	0=OK, 1=break detected
`cbc_audit_chain_last_verified_seconds`	Gauge	—	Age since last successful verification
`cbc_audit_rows_written_total`	Counter	`event_type`	Append-only writes
`cbc_drill_scheduled_total`	Counter	`kind` (monthly/ad-hoc)	Drills scheduled
`cbc_drill_completed_total`	Counter	`status` (SUCCESS/PARTIAL/FAILED)	Drills completed
`cbc_drill_overdue_seconds`	Gauge	—	Seconds past next-expected drill window
`cbc_cell_database_last_refresh_seconds`	Gauge	`mno`	Age of cell-tower database per MNO

1.5 HSM + downstream-service metrics

Metric	Type	Labels	Description
`cbc_hsm_operation_total`	Counter	`op` (verify/sign), `result`	HSM calls
`cbc_hsm_operation_seconds`	Histogram	`op`	HSM latency; buckets `[0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1]`
`cbc_nats_publish_total`	Counter	`subject`, `result`	NATS publish outcomes
`cbc_nats_publish_lag_seconds`	Histogram	`subject`	NATS ack latency

Histogram buckets are chosen deliberately to match expected latency profiles at national-backbone scale.

2. Structured Log Events

All logs emitted as Pino JSON with standard fields (ts, traceId, spanId, service, region, level). Service-specific fields per event-class:

2.1 Broadcast lifecycle events

cbc.broadcast.requested: { broadcastId, callerOrg, certSubject, severity, geoTarget.kind, languages[], pkiSigHash }
cbc.broadcast.accepted: as above + acceptedAt, expectedDispatchBy
cbc.broadcast.dispatched: { broadcastId, mno, adapter, cbsMessageIdentifier, pduCount, latencyMs }
cbc.broadcast.acked: { broadcastId, mno, perMno: { status, latencyMs, error? }[] }
cbc.broadcast.final: { broadcastId, finalStatus, perMnoBreakdown }

2.2 Security events

cbc.pki.verified: { callerOrg, certSubject, certFingerprintSha256, verifyDurationMs }
cbc.pki.failed: { callerOrg?, certSubject?, reason (BAD_SIG/CERT_EXPIRED/CRL_REVOKED/OCSP_REJECT/REPLAY_NONCE), sourceIp }
cbc.authorisation.denied: { callerOrg, requested (severity, region), granted: false, reason }
cbc.cancel.initiated: { broadcastId, cancelInitiator, cancelApprover, timeDelta }

2.3 Operational events

cbc.mno.adapter.health: { mno, adapter, state (HEALTHY/DEGRADED/UNAVAILABLE), lastOkAt }
cbc.drill.scheduled: { drillId, scheduledFor }
cbc.drill.completed: { drillId, status, reachedMnos, unreachedMnos, reportS3Uri }
cbc.cellDb.refreshed: { mno, rowCount, newRowsDelta, refreshDurationMs }

2.4 Audit events

cbc.audit.appended: { recordId, prevHash, recordHash, eventType, payloadSha256 }
cbc.audit.chain_verified: { from, to, ok, firstBreakAt? }

PII policy

The caller-organisation identity is logged (government accountability is the point of this service). No subscriber MSISDN appears in any log — cell broadcast is targeted by area, not by individual. Cell IDs may appear.

3. OpenTelemetry Tracing

OpenTelemetry SDK (Node) initialised before NestFactory. Auto-instrumentation for gRPC, HTTP, Postgres, NATS. Explicit manual spans for:

cbc.broadcast.accept — incoming request → persisted row
cbc.pki.verify — HSM round-trip
cbc.cbs.encode — language × severity → CBS PDU
cbc.mno.dispatch — per-MNO fan-out (span per MNO)
cbc.mno.ack.wait — per-MNO ack-wait
cbc.audit.append — chain append
cbc.cbs.cancel — cancellation path

W3C TraceContext propagation from regulator-portal-service and government clients. Span attributes include broadcast.id, caller.org (but not cert secret material), mno, severity.

Sampling: head-based 100% for errors; 100% for severity=P0; 10% for severity=P2; 1% for non-broadcast background jobs.

4. Alerting Rules

Alertmanager PagerDuty routing with team=government-emergency primary, team=sre secondary.

groups:
- name: cbc-bridge.rules
  rules:
  - alert: CbcBroadcastDispatchFailureCritical
    expr: sum(rate(cbc_mno_dispatch_failed_total{reason!="CIRCUIT_OPEN"}[5m])) by (mno) / sum(rate(cbc_mno_dispatch_total[5m])) by (mno) > 0.25
    for: 2m
    labels: { severity: critical, team: government-emergency }
    annotations:
      summary: "CBC dispatch failing for {{ $labels.mno }}"
      runbook: https://runbooks.ghasi.io/cbc/mno-dispatch-failure

  - alert: CbcBroadcastAllMnoFailed
    expr: sum(cbc_broadcast_final_status_total{status="FAILED"}[10m]) > 0
    for: 1m
    labels: { severity: critical, team: government-emergency, page: ceo }

  - alert: CbcPkiVerifyFailureSpike
    expr: sum(rate(cbc_pki_signature_verified_total{result="FAILURE"}[5m])) > 0.05
    for: 5m
    labels: { severity: high, team: security }

  - alert: CbcHsmUnavailable
    expr: up{job="cbc-bridge",subsystem="hsm"} == 0 or rate(cbc_hsm_operation_total{result="FAILURE"}[5m]) > 0.1
    for: 2m
    labels: { severity: critical, team: sre }

  - alert: CbcAuditChainBroken
    expr: cbc_audit_chain_verifier_status == 1
    for: 0m
    labels: { severity: critical, team: government-emergency, page: ciso }

  - alert: CbcDrillOverdue
    expr: cbc_drill_overdue_seconds > 604800
    for: 1h
    labels: { severity: high, team: government-emergency }

  - alert: CbcCellDatabaseStale
    expr: cbc_cell_database_last_refresh_seconds > 1209600
    for: 1h
    labels: { severity: medium, team: government-emergency }

  - alert: CbcAuthorisedCallerCertExpiringSoon
    expr: cbc_pki_cert_expiring_days < 14
    for: 1h
    labels: { severity: medium, team: government-emergency }

  - alert: CbcBroadcastAcceptLatencyHigh
    expr: histogram_quantile(0.99, sum(rate(cbc_broadcast_accept_seconds_bucket[5m])) by (le)) > 1
    for: 5m
    labels: { severity: medium, team: sre }

  - alert: CbcPartialDispatchRateHigh
    expr: sum(rate(cbc_broadcast_final_status_total{status="PARTIAL"}[1h])) / sum(rate(cbc_broadcast_final_status_total[1h])) > 0.1
    for: 30m
    labels: { severity: high, team: government-emergency }

Every alert has a linked runbook in runbooks/cbc/.

5. Grafana Dashboard Panels

Dashboard cbc-bridge-service.json provides three rows targeting different audiences:

5.1 NOC row (always-on SPoG)

Tile: current broadcasts in-flight + their state
Tile: per-MNO circuit state (red/amber/green)
Tile: last drill status + next drill ETA
Panel: broadcast final-status distribution (24h)
Panel: per-MNO dispatch latency P50/P95/P99 heatmap
Panel: HSM latency + error rate

5.2 Regulator / Legal row

Panel: audit-chain verifier status + last-verified-age
Panel: authorised-caller cert expiry (top-10 nearest)
Panel: PKI verification success/fail per caller-org (24h)
Panel: drill cadence vs. schedule
Panel: signed-file generation pipeline status (PDF + package delivery)
Panel: per-region broadcast distribution (regulatory geo audit)

5.3 Engineering row

Panel: NATS publish lag by subject
Panel: adapter availability per MNO/adapter
Panel: cell-database age per MNO
Panel: Postgres/Redis/HSM pool utilisation
Panel: K8s HPA activity (replicas over time)

Dashboard links to the NOC dashboard (EP-ADMDASH-09) and regulator workbench (EP-ADMDASH-10).

6. Runbook Index

Alert	Runbook
CbcBroadcastDispatchFailureCritical	`runbooks/cbc/mno-dispatch-failure.md`
CbcBroadcastAllMnoFailed	`runbooks/cbc/all-mno-failed.md` (CEO-paging incident)
CbcPkiVerifyFailureSpike	`runbooks/cbc/pki-verify-spike.md` (probing detection)
CbcHsmUnavailable	`runbooks/cbc/hsm-unavailable.md`
CbcAuditChainBroken	`runbooks/cbc/chain-broken.md`
CbcDrillOverdue	`runbooks/cbc/drill-overdue.md`
CbcCellDatabaseStale	`runbooks/cbc/cell-db-stale.md`
CbcAuthorisedCallerCertExpiringSoon	`runbooks/cbc/caller-cert-renewal.md`
CbcBroadcastAcceptLatencyHigh	`runbooks/cbc/accept-latency.md`
CbcPartialDispatchRateHigh	`runbooks/cbc/partial-rate.md`

Each runbook has: detection signal, hypotheses, immediate mitigations, escalation tree (incl. CEO + Board Secretary if emergency-broadcast-level impact), post-incident review template.

7. SLIs / SLOs

Bound to the platform NFR catalog (EP-PLAT-NB-09). Concrete SLOs:

SLI	SLO target	Error-budget window
`BroadcastEmergency` accept latency (P99)	≤ 500 ms	30 d
`Broadcast.dispatched.v1` emit latency (P95) from accept	≤ 15 s	30 d
Any-MNO ACK latency (P95) from dispatch	≤ 30 s	30 d
Broadcast final-status `DELIVERED` or `PARTIAL` rate	≥ 99.9%	90 d
PKI verification success rate (genuine callers)	≥ 99.99%	30 d
HSM availability	≥ 99.95%	30 d
Audit chain integrity	100% (no breaks)	continuous
Monthly drill completion	100%	annual

Error-budget burn alerts fire at 5% and 25% of monthly budget consumed.

8. Log Retention

Stream	Hot (Loki)	Cold (S3)
Broadcast + PKI + audit events	14 d	13 m + 7 y object-lock
Operational + debug	14 d	30 d

Cold-tier queries run against ClickHouse (per analytics-service EP-ANLYT-02).

1. Prometheus Metrics​

1.1 Broadcast-pipeline metrics​

1.2 Per-MNO dispatch metrics​

1.3 PKI + authorised-caller metrics​

1.4 Audit + drill metrics​

1.5 HSM + downstream-service metrics​

2. Structured Log Events​

2.1 Broadcast lifecycle events​

2.2 Security events​

2.3 Operational events​

2.4 Audit events​

PII policy​

3. OpenTelemetry Tracing​

4. Alerting Rules​

5. Grafana Dashboard Panels​

5.1 NOC row (always-on SPoG)​

5.2 Regulator / Legal row​

5.3 Engineering row​

6. Runbook Index​

7. SLIs / SLOs​

8. Log Retention​