cdr-mediation-service — Observability
Version: 1.0 Status: Draft Owner: Commerce + Regulator Liaison + SRE Last Updated: 2026-04-21 References: SERVICE_OVERVIEW.md, FAILURE_MODES.md, services/compliance-engine/OBSERVABILITY.md, docs/architecture/ADR-0004-national-backbone-resilience.md
cdr-mediation-service is the platform's canonical CDR pipeline — a batch-heavy async service whose availability semantics differ from hot-path services (compliance, firewall, consent). Its observability emphasises pipeline lag, export-to-regulator SLA, and chain integrity over request latency.
1. Prometheus Metrics
All metrics carry service="cdr-mediation-service", region, namespace, pod.
1.1 CDR ingestion pipeline
| Metric | Type | Labels | Description |
|---|---|---|---|
cdr_ingest_events_total | Counter | source_stream (sms.dlr.inbound/compliance.audit.v1/billing.events.v1), status (PROCESSED/DEDUPED/REJECTED) | Events read from NATS |
cdr_ingest_lag_seconds | Histogram | source_stream | Time from NATS publish to CDR-row persisted; buckets [0.1, 1, 5, 10, 30, 60, 300, 600] |
cdr_records_written_total | Counter | tenant_id, operator_id, direction (MT/MO) | CDR records persisted |
cdr_duplicate_suppressed_total | Counter | source_stream | Events already in CDR (idempotency) |
1.2 Rollup pipeline
| Metric | Type | Labels | Description |
|---|---|---|---|
cdr_rollup_runs_total | Counter | status (SUCCESS/FAILED/PARTIAL) | Hourly rollup executions |
cdr_rollup_duration_seconds | Histogram | — | Time to roll up one hour window; buckets [5, 15, 30, 60, 120, 300, 600] |
cdr_rollup_last_success_seconds | Gauge | — | Age since last successful rollup |
cdr_rollup_rows_processed_total | Counter | hour | Rows aggregated per rollup |
cdr_rollup_backlog_hours | Gauge | — | Hours behind schedule |
1.3 Chain + archive
| Metric | Type | Labels | Description |
|---|---|---|---|
cdr_audit_chain_verifier_status | Gauge | — | 0=OK, 1=break |
cdr_audit_chain_last_verified_seconds | Gauge | — | Age since last successful verifier run |
cdr_s3_archive_bytes_total | Counter | region | Bytes written to cold tier |
cdr_s3_archive_last_success_seconds | Gauge | — | Age since last successful archive |
cdr_hot_retention_oldest_hour | Gauge | — | Oldest hot-tier partition age |
1.4 Regulator export
| Metric | Type | Labels | Description |
|---|---|---|---|
cdr_export_jobs_total | Counter | schema_version, destination (ATRA_SFTP/ATRA_HTTPS), status (SUCCESS/ACKED/REJECTED/TIMEOUT) | Export jobs |
cdr_export_duration_seconds | Histogram | schema_version, destination | Time per export; buckets [10, 30, 60, 300, 600, 1800, 3600] |
cdr_export_last_ack_seconds | Gauge | destination | Age since last ACK from regulator |
cdr_export_retries_total | Counter | destination, attempt | Retry attempts |
cdr_export_queue_depth | Gauge | — | Exports waiting |
cdr_export_signed_file_size_bytes | Histogram | schema_version | Size distribution |
1.5 Adjustment records
| Metric | Type | Labels | Description |
|---|---|---|---|
cdr_adjustment_total | Counter | reason (VOID/CORRECT), tenant_id | Adjustments created |
cdr_adjustment_volume_ratio | Gauge | tenant_id | Adjustments as fraction of tenant's original CDRs (anomaly detection) |
1.6 HSM + downstream
| Metric | Type | Labels | Description |
|---|---|---|---|
cdr_hsm_sign_total | Counter | result | HSM signing operations |
cdr_hsm_sign_seconds | Histogram | — | HSM latency |
cdr_nats_publish_total | Counter | subject, result | NATS emission |
cdr_clickhouse_insert_total | Counter | status | ClickHouse analytics writes (EP-ANLYT-02) |
cdr_clickhouse_insert_lag_seconds | Gauge | — | Analytics-tier lag |
2. Structured Log Events
Pino JSON. Standard fields + service-specific:
2.1 Ingest
cdr.ingest.processed:{ source, eventId, tenantId, operatorId, direction, ingestLatencyMs }cdr.ingest.duplicate:{ source, eventId, existingCdrId }cdr.ingest.rejected:{ source, eventId, reason }
2.2 Rollup
cdr.rollup.started:{ rollupId, hourStart, hourEnd }cdr.rollup.completed:{ rollupId, rowsProcessed, durationMs }cdr.rollup.failed:{ rollupId, error, rowsPartial }
2.3 Export
cdr.export.started:{ exportId, schemaVersion, destination, fileCount, totalRows }cdr.export.signed:{ exportId, signedFileSha256 }cdr.export.delivered:{ exportId, destination, deliveryDurationMs }cdr.export.acked:{ exportId, destination, ackTimestamp }cdr.export.rejected:{ exportId, destination, reason }
2.4 Adjustment + audit
cdr.adjustment.created:{ adjustmentId, originalCdrId, reason, initiator }cdr.audit.appended:{ recordId, prevHash, recordHash, eventType }cdr.audit.verified:{ from, to, ok, firstBreakAt? }
PII policy
MSISDNs are hashed at ingest (SHA-256 per consent-ledger pattern); raw MSISDN never appears in logs or CDR body. Tenant IDs and operator IDs are logged (they're not subscriber PII).
3. OpenTelemetry Tracing
Manual spans:
cdr.ingest.event— per NATS consumecdr.rollup.hour— per rollup runcdr.export.build— per export buildcdr.export.sign— HSM sign spancdr.export.deliver— SFTP / HTTPS deliverycdr.audit.verify— daily verifier
W3C TraceContext propagation from dlr-processor / compliance-engine. Sampling: 1% for ingest; 100% for rollup + export + verify.
4. Alerting Rules
groups:
- name: cdr-mediation.rules
rules:
- alert: CdrIngestLagHigh
expr: histogram_quantile(0.95, sum(rate(cdr_ingest_lag_seconds_bucket[5m])) by (le, source_stream)) > 30
for: 10m
labels: { severity: high, team: commerce }
- alert: CdrRollupBehind
expr: cdr_rollup_backlog_hours > 2
for: 30m
labels: { severity: high, team: commerce }
- alert: CdrRollupFailed
expr: increase(cdr_rollup_runs_total{status="FAILED"}[1h]) > 0
for: 0m
labels: { severity: high, team: commerce }
- alert: CdrExportFailed
expr: increase(cdr_export_jobs_total{status=~"REJECTED|TIMEOUT"}[1h]) > 0
for: 0m
labels: { severity: critical, team: regulator-liaison }
- alert: CdrExportSlaBreach
expr: cdr_export_last_ack_seconds > 129600 # 36h (daily + 12h buffer)
for: 30m
labels: { severity: critical, team: regulator-liaison, page: legal }
- alert: CdrHsmUnavailable
expr: rate(cdr_hsm_sign_total{result="FAILURE"}[5m]) > 0.1
for: 5m
labels: { severity: critical, team: sre }
- alert: CdrChainBroken
expr: cdr_audit_chain_verifier_status == 1
for: 0m
labels: { severity: critical, team: commerce, page: ciso }
- alert: CdrArchiveStale
expr: cdr_s3_archive_last_success_seconds > 86400
for: 1h
labels: { severity: high, team: sre }
- alert: CdrHotRetentionOverflow
expr: cdr_hot_retention_oldest_hour > 35 # > 35 days
for: 1h
labels: { severity: high, team: sre }
- alert: CdrAdjustmentAnomaly
expr: cdr_adjustment_volume_ratio > 0.05
for: 1h
labels: { severity: medium, team: commerce }
- alert: CdrClickHouseIngestLag
expr: cdr_clickhouse_insert_lag_seconds > 600
for: 15m
labels: { severity: medium, team: data-eng }
5. Grafana Dashboard Panels
cdr-mediation-service.json — three rows:
5.1 Commerce / Revenue Assurance
- Ingest throughput vs.
sms.dlr.inboundrate (should match) - CDR volume per tenant per day
- Adjustment volume per tenant (anomaly indicator)
- Per-operator CDR direction distribution
5.2 Regulator / Export Pipeline
- Today's export status (pipeline stages)
- 30-day export-delivery success rate per destination
- Signed-file size distribution
- ATRA ACK latency trend
5.3 SRE
- Ingest lag by source-stream
- Rollup backlog + per-hour duration
- S3 archive throughput
- HSM latency
- Chain-verifier status + last-run-age
- ClickHouse insert lag
6. Runbook Index
| Alert | Runbook |
|---|---|
| CdrIngestLagHigh | runbooks/cdr/ingest-lag.md |
| CdrRollupBehind / Failed | runbooks/cdr/rollup-recovery.md |
| CdrExportFailed | runbooks/cdr/export-failure.md |
| CdrExportSlaBreach | runbooks/cdr/export-sla-breach.md (regulator-notify) |
| CdrHsmUnavailable | runbooks/cdr/hsm-unavailable.md |
| CdrChainBroken | runbooks/cdr/chain-broken.md |
| CdrArchiveStale | runbooks/cdr/archive-stale.md |
| CdrHotRetentionOverflow | runbooks/cdr/retention-overflow.md |
| CdrAdjustmentAnomaly | runbooks/cdr/adjustment-anomaly.md |
| CdrClickHouseIngestLag | runbooks/cdr/clickhouse-lag.md |
7. SLIs / SLOs
| SLI | SLO target | Window |
|---|---|---|
| Ingest lag (DLR → CDR persisted) | P99 ≤ 10 s | 30 d |
| Hourly rollup completion | Within 30 min of hour boundary, 99% of the time | 30 d |
| Daily export delivery (ATRA ACK) | 100% of days within 36 h | 30 d |
| Chain integrity | 100% (no breaks) | Continuous |
| Hot retention | Oldest hot partition ≤ 35 days | Continuous |
| S3 archive success | 100% of rollups archived within 24 h | 30 d |
| HSM sign availability | ≥ 99.95% | 30 d |
Error-budget burn alerts at 5% and 25% consumed.
8. Log Retention
| Stream | Hot (Loki) | Cold (S3) |
|---|---|---|
| Ingest + rollup + export events | 14 d | 30 d |
| Audit append events | 14 d | 7 y (matching CDR retention) |
| Debug / trace | 7 d | — |