Skip to main content

cdr-mediation-service — Observability

Version: 1.0 Status: Draft Owner: Commerce + Regulator Liaison + SRE Last Updated: 2026-04-21 References: SERVICE_OVERVIEW.md, FAILURE_MODES.md, services/compliance-engine/OBSERVABILITY.md, docs/architecture/ADR-0004-national-backbone-resilience.md

cdr-mediation-service is the platform's canonical CDR pipeline — a batch-heavy async service whose availability semantics differ from hot-path services (compliance, firewall, consent). Its observability emphasises pipeline lag, export-to-regulator SLA, and chain integrity over request latency.


1. Prometheus Metrics

All metrics carry service="cdr-mediation-service", region, namespace, pod.

1.1 CDR ingestion pipeline

MetricTypeLabelsDescription
cdr_ingest_events_totalCountersource_stream (sms.dlr.inbound/compliance.audit.v1/billing.events.v1), status (PROCESSED/DEDUPED/REJECTED)Events read from NATS
cdr_ingest_lag_secondsHistogramsource_streamTime from NATS publish to CDR-row persisted; buckets [0.1, 1, 5, 10, 30, 60, 300, 600]
cdr_records_written_totalCountertenant_id, operator_id, direction (MT/MO)CDR records persisted
cdr_duplicate_suppressed_totalCountersource_streamEvents already in CDR (idempotency)

1.2 Rollup pipeline

MetricTypeLabelsDescription
cdr_rollup_runs_totalCounterstatus (SUCCESS/FAILED/PARTIAL)Hourly rollup executions
cdr_rollup_duration_secondsHistogramTime to roll up one hour window; buckets [5, 15, 30, 60, 120, 300, 600]
cdr_rollup_last_success_secondsGaugeAge since last successful rollup
cdr_rollup_rows_processed_totalCounterhourRows aggregated per rollup
cdr_rollup_backlog_hoursGaugeHours behind schedule

1.3 Chain + archive

MetricTypeLabelsDescription
cdr_audit_chain_verifier_statusGauge0=OK, 1=break
cdr_audit_chain_last_verified_secondsGaugeAge since last successful verifier run
cdr_s3_archive_bytes_totalCounterregionBytes written to cold tier
cdr_s3_archive_last_success_secondsGaugeAge since last successful archive
cdr_hot_retention_oldest_hourGaugeOldest hot-tier partition age

1.4 Regulator export

MetricTypeLabelsDescription
cdr_export_jobs_totalCounterschema_version, destination (ATRA_SFTP/ATRA_HTTPS), status (SUCCESS/ACKED/REJECTED/TIMEOUT)Export jobs
cdr_export_duration_secondsHistogramschema_version, destinationTime per export; buckets [10, 30, 60, 300, 600, 1800, 3600]
cdr_export_last_ack_secondsGaugedestinationAge since last ACK from regulator
cdr_export_retries_totalCounterdestination, attemptRetry attempts
cdr_export_queue_depthGaugeExports waiting
cdr_export_signed_file_size_bytesHistogramschema_versionSize distribution

1.5 Adjustment records

MetricTypeLabelsDescription
cdr_adjustment_totalCounterreason (VOID/CORRECT), tenant_idAdjustments created
cdr_adjustment_volume_ratioGaugetenant_idAdjustments as fraction of tenant's original CDRs (anomaly detection)

1.6 HSM + downstream

MetricTypeLabelsDescription
cdr_hsm_sign_totalCounterresultHSM signing operations
cdr_hsm_sign_secondsHistogramHSM latency
cdr_nats_publish_totalCountersubject, resultNATS emission
cdr_clickhouse_insert_totalCounterstatusClickHouse analytics writes (EP-ANLYT-02)
cdr_clickhouse_insert_lag_secondsGaugeAnalytics-tier lag

2. Structured Log Events

Pino JSON. Standard fields + service-specific:

2.1 Ingest

  • cdr.ingest.processed: { source, eventId, tenantId, operatorId, direction, ingestLatencyMs }
  • cdr.ingest.duplicate: { source, eventId, existingCdrId }
  • cdr.ingest.rejected: { source, eventId, reason }

2.2 Rollup

  • cdr.rollup.started: { rollupId, hourStart, hourEnd }
  • cdr.rollup.completed: { rollupId, rowsProcessed, durationMs }
  • cdr.rollup.failed: { rollupId, error, rowsPartial }

2.3 Export

  • cdr.export.started: { exportId, schemaVersion, destination, fileCount, totalRows }
  • cdr.export.signed: { exportId, signedFileSha256 }
  • cdr.export.delivered: { exportId, destination, deliveryDurationMs }
  • cdr.export.acked: { exportId, destination, ackTimestamp }
  • cdr.export.rejected: { exportId, destination, reason }

2.4 Adjustment + audit

  • cdr.adjustment.created: { adjustmentId, originalCdrId, reason, initiator }
  • cdr.audit.appended: { recordId, prevHash, recordHash, eventType }
  • cdr.audit.verified: { from, to, ok, firstBreakAt? }

PII policy

MSISDNs are hashed at ingest (SHA-256 per consent-ledger pattern); raw MSISDN never appears in logs or CDR body. Tenant IDs and operator IDs are logged (they're not subscriber PII).


3. OpenTelemetry Tracing

Manual spans:

  • cdr.ingest.event — per NATS consume
  • cdr.rollup.hour — per rollup run
  • cdr.export.build — per export build
  • cdr.export.sign — HSM sign span
  • cdr.export.deliver — SFTP / HTTPS delivery
  • cdr.audit.verify — daily verifier

W3C TraceContext propagation from dlr-processor / compliance-engine. Sampling: 1% for ingest; 100% for rollup + export + verify.


4. Alerting Rules

groups:
- name: cdr-mediation.rules
rules:
- alert: CdrIngestLagHigh
expr: histogram_quantile(0.95, sum(rate(cdr_ingest_lag_seconds_bucket[5m])) by (le, source_stream)) > 30
for: 10m
labels: { severity: high, team: commerce }

- alert: CdrRollupBehind
expr: cdr_rollup_backlog_hours > 2
for: 30m
labels: { severity: high, team: commerce }

- alert: CdrRollupFailed
expr: increase(cdr_rollup_runs_total{status="FAILED"}[1h]) > 0
for: 0m
labels: { severity: high, team: commerce }

- alert: CdrExportFailed
expr: increase(cdr_export_jobs_total{status=~"REJECTED|TIMEOUT"}[1h]) > 0
for: 0m
labels: { severity: critical, team: regulator-liaison }

- alert: CdrExportSlaBreach
expr: cdr_export_last_ack_seconds > 129600 # 36h (daily + 12h buffer)
for: 30m
labels: { severity: critical, team: regulator-liaison, page: legal }

- alert: CdrHsmUnavailable
expr: rate(cdr_hsm_sign_total{result="FAILURE"}[5m]) > 0.1
for: 5m
labels: { severity: critical, team: sre }

- alert: CdrChainBroken
expr: cdr_audit_chain_verifier_status == 1
for: 0m
labels: { severity: critical, team: commerce, page: ciso }

- alert: CdrArchiveStale
expr: cdr_s3_archive_last_success_seconds > 86400
for: 1h
labels: { severity: high, team: sre }

- alert: CdrHotRetentionOverflow
expr: cdr_hot_retention_oldest_hour > 35 # > 35 days
for: 1h
labels: { severity: high, team: sre }

- alert: CdrAdjustmentAnomaly
expr: cdr_adjustment_volume_ratio > 0.05
for: 1h
labels: { severity: medium, team: commerce }

- alert: CdrClickHouseIngestLag
expr: cdr_clickhouse_insert_lag_seconds > 600
for: 15m
labels: { severity: medium, team: data-eng }

5. Grafana Dashboard Panels

cdr-mediation-service.json — three rows:

5.1 Commerce / Revenue Assurance

  • Ingest throughput vs. sms.dlr.inbound rate (should match)
  • CDR volume per tenant per day
  • Adjustment volume per tenant (anomaly indicator)
  • Per-operator CDR direction distribution

5.2 Regulator / Export Pipeline

  • Today's export status (pipeline stages)
  • 30-day export-delivery success rate per destination
  • Signed-file size distribution
  • ATRA ACK latency trend

5.3 SRE

  • Ingest lag by source-stream
  • Rollup backlog + per-hour duration
  • S3 archive throughput
  • HSM latency
  • Chain-verifier status + last-run-age
  • ClickHouse insert lag

6. Runbook Index

AlertRunbook
CdrIngestLagHighrunbooks/cdr/ingest-lag.md
CdrRollupBehind / Failedrunbooks/cdr/rollup-recovery.md
CdrExportFailedrunbooks/cdr/export-failure.md
CdrExportSlaBreachrunbooks/cdr/export-sla-breach.md (regulator-notify)
CdrHsmUnavailablerunbooks/cdr/hsm-unavailable.md
CdrChainBrokenrunbooks/cdr/chain-broken.md
CdrArchiveStalerunbooks/cdr/archive-stale.md
CdrHotRetentionOverflowrunbooks/cdr/retention-overflow.md
CdrAdjustmentAnomalyrunbooks/cdr/adjustment-anomaly.md
CdrClickHouseIngestLagrunbooks/cdr/clickhouse-lag.md

7. SLIs / SLOs

SLISLO targetWindow
Ingest lag (DLR → CDR persisted)P99 ≤ 10 s30 d
Hourly rollup completionWithin 30 min of hour boundary, 99% of the time30 d
Daily export delivery (ATRA ACK)100% of days within 36 h30 d
Chain integrity100% (no breaks)Continuous
Hot retentionOldest hot partition ≤ 35 daysContinuous
S3 archive success100% of rollups archived within 24 h30 d
HSM sign availability≥ 99.95%30 d

Error-budget burn alerts at 5% and 25% consumed.


8. Log Retention

StreamHot (Loki)Cold (S3)
Ingest + rollup + export events14 d30 d
Audit append events14 d7 y (matching CDR retention)
Debug / trace7 d