Skip to main content

cdr-mediation-service — Service Readiness

Version: 1.0 Status: Draft Owner: Commerce + Regulator Liaison + SRE Last Updated: 2026-04-21 References: SERVICE_OVERVIEW.md, _report.md, FAILURE_MODES.md, TESTING_STRATEGY.md

Production-readiness checklist. The bar emphasises regulator-facing correctness (ATRA handshake + export SLA), hash-chain integrity, HSM availability, and hot→cold archive pipeline.


1. Code Readiness

CriterionStatus
Ingest (NATS → Postgres) with dedup + idempotency
Hourly rollup with distributed lock + idempotency
Daily regulator export (build → HSM sign → SFTP/HTTPS deliver → ACK tracking)
TAP 3.12 encoder + RAP encoder
Schema-adapter abstraction (ATRA_TAP_312_V1 + configurable alternates)
Adjustment (VOID/CORRECT) semantics with audit
Hash-chain (prev_hash, record_hash) with canonical JSON (RFC 8785)
Daily chain verifier
S3 cold archive (13 m hot → S3 with object-lock 7 y)
ClickHouse mirror (analytics-tier)
Admin REST (CDR list, rollup status, export status, adjustment create, audit query)
Chain-break alert + incident playbook
mTLS on service mesh
Idempotency on admin writes

2. Testing Readiness

CriterionTargetStatus
Unit coverage≥ 90% line (domain), ≥ 80% branch
TAP 3.12 + RAP encoder golden tests≥ 25 each
Hash-chain tests≥ 15
Adjustment semantics tests≥ 20
Rollup idempotency tests≥ 15
Property-based tests (fast-check)≥ 10 × 500 runs
Integration: NATS → PG → rollup → S3 → SFTP happyPassed
Integration: ATRA mock ACK/reject/timeoutPassed
Integration: HSM sign with softhsm2Passed
Integration: adjustment + audit append + chain verifyPassed
Integration: cold-tier archive + restorePassed
Contract with dlr-processor + compliance-engine + billing-servicePassed
E2E: 10 k DLR → daily export → ATRA ACKPassed
Chaos: Postgres out → ingest queues; no CDR loss on recoveryPassed
Chaos: HSM out → exports queuePassed
Chaos: ATRA unreachable → retries + manual fallbackPassed
Chaos: S3 out → hot retention extends; manual interventionPassed
Security: chain tamper detectedPassed
Security: UPDATE/DELETE rejected on CDR + auditPassed
Security: signed-file tamper detected at ATRAPassed
Load: 500 k DLR/h sustained 1 h, ingest lag < 30 s P99Passed
Load: 10 M-row export builds + delivers within 30 minPassed

3. Observability Readiness

CriterionStatus
Prometheus metrics all emitting (OBSERVABILITY §1)
Grafana dashboard deployed (Commerce + Regulator + SRE rows)
All alerts configured with runbook links
Structured JSON logs with MSISDN hashing
OTel trace propagation verified (dlr → cdr → regulator-portal)
SIEM forwarding of cdr.audit.v1 verified

Alerts Configured

  • CdrIngestLagHigh
  • CdrRollupBehind
  • CdrRollupFailed
  • CdrExportFailed
  • CdrExportSlaBreach (regulator-notify)
  • CdrHsmUnavailable
  • CdrChainBroken (CISO-paging)
  • CdrArchiveStale
  • CdrHotRetentionOverflow
  • CdrAdjustmentAnomaly
  • CdrClickHouseIngestLag

4. Security Readiness

CriterionStatus
mTLS on service mesh; SPIRE SVIDs
NetworkPolicy restricts egress to PG + Redis + NATS + S3 + CH + HSM + ATRA CIDRs
ATRA SFTP key-auth (no password)
ATRA HTTPS mTLS if supported by ATRA endpoint
HSM signing key provisioned; dual-control on key rotation
Postgres UPDATE/DELETE trigger rejects mutation on CDR + audit
S3 bucket object-lock 7 y; bucket policy change dual-control
Signed-file tamper detected at ATRA verified
Pen test against REST admin
Security team sign-off

5. Operational Readiness

CriterionStatus
3 Deployments (ingest, batch, exporter) reviewed
HPA on ingest (lag-driven)
PDB per Deployment
Rolling update: zero-drop on ingest at 100k DLR/h
Graceful shutdown: batch finishes current job before restart
Postgres conn pool sized (pgbouncer)
Redis conn pool sized
CronJobs for rollup, archive, verifier, daily export
ClickHouse mirror operational
Runbook set (§OBSERVABILITY.md §6) complete
On-call: Commerce primary; SRE secondary; Regulator Liaison exec

6. Documentation Readiness

All 16 SERVICE_TEMPLATE docs at "Complete". Plus:

  • TAP 3.12 field-mapping reference sheet (Ghasi-canonical → TAP field)
  • RAP error-record format reference
  • ATRA-specific schema notes (any Afghan-regulator deviations from GSMA baseline)
  • Runbooks per alert
  • Operator handbook for adjustment reviewers

7. Compliance / Regulatory Readiness

CriterionStatus
ATRA MoU for CDR submission
ATRA schema dry-run passed (T-7d)
SFTP credentials exchanged with ATRA (dual-control at both ends)
7-year retention configured (S3 object-lock governance mode)
Audit retention (13 m hot + 7 y cold) tested via restore drill
Revenue-assurance reconciliation with billing-service (EP-BILL-09) verified
Signed-file format approved (PKCS#7 detached signature on ZIP)

8. Go/No-Go Criteria Summary

Production deployment is GO when:

  • All §1 Code Readiness complete.
  • Coverage targets met (§2).
  • Load + chaos tests pass.
  • ATRA dry-run exports delivered + ACKed (3 consecutive daily runs in staging).
  • 14-day shadow mode completed: CDRs generated + retained but no ATRA export.
  • Chain verifier: 14 days of clean runs in staging.
  • Regulator Liaison signs off on schema + delivery.
  • Legal signs off on retention policy + adjustment liability.
  • Rollback plan validated in staging.

9. Post-Launch Review

Within 30 days:

  • Ingest lag SLO attainment.
  • Export ACK SLA attainment (100% within 36 h).
  • Chain integrity (0 breaks).
  • Adjustment volume as % of total CDRs (anomaly check).
  • Revenue-assurance reconciliation delta (billing vs. CDR — target < 0.1%).
  • ClickHouse ingest lag trend.
  • Cost analysis: Postgres growth, S3 growth, HSM signing counts, ATRA egress.
  • Restore-drill time-to-recover for cold-tier query.

10. Phased Rollout

PhaseDurationBehaviourExit criteria
P0 — Pre-migration30 dATRA engagement; schema-dry-run; SFTP handshake; HSM provisioningMoU signed; dry-run exports ACK
P1 — Shadow30 dGenerate CDRs; retain locally; no ATRA exportSchema validated; volume matches forecast
P2 — Export Live30 dDaily ATRA exports begin; observation mode3 consecutive daily exports ACKed
P3 — Full ProductionOngoingAdjustments live; tenant-facing CDR queries (via analytics); ongoing ATRA partnershipSteady state

Rollback flags: CDR_EXPORT_ENABLED, CDR_ADJUSTMENT_ENABLED, CDR_CHAIN_VERIFY_FAIL_FAST.