Skip to main content

cdr-mediation-service — Failure Modes

Version: 1.0 Status: Draft Owner: Commerce + Regulator Liaison + SRE Last Updated: 2026-04-21 References: SERVICE_OVERVIEW.md, OBSERVABILITY.md, SECURITY_MODEL.md

Catalog of how cdr-mediation-service fails. Operating principle: durability over latency — it is acceptable for exports to be delayed; it is not acceptable for CDR rows to be lost or the hash chain to break.


1. Operating Principle

  • CDR ingest: fail-loud but queue (NATS retries; rows only written when durable).
  • Rollup / archive / export: batch jobs — retries tolerated; alerts on prolonged failure.
  • HSM: fail-closed on export signing (no unsigned regulator files).
  • Audit chain break: critical — regulator-defensibility claim at stake.
  • ATRA unreachable: queue exports; manual-delivery fallback runbook.

2. Failure Mode Summary

#NameClassDetectionUser-visible impactRunbook
FM-01NATS ingest consumer lag > 5 minInfra5 minCDR generation behind; billing-downstream may be stalerunbooks/cdr/ingest-lag.md
FM-02Postgres unavailable (ingest write-path)Infra< 30 sIngest stalls; NATS queues up (7 d retention)runbooks/cdr/pg-out.md
FM-03S3 archive write failsInfra< 5 minHot retention stretches; manual intervention if sustainedrunbooks/cdr/archive-fail.md
FM-04Hourly rollup cron failsOps15 minGap in aggregation; billing uses last-good aggregaterunbooks/cdr/rollup-fail.md
FM-05HSM unavailableDependency< 30 sExport signing pauses; queue fillsrunbooks/cdr/hsm-out.md
FM-06ATRA SFTP/HTTPS unreachableDependencyPer-attemptDaily export queued; manual delivery runbook if > 36 hrunbooks/cdr/atra-unreachable.md
FM-07ATRA rejects export (schema mismatch)Dependency< 60 s after deliveryExport marked REJECTED; adapter rollbackrunbooks/cdr/export-rejected.md
FM-08Hash chain break detectedCorrectness24 h (daily verifier)Regulator-defensibility lost for affected periodrunbooks/cdr/chain-broken.md
FM-09Adjustment-record raceConcurrency< 5 sAtomic state ensures only one winnerrunbooks/cdr/adjustment-race.md
FM-10Schema-version mismatch during ATRA transitionIntegrationPer-deliveryAdapter fallback to previous versionrunbooks/cdr/schema-transition.md
FM-11Chain verifier detects tamperSecurity24 hCritical security incidentrunbooks/cdr/tamper-detected.md
FM-12Cold archive S3 bucket policy changeOps< 5 min (weekly scan)Archive write fails; alertrunbooks/cdr/s3-policy-drift.md
FM-13Duplicate CDR from NATS redeliveryIdempotency< 1 sDeduplicated; no impactrunbooks/cdr/dedup-working.md (informational)
FM-14ClickHouse ingest lag > 10 minAnalytics15 minAnalytics stale; hot-path unaffectedrunbooks/cdr/clickhouse-lag.md
FM-15Region partitionInfra< 1 minRegion-local operation continues; cross-region archive delayedrunbooks/cdr/region-split.md

3. Detailed Failure Modes

FM-01 — NATS ingest consumer lag

Scenario. Consumer can't keep up (e.g., Postgres slow, CDR surge).

Impact. CDR-generation lag; downstream (billing revenue-assurance, regulator export) delayed.

Mitigation. HPA on cdr_nats_consumer_lag; scale-out to 12 replicas. Postgres conn pool tuned. If lag > 30 min, page Commerce team — likely an upstream anomaly (surge of DLRs).

Recovery. Auto-scale drains queue. No data loss (JetStream 7 d retention).


FM-02 — Postgres unavailable

Scenario. Primary + replica both unreachable.

Impact. Ingest stalls; NATS queues. Batch jobs skip this run. Export might build from stale cache.

Mitigation. Postgres HA with auto-failover (≤ 30 s). NATS retention holds events for 7 d. Batch jobs idempotent: re-run catches up.

Recovery. DB recovery → ingest drains queue. Rollups re-run for missed hours.


FM-03 — S3 archive write fails

Scenario. S3 / MinIO unreachable or bucket-policy change blocks writes.

Impact. Hot-tier retention stretches past 30 d; if sustained > 7 d, hot partition count grows beyond DB-friendly limits.

Mitigation. Archive cron retries with exponential backoff. Alert at 24 h. Manual-intervention runbook if S3 credentials / policy issue. Cross-region replication alternative.

Recovery. Archive worker drains backlog once S3 recovered.


FM-04 — Hourly rollup fails

Scenario. Rollup CronJob exits non-zero (OOM, DB error).

Impact. That hour missing from aggregates until retry.

Mitigation. CronJob backoffLimit: 3. Manual re-trigger available. Rollup idempotent (identical result on re-run).

Recovery. Successful re-run backfills aggregate.


FM-05 — HSM unavailable

Scenario. HSM cluster outage.

Impact. Export signing blocked; exports queue in cdr.exports with state AWAITING_SIGN.

Mitigation. HSM HA with regional quorum. Export cron detects HSM outage and defers. Alert fires.

Recovery. HSM recovery → queued exports signed + delivered.


FM-06 — ATRA SFTP/HTTPS unreachable

Scenario. ATRA endpoint down or network partition.

Impact. Daily export queued. If sustained > 36 h, SLA breach.

Mitigation. Retry 3× with exponential backoff. Manual-delivery runbook (drop file on USB to ATRA liaison). Regulator Liaison notified after 6 h.

Recovery. Automatic on ATRA recovery. Manual-delivery requires ATRA MoU update.


FM-07 — ATRA rejects export

Scenario. ATRA returns "schema mismatch" or content-validation error.

Impact. Export marked REJECTED. Regulator requires corrected re-submission.

Mitigation. Error details captured in cdr.export_delivery_log. Adapter rollback to previous schema version. Regulator Liaison escalation to diagnose.

Recovery. Adapter fix → re-run export → ACKed.


FM-08 — Hash chain break

Scenario. Daily verifier detects record_hash ≠ sha256(canonical(payload) || prev_hash).

Impact. Regulator-defensibility claim compromised.

Mitigation. Investigate (canonicalisation bug, race, tamper). Quarantine affected partition. Regulator notified within 24 h if audit already exported.

Recovery. Root-cause + fix + forward partition restart. Post-mortem + code review within 72 h.


FM-09 — Adjustment race

Scenario. Two simultaneous VOID adjustments on same CDR.

Impact. Potential duplicate adjustment.

Mitigation. Postgres row-level FOR UPDATE ensures only one commits; second rejected with INVALID_STATE.

Recovery. No action needed.


FM-10 — Schema-version transition

Scenario. ATRA cutover from v1 to v2 schema; both required during transition window.

Impact. Potential export rejection if adapter only produces one version.

Mitigation. Adapter supports side-by-side schema generation during transition. Dual-delivery in parallel for up to 30 days during cutover. Regulator Liaison coordinates.

Recovery. Clean switchover after transition.


FM-11 — Tamper detected

Scenario. Chain-verifier detects mismatch that cannot be explained by bug.

Impact. Security incident.

Mitigation. Audit UPDATE/DELETE rejected by Postgres trigger (prevents most tampering). Any break → CISO + Commerce Lead + Legal paged. Investigate.

Recovery. Forensic analysis; root cause; possible regulator notification + notification to affected tenants.


FM-12 — S3 bucket-policy change

Scenario. IAM or bucket policy inadvertently changed, blocking writes.

Impact. Archive fails silently if monitoring lags.

Mitigation. Weekly scan verifies bucket policy; immediate alert on drift. Bucket policy changes require dual-control.

Recovery. Revert policy; archive worker resumes.


FM-13 — Duplicate CDR from NATS

Scenario. NATS redelivery after ack timeout.

Impact. Duplicate event arrives. Idempotency key catches it.

Mitigation. Correlation-ID unique constraint in cdr.records; duplicate INSERT → NO-OP.

Recovery. Automatic.


FM-14 — ClickHouse ingest lag

Scenario. ClickHouse cluster degraded or connection issue.

Impact. Analytics stale; regulator-portal long-range queries affected.

Mitigation. ClickHouse replica fail-over; buffer in Redis until recovery.

Recovery. Auto-retry syncs.


FM-15 — Region partition

Scenario. Kabul ↔ Mazar partition.

Impact. Each region operates region-local. Cross-region archive delayed. Daily export runs only in primary region.

Mitigation. Region-local Postgres + S3 bucket; cross-region replication async. Eventual consistency when partition heals.

Recovery. Replication catches up.


4. Graceful Degradation Summary

Failure domainBehaviour
NATS ingestQueue + retry (7 d retention)
PostgresAuto-failover; batch retries
HSMExport queues; no silent fallback
ATRA unreachableExport queues; manual delivery runbook
S3 archiveRetry; hot retention stretches
Chain breakRegulator-defensibility at stake — immediate investigation
ClickHouseAnalytics lag; hot-path unaffected

5. Failure ↔ Consumer Experience Matrix

FMBillingRegulatorTenantNOC
FM-01 LagDelayed rev assuranceNoneNoneAlert
FM-02 PG outCDR delayNoneNoneCritical
FM-03 Archive failNoneEventual retention concernNoneAlert
FM-04 Rollup failGap in billing feedNoneNoneAlert
FM-05 HSM outNoneExport delayedNoneCritical
FM-06 ATRA outNoneExport delayed; manual fallbackNoneCritical after 6 h
FM-07 ATRA rejectNoneExport rejected; manual fixNoneRegulator Liaison
FM-08 Chain breakNoneAudit defensibility reducedNoneCritical; CISO paged
FM-09 Adjustment raceNoneNoneNoneNone
FM-10 Schema transitionNonePotential reject during transitionNoneRegulator Liaison
FM-11 TamperNoneAudit defensibility compromisedNoneSecurity incident
FM-12 S3 driftArchive staleNoneNoneAlert
FM-13 Dup CDRNoneNoneNoneNone
FM-14 ClickHouse lagNoneStale analyticsNoneAlert
FM-15 Region splitRegion-localNoneNoneAlert

6. Open Points

IDQuestionOwner
FM-OPEN-01Exact ATRA SLA for export ACK (36 h is Ghasi-proposed)Regulator Liaison
FM-OPEN-02Manual-delivery fallback procedure to ATRA (USB drop-box? Trusted courier?)Regulator Liaison + Legal
FM-OPEN-03Adjustment SLA once original CDR is exported + ACKedLegal