cdr-mediation-service — Failure Modes
Version: 1.0 Status: Draft Owner: Commerce + Regulator Liaison + SRE Last Updated: 2026-04-21 References: SERVICE_OVERVIEW.md, OBSERVABILITY.md, SECURITY_MODEL.md
Catalog of how cdr-mediation-service fails. Operating principle: durability over latency — it is acceptable for exports to be delayed; it is not acceptable for CDR rows to be lost or the hash chain to break.
1. Operating Principle
- CDR ingest: fail-loud but queue (NATS retries; rows only written when durable).
- Rollup / archive / export: batch jobs — retries tolerated; alerts on prolonged failure.
- HSM: fail-closed on export signing (no unsigned regulator files).
- Audit chain break: critical — regulator-defensibility claim at stake.
- ATRA unreachable: queue exports; manual-delivery fallback runbook.
2. Failure Mode Summary
| # | Name | Class | Detection | User-visible impact | Runbook |
|---|---|---|---|---|---|
| FM-01 | NATS ingest consumer lag > 5 min | Infra | 5 min | CDR generation behind; billing-downstream may be stale | runbooks/cdr/ingest-lag.md |
| FM-02 | Postgres unavailable (ingest write-path) | Infra | < 30 s | Ingest stalls; NATS queues up (7 d retention) | runbooks/cdr/pg-out.md |
| FM-03 | S3 archive write fails | Infra | < 5 min | Hot retention stretches; manual intervention if sustained | runbooks/cdr/archive-fail.md |
| FM-04 | Hourly rollup cron fails | Ops | 15 min | Gap in aggregation; billing uses last-good aggregate | runbooks/cdr/rollup-fail.md |
| FM-05 | HSM unavailable | Dependency | < 30 s | Export signing pauses; queue fills | runbooks/cdr/hsm-out.md |
| FM-06 | ATRA SFTP/HTTPS unreachable | Dependency | Per-attempt | Daily export queued; manual delivery runbook if > 36 h | runbooks/cdr/atra-unreachable.md |
| FM-07 | ATRA rejects export (schema mismatch) | Dependency | < 60 s after delivery | Export marked REJECTED; adapter rollback | runbooks/cdr/export-rejected.md |
| FM-08 | Hash chain break detected | Correctness | 24 h (daily verifier) | Regulator-defensibility lost for affected period | runbooks/cdr/chain-broken.md |
| FM-09 | Adjustment-record race | Concurrency | < 5 s | Atomic state ensures only one winner | runbooks/cdr/adjustment-race.md |
| FM-10 | Schema-version mismatch during ATRA transition | Integration | Per-delivery | Adapter fallback to previous version | runbooks/cdr/schema-transition.md |
| FM-11 | Chain verifier detects tamper | Security | 24 h | Critical security incident | runbooks/cdr/tamper-detected.md |
| FM-12 | Cold archive S3 bucket policy change | Ops | < 5 min (weekly scan) | Archive write fails; alert | runbooks/cdr/s3-policy-drift.md |
| FM-13 | Duplicate CDR from NATS redelivery | Idempotency | < 1 s | Deduplicated; no impact | runbooks/cdr/dedup-working.md (informational) |
| FM-14 | ClickHouse ingest lag > 10 min | Analytics | 15 min | Analytics stale; hot-path unaffected | runbooks/cdr/clickhouse-lag.md |
| FM-15 | Region partition | Infra | < 1 min | Region-local operation continues; cross-region archive delayed | runbooks/cdr/region-split.md |
3. Detailed Failure Modes
FM-01 — NATS ingest consumer lag
Scenario. Consumer can't keep up (e.g., Postgres slow, CDR surge).
Impact. CDR-generation lag; downstream (billing revenue-assurance, regulator export) delayed.
Mitigation. HPA on cdr_nats_consumer_lag; scale-out to 12 replicas. Postgres conn pool tuned. If lag > 30 min, page Commerce team — likely an upstream anomaly (surge of DLRs).
Recovery. Auto-scale drains queue. No data loss (JetStream 7 d retention).
FM-02 — Postgres unavailable
Scenario. Primary + replica both unreachable.
Impact. Ingest stalls; NATS queues. Batch jobs skip this run. Export might build from stale cache.
Mitigation. Postgres HA with auto-failover (≤ 30 s). NATS retention holds events for 7 d. Batch jobs idempotent: re-run catches up.
Recovery. DB recovery → ingest drains queue. Rollups re-run for missed hours.
FM-03 — S3 archive write fails
Scenario. S3 / MinIO unreachable or bucket-policy change blocks writes.
Impact. Hot-tier retention stretches past 30 d; if sustained > 7 d, hot partition count grows beyond DB-friendly limits.
Mitigation. Archive cron retries with exponential backoff. Alert at 24 h. Manual-intervention runbook if S3 credentials / policy issue. Cross-region replication alternative.
Recovery. Archive worker drains backlog once S3 recovered.
FM-04 — Hourly rollup fails
Scenario. Rollup CronJob exits non-zero (OOM, DB error).
Impact. That hour missing from aggregates until retry.
Mitigation. CronJob backoffLimit: 3. Manual re-trigger available. Rollup idempotent (identical result on re-run).
Recovery. Successful re-run backfills aggregate.
FM-05 — HSM unavailable
Scenario. HSM cluster outage.
Impact. Export signing blocked; exports queue in cdr.exports with state AWAITING_SIGN.
Mitigation. HSM HA with regional quorum. Export cron detects HSM outage and defers. Alert fires.
Recovery. HSM recovery → queued exports signed + delivered.
FM-06 — ATRA SFTP/HTTPS unreachable
Scenario. ATRA endpoint down or network partition.
Impact. Daily export queued. If sustained > 36 h, SLA breach.
Mitigation. Retry 3× with exponential backoff. Manual-delivery runbook (drop file on USB to ATRA liaison). Regulator Liaison notified after 6 h.
Recovery. Automatic on ATRA recovery. Manual-delivery requires ATRA MoU update.
FM-07 — ATRA rejects export
Scenario. ATRA returns "schema mismatch" or content-validation error.
Impact. Export marked REJECTED. Regulator requires corrected re-submission.
Mitigation. Error details captured in cdr.export_delivery_log. Adapter rollback to previous schema version. Regulator Liaison escalation to diagnose.
Recovery. Adapter fix → re-run export → ACKed.
FM-08 — Hash chain break
Scenario. Daily verifier detects record_hash ≠ sha256(canonical(payload) || prev_hash).
Impact. Regulator-defensibility claim compromised.
Mitigation. Investigate (canonicalisation bug, race, tamper). Quarantine affected partition. Regulator notified within 24 h if audit already exported.
Recovery. Root-cause + fix + forward partition restart. Post-mortem + code review within 72 h.
FM-09 — Adjustment race
Scenario. Two simultaneous VOID adjustments on same CDR.
Impact. Potential duplicate adjustment.
Mitigation. Postgres row-level FOR UPDATE ensures only one commits; second rejected with INVALID_STATE.
Recovery. No action needed.
FM-10 — Schema-version transition
Scenario. ATRA cutover from v1 to v2 schema; both required during transition window.
Impact. Potential export rejection if adapter only produces one version.
Mitigation. Adapter supports side-by-side schema generation during transition. Dual-delivery in parallel for up to 30 days during cutover. Regulator Liaison coordinates.
Recovery. Clean switchover after transition.
FM-11 — Tamper detected
Scenario. Chain-verifier detects mismatch that cannot be explained by bug.
Impact. Security incident.
Mitigation. Audit UPDATE/DELETE rejected by Postgres trigger (prevents most tampering). Any break → CISO + Commerce Lead + Legal paged. Investigate.
Recovery. Forensic analysis; root cause; possible regulator notification + notification to affected tenants.
FM-12 — S3 bucket-policy change
Scenario. IAM or bucket policy inadvertently changed, blocking writes.
Impact. Archive fails silently if monitoring lags.
Mitigation. Weekly scan verifies bucket policy; immediate alert on drift. Bucket policy changes require dual-control.
Recovery. Revert policy; archive worker resumes.
FM-13 — Duplicate CDR from NATS
Scenario. NATS redelivery after ack timeout.
Impact. Duplicate event arrives. Idempotency key catches it.
Mitigation. Correlation-ID unique constraint in cdr.records; duplicate INSERT → NO-OP.
Recovery. Automatic.
FM-14 — ClickHouse ingest lag
Scenario. ClickHouse cluster degraded or connection issue.
Impact. Analytics stale; regulator-portal long-range queries affected.
Mitigation. ClickHouse replica fail-over; buffer in Redis until recovery.
Recovery. Auto-retry syncs.
FM-15 — Region partition
Scenario. Kabul ↔ Mazar partition.
Impact. Each region operates region-local. Cross-region archive delayed. Daily export runs only in primary region.
Mitigation. Region-local Postgres + S3 bucket; cross-region replication async. Eventual consistency when partition heals.
Recovery. Replication catches up.
4. Graceful Degradation Summary
| Failure domain | Behaviour |
|---|---|
| NATS ingest | Queue + retry (7 d retention) |
| Postgres | Auto-failover; batch retries |
| HSM | Export queues; no silent fallback |
| ATRA unreachable | Export queues; manual delivery runbook |
| S3 archive | Retry; hot retention stretches |
| Chain break | Regulator-defensibility at stake — immediate investigation |
| ClickHouse | Analytics lag; hot-path unaffected |
5. Failure ↔ Consumer Experience Matrix
| FM | Billing | Regulator | Tenant | NOC |
|---|---|---|---|---|
| FM-01 Lag | Delayed rev assurance | None | None | Alert |
| FM-02 PG out | CDR delay | None | None | Critical |
| FM-03 Archive fail | None | Eventual retention concern | None | Alert |
| FM-04 Rollup fail | Gap in billing feed | None | None | Alert |
| FM-05 HSM out | None | Export delayed | None | Critical |
| FM-06 ATRA out | None | Export delayed; manual fallback | None | Critical after 6 h |
| FM-07 ATRA reject | None | Export rejected; manual fix | None | Regulator Liaison |
| FM-08 Chain break | None | Audit defensibility reduced | None | Critical; CISO paged |
| FM-09 Adjustment race | None | None | None | None |
| FM-10 Schema transition | None | Potential reject during transition | None | Regulator Liaison |
| FM-11 Tamper | None | Audit defensibility compromised | None | Security incident |
| FM-12 S3 drift | Archive stale | None | None | Alert |
| FM-13 Dup CDR | None | None | None | None |
| FM-14 ClickHouse lag | None | Stale analytics | None | Alert |
| FM-15 Region split | Region-local | None | None | Alert |
6. Open Points
| ID | Question | Owner |
|---|---|---|
| FM-OPEN-01 | Exact ATRA SLA for export ACK (36 h is Ghasi-proposed) | Regulator Liaison |
| FM-OPEN-02 | Manual-delivery fallback procedure to ATRA (USB drop-box? Trusted courier?) | Regulator Liaison + Legal |
| FM-OPEN-03 | Adjustment SLA once original CDR is exported + ACKed | Legal |