Billing Service — Failure Modes
Status: populated Owner: Platform Engineering + SRE Last updated: 2026-04-18
| # | Failure | User impact | Detection | Mitigation |
|---|---|---|---|---|
| 1 | PG primary down | Event ingestion fails; NATS NAK → redelivery | BillingPgErrors alert | NATS NAK with backoff; events replay when PG recovers; no data loss |
| 2 | PG replica lag | Stale usage query reads | replica lag alert | Route usage queries to primary on lag > 2s |
| 3 | Redis cluster down | Pricing cache misses; all lookups hit PG | BillingRedisErrors alert | Degraded mode: PG fallback; log warning; no data loss |
| 4 | NATS consumer lag growing | Billing events backed up; invoices reflect delayed data | NATS stream depth alert | Scale consumers; investigate dlr-processor publish rate |
| 5 | Pricing rule not found for event | Event cannot be priced; NAK with delay | BillingPricingNotFound alert | Alert ops; create missing pricing rule; event replays |
| 6 | Operator cost not found | Margin cannot be computed | BillingCostNotFound alert | Store operatorCost = 0; flag event for manual review |
| 7 | Invoice cron fails for one account | Account invoice DRAFT but not FINALIZED | BillingInvoiceCronError alert | Per-account error isolation; retry next run or manual trigger |
| 8 | S3 unavailable during invoice generation | PDF cannot be stored; invoice stays DRAFT | S3 error metrics | Retry S3 upload; invoice finalization blocked until S3 recovers |
| 9 | PDF render failure (template error) | Invoice DRAFT; account not invoiced | BillingInvoiceRenderError alert | Template validation in CI; manual re-run after fix |
| 10 | Duplicate NATS message delivery | Second insert attempt on message_id | Expected; ON CONFLICT DO NOTHING | No-op insert; ACK; no double billing |
| 11 | Clock skew: chargedAt in future | Wrong pricing period resolved | NTP monitoring | Server-side NTP; max 5s tolerance; events with skew flagged |
| 12 | Pricing version overlap / gap | Wrong price applied | CI overlap detection + alert | Partial unique index prevents overlap on active rows; gap → PricingNotFound |
| 13 | Invoice void after payment collected | Revenue reversal | Finance manual review | platform.finance role required; audit record mandatory |
| 14 | PII leak in event payload | Compliance incident | Log scanner | billing.events schema has no MSISDN/body; enforced by schema validation |
| 15 | Usage summary desync after replay | Summaries over-counted | Rare; bucket_hour aggregation idempotent via UPSERT | Manual reconciliation script; compare billing_events SUM vs usage_summaries |