Skip to main content

Billing Service — Failure Modes

Status: populated Owner: Platform Engineering + SRE Last updated: 2026-04-18

#FailureUser impactDetectionMitigation
1PG primary downEvent ingestion fails; NATS NAK → redeliveryBillingPgErrors alertNATS NAK with backoff; events replay when PG recovers; no data loss
2PG replica lagStale usage query readsreplica lag alertRoute usage queries to primary on lag > 2s
3Redis cluster downPricing cache misses; all lookups hit PGBillingRedisErrors alertDegraded mode: PG fallback; log warning; no data loss
4NATS consumer lag growingBilling events backed up; invoices reflect delayed dataNATS stream depth alertScale consumers; investigate dlr-processor publish rate
5Pricing rule not found for eventEvent cannot be priced; NAK with delayBillingPricingNotFound alertAlert ops; create missing pricing rule; event replays
6Operator cost not foundMargin cannot be computedBillingCostNotFound alertStore operatorCost = 0; flag event for manual review
7Invoice cron fails for one accountAccount invoice DRAFT but not FINALIZEDBillingInvoiceCronError alertPer-account error isolation; retry next run or manual trigger
8S3 unavailable during invoice generationPDF cannot be stored; invoice stays DRAFTS3 error metricsRetry S3 upload; invoice finalization blocked until S3 recovers
9PDF render failure (template error)Invoice DRAFT; account not invoicedBillingInvoiceRenderError alertTemplate validation in CI; manual re-run after fix
10Duplicate NATS message deliverySecond insert attempt on message_idExpected; ON CONFLICT DO NOTHINGNo-op insert; ACK; no double billing
11Clock skew: chargedAt in futureWrong pricing period resolvedNTP monitoringServer-side NTP; max 5s tolerance; events with skew flagged
12Pricing version overlap / gapWrong price appliedCI overlap detection + alertPartial unique index prevents overlap on active rows; gap → PricingNotFound
13Invoice void after payment collectedRevenue reversalFinance manual reviewplatform.finance role required; audit record mandatory
14PII leak in event payloadCompliance incidentLog scannerbilling.events schema has no MSISDN/body; enforced by schema validation
15Usage summary desync after replaySummaries over-countedRare; bucket_hour aggregation idempotent via UPSERTManual reconciliation script; compare billing_events SUM vs usage_summaries