Skip to main content

Billing Service — Failure Modes

Status: populated Owner: TBD Last updated: 2026-04-17 Companion: Service Template

1. Failure catalog

IDFailureUser impactDetectionMitigation
FM-01Postgres primary down503 on all writesreadyz failing, k8s restarts, DB monitor pageFail over to replica promoted by Patroni; retry with exponential backoff; outbox unaffected
FM-02NATS JetStream unreachableEvents queue in outbox; downstream consumers driftbilling_outbox_pending climbs, billing_outbox_lag_seconds climbsOutbox relay retries; consumers catch up after recovery; no data loss
FM-03tenant-service downLicensing check fails; writes blockedTimeouts on entitlement checkUse cached entitlement for up to 5 min; beyond that, fail closed for new tenants, open for known tenants
FM-04terminology-service downCharge code validation failsTimeoutsFail closed on charge capture (400 with reason); allow draft charges; queue for validation on service return
FM-05fhir-gateway downFHIR writes retryRetry metric climbsAsync retry queue; publish internal event regardless of FHIR outcome
FM-06Payment gateway downCard/mobile-money payment fails503 from adapterCircuit breaker per tenant adapter; fallback to manual/cash method by cashier
FM-07Ledger integrity driftBalance ≠ sum(entries)Nightly integrity jobP1 page; freeze writes on affected account; run reconciliation script; create reversing adjustment
FM-08Duplicate charge captureDouble-charging patientDedup miss on CloudEvents idInbox unique-index prevents; if miss, automated reversal script + incident
FM-09Idempotency TTL misconfigurationLong retries cause duplicate paymentsMonitoring shows same-key duplicatesEnforce 24 h default TTL; alert on duplicate posting for same account
FM-10Refund approval stuckRefund never processedbilling_refunds_requested_totalapproved > 50 for > 24 hOps review queue; auto-escalate to ops channel; clinician/supervisor paged
FM-11Statement run timeoutStatements not generatedWorker job metricRetry partitioned; fail-open per account; report failed count in run detail
FM-12Clock skew across podsOrdering / idempotency TTL driftOTel detects time anomalyNTP enforced; alert on skew > 500 ms
FM-13Currency misconfiguration for new tenantPayments rejected400 spike on /paymentsOnboarding checklist: currency default set; alert on tenant onboarding without currency
FM-14Price list lapseCharge capture fails after effective_toPRICE_NOT_FOUND rate spikesAlert 7 days before effective_to expiry per list
FM-15RLS disabled by migrationCross-tenant leakTenant-isolation integration test failsBlock release; revert migration; rotate JWT signing key
FM-16PDF renderer OOM on RTL + large statementStatement failsWorker restartsPartition by account batch size; memory limit 512 MiB; fall back to basic PDF
FM-17Outbox relay behind by > 30 sDownstream driftAlertScale relay; inspect for poison message; move to DLQ
FM-18Poison event in inboxConsumer retries foreverConsumer lag + logsDedicated billing.dlq.* subject after 5 redelivery attempts; quarantine and page ops
FM-19DB connection pool exhaustion503p95 latency climbsIncrease pool; open queries audited; pgbouncer in front
FM-20JWT signing key rotation without JWKS refresh401 across all routesAuth failure rate spike10-min overlap window during rotation; preload JWKS in readiness probe
FM-21Cross-border data transfer violationCompliance breachEgress monitoring by regionBlock egress beyond tenant residency allowlist at service mesh; alert ops + compliance
FM-22Data-residency drift during DRPHI in wrong regionDR runbook gatingRegion-tagged backups; DR only to region-compatible cluster

2. Blast radius matrix

FailureAffectsBypass
PostgresAll writes, most readsReplica for read-only queries
NATSCross-service integrationOutbox stored; catch up on recovery
tenant-serviceLicensing checkCache
terminologyCharge capture onlyDraft charges still accepted
fhir-gatewayExternal FHIR writes onlyInternal ledger unaffected
payment gatewayCard/mobile paymentsCashier switches to cash
AI gatewayAssist featuresFeature degrades to manual UX

3. Recovery actions

ActionWhenSteps
Freeze writesLedger integrity mismatchRoll maintenance_mode=true env; deployment drains; 503 on writes
Replay outboxLost event deliveryQuery outbox by tenant+time; re-publish via ops tool
Reconcile ledgerMismatch detectedRun billing reconcile --account <id> script; emit reversing adjustment
Rotate secretsGateway creds compromiseVault rotate; propagate via k8s secret; restart pods
DR cutoverRegion outagePromote DR region; re-point Kong upstream; verify residency