FAILURE_MODES — billing-service
On-call runbook root. Each scenario lists detection, immediate mitigation, root-cause investigation, and recovery. Severities mirror OBSERVABILITY §6. Dial: PagerDuty
billing-on-call, Slack#oncall-billing.
1. Cash drawer reconciliation failure
Symptoms: BillingCashDiscrepancyRateHigh alert; multiple cash_drawer.discrepancy_found.v1 events for the same property in a short window; daily_cash_reconciliations.status='mismatch' rows accumulating.
Detect: dashboard billing-cash-drawer → variance distribution by drawer + cashier; BillingReconciliationMismatch Cloud Monitoring alert.
Immediate mitigation:
- Confirm whether the affected property is single-tenant (likely cashier issue) or systemic (likely software issue).
- If systemic across tenants: check for a recent
billing-apirevision; roll back if the regression is inCashDrawerSession.expectedClosingFloat()orRecordCashPaymentUseCase. - If single-tenant: contact the property's GM through
notification-serviceescalation; freeze new sessions on the affected drawer by togglingcash_drawers.active=false(audited).
Root cause:
- Schema-mismatch from a partial migration (use-case writes a payment without joining a session), revealed by the audit chain.
- Time-zone bug: business date computation off-by-one across midnight; fixture in
test/fixtures/timezones.spec.tsshould be extended. - Currency rounding: a multi-currency cash payment rounding to the wrong minor unit.
- Insider abuse: AI cash pattern detector should already have flagged.
Recovery:
- Create the corrective entries via supervisor override
POST /folios/:id/refunds(orPOST /credit-notes/...). - Update
daily_cash_reconciliations.status='resolved'(single SQL admin task — audited). - Postmortem with finance ops within 5 business days.
2. Multi-currency rounding edge cases
Symptoms: folio.balance_due.v1 fires on a folio whose payments visibly equal the charges to the operator; Settlement.residual_micro non-zero on closed folios; complaints from front desk.
Detect: dashboard panel Folio close residual distribution; alert BillingFolioCloseResidualNonZeroSpike.
Immediate mitigation:
- Identify the affected
(currency_pair, FXSnapshot.source)combo. - If the FX snapshot is stale or wrong (e.g.,
pricing-servicepublished bad rates), pause new folio opens for the tenant viatenant.settings.billing.eagerOpen=falseand switch todeferreduntil rates are corrected (the deferred mode pulls a fresh snapshot at check-in). - For affected folios, the recommended close path is to record a small
adjustmentcharge (positive or negative) to absorb the residual, rather than block the guest at checkout.
Root cause:
- Banker's rounding misapplied:
Money.mulFractioninteger division truncates; we add a half-up step in the conversion. VerifyFXSnapshot.fromBaseagainst the property test. - Stale rate snapshot: the snapshot pinned at confirm is the contractual rate; if the guest expects a different rate, that is a tenant policy decision (always honor the snapshot in code; surface the diff to the guest in the invoice if
tenant.settings.billing.showFxAtClose=true).
Recovery:
- If a code bug, ship a forward fix and a back-fill SQL that recomputes residuals on impacted folios; do not mutate historical invoices, issue corrective credit notes.
3. Dunning failure (subscription invoice unpaid past hard cap)
Symptoms: BillingSubscriptionCycleStuck alert; subscription_invoices.status='failed' and next_retry_at in the past for many tenants.
Detect: dashboard billing-subscription → dunning state distribution; alert burns the SLO once correctness drops below 100%.
Immediate mitigation:
- Inspect
payment-gateway-servicehealth; if the processor is down, pause the cycle worker viagcloud scheduler jobs pause billing-subscription-cycle. - If the failure is concentrated on one processor adapter (e.g.,
mfs-jazzcash), drain that path via processor failover inpayment-gateway-service. - Communicate with affected tenants via
notification-serviceoutbound — do not silently auto-suspend during a known platform incident.
Root cause:
- Processor outage: confirm via the gateway's status; track via incident.
- Wrong
paymentMethodTokenon file (tenant changed cards but didn't update the token intenant-service): trigger update reminder. - Bug in dunning state machine: walk
subscription.recordPaymentFailureagainst the failing fixture.
Recovery:
- Once processor recovers, manually re-run cycle for affected tenants:
POST /api/v1/subscriptions/:tenantId/cycle?force=false. - For tenants auto-suspended during the incident, run
POST /api/v1/subscriptions/:tenantId/reactivatewithreason='platform_incident_<id>'.
4. Outbox drainer lag
Symptoms: BillingOutboxLag alert; _outbox.published_at IS NULL count growing.
Immediate mitigation:
- Scale
billing-outbox-drainerto max replicas. - Check Pub/Sub publish errors metric; if topic IAM is broken (e.g., a key rotation missed), repair via Workload Identity grant.
- If a single tenant schema is responsible (huge spike), shard the drainer pool to isolate the noisy neighbor.
Recovery:
- Drainer is at-least-once; consumers idempotently dedupe via inbox keys. After backlog drains, verify no DLQ growth on consumer side.
5. Invoice render failed
Symptoms: BillingInvoiceGenerationSlow alert; MELMASTOON.BILLING.INVOICE_RENDER_FAILED errors; invoices.pdf_uri IS NULL rows accumulating.
Immediate mitigation:
- The folio is already closed; the deferred-render queue picks it up automatically. Surface progress via the dashboard panel "Deferred invoices in queue".
- If the failure is template-specific (e.g., new RTL template), revert the template version in
tenant.settings.billing.invoiceTemplates.
Recovery:
- Manually retrigger
POST /api/v1/invoices/:id/regenerate(admin endpoint) for affected invoices once the renderer is healthy.
6. Tax rule missing spike
Symptoms: BillingTaxRuleMissingSpike; charges blocked at post time; front desk friction.
Immediate mitigation:
- Identify the tenant + jurisdiction; coordinate with tenant-success to update
tenant-service.taxRules. - Temporary mitigation: with tenant approval, enable
tenant.settings.billing.allowUntaxed=truefor the day (audited; reverts automatically at midnight).
Recovery:
- Post-fact, run a back-fill that emits
IssueCreditNote+PostCharge(tax)pair against impacted folios to apply the correct tax once the rule is in place — only if tenant requests; otherwise, the untaxed period remains as the audit shows.
7. Cross-tenant data exposure (worst-case)
Detect: an audit-service alert (audit.cross_tenant_access) or a customer report.
Immediate mitigation:
- Page P1 to security on-call.
- Capture the request trace ID and immediately rotate the offending pool's credentials.
- Snapshot the impacted tenant schemas for forensics.
Recovery:
- Layered defense should have prevented row-level exposure; verify which layers tripped (RLS, application guard) and which silently passed.
- Communicate to affected tenants per 07 Security §13.
- Public postmortem.
8. Cash drawer "stuck" in pending_close
Symptoms: cash_drawer_sessions.status='pending_close' for > 6 h; new session blocked; cashier complaints.
Immediate mitigation:
- Verify the desktop is online and the co-signer is available.
- If the desktop is offline-only (long blackout), advise the property to retry close once connectivity returns; the local action is queued.
- As a last resort, a supervisor on the cloud-only console can
POST /api/v1/cash-sessions/:id/closedirectly with a fresh step-up token (assuming both staff are present and authenticated).
9. AI orchestrator hard down
Symptoms: BillingAICircuitOpen alert.
Mitigation: circuits open; signals stop flowing. The desktop hides AI panels gracefully. No billing operation degrades. Once recovered, the next mutation per folio re-evaluates within 5 s; the historical gap is documented in the postmortem and back-filled by the nightly job.
10. Cloud SQL primary failover
Mitigation: Cloud SQL HA flips automatically. Connection pools reconnect within 30 s. Any in-flight transaction is rolled back; idempotent retries succeed. Outbox drainer resumes; outbox guarantees no event loss because publish-after-commit semantics hold.
11. Schema migration rollback
A bad migration on a few tenants:
- Pause
billing-tenant-migrator. - Apply backward-compatible migration revert per tenant via the same job (the migration framework supports
down). - Roll the API revision back if its code requires the new schema.
12. Pub/Sub DLQ accumulation
Symptoms: any <topic>.dlq size > 0.
Mitigation:
- Read DLQ messages; identify the failing handler.
- If the failure is a poison message (malformed payload), manually drop after persisting evidence.
- If the failure is a transient handler bug, fix forward, then replay DLQ via
gcloud pubsub subscriptions pull --ackscript with idempotent processing (inbox dedupe protects).
13. SQLCipher local DB corruption (desktop)
Mitigation: the desktop detects corruption on boot, wipes the local DB, and re-pulls. Any in-flight local_outbox rows that were corrupted are lost; the cloud is the source of truth and any cash-session-close that hadn't reached cloud is irrecoverable. The cashier must re-count and re-initiate.
14. Secret leak
If a database credential leaks:
- Rotate immediately in Secret Manager.
- Force pool reconnect by rolling Cloud Run revisions.
- Audit recent SQL by
pg_auditextracts.
15. Index hot-spot on folio_charges
A property with rapid POS posting can hot-spot (folio_id). Mitigation: posting batches > 50 items/s should batch via INSERT … VALUES of size 25; the application layer auto-batches above the threshold. If still hot, partition folio_charges by (tenant_id, hash(folio_id)) — reserved tactic.
16. Step-up token replay
If a step-up token is presented twice, the second use returns MELMASTOON.IAM.STEP_UP_REPLAY from iam-service; we treat as a security event and log to audit. No state mutation occurs on the second attempt.
17. Disaster recovery drill
Quarterly drill: restore Cloud SQL into a clean project from PITR at now() - 1h; verify RPO; verify all per-tenant schemas restored; smoke-test invoice generation against restored data.
18. Cross-references
- Alerts: OBSERVABILITY §6.
- Use cases that emit each error: APPLICATION_LOGIC §15.
- Security incidents: SECURITY_MODEL §10–§11.
- Migration & migrator failures: MIGRATION_PLAN.