Skip to main content

FAILURE_MODES — billing-service

On-call runbook root. Each scenario lists detection, immediate mitigation, root-cause investigation, and recovery. Severities mirror OBSERVABILITY §6. Dial: PagerDuty billing-on-call, Slack #oncall-billing.

1. Cash drawer reconciliation failure

Symptoms: BillingCashDiscrepancyRateHigh alert; multiple cash_drawer.discrepancy_found.v1 events for the same property in a short window; daily_cash_reconciliations.status='mismatch' rows accumulating.

Detect: dashboard billing-cash-drawer → variance distribution by drawer + cashier; BillingReconciliationMismatch Cloud Monitoring alert.

Immediate mitigation:

  1. Confirm whether the affected property is single-tenant (likely cashier issue) or systemic (likely software issue).
  2. If systemic across tenants: check for a recent billing-api revision; roll back if the regression is in CashDrawerSession.expectedClosingFloat() or RecordCashPaymentUseCase.
  3. If single-tenant: contact the property's GM through notification-service escalation; freeze new sessions on the affected drawer by toggling cash_drawers.active=false (audited).

Root cause:

  • Schema-mismatch from a partial migration (use-case writes a payment without joining a session), revealed by the audit chain.
  • Time-zone bug: business date computation off-by-one across midnight; fixture in test/fixtures/timezones.spec.ts should be extended.
  • Currency rounding: a multi-currency cash payment rounding to the wrong minor unit.
  • Insider abuse: AI cash pattern detector should already have flagged.

Recovery:

  1. Create the corrective entries via supervisor override POST /folios/:id/refunds (or POST /credit-notes/...).
  2. Update daily_cash_reconciliations.status='resolved' (single SQL admin task — audited).
  3. Postmortem with finance ops within 5 business days.

2. Multi-currency rounding edge cases

Symptoms: folio.balance_due.v1 fires on a folio whose payments visibly equal the charges to the operator; Settlement.residual_micro non-zero on closed folios; complaints from front desk.

Detect: dashboard panel Folio close residual distribution; alert BillingFolioCloseResidualNonZeroSpike.

Immediate mitigation:

  1. Identify the affected (currency_pair, FXSnapshot.source) combo.
  2. If the FX snapshot is stale or wrong (e.g., pricing-service published bad rates), pause new folio opens for the tenant via tenant.settings.billing.eagerOpen=false and switch to deferred until rates are corrected (the deferred mode pulls a fresh snapshot at check-in).
  3. For affected folios, the recommended close path is to record a small adjustment charge (positive or negative) to absorb the residual, rather than block the guest at checkout.

Root cause:

  • Banker's rounding misapplied: Money.mulFraction integer division truncates; we add a half-up step in the conversion. Verify FXSnapshot.fromBase against the property test.
  • Stale rate snapshot: the snapshot pinned at confirm is the contractual rate; if the guest expects a different rate, that is a tenant policy decision (always honor the snapshot in code; surface the diff to the guest in the invoice if tenant.settings.billing.showFxAtClose=true).

Recovery:

  1. If a code bug, ship a forward fix and a back-fill SQL that recomputes residuals on impacted folios; do not mutate historical invoices, issue corrective credit notes.

3. Dunning failure (subscription invoice unpaid past hard cap)

Symptoms: BillingSubscriptionCycleStuck alert; subscription_invoices.status='failed' and next_retry_at in the past for many tenants.

Detect: dashboard billing-subscription → dunning state distribution; alert burns the SLO once correctness drops below 100%.

Immediate mitigation:

  1. Inspect payment-gateway-service health; if the processor is down, pause the cycle worker via gcloud scheduler jobs pause billing-subscription-cycle.
  2. If the failure is concentrated on one processor adapter (e.g., mfs-jazzcash), drain that path via processor failover in payment-gateway-service.
  3. Communicate with affected tenants via notification-service outbound — do not silently auto-suspend during a known platform incident.

Root cause:

  • Processor outage: confirm via the gateway's status; track via incident.
  • Wrong paymentMethodToken on file (tenant changed cards but didn't update the token in tenant-service): trigger update reminder.
  • Bug in dunning state machine: walk subscription.recordPaymentFailure against the failing fixture.

Recovery:

  1. Once processor recovers, manually re-run cycle for affected tenants: POST /api/v1/subscriptions/:tenantId/cycle?force=false.
  2. For tenants auto-suspended during the incident, run POST /api/v1/subscriptions/:tenantId/reactivate with reason='platform_incident_<id>'.

4. Outbox drainer lag

Symptoms: BillingOutboxLag alert; _outbox.published_at IS NULL count growing.

Immediate mitigation:

  1. Scale billing-outbox-drainer to max replicas.
  2. Check Pub/Sub publish errors metric; if topic IAM is broken (e.g., a key rotation missed), repair via Workload Identity grant.
  3. If a single tenant schema is responsible (huge spike), shard the drainer pool to isolate the noisy neighbor.

Recovery:

  • Drainer is at-least-once; consumers idempotently dedupe via inbox keys. After backlog drains, verify no DLQ growth on consumer side.

5. Invoice render failed

Symptoms: BillingInvoiceGenerationSlow alert; MELMASTOON.BILLING.INVOICE_RENDER_FAILED errors; invoices.pdf_uri IS NULL rows accumulating.

Immediate mitigation:

  1. The folio is already closed; the deferred-render queue picks it up automatically. Surface progress via the dashboard panel "Deferred invoices in queue".
  2. If the failure is template-specific (e.g., new RTL template), revert the template version in tenant.settings.billing.invoiceTemplates.

Recovery:

  1. Manually retrigger POST /api/v1/invoices/:id/regenerate (admin endpoint) for affected invoices once the renderer is healthy.

6. Tax rule missing spike

Symptoms: BillingTaxRuleMissingSpike; charges blocked at post time; front desk friction.

Immediate mitigation:

  1. Identify the tenant + jurisdiction; coordinate with tenant-success to update tenant-service.taxRules.
  2. Temporary mitigation: with tenant approval, enable tenant.settings.billing.allowUntaxed=true for the day (audited; reverts automatically at midnight).

Recovery:

  1. Post-fact, run a back-fill that emits IssueCreditNote + PostCharge(tax) pair against impacted folios to apply the correct tax once the rule is in place — only if tenant requests; otherwise, the untaxed period remains as the audit shows.

7. Cross-tenant data exposure (worst-case)

Detect: an audit-service alert (audit.cross_tenant_access) or a customer report.

Immediate mitigation:

  1. Page P1 to security on-call.
  2. Capture the request trace ID and immediately rotate the offending pool's credentials.
  3. Snapshot the impacted tenant schemas for forensics.

Recovery:

  1. Layered defense should have prevented row-level exposure; verify which layers tripped (RLS, application guard) and which silently passed.
  2. Communicate to affected tenants per 07 Security §13.
  3. Public postmortem.

8. Cash drawer "stuck" in pending_close

Symptoms: cash_drawer_sessions.status='pending_close' for > 6 h; new session blocked; cashier complaints.

Immediate mitigation:

  1. Verify the desktop is online and the co-signer is available.
  2. If the desktop is offline-only (long blackout), advise the property to retry close once connectivity returns; the local action is queued.
  3. As a last resort, a supervisor on the cloud-only console can POST /api/v1/cash-sessions/:id/close directly with a fresh step-up token (assuming both staff are present and authenticated).

9. AI orchestrator hard down

Symptoms: BillingAICircuitOpen alert.

Mitigation: circuits open; signals stop flowing. The desktop hides AI panels gracefully. No billing operation degrades. Once recovered, the next mutation per folio re-evaluates within 5 s; the historical gap is documented in the postmortem and back-filled by the nightly job.

10. Cloud SQL primary failover

Mitigation: Cloud SQL HA flips automatically. Connection pools reconnect within 30 s. Any in-flight transaction is rolled back; idempotent retries succeed. Outbox drainer resumes; outbox guarantees no event loss because publish-after-commit semantics hold.

11. Schema migration rollback

A bad migration on a few tenants:

  1. Pause billing-tenant-migrator.
  2. Apply backward-compatible migration revert per tenant via the same job (the migration framework supports down).
  3. Roll the API revision back if its code requires the new schema.

12. Pub/Sub DLQ accumulation

Symptoms: any <topic>.dlq size > 0.

Mitigation:

  1. Read DLQ messages; identify the failing handler.
  2. If the failure is a poison message (malformed payload), manually drop after persisting evidence.
  3. If the failure is a transient handler bug, fix forward, then replay DLQ via gcloud pubsub subscriptions pull --ack script with idempotent processing (inbox dedupe protects).

13. SQLCipher local DB corruption (desktop)

Mitigation: the desktop detects corruption on boot, wipes the local DB, and re-pulls. Any in-flight local_outbox rows that were corrupted are lost; the cloud is the source of truth and any cash-session-close that hadn't reached cloud is irrecoverable. The cashier must re-count and re-initiate.

14. Secret leak

If a database credential leaks:

  1. Rotate immediately in Secret Manager.
  2. Force pool reconnect by rolling Cloud Run revisions.
  3. Audit recent SQL by pg_audit extracts.

15. Index hot-spot on folio_charges

A property with rapid POS posting can hot-spot (folio_id). Mitigation: posting batches > 50 items/s should batch via INSERT … VALUES of size 25; the application layer auto-batches above the threshold. If still hot, partition folio_charges by (tenant_id, hash(folio_id)) — reserved tactic.

16. Step-up token replay

If a step-up token is presented twice, the second use returns MELMASTOON.IAM.STEP_UP_REPLAY from iam-service; we treat as a security event and log to audit. No state mutation occurs on the second attempt.

17. Disaster recovery drill

Quarterly drill: restore Cloud SQL into a clean project from PITR at now() - 1h; verify RPO; verify all per-tenant schemas restored; smoke-test invoice generation against restored data.

18. Cross-references