Skip to main content

OBSERVABILITY — billing-service

Conforms to the platform observability stack: OpenTelemetry traces + Cloud Trace, OpenTelemetry metrics + Cloud Monitoring, structured JSON logs to Cloud Logging, alerts in Cloud Monitoring with Slack #oncall-billing + PagerDuty routing.

1. Tracing

Every public entry point opens a root span:

SpanAttributes
http.serverhttp.method, http.route, http.status_code, tenant.id, actor.id, idempotency.key, route.scope
pubsub.consumepubsub.topic, pubsub.message.id, event.type, event.id, tenant.id, inbox.hit (bool)
cron.tickcron.name, tenant.id?, period?

Use cases open child spans billing.usecase.<name> with attributes:

  • aggregate.id, aggregate.type, tenant.id, actor.id, idempotency.key,
  • domain.error.code (only on failure),
  • outbox.event.count,
  • For folio close: folio.line_count, folio.payment_count, folio.refund_count, invoice.template, invoice.locale, pdf.bytes.

Storage spans annotate db.system=postgresql, db.namespace=tenant_<…>_billing, db.statement (parameterized only — never literals), db.operation. Pub/Sub publish spans annotate pubsub.topic, pubsub.message.id.

Trace sampling: 100% for failures and 5xx; 10% baseline for successful requests; 100% for hot-path use cases (CloseFolio, FinalizeCashSessionClose, GenerateInvoice).

2. Logs

Structured JSON. Mandatory fields:

  • time, level, service=billing-service, version=<git-sha>,
  • traceId, spanId,
  • tenantId, actorId?, route?, eventType?,
  • message, outcome (ok|domain_error|infra_error).

Domain errors include errorCode=MELMASTOON.BILLING.… and a redacted details object. Never log: PAN, token strings, customer PII, raw request bodies, Authorization header, stepUpToken value, amountMicro precise values (use amountClass: small|medium|large derived from tenant currency thresholds).

3. SLIs and SLOs

SLIDefinitionSLO
Folio mutation latencyp95 of POST /folios/:id/{charges,payments,refunds} end-to-end≤ 350 ms (28-day rolling)
Folio mutation success ratio1 − (5xx_count / total)≥ 99.95%
Invoice generation latencyp95 of CloseFolio + GenerateInvoice (domain start → outbox commit)≤ 2 s
Invoice generation successnon-failure ratio (InvoiceRenderFailed excluded only when deferred path engaged successfully)≥ 99.9%
Cash drawer close latencyp95 of POST /cash-sessions/:id/close≤ 5 s
Cash drawer close successnon-failure ratio (excluding CASH_DRAWER_OFFLINE_CLOSE_FORBIDDEN user-error class)≥ 99.99% (very high — a failure here delays the entire shift handover)
Subscription cycle durationwall-clock of monthly cycle worker per tenant≤ 30 s p95
Subscription dunning correctness(cycles where dunning state advanced exactly per spec) / (cycles where it should have)= 100% (gated by daily integrity job, not by sampling)
Outbox lagnow() − min(_outbox.created_at WHERE published_at IS NULL)p99 ≤ 30 s
Inbox dedupe coverage% of consumed events with inbox check before processing= 100% (assert at handler entry)
Tax computation correctnessCI-gated against fixture matrix; production hourly sample of 1% folios re-computed in shadow= 100% (zero tolerance)

Error budget burns and burn-rate alerts follow Google SRE workbook — fast burn (≥ 14× over 1 h) pages, slow burn (≥ 1× over 6 h) tickets.

4. Metrics (RED + USE + business)

billing_http_requests_total{route, method, tenant_id, status}
billing_http_request_duration_seconds{route, method, tenant_id}_bucket
billing_pubsub_consumed_total{event_type, outcome, tenant_id}
billing_pubsub_consume_duration_seconds{event_type}_bucket
billing_outbox_lag_seconds{schema} # gauge
billing_outbox_unpublished_count{schema} # gauge
billing_usecase_duration_seconds{name, outcome}_bucket
billing_usecase_domain_error_total{name, error_code}
billing_invoice_generated_total{tenant_id, template, locale}
billing_invoice_render_duration_seconds{template}_bucket
billing_cash_drawer_close_total{tenant_id, outcome} # outcome ∈ ok|reconciliation_blocked|step_up_failed|offline_blocked
billing_cash_drawer_variance_micro{tenant_id, drawer_id}_bucket
billing_subscription_cycle_total{outcome} # outcome ∈ ok|hard_cap_suspended|payment_failed
billing_subscription_dunning_state_total{state} # gauge
billing_ai_call_total{capability, outcome}
billing_ai_call_duration_seconds{capability}_bucket
billing_ai_signals_emitted_total{capability, severity}
billing_tax_engine_duration_seconds_bucket
billing_tax_engine_rule_missing_total{tenant_id}
billing_reconciliation_mismatch_total{tenant_id, property_id}

USE indicators on the Cloud SQL instance: CPU, IO wait, replication lag (cross-region replica), connection saturation per pool, lock waits.

5. Dashboards (Grafana / Cloud Monitoring)

  1. billing-overview — request rate, error rate, p50/p95/p99 latencies, outbox lag, top tenants by request rate.
  2. billing-folio — open folios per tenant, charges/sec, payments/sec, refunds/sec, close success/fail, balance-due drop-off rate.
  3. billing-invoice — invoice generation throughput, render duration histogram, deferred-render queue depth, top templates / locales.
  4. billing-cash-drawer — open sessions count, close latency, variance distribution, discrepancy events, sessions blocked > 6 h count.
  5. billing-subscription — cycle progress (per region), dunning state distribution, payment failure rate, suspended count.
  6. billing-ai — AI call rate, latency, error rate per capability, signals emitted by severity, kill-switched tenants.
  7. billing-data — Cloud SQL CPU/IO/replication lag, top SQL by latency, lock waits, partition sizes for usage_records.

6. Alerts

AlertConditionSeverityRouting
BillingHighErrorRateerror ratio ≥ 1% over 5 minP1PagerDuty + #oncall-billing
BillingFolioMutationSloBurnFast14× burn over 1 hP1PagerDuty
BillingInvoiceGenerationSlowp95 > 4 s for 10 minP2#oncall-billing
BillingCashDrawerCloseSlowp95 > 8 s for 5 minP1PagerDuty
BillingCashDiscrepancyRateHigh> 5% of closes in 1 h carry reconciliation_blockedP2#oncall-billing + finance ops
BillingOutboxLagp99 > 60 s for 5 minP1PagerDuty
BillingTaxRuleMissingSpike> 50 events/min for any tenantP2tenant-success channel
BillingReconciliationMismatchany row with status='mismatch'P1finance ops + #oncall-billing
BillingSubscriptionCycleStuckcycle worker > 60 min for any tenantP2platform ops
BillingAICircuitOpencircuit open > 10 min for any capabilityP3#observability
BillingShariaViolationAttemptany MELMASTOON.BILLING.SHARIA_COMPLIANT_VIOLATION (config error or attack)P2tenant admin + #oncall-billing

7. Audit / forensic queries

A read-only billing-readonly BigQuery dataset is sourced from a daily Cloud SQL → BigQuery export:

  • per-tenant folio totals,
  • variance trend per cashier,
  • subscription dunning history,
  • AI signal review outcomes vs. eventual fraud confirmations.

8. Synthetic monitoring

  • Cloud Scheduler hits a synthetic-tenant instance every 60 s exercising: open folio → post charge → record cash payment → close → fetch invoice PDF. End-to-end latency feeds billing_synthetic_e2e_seconds.
  • A second synthetic exercises cash drawer open → close (no two-staff requirement on the synthetic tenant; bypass mode flagged in audit).

9. Runbook entry points

  • FAILURE_MODES is the on-call runbook root.
  • Each alert above links a section in FAILURE_MODES and an associated set of metrics queries.
  • Rolling restart and drainer manual flush procedures are in DEPLOYMENT_TOPOLOGY §6.