OBSERVABILITY — billing-service
Conforms to the platform observability stack: OpenTelemetry traces + Cloud Trace, OpenTelemetry metrics + Cloud Monitoring, structured JSON logs to Cloud Logging, alerts in Cloud Monitoring with Slack
#oncall-billing+ PagerDuty routing.
1. Tracing
Every public entry point opens a root span:
| Span | Attributes |
|---|---|
http.server | http.method, http.route, http.status_code, tenant.id, actor.id, idempotency.key, route.scope |
pubsub.consume | pubsub.topic, pubsub.message.id, event.type, event.id, tenant.id, inbox.hit (bool) |
cron.tick | cron.name, tenant.id?, period? |
Use cases open child spans billing.usecase.<name> with attributes:
aggregate.id,aggregate.type,tenant.id,actor.id,idempotency.key,domain.error.code(only on failure),outbox.event.count,- For folio close:
folio.line_count,folio.payment_count,folio.refund_count,invoice.template,invoice.locale,pdf.bytes.
Storage spans annotate db.system=postgresql, db.namespace=tenant_<…>_billing, db.statement (parameterized only — never literals), db.operation. Pub/Sub publish spans annotate pubsub.topic, pubsub.message.id.
Trace sampling: 100% for failures and 5xx; 10% baseline for successful requests; 100% for hot-path use cases (CloseFolio, FinalizeCashSessionClose, GenerateInvoice).
2. Logs
Structured JSON. Mandatory fields:
time,level,service=billing-service,version=<git-sha>,traceId,spanId,tenantId,actorId?,route?,eventType?,message,outcome(ok|domain_error|infra_error).
Domain errors include errorCode=MELMASTOON.BILLING.… and a redacted details object. Never log: PAN, token strings, customer PII, raw request bodies, Authorization header, stepUpToken value, amountMicro precise values (use amountClass: small|medium|large derived from tenant currency thresholds).
3. SLIs and SLOs
| SLI | Definition | SLO |
|---|---|---|
| Folio mutation latency | p95 of POST /folios/:id/{charges,payments,refunds} end-to-end | ≤ 350 ms (28-day rolling) |
| Folio mutation success ratio | 1 − (5xx_count / total) | ≥ 99.95% |
| Invoice generation latency | p95 of CloseFolio + GenerateInvoice (domain start → outbox commit) | ≤ 2 s |
| Invoice generation success | non-failure ratio (InvoiceRenderFailed excluded only when deferred path engaged successfully) | ≥ 99.9% |
| Cash drawer close latency | p95 of POST /cash-sessions/:id/close | ≤ 5 s |
| Cash drawer close success | non-failure ratio (excluding CASH_DRAWER_OFFLINE_CLOSE_FORBIDDEN user-error class) | ≥ 99.99% (very high — a failure here delays the entire shift handover) |
| Subscription cycle duration | wall-clock of monthly cycle worker per tenant | ≤ 30 s p95 |
| Subscription dunning correctness | (cycles where dunning state advanced exactly per spec) / (cycles where it should have) | = 100% (gated by daily integrity job, not by sampling) |
| Outbox lag | now() − min(_outbox.created_at WHERE published_at IS NULL) | p99 ≤ 30 s |
| Inbox dedupe coverage | % of consumed events with inbox check before processing | = 100% (assert at handler entry) |
| Tax computation correctness | CI-gated against fixture matrix; production hourly sample of 1% folios re-computed in shadow | = 100% (zero tolerance) |
Error budget burns and burn-rate alerts follow Google SRE workbook — fast burn (≥ 14× over 1 h) pages, slow burn (≥ 1× over 6 h) tickets.
4. Metrics (RED + USE + business)
billing_http_requests_total{route, method, tenant_id, status}
billing_http_request_duration_seconds{route, method, tenant_id}_bucket
billing_pubsub_consumed_total{event_type, outcome, tenant_id}
billing_pubsub_consume_duration_seconds{event_type}_bucket
billing_outbox_lag_seconds{schema} # gauge
billing_outbox_unpublished_count{schema} # gauge
billing_usecase_duration_seconds{name, outcome}_bucket
billing_usecase_domain_error_total{name, error_code}
billing_invoice_generated_total{tenant_id, template, locale}
billing_invoice_render_duration_seconds{template}_bucket
billing_cash_drawer_close_total{tenant_id, outcome} # outcome ∈ ok|reconciliation_blocked|step_up_failed|offline_blocked
billing_cash_drawer_variance_micro{tenant_id, drawer_id}_bucket
billing_subscription_cycle_total{outcome} # outcome ∈ ok|hard_cap_suspended|payment_failed
billing_subscription_dunning_state_total{state} # gauge
billing_ai_call_total{capability, outcome}
billing_ai_call_duration_seconds{capability}_bucket
billing_ai_signals_emitted_total{capability, severity}
billing_tax_engine_duration_seconds_bucket
billing_tax_engine_rule_missing_total{tenant_id}
billing_reconciliation_mismatch_total{tenant_id, property_id}
USE indicators on the Cloud SQL instance: CPU, IO wait, replication lag (cross-region replica), connection saturation per pool, lock waits.
5. Dashboards (Grafana / Cloud Monitoring)
- billing-overview — request rate, error rate, p50/p95/p99 latencies, outbox lag, top tenants by request rate.
- billing-folio — open folios per tenant, charges/sec, payments/sec, refunds/sec, close success/fail, balance-due drop-off rate.
- billing-invoice — invoice generation throughput, render duration histogram, deferred-render queue depth, top templates / locales.
- billing-cash-drawer — open sessions count, close latency, variance distribution, discrepancy events, sessions blocked > 6 h count.
- billing-subscription — cycle progress (per region), dunning state distribution, payment failure rate, suspended count.
- billing-ai — AI call rate, latency, error rate per capability, signals emitted by severity, kill-switched tenants.
- billing-data — Cloud SQL CPU/IO/replication lag, top SQL by latency, lock waits, partition sizes for
usage_records.
6. Alerts
| Alert | Condition | Severity | Routing |
|---|---|---|---|
BillingHighErrorRate | error ratio ≥ 1% over 5 min | P1 | PagerDuty + #oncall-billing |
BillingFolioMutationSloBurnFast | 14× burn over 1 h | P1 | PagerDuty |
BillingInvoiceGenerationSlow | p95 > 4 s for 10 min | P2 | #oncall-billing |
BillingCashDrawerCloseSlow | p95 > 8 s for 5 min | P1 | PagerDuty |
BillingCashDiscrepancyRateHigh | > 5% of closes in 1 h carry reconciliation_blocked | P2 | #oncall-billing + finance ops |
BillingOutboxLag | p99 > 60 s for 5 min | P1 | PagerDuty |
BillingTaxRuleMissingSpike | > 50 events/min for any tenant | P2 | tenant-success channel |
BillingReconciliationMismatch | any row with status='mismatch' | P1 | finance ops + #oncall-billing |
BillingSubscriptionCycleStuck | cycle worker > 60 min for any tenant | P2 | platform ops |
BillingAICircuitOpen | circuit open > 10 min for any capability | P3 | #observability |
BillingShariaViolationAttempt | any MELMASTOON.BILLING.SHARIA_COMPLIANT_VIOLATION (config error or attack) | P2 | tenant admin + #oncall-billing |
7. Audit / forensic queries
A read-only billing-readonly BigQuery dataset is sourced from a daily Cloud SQL → BigQuery export:
- per-tenant folio totals,
- variance trend per cashier,
- subscription dunning history,
- AI signal review outcomes vs. eventual fraud confirmations.
8. Synthetic monitoring
- Cloud Scheduler hits a
synthetic-tenantinstance every 60 s exercising: open folio → post charge → record cash payment → close → fetch invoice PDF. End-to-end latency feedsbilling_synthetic_e2e_seconds. - A second synthetic exercises cash drawer open → close (no two-staff requirement on the synthetic tenant; bypass mode flagged in audit).
9. Runbook entry points
- FAILURE_MODES is the on-call runbook root.
- Each alert above links a section in
FAILURE_MODESand an associated set of metrics queries. - Rolling restart and drainer manual flush procedures are in
DEPLOYMENT_TOPOLOGY§6.