Billing Service — Observability
Status: populated Owner: TBD Last updated: 2026-04-17 Companion: Service Template
1. SLIs / SLOs
| SLI | SLO (monthly) | Error budget |
|---|---|---|
| Availability (2xx+3xx / total) | 99.9 % | 43.2 min |
POST /payments latency p95 | ≤ 2 s | 1 % requests slower |
POST /charges latency p95 | ≤ 1.5 s | 1 % |
| Statement run success rate | ≥ 99 % runs complete | 1 % |
| Outbox publish lag p95 | ≤ 10 s | 1 % events slower |
| Event consumer lag p95 | ≤ 30 s | 1 % |
| Ledger integrity (balance = sum(entries)) | 100 % | zero — alert immediately on mismatch |
2. Metrics (Prometheus via OpenTelemetry)
| Metric | Type | Labels |
|---|---|---|
billing_http_requests_total | counter | route, method, status, tenant_id |
billing_http_request_duration_seconds | histogram | route, method |
billing_charges_captured_total | counter | tenant_id, facility_id, code_system |
billing_payments_posted_total | counter | tenant_id, method, currency |
billing_payment_amount_minor_total | counter | tenant_id, currency |
billing_refunds_requested_total | counter | tenant_id, reason |
billing_refunds_approved_total | counter | tenant_id |
billing_adjustments_applied_total | counter | tenant_id, reason |
billing_invoices_issued_total | counter | tenant_id |
billing_outbox_lag_seconds | gauge | — |
billing_outbox_pending | gauge | — |
billing_inbox_dedup_skipped_total | counter | source |
billing_idempotency_replay_total | counter | route |
billing_statement_run_duration_seconds | histogram | facility_id |
billing_license_denied_total | counter | tenant_id |
billing_ledger_integrity_mismatch_total | counter | — |
billing_ai_calls_total | counter | use_case, accepted |
3. Traces
- Every request opens an OpenTelemetry span
http.requestwith attributestenant.id,user.id,route,status,request.id. - Downstream calls to terminology-service, tenant-service, fhir-gateway, ai-gateway-service are child spans.
- Outbox publishes create a span
outbox.publishcorrelated via CloudEventsid.
4. Logs
- Structured JSON via pino (
@ghasi/telemetry). Mandatory fields:ts,level,tenant_id,actor_id,request_id,trace_id,span_id,route,status,latency_ms. - Never log: raw monetary amounts with PII (PHI-financial combo), card data, secrets, full bodies of payment requests.
- Sensitive fields auto-redacted at logger level (
reference,external_ref,idempotency_keysuffix-redacted).
5. Dashboards (Grafana)
| Dashboard | Panels |
|---|---|
| Billing Overview | Request rate, latency, error rate, tenant top-10, ledger integrity gauge |
| Revenue Capture | Charges captured per minute; amount per tenant/facility; code distribution |
| Payments | Payments per method, payment amount, idempotency replays, time-to-post |
| Invoices | Drafts vs issued vs paid; aging by facility |
| Statements | Runs queued / running / completed; failure count |
| Outbox / Events | Outbox lag, pending count, consumer lag per subject |
| AI Assist | Calls per use case, accept rate |
| Licensing | Denied requests per tenant; module activation |
6. Alerts
| Alert | Condition | Severity | Runbook |
|---|---|---|---|
BillingAvailability | 5-min error rate > 1 % | P2 | rb-billing-availability |
PaymentLatencyBreach | POST /payments p95 > 3 s for 10 min | P2 | rb-payment-latency |
OutboxLag | billing_outbox_lag_seconds > 30 s | P1 | rb-outbox |
LedgerIntegrityMismatch | any non-zero reading | P1 — page | rb-ledger-integrity |
ConsumerLag | consumer lag > 60 s for 5 min | P2 | rb-consumer-lag |
StatementRunFailure | any run fails | P3 | rb-statement-run |
LicenseDeniedSpike | billing_license_denied_total increases > 100 / min | P3 | rb-licensing |
RefundApprovalPending | billing_refunds_requested_total – approved > 50 for > 24 h | P3 | ops queue review |
7. Runbook index
rb-billing-availability.md— pod restarts, DB connectivity checksrb-payment-latency.md— gateway + DB healthrb-outbox.md— relay lag triage, JetStream back-pressurerb-ledger-integrity.md— PAGE immediately; freeze writes, run reconciliation scriptrb-consumer-lag.md— NATS JetStream consumer healthrb-statement-run.md— worker health, artifact storagerb-licensing.md— tenant-service entitlement lookup
8. Health endpoints
| Endpoint | Purpose |
|---|---|
GET /healthz | Liveness (process alive) |
GET /readyz | Readiness (DB, NATS, tenant-service reachable) |
GET /metrics | Prometheus scrape |
9. Synthetic checks
- 1-min probe: POST a no-op charge to a shadow tenant (not billed).
- 5-min probe: attempt a dummy
POST /paymentswith idempotency keyprobe-<ts>; assert replay behaviour. - Nightly: ledger integrity batch (sum of entries vs materialised balance).