Skip to main content

Billing Service — Observability

Status: populated Owner: TBD Last updated: 2026-04-17 Companion: Service Template

1. SLIs / SLOs

SLISLO (monthly)Error budget
Availability (2xx+3xx / total)99.9 %43.2 min
POST /payments latency p95≤ 2 s1 % requests slower
POST /charges latency p95≤ 1.5 s1 %
Statement run success rate≥ 99 % runs complete1 %
Outbox publish lag p95≤ 10 s1 % events slower
Event consumer lag p95≤ 30 s1 %
Ledger integrity (balance = sum(entries))100 %zero — alert immediately on mismatch

2. Metrics (Prometheus via OpenTelemetry)

MetricTypeLabels
billing_http_requests_totalcounterroute, method, status, tenant_id
billing_http_request_duration_secondshistogramroute, method
billing_charges_captured_totalcountertenant_id, facility_id, code_system
billing_payments_posted_totalcountertenant_id, method, currency
billing_payment_amount_minor_totalcountertenant_id, currency
billing_refunds_requested_totalcountertenant_id, reason
billing_refunds_approved_totalcountertenant_id
billing_adjustments_applied_totalcountertenant_id, reason
billing_invoices_issued_totalcountertenant_id
billing_outbox_lag_secondsgauge
billing_outbox_pendinggauge
billing_inbox_dedup_skipped_totalcountersource
billing_idempotency_replay_totalcounterroute
billing_statement_run_duration_secondshistogramfacility_id
billing_license_denied_totalcountertenant_id
billing_ledger_integrity_mismatch_totalcounter
billing_ai_calls_totalcounteruse_case, accepted

3. Traces

  • Every request opens an OpenTelemetry span http.request with attributes tenant.id, user.id, route, status, request.id.
  • Downstream calls to terminology-service, tenant-service, fhir-gateway, ai-gateway-service are child spans.
  • Outbox publishes create a span outbox.publish correlated via CloudEvents id.

4. Logs

  • Structured JSON via pino (@ghasi/telemetry). Mandatory fields: ts, level, tenant_id, actor_id, request_id, trace_id, span_id, route, status, latency_ms.
  • Never log: raw monetary amounts with PII (PHI-financial combo), card data, secrets, full bodies of payment requests.
  • Sensitive fields auto-redacted at logger level (reference, external_ref, idempotency_key suffix-redacted).

5. Dashboards (Grafana)

DashboardPanels
Billing OverviewRequest rate, latency, error rate, tenant top-10, ledger integrity gauge
Revenue CaptureCharges captured per minute; amount per tenant/facility; code distribution
PaymentsPayments per method, payment amount, idempotency replays, time-to-post
InvoicesDrafts vs issued vs paid; aging by facility
StatementsRuns queued / running / completed; failure count
Outbox / EventsOutbox lag, pending count, consumer lag per subject
AI AssistCalls per use case, accept rate
LicensingDenied requests per tenant; module activation

6. Alerts

AlertConditionSeverityRunbook
BillingAvailability5-min error rate > 1 %P2rb-billing-availability
PaymentLatencyBreachPOST /payments p95 > 3 s for 10 minP2rb-payment-latency
OutboxLagbilling_outbox_lag_seconds > 30 sP1rb-outbox
LedgerIntegrityMismatchany non-zero readingP1 — pagerb-ledger-integrity
ConsumerLagconsumer lag > 60 s for 5 minP2rb-consumer-lag
StatementRunFailureany run failsP3rb-statement-run
LicenseDeniedSpikebilling_license_denied_total increases > 100 / minP3rb-licensing
RefundApprovalPendingbilling_refunds_requested_total – approved > 50 for > 24 hP3ops queue review

7. Runbook index

  • rb-billing-availability.md — pod restarts, DB connectivity checks
  • rb-payment-latency.md — gateway + DB health
  • rb-outbox.md — relay lag triage, JetStream back-pressure
  • rb-ledger-integrity.mdPAGE immediately; freeze writes, run reconciliation script
  • rb-consumer-lag.md — NATS JetStream consumer health
  • rb-statement-run.md — worker health, artifact storage
  • rb-licensing.md — tenant-service entitlement lookup

8. Health endpoints

EndpointPurpose
GET /healthzLiveness (process alive)
GET /readyzReadiness (DB, NATS, tenant-service reachable)
GET /metricsPrometheus scrape

9. Synthetic checks

  • 1-min probe: POST a no-op charge to a shadow tenant (not billed).
  • 5-min probe: attempt a dummy POST /payments with idempotency key probe-<ts>; assert replay behaviour.
  • Nightly: ledger integrity batch (sum of entries vs materialised balance).