Billing Service — Observability
Status: populated Owner: Platform Engineering + SRE Last updated: 2026-04-18 Companion: 12 Observability
1. SLIs / SLOs
| SLI | SLO | Window |
|---|---|---|
| Event ingestion lag (NATS consumer lag) | P95 ≤ 30s | 30 d |
| Pricing resolution latency (cache hit) | P95 ≤ 5 ms | 30 d |
| Pricing resolution latency (cache miss) | P95 ≤ 50 ms | 30 d |
| Invoice generation success rate | ≥ 99% of accounts invoiced within 1h of cron start | monthly |
| Usage query API P95 | ≤ 300 ms | 30 d |
| Admin API availability | ≥ 99.5% | 30 d |
2. Metrics
billing_events_ingested_total{result="ok|duplicate|pricing_not_found|error"}
billing_events_ingestion_latency_seconds_bucket
billing_pricing_resolve_total{source="cache|db"}
billing_pricing_resolve_latency_seconds_bucket{source=...}
billing_invoice_cron_total{result="ok|error"}
billing_invoice_cron_accounts_total{result="ok|error"}
billing_invoice_cron_duration_seconds
billing_nats_consumer_lag (gauge)
billing_pg_errors_total{op=...}
billing_redis_errors_total{op=...}
billing_negative_margin_total
3. Traces
OpenTelemetry spans (propagated from billing.events NATS traceparent header):
billing.ingest_event(root from NATS)billing.resolve_pricingbilling.resolve_costbilling.pg.insert_billing_eventbilling.pg.upsert_usage_summary
billing.invoice_cron(root from scheduler)billing.aggregate_usage{accountId=...}billing.render_pdfbilling.s3.putbilling.pg.insert_invoicebilling.nats.publish_invoice_generated
Attributes: billing.account_id, billing.tenant_id, billing.operator_id, billing.message_id, billing.invoice_id.
4. Logs (Pino → Loki)
Fields: level, ts, service=billing-service, messageId, accountId, tenantId, stage, durationMs, traceId.
No MSISDN or SMS body in billing logs (not present in billing events schema).
5. Dashboards (Grafana)
- Billing Overview — ingestion rate, consumer lag, pricing cache hit ratio, negative margin rate
- Invoice Generation — cron run timeline, accounts invoiced/failed, PDF render duration
- Pricing Admin — active pricing rules by tier, operator cost coverage
- Dependencies — PG op latency, Redis hit/miss, S3 latency, NATS consumer lag
6. Alerts
| Alert | Condition | Runbook |
|---|---|---|
BillingPricingNotFound | pricing_not_found events > 0/min | runbooks/billing/pricing-not-found.md |
BillingNatsLag | consumer lag > 10000 events | runbooks/billing/nats-lag.md |
BillingPgErrors | PG errors > 5/min | runbooks/billing/pg-down.md |
BillingRedisErrors | Redis errors > 10/min | runbooks/billing/redis-down.md |
BillingInvoiceCronError | cron result=error | runbooks/billing/invoice-cron.md |
BillingNegativeMargin | negative_margin_total increases | runbooks/billing/negative-margin.md |
BillingS3Error | S3 PUT errors > 0 during cron | runbooks/billing/s3-unavailable.md |