Skip to main content

Billing Service — Observability

Status: populated Owner: Platform Engineering + SRE Last updated: 2026-04-18 Companion: 12 Observability

1. SLIs / SLOs

SLISLOWindow
Event ingestion lag (NATS consumer lag)P95 ≤ 30s30 d
Pricing resolution latency (cache hit)P95 ≤ 5 ms30 d
Pricing resolution latency (cache miss)P95 ≤ 50 ms30 d
Invoice generation success rate≥ 99% of accounts invoiced within 1h of cron startmonthly
Usage query API P95≤ 300 ms30 d
Admin API availability≥ 99.5%30 d

2. Metrics

billing_events_ingested_total{result="ok|duplicate|pricing_not_found|error"}
billing_events_ingestion_latency_seconds_bucket
billing_pricing_resolve_total{source="cache|db"}
billing_pricing_resolve_latency_seconds_bucket{source=...}
billing_invoice_cron_total{result="ok|error"}
billing_invoice_cron_accounts_total{result="ok|error"}
billing_invoice_cron_duration_seconds
billing_nats_consumer_lag (gauge)
billing_pg_errors_total{op=...}
billing_redis_errors_total{op=...}
billing_negative_margin_total

3. Traces

OpenTelemetry spans (propagated from billing.events NATS traceparent header):

  • billing.ingest_event (root from NATS)
    • billing.resolve_pricing
    • billing.resolve_cost
    • billing.pg.insert_billing_event
    • billing.pg.upsert_usage_summary
  • billing.invoice_cron (root from scheduler)
    • billing.aggregate_usage{accountId=...}
    • billing.render_pdf
    • billing.s3.put
    • billing.pg.insert_invoice
    • billing.nats.publish_invoice_generated

Attributes: billing.account_id, billing.tenant_id, billing.operator_id, billing.message_id, billing.invoice_id.

4. Logs (Pino → Loki)

Fields: level, ts, service=billing-service, messageId, accountId, tenantId, stage, durationMs, traceId. No MSISDN or SMS body in billing logs (not present in billing events schema).

5. Dashboards (Grafana)

  • Billing Overview — ingestion rate, consumer lag, pricing cache hit ratio, negative margin rate
  • Invoice Generation — cron run timeline, accounts invoiced/failed, PDF render duration
  • Pricing Admin — active pricing rules by tier, operator cost coverage
  • Dependencies — PG op latency, Redis hit/miss, S3 latency, NATS consumer lag

6. Alerts

AlertConditionRunbook
BillingPricingNotFoundpricing_not_found events > 0/minrunbooks/billing/pricing-not-found.md
BillingNatsLagconsumer lag > 10000 eventsrunbooks/billing/nats-lag.md
BillingPgErrorsPG errors > 5/minrunbooks/billing/pg-down.md
BillingRedisErrorsRedis errors > 10/minrunbooks/billing/redis-down.md
BillingInvoiceCronErrorcron result=errorrunbooks/billing/invoice-cron.md
BillingNegativeMarginnegative_margin_total increasesrunbooks/billing/negative-margin.md
BillingS3ErrorS3 PUT errors > 0 during cronrunbooks/billing/s3-unavailable.md