Observability
:::info Source
Sourced from services/billing-service/OBSERVABILITY.md in the documentation repo.
:::
1. Logs
Events: billing.payment_intent.created, .succeeded, .failed, billing.invoice.created|paid|voided, billing.subscription.created|changed|canceled, billing.dunning.stage_advanced|resolved, billing.payout.initiated|completed|failed, billing.webhook.received|processed|rejected, billing.refund.processed.
Redact: PAN (never logged), CVV (never stored), full address (hashed).
2. Metrics
RED
billing_api_requests_total{endpoint,status}counterbilling_api_duration_seconds{endpoint}histogram
Domain
billing_payments_total{status}counterbilling_payments_amount_micro_total{currency,status}counterbilling_refunds_total{reason}counterbilling_subscriptions_created_total{plan}counterbilling_subscriptions_activegaugebilling_dunning_activegaugebilling_payouts_total{status}counterbilling_webhook_processing_duration_secondshistogrambilling_webhook_failures_total{reason}counterbilling_mrr_microgauge (monthly recurring revenue)billing_arr_microgauge (annual recurring revenue)billing_churn_rategauge
Reconciliation
billing_reconciliation_variance_microgauge (difference between our ledger and Stripe's)billing_reconciliation_last_run_timestampgauge
3. Traces
Spans: billing.payment.create_intent, billing.payment.confirm, billing.invoice.finalize, billing.subscription.renew, billing.payout.create, billing.webhook.process.
4. Dashboards
- Revenue: MRR/ARR trend, payment success rate, refund rate.
- Dunning: active processes, stage distribution, resolution rate.
- Payouts: upcoming, completed, failures.
- Webhooks: throughput, latency, DLQ.
- Reconciliation: variance over time.
5. Alerts
| Alert | Threshold | Severity |
|---|---|---|
| payment-failure-rate-spike | > 5% in 10 min | P2 |
| webhook-failure-rate | > 1% in 10 min | P1 |
| webhook-lag | p99 > 30s | P2 |
| reconciliation-variance | > $100 | P1 |
| payout-failed | any | P2 |
| dunning-not-advancing | process > 14 days | P3 |
| mrr-drop | > 5% day-over-day | P2 |
| kms-failure | > 5 fail / 1 min | P1 |
6. SLOs
| SLI | Target |
|---|---|
| Payment intent p95 | < 500ms |
| Webhook processing p95 | < 200ms |
| Webhook success rate | ≥ 99.9% |
| Reconciliation variance | < 0.01% |
| Invoice generation p95 | < 5s |
7. Business Metrics
- Daily: GMV, net revenue, refunds, payouts.
- Monthly: MRR, ARR, churn, expansion.
- Per-tenant: revenue, refund rate.