Skip to main content

Observability

:::info Source Sourced from services/billing-service/OBSERVABILITY.md in the documentation repo. :::

1. Logs

Events: billing.payment_intent.created, .succeeded, .failed, billing.invoice.created|paid|voided, billing.subscription.created|changed|canceled, billing.dunning.stage_advanced|resolved, billing.payout.initiated|completed|failed, billing.webhook.received|processed|rejected, billing.refund.processed.

Redact: PAN (never logged), CVV (never stored), full address (hashed).

2. Metrics

RED

  • billing_api_requests_total{endpoint,status} counter
  • billing_api_duration_seconds{endpoint} histogram

Domain

  • billing_payments_total{status} counter
  • billing_payments_amount_micro_total{currency,status} counter
  • billing_refunds_total{reason} counter
  • billing_subscriptions_created_total{plan} counter
  • billing_subscriptions_active gauge
  • billing_dunning_active gauge
  • billing_payouts_total{status} counter
  • billing_webhook_processing_duration_seconds histogram
  • billing_webhook_failures_total{reason} counter
  • billing_mrr_micro gauge (monthly recurring revenue)
  • billing_arr_micro gauge (annual recurring revenue)
  • billing_churn_rate gauge

Reconciliation

  • billing_reconciliation_variance_micro gauge (difference between our ledger and Stripe's)
  • billing_reconciliation_last_run_timestamp gauge

3. Traces

Spans: billing.payment.create_intent, billing.payment.confirm, billing.invoice.finalize, billing.subscription.renew, billing.payout.create, billing.webhook.process.

4. Dashboards

  • Revenue: MRR/ARR trend, payment success rate, refund rate.
  • Dunning: active processes, stage distribution, resolution rate.
  • Payouts: upcoming, completed, failures.
  • Webhooks: throughput, latency, DLQ.
  • Reconciliation: variance over time.

5. Alerts

AlertThresholdSeverity
payment-failure-rate-spike> 5% in 10 minP2
webhook-failure-rate> 1% in 10 minP1
webhook-lagp99 > 30sP2
reconciliation-variance> $100P1
payout-failedanyP2
dunning-not-advancingprocess > 14 daysP3
mrr-drop> 5% day-over-dayP2
kms-failure> 5 fail / 1 minP1

6. SLOs

SLITarget
Payment intent p95< 500ms
Webhook processing p95< 200ms
Webhook success rate≥ 99.9%
Reconciliation variance< 0.01%
Invoice generation p95< 5s

7. Business Metrics

  • Daily: GMV, net revenue, refunds, payouts.
  • Monthly: MRR, ARR, churn, expansion.
  • Per-tenant: revenue, refund rate.