Skip to main content

Failure Modes

:::info Source Sourced from services/billing-service/FAILURE_MODES.md in the documentation repo. :::

1. Scenarios

1.1 Stripe API Outage

  • Symptom: PaymentIntent creation fails.
  • Mitigation:
    • Retry with exponential backoff.
    • Queue in pending state.
    • Alert ops; surface "payment temporarily unavailable" to users.
    • Fall over to secondary Stripe region if available.

1.2 Webhook Delivery Delay

  • Symptom: Stripe delivers webhook 30 min late.
  • Mitigation: Idempotent on event.id; processes correctly regardless of timing. Stripe retries for up to 3 days.

1.3 Webhook Signature Mismatch

  • Symptom: Invalid signature header.
  • Response: 400; alert; investigate (possible key rotation missed, or spoofing attempt).

1.4 Double Payment Processing

  • Mitigation: Payment PK on processor_ref (Stripe PaymentIntent ID). Second webhook for same event is no-op.

1.5 Reconciliation Drift

  • Symptom: Our ledger != Stripe balance.
  • Mitigation: Daily reconciler detects; under threshold auto-corrects; above threshold alert P1 + manual review.

1.6 Payout to Wrong Bank Account

  • Mitigation: Bank account verified via micro-deposits before first payout; high-value payouts require 4-eyes approval.

1.7 Dunning Stuck

  • Symptom: Dunning process not advancing.
  • Mitigation: Monitor next_attempt_at; alert if not advancing for 7 days.

1.8 Tax Provider Outage

  • Mitigation: Cache recent tax calculations (rate + jurisdiction); fallback to cached rate for 1 hour; alert if extended.

1.9 Currency Conversion Drift

  • Mitigation: Snapshot rate at invoice creation; reconcile with processor's conversion on payout.

1.10 Chargeback / Dispute

  • Mitigation: Stripe notifies via webhook → we emit billing.dispute.opened.v1; admin collects evidence; automatic response with evidence.

2. Retry / Backoff

OpMaxBackoff
Stripe API call3200ms, 1s, 3s
Webhook retry (outbound)51s, 5s, 30s, 2m, 10m
Payout31 hour apart
Reconciliationunlimiteddaily

3. Circuit Breakers

Stripe: 10 fail/30s → 60s. Tax provider: 10 fail/30s → 60s.

4. Fallbacks

PrimaryFallback
Live tax calculationCached rate (< 1h)
Stripe primarySecondary region
Real-time reconciliationDaily batch

5. Chaos

  • Stripe mock returns 500 for 30s → retries succeed.
  • Webhook sent twice → single side-effect.
  • Kill worker mid-renewal → resumes on restart.
  • Inject $100 drift → reconciler detects + alerts.