Failure Modes
:::info Source
Sourced from services/billing-service/FAILURE_MODES.md in the documentation repo.
:::
1. Scenarios
1.1 Stripe API Outage
- Symptom: PaymentIntent creation fails.
- Mitigation:
- Retry with exponential backoff.
- Queue in
pendingstate. - Alert ops; surface "payment temporarily unavailable" to users.
- Fall over to secondary Stripe region if available.
1.2 Webhook Delivery Delay
- Symptom: Stripe delivers webhook 30 min late.
- Mitigation: Idempotent on event.id; processes correctly regardless of timing. Stripe retries for up to 3 days.
1.3 Webhook Signature Mismatch
- Symptom: Invalid signature header.
- Response: 400; alert; investigate (possible key rotation missed, or spoofing attempt).
1.4 Double Payment Processing
- Mitigation: Payment PK on
processor_ref(Stripe PaymentIntent ID). Second webhook for same event is no-op.
1.5 Reconciliation Drift
- Symptom: Our ledger != Stripe balance.
- Mitigation: Daily reconciler detects; under threshold auto-corrects; above threshold alert P1 + manual review.
1.6 Payout to Wrong Bank Account
- Mitigation: Bank account verified via micro-deposits before first payout; high-value payouts require 4-eyes approval.
1.7 Dunning Stuck
- Symptom: Dunning process not advancing.
- Mitigation: Monitor
next_attempt_at; alert if not advancing for 7 days.
1.8 Tax Provider Outage
- Mitigation: Cache recent tax calculations (rate + jurisdiction); fallback to cached rate for 1 hour; alert if extended.
1.9 Currency Conversion Drift
- Mitigation: Snapshot rate at invoice creation; reconcile with processor's conversion on payout.
1.10 Chargeback / Dispute
- Mitigation: Stripe notifies via webhook → we emit
billing.dispute.opened.v1; admin collects evidence; automatic response with evidence.
2. Retry / Backoff
| Op | Max | Backoff |
|---|---|---|
| Stripe API call | 3 | 200ms, 1s, 3s |
| Webhook retry (outbound) | 5 | 1s, 5s, 30s, 2m, 10m |
| Payout | 3 | 1 hour apart |
| Reconciliation | unlimited | daily |
3. Circuit Breakers
Stripe: 10 fail/30s → 60s. Tax provider: 10 fail/30s → 60s.
4. Fallbacks
| Primary | Fallback |
|---|---|
| Live tax calculation | Cached rate (< 1h) |
| Stripe primary | Secondary region |
| Real-time reconciliation | Daily batch |
5. Chaos
- Stripe mock returns 500 for 30s → retries succeed.
- Webhook sent twice → single side-effect.
- Kill worker mid-renewal → resumes on restart.
- Inject $100 drift → reconciler detects + alerts.