Failure Modes
:::info Source
Sourced from services/marketplace-service/FAILURE_MODES.md in the documentation repo.
:::
1. Scenarios
1.1 Purchase Saga Split-Brain with Licensing
- Payment succeeds but license grant fails.
- Mitigation: saga with idempotent steps + compensations; 30-min timeout; reconciliation job hourly.
1.2 Double-Charge
- Mitigation:
Idempotency-Keyon order placement; billing also idempotent; duplicate events deduped by saga.
1.3 Refund After Enrollment Started
- Mitigation: refund policy (tenant-configurable); partial refunds; enrollment revocation event → learner notified.
1.4 Payment Processor Outage
- Mitigation: queue orders in
pending_payment; 30-min timeout → compensate. - UX: "payment processor temporarily unavailable; your order is on hold."
1.5 KYC Provider Outage
- Providers cannot onboard; existing listings unaffected.
- Status: degraded; manual review fallback.
1.6 Coupon Code Race
- Mitigation: Postgres SELECT FOR UPDATE on coupon row; atomic redemption.
1.7 Listing with Invalid Course Reference
- Publish-time validation: courseId must resolve to a published CourseVersion.
1.8 Cross-Tenant Purchase Attempt
- JWT
tidnarrows Order scope to buyer tenant; cross-tenant purchase legitimate (provider tenant differs).
1.9 Provider Payout Failure
- Retry via billing; fallback to manual payout review.
1.10 Review Spam / Abuse
- Rate limit per user; ML classifier flags; admin queue.
2. Retry / Backoff
| Op | Max | Backoff |
|---|---|---|
| Billing call | 3 | 200ms, 1s, 3s |
| Postgres write | 3 | 10ms–200ms |
| Outbox | infinite | exp cap 5m |
| Saga step | 5 | exp cap 30s |
| Payout retry | 3 | 1 hour apart |
3. Circuit Breakers
Billing: 10 fail/30s → 60s. KYC: 10 fail/60s → 120s.
4. Fallbacks
| Primary | Fallback |
|---|---|
| Stripe primary | Stripe secondary region |
| Real-time saga | Queue + async processing |
| KYC live | Manual review queue |
5. Chaos
- Payment succeeds; kill saga pod before license grant → saga resumes on restart; license granted idempotently.
- Double
payment.succeeded.v1event → only one license granted. - Network partition during refund flow → compensation completes on recovery.