Skip to main content

Failure Modes

:::info Source Sourced from services/marketplace-service/FAILURE_MODES.md in the documentation repo. :::

1. Scenarios

1.1 Purchase Saga Split-Brain with Licensing

  • Payment succeeds but license grant fails.
  • Mitigation: saga with idempotent steps + compensations; 30-min timeout; reconciliation job hourly.

1.2 Double-Charge

  • Mitigation: Idempotency-Key on order placement; billing also idempotent; duplicate events deduped by saga.

1.3 Refund After Enrollment Started

  • Mitigation: refund policy (tenant-configurable); partial refunds; enrollment revocation event → learner notified.

1.4 Payment Processor Outage

  • Mitigation: queue orders in pending_payment; 30-min timeout → compensate.
  • UX: "payment processor temporarily unavailable; your order is on hold."

1.5 KYC Provider Outage

  • Providers cannot onboard; existing listings unaffected.
  • Status: degraded; manual review fallback.

1.6 Coupon Code Race

  • Mitigation: Postgres SELECT FOR UPDATE on coupon row; atomic redemption.

1.7 Listing with Invalid Course Reference

  • Publish-time validation: courseId must resolve to a published CourseVersion.

1.8 Cross-Tenant Purchase Attempt

  • JWT tid narrows Order scope to buyer tenant; cross-tenant purchase legitimate (provider tenant differs).

1.9 Provider Payout Failure

  • Retry via billing; fallback to manual payout review.

1.10 Review Spam / Abuse

  • Rate limit per user; ML classifier flags; admin queue.

2. Retry / Backoff

OpMaxBackoff
Billing call3200ms, 1s, 3s
Postgres write310ms–200ms
Outboxinfiniteexp cap 5m
Saga step5exp cap 30s
Payout retry31 hour apart

3. Circuit Breakers

Billing: 10 fail/30s → 60s. KYC: 10 fail/60s → 120s.

4. Fallbacks

PrimaryFallback
Stripe primaryStripe secondary region
Real-time sagaQueue + async processing
KYC liveManual review queue

5. Chaos

  • Payment succeeds; kill saga pod before license grant → saga resumes on restart; license granted idempotently.
  • Double payment.succeeded.v1 event → only one license granted.
  • Network partition during refund flow → compensation completes on recovery.