:::info Source
Sourced from services/progress-service/FAILURE_MODES.md in the documentation repo.
:::
1. Scenarios
1.1 Ingest Overload (Spike Beyond Capacity)
- Mitigation: Backpressure via 429 (clients use local statements outbox). Ingest autoscales to 100 pods. Kafka-like shedding not used (at-least-once guarantees required).
- Recovery: scale-out completes within 2 min; backlog drains.
1.2 Postgres Primary Failover
- Mitigation: Patroni; automatic promotion < 30s. Ingest queues in outbox during failover.
- Recovery: replica promoted; clients retry.
1.3 Projection Lag
- Mitigation: Projector scales on lag metric. Projections are idempotent; rebuilds from statement log as disaster recovery.
- Recovery: scale projector, or run offline replay.
1.4 Duplicate Statement From Retry
- Mitigation:
statementId PK dedups at insert. Spec-compliant (silently accept duplicate).
1.5 Out-of-Order Statement from Offline Replay
- Mitigation:
timestamp is authoritative for ordering. Late statement may trigger attempt re-evaluation (projection recomputes outcome).
- Recovery: projector idempotent.
1.6 Completion Record Created Twice
- Mitigation: UNIQUE constraint
(tenant_id, attempt_id) on completion_records. Second attempt to INSERT is a no-op.
1.7 GDPR Erasure Incomplete
- Mitigation: Saga participation mandatory; replay test in CI; compliance review quarterly.
1.8 Partition Detach During Query
- Mitigation: Detach scheduled during off-peak; query replays against older data go to cold tier.
1.9 xAPI Signed Statement — Signature Fails
- Mitigation: Rejected with
validation.statement.signature_invalid. Sender retries after fixing.
1.10 Clock Skew on Ingest (Statement from Future)
- Mitigation:
timestamp accepted; stored is server-time. Alert if timestamp > stored + 5min.
2. Retry / Backoff
| Op | Max | Backoff | Budget |
|---|
| Postgres write | 3 | 10ms, 50ms, 200ms | 300ms |
| NATS publish (outbox) | infinite | exp cap 5 min | — |
| Projection retry | 5 | exp cap 30s | 5 min |
3. Circuit Breakers
| Target | Trip | Reset |
|---|
| Postgres primary | 10 fail / 30s | 60s |
| NATS | 10 fail / 30s | 60s |
4. Fallbacks
| Primary | Fallback |
|---|
| Postgres primary | Replica (read-only mode) |
| NATS publish | Outbox retains; alert ops |
| Real-time projection | Batch replay from log |
5. Chaos
- Partition Postgres from NATS for 2 min → verify outbox drains on reconnect.
- Corrupt a partition (remove a row) → verify Merkle detects.
- Kill projector mid-batch → verify resumes from cursor.
- Ingest 100k/sec burst → verify backpressure + scale-out.