Skip to main content

Failure Modes

:::info Source Sourced from services/progress-service/FAILURE_MODES.md in the documentation repo. :::

1. Scenarios

1.1 Ingest Overload (Spike Beyond Capacity)

  • Mitigation: Backpressure via 429 (clients use local statements outbox). Ingest autoscales to 100 pods. Kafka-like shedding not used (at-least-once guarantees required).
  • Recovery: scale-out completes within 2 min; backlog drains.

1.2 Postgres Primary Failover

  • Mitigation: Patroni; automatic promotion < 30s. Ingest queues in outbox during failover.
  • Recovery: replica promoted; clients retry.

1.3 Projection Lag

  • Mitigation: Projector scales on lag metric. Projections are idempotent; rebuilds from statement log as disaster recovery.
  • Recovery: scale projector, or run offline replay.

1.4 Duplicate Statement From Retry

  • Mitigation: statementId PK dedups at insert. Spec-compliant (silently accept duplicate).

1.5 Out-of-Order Statement from Offline Replay

  • Mitigation: timestamp is authoritative for ordering. Late statement may trigger attempt re-evaluation (projection recomputes outcome).
  • Recovery: projector idempotent.

1.6 Completion Record Created Twice

  • Mitigation: UNIQUE constraint (tenant_id, attempt_id) on completion_records. Second attempt to INSERT is a no-op.

1.7 GDPR Erasure Incomplete

  • Mitigation: Saga participation mandatory; replay test in CI; compliance review quarterly.

1.8 Partition Detach During Query

  • Mitigation: Detach scheduled during off-peak; query replays against older data go to cold tier.

1.9 xAPI Signed Statement — Signature Fails

  • Mitigation: Rejected with validation.statement.signature_invalid. Sender retries after fixing.

1.10 Clock Skew on Ingest (Statement from Future)

  • Mitigation: timestamp accepted; stored is server-time. Alert if timestamp > stored + 5min.

2. Retry / Backoff

OpMaxBackoffBudget
Postgres write310ms, 50ms, 200ms300ms
NATS publish (outbox)infiniteexp cap 5 min
Projection retry5exp cap 30s5 min

3. Circuit Breakers

TargetTripReset
Postgres primary10 fail / 30s60s
NATS10 fail / 30s60s

4. Fallbacks

PrimaryFallback
Postgres primaryReplica (read-only mode)
NATS publishOutbox retains; alert ops
Real-time projectionBatch replay from log

5. Chaos

  • Partition Postgres from NATS for 2 min → verify outbox drains on reconnect.
  • Corrupt a partition (remove a row) → verify Merkle detects.
  • Kill projector mid-batch → verify resumes from cursor.
  • Ingest 100k/sec burst → verify backpressure + scale-out.