Skip to main content

Failure Modes

:::info Source Sourced from services/assessment-service/FAILURE_MODES.md in the documentation repo. :::

1. Scenarios

1.1 AI Grading Provider Unavailable

  • Fallback: queue response for later grading; notify instructor; SLA = 4h.

1.2 Answer Key Decryption Failure

  • Symptom: KMS unavailable or DEK lost.
  • Response: 503 to scoring endpoint; alert P1; cached answer key for 5 min allows continued serving.

1.3 Scoring Timeout

  • Complex rubric grading exceeds 30s → async grading path; learner sees "grading in progress" state.

1.4 Duplicate Submission

  • Idempotent on attemptId + attemptNumber; second submission returns first result.

1.5 Offline-Computed Score Mismatch vs Server Recompute

  • Server recomputes on ingest; mismatch → log + trust server; alert on high mismatch rate (may indicate tampered bundle).

1.6 Branching Scenario Node Not Found

  • Domain invariant at publish prevents orphan; runtime defensive fallback: return to previous node.

1.7 AI Question Generation Refused

  • Show author the refusal reason; offer manual path; log for prompt tuning.

1.8 Rubric Grading Disagreement

  • AI vs human disagreement > threshold → flag; retrain eval corpus; prompt version may be rolled back.

2. Retry / Backoff

OpMaxBackoff
AI generate21s, 5s
AI grade32s, 10s, 30s
Postgres write310ms, 50ms, 200ms
Outboxinfiniteexp cap 5m

3. Circuit Breakers

ai-gateway: 10 fail/30s → 60s. KMS: 10 fail/30s → 60s.

4. Fallbacks

PrimaryFallback
AI gradingQueue + notify instructor
AI question genManual authoring
Real-time scoringAsync grading notification

5. Chaos

  • AI gateway 30s latency → verify UX degradation (not error).
  • KMS 30s outage → scoring queue builds up; drains on recovery.
  • Bundle tamper → scoring fails cleanly with diagnostic.