Failure Modes
:::info Source
Sourced from services/certification-service/FAILURE_MODES.md in the documentation repo.
:::
1. Scenarios
1.1 Rendering Failure (PDF/PNG)
- Cause: headless Chrome crash, malformed template.
- Mitigation: retry with exponential backoff; fall back to minimal template if template render fails; alert + manual intervention.
1.2 KMS Unavailable for Signing
- Certificates queue; issuance retries; critical alert if > 10 min outage.
1.3 Offline Claim Signature Invalid
- Reject; learner sees reason ("content integrity could not be verified — please resync"); support can manually issue if evidence supports it.
1.4 Duplicate Completion Event
- Idempotent: one certificate per (enrollment, course_version, state='issued'). Second event returns existing.
1.5 Clock Skew in Offline Claim
- Tolerance ±24h; outside window → rejected with specific reason; learner can request manual review.
1.6 Public Verify Token Enumeration
- Rate-limited per IP + per tenant; alert on spike; WAF geo rules.
1.7 kid Rotation Without Overlap
- Published JWKS updated; 2-day overlap; if rushed, old bundles fail verify — requires emergency issuance.
1.8 CDN Cache Staleness on Revoke
- Revoke triggers CDN purge +
Cache-Control: no-cacheon response for 5 min. - Short-window stale cache accepted; reinforced by verify endpoint re-check on user click.
1.9 Artifact S3 Outage
- Learner portfolio shows stub with "temporarily unavailable"; retries via CDN fallback to origin.
1.10 GDPR Anonymization Race
- Certificate retained with
user_display_name_at_issuance→ policy decides whether to anonymize on erasure.
2. Retry / Backoff
| Op | Max | Backoff |
|---|---|---|
| KMS sign | 3 | 50ms, 200ms, 500ms |
| Render | 3 | 1s, 5s, 15s |
| S3 upload | 5 | exp 100ms–10s |
| Outbox | infinite | exp cap 5m |
3. Circuit Breakers
KMS: 10 fail/30s → 60s. S3: 10 fail/30s → 60s. Renderer: 5 fail/60s → 120s.
4. Fallbacks
| Primary | Fallback |
|---|---|
| Custom template render | Minimal default template |
| OpenBadges generator | Basic PDF + PNG only (deferred OB) |
| CDN | Origin direct |
5. Chaos
- KMS 60s outage → issuance queue builds; drains cleanly.
- Renderer OOM on adversarial template → isolated pod crash; saga retries.
- Verify spike 10x baseline → rate limit + CDN absorb; alert on breach.