Failure Modes
:::info Source
Sourced from services/notification-service/FAILURE_MODES.md in the documentation repo.
:::
1. Scenarios
1.1 Provider Outage
- Fail over to secondary (SES → SendGrid).
- Retry up to 24h then DLQ.
- Alert P2.
1.2 Template Render Error
- Missing variable → fallback to plain message ("An update is available").
- Syntax error in template → reject at save; prevent deploy.
1.3 Bounce / Suppression
- Update suppression list; future sends blocked.
- Notify sender of high-bounce address.
1.4 Rate Limit Hit (Provider)
- Queue + retry with backoff.
- Per-tenant throttle if sustained.
1.5 SMS Toll Fraud
- Detect unusual pattern (e.g., many sends to premium-rate numbers) → block + alert.
1.6 Push Token Invalid
- Provider returns "invalid token" → remove from user devices; re-register on next app launch.
1.7 Digest Batch OOM
- Stream rendering; batch size capped.
1.8 Webhook Event Lost
- Provider retries; event.id dedup.
2. Retry / Backoff
| Op | Max | Backoff |
|---|---|---|
| Provider send | 5 | 1s, 10s, 1m, 10m, 1h |
| Postgres | 3 | 10ms–200ms |
| Outbox | infinite | exp cap 5m |
3. Circuit Breakers
Each provider: 20 fail / 30s → 120s. Auto-failover to secondary.
4. Fallbacks
| Primary | Fallback |
|---|---|
| SES | SendGrid |
| Twilio | Vonage |
| FCM | Web push / in-app |
| AI copy | Static template |
5. Chaos
- Kill primary provider → failover.
- Template syntax error → caught at deploy.
- Duplicate event → single notification (inbox dedup).