Notification Service — Failure Modes
Status: populated Owner: Platform Engineering + SRE Last updated: 2026-04-18
| # | Failure | User impact | Detection | Mitigation |
|---|---|---|---|---|
| 1 | SendGrid API down | Emails not delivered | NotifEmailDeliveryFailed alert; notification_log.status=FAILED | Retry 3 attempts (5s/30s/2min); log FAILED; ops investigates; no re-queue (event already ACKed) |
| 2 | sms-orchestrator down | Notification SMS not delivered | NotifSmsDeliveryFailed alert | Retry 3 attempts; log FAILED; email channel unaffected |
| 3 | PG primary down | Log writes fail; consumer NAKs | NotifPgErrors alert | NATS NAK; redelivery when PG recovers; notification may be delayed |
| 4 | Template missing for event type | Notification not sent | NotifTemplateMissing alert | Log FAILED with templateId=null; alert ops to create template |
| 5 | Template render error (invalid Mjml) | Email not sent; fallback to plain text | NotifTemplateRenderError log | Use bodyText fallback; alert ops to fix template |
| 6 | Recipient not found (auth-service 404) | Notification not sent | NotifRecipientNotFound log | Log SUPPRESSED; event ACKed; ops verifies user record |
| 7 | auth-service unavailable | All recipient lookups fail | NotifAuthServiceErrors alert | NATS NAK with backoff; event replayed when auth-service recovers |
| 8 | NATS consumer lag grows | Delayed notifications | Consumer lag metric alert | Scale consumer pods; investigate upstream event burst |
| 9 | Duplicate event delivery (NATS redelivery) | Duplicate notification sent to recipient | source_event_id check in notification_log | Check for existing SENT log entry; suppress duplicate |
| 10 | S3 presigned URL in invoice email already expired | Customer clicks broken link | Link expiry set to 7 days in invoice email template | Invoice email generated within 1 min of invoice FINALIZED; 7-day link TTL sufficient |
| 11 | Opt-out table missing entry (new category) | Preference not respected | notification_preferences default is opted_out=false (opt-in by default) | New categories treated as opted-in; expected behavior |
| 12 | SendGrid rate limit | Bulk of notifications delayed | SendGrid 429 response | Retry with exponential backoff; low notification volume makes this unlikely |
| 13 | CRITICAL system alert not delivered | Platform admin not notified | NotifSystemAlertFailed alert | SMS + email dual channel; retry; escalate to PagerDuty if both fail |
| 14 | Template version mismatch (variables_schema changed) | Render fails mid-delivery | NotifTemplateRenderError | variablesSchema validated on template save; CI schema test |