Skip to main content

Notification Service — Failure Modes

Status: populated Owner: Platform Engineering + SRE Last updated: 2026-04-18

#FailureUser impactDetectionMitigation
1SendGrid API downEmails not deliveredNotifEmailDeliveryFailed alert; notification_log.status=FAILEDRetry 3 attempts (5s/30s/2min); log FAILED; ops investigates; no re-queue (event already ACKed)
2sms-orchestrator downNotification SMS not deliveredNotifSmsDeliveryFailed alertRetry 3 attempts; log FAILED; email channel unaffected
3PG primary downLog writes fail; consumer NAKsNotifPgErrors alertNATS NAK; redelivery when PG recovers; notification may be delayed
4Template missing for event typeNotification not sentNotifTemplateMissing alertLog FAILED with templateId=null; alert ops to create template
5Template render error (invalid Mjml)Email not sent; fallback to plain textNotifTemplateRenderError logUse bodyText fallback; alert ops to fix template
6Recipient not found (auth-service 404)Notification not sentNotifRecipientNotFound logLog SUPPRESSED; event ACKed; ops verifies user record
7auth-service unavailableAll recipient lookups failNotifAuthServiceErrors alertNATS NAK with backoff; event replayed when auth-service recovers
8NATS consumer lag growsDelayed notificationsConsumer lag metric alertScale consumer pods; investigate upstream event burst
9Duplicate event delivery (NATS redelivery)Duplicate notification sent to recipientsource_event_id check in notification_logCheck for existing SENT log entry; suppress duplicate
10S3 presigned URL in invoice email already expiredCustomer clicks broken linkLink expiry set to 7 days in invoice email templateInvoice email generated within 1 min of invoice FINALIZED; 7-day link TTL sufficient
11Opt-out table missing entry (new category)Preference not respectednotification_preferences default is opted_out=false (opt-in by default)New categories treated as opted-in; expected behavior
12SendGrid rate limitBulk of notifications delayedSendGrid 429 responseRetry with exponential backoff; low notification volume makes this unlikely
13CRITICAL system alert not deliveredPlatform admin not notifiedNotifSystemAlertFailed alertSMS + email dual channel; retry; escalate to PagerDuty if both fail
14Template version mismatch (variables_schema changed)Render fails mid-deliveryNotifTemplateRenderErrorvariablesSchema validated on template save; CI schema test