Skip to main content

Failure Modes

:::info Source Sourced from services/notification-service/FAILURE_MODES.md in the documentation repo. :::

1. Scenarios

1.1 Provider Outage

  • Fail over to secondary (SES → SendGrid).
  • Retry up to 24h then DLQ.
  • Alert P2.

1.2 Template Render Error

  • Missing variable → fallback to plain message ("An update is available").
  • Syntax error in template → reject at save; prevent deploy.

1.3 Bounce / Suppression

  • Update suppression list; future sends blocked.
  • Notify sender of high-bounce address.

1.4 Rate Limit Hit (Provider)

  • Queue + retry with backoff.
  • Per-tenant throttle if sustained.

1.5 SMS Toll Fraud

  • Detect unusual pattern (e.g., many sends to premium-rate numbers) → block + alert.

1.6 Push Token Invalid

  • Provider returns "invalid token" → remove from user devices; re-register on next app launch.

1.7 Digest Batch OOM

  • Stream rendering; batch size capped.

1.8 Webhook Event Lost

  • Provider retries; event.id dedup.

2. Retry / Backoff

OpMaxBackoff
Provider send51s, 10s, 1m, 10m, 1h
Postgres310ms–200ms
Outboxinfiniteexp cap 5m

3. Circuit Breakers

Each provider: 20 fail / 30s → 120s. Auto-failover to secondary.

4. Fallbacks

PrimaryFallback
SESSendGrid
TwilioVonage
FCMWeb push / in-app
AI copyStatic template

5. Chaos

  • Kill primary provider → failover.
  • Template syntax error → caught at deploy.
  • Duplicate event → single notification (inbox dedup).