FAILURE_MODES — notification-service
Sibling: APPLICATION_LOGIC · OBSERVABILITY · DEPLOYMENT_TOPOLOGY · SERVICE_RISK_REGISTER
Strategic anchors: 02 Enterprise Architecture §14 Reliability · 04 Event-Driven §11 DLQ
This document enumerates the realistic failure modes for notification-service and the runbook for each. Severity follows the platform incident scale (P1: customer-impacting now; P2: customer-impacting soon or partial; P3: degradation, no immediate impact; P4: cosmetic).
1. Vendor (channel adapter) failures
F-NTF-01 — Single vendor returns 5xx / times out
Symptoms: rising notif.failed_total{vendor=…,reason=vendor_unreachable|timeout}; alert NotifVendorDegraded fires.
Auto-handling:
- Dispatch worker retries with exponential backoff (
2^attempt s, capped at 1024 s; honoursRetry-After). - After 3 consecutive failures on a
Channel, status flipsactive → degraded; emitschannel.health_changed.v1. - After 5 consecutive failures or 50 % failure rate over 10 min, status flips
degraded → down. Worker switches tofallbackVendorif configured. - If no fallback, transactional notifications keep retrying within their TTL (24 h default); marketing fails fast.
Runbook (P2):
- Confirm vendor status page; correlate with reservation throughput.
- Verify
fallbackVendoractivation in Grafana; if not configured, enable per-tenant via API. - If outage > 30 min: page tenant ops; consider disabling marketing categories temporarily.
- After recovery, prober flips status back to
active; backlog should drain within 15 min.
F-NTF-02 — Vendor returns 4xx terminal (invalid recipient, bad sender, template not approved)
Auto-handling:
- Per-error mapping (e.g., Twilio 21211 →
invalid_recipient→ suppression; SendGrid 401 →vendor_credential_invalid→ halt channel + page). - Per-recipient terminal failures emit
failed.v1and add aSuppressionRecord.
Runbook (P3): review delivery_attempts audit; if a tenant misconfiguration (e.g., unverified sender), notify them via in-app + email.
F-NTF-03 — WhatsApp template not approved
Symptoms: notif.failed_total{reason='whatsapp_template_not_approved'} non-zero.
Auto-handling:
- For transactional categories with an SMS counterpart in the trigger map, fall back to SMS automatically; emit a warning.
- For marketing, defer the send until template approval; alert tenant admin via in-app.
Runbook (P3):
- Run
POST /api/v1/notification-channels/{id}/whatsapp-templates/syncto refresh approval status from Meta. - If still pending after 24 h, advise tenant to revise the template per Meta policy.
2. Datastore failures
F-NTF-04 — Cloud SQL primary failover
Symptoms: connection errors spike for 30–60 s; pgbouncer reconnects.
Auto-handling:
- Application uses Cloud SQL Connector with built-in retry; PgBouncer recycles connections.
- Workers idempotently re-pick rows after reconnect.
- WS feed reconnects automatically (clients have backoff).
Runbook (P2):
- Confirm via Cloud SQL Operations dashboard.
- Watch
notif.outbox.lag_seconds— should recover within 2 min. - If lag > 5 min, scale outbox-relay; if > 10 min, manually fail over reads to a replica.
F-NTF-05 — Memorystore Redis failover
Symptoms: rate-limit cache + suppression-set cache miss → fallback to DB lookup; latency spike.
Auto-handling:
- Suppression check falls back to
SuppressionRepository.isSuppressedagainst Postgres (slower but correct). - Rate-limit counters fall back to Postgres counters (
notification_rate_counterstable — degraded mode); accuracy is best-effort during the gap.
Runbook (P2):
- Wait for Memorystore HA failover (≤ 60 s).
- Cache warmer rehydrates trigger map and template caches automatically (TTL 30 s).
F-NTF-06 — Postgres partition pruning gap (pg_partman lag)
Symptoms: writes to a future date return constraint violation; partition does not exist errors.
Auto-handling: scheduled partition-maintainer cron job creates +14 future months daily.
Runbook (P3): manually CALL partman.run_maintenance_proc(); investigate why the cron failed (Cloud Scheduler logs).
3. Event bus failures
F-NTF-07 — Pub/Sub publish unavailable
Symptoms: notif.outbox.lag_seconds rising; outbox.publish_attempts rising; alert NotifOutboxLag fires.
Auto-handling:
- Outbox-relay backs off and retries; rows accumulate.
- API/Worker writes still succeed (outbox is in-DB).
Runbook (P1):
- Confirm GCP status; if widespread, follow GCP guidance.
- Watch outbox depth; the system catches up automatically on recovery.
- If outage > 60 min, scale outbox-relay (
maxInstraised) so the catch-up completes within SLO.
F-NTF-08 — Pub/Sub subscriber stuck (high redelivery)
Symptoms: pubsub.consumer.messages_total{outcome='nack'} rising; ack rate dropping.
Auto-handling:
- After 5 redeliveries, message routes to DLQ (
melmastoon.dlq.notif.<consumer>).
Runbook (P2):
- Inspect DLQ in
GET /api/v1/internal/dlq. - Identify root cause (often: a malformed upstream event or a regression in the consumer).
- Patch + deploy; replay DLQ via
POST /api/v1/internal/dlq/{id}/retry.
F-NTF-09 — Consumed event ordering violation (rare)
Symptoms: state appears to "jump back" (e.g., delivered arrives before dispatched).
Auto-handling:
- Use case explicitly validates state-machine transitions; out-of-order arrivals are buffered (re-enqueued with delay) up to 60 s, then dropped with audit.
Runbook (P3): investigate per-aggregate ordering at the upstream service; ordering keys should ensure correctness.
4. Webhook failures
F-NTF-10 — Vendor webhook flood
Symptoms: ingestion 99th-percentile latency rising; notif.webhook.received_total >> baseline.
Auto-handling:
- Cloud Armor throttles to 1000 rps/vendor.
- Persist-then-process pattern means we accept 204 quickly; the queue absorbs spikes.
Runbook (P2): scale notification-api; confirm Cloud Armor rule is active; reach out to vendor if accidental retransmit storm.
F-NTF-11 — HMAC mismatch surge
Symptoms: notif.webhook.received_total{signature_valid=false} rising; alert NotifWebhookHmacFailures.
Possible causes:
- Vendor key rotated without coordination.
- Spoofing attempt.
Runbook (P1):
- Verify vendor signing key in Secret Manager; rotate if needed.
- If spoofing, add Cloud Armor block on offending source.
- Audit recent
webhook_inbound.status='rejected'rows.
F-NTF-12 — Late-correlation backlog
Symptoms: notif.webhook.late_correlation_total rising > 5 % of inbound.
Possible causes: dispatch attempt persisted vendorMessageId after the webhook arrived (race); or the notification belongs to a tenant that was deleted mid-flight.
Runbook (P3): tune dispatch flush ordering; for orphan webhooks > 24h, archive to DLQ for manual review.
5. Application failures
F-NTF-13 — Render error spike
Symptoms: notif.render.errors_total rising; transactional sends failing.
Possible causes:
- A tenant published a template version with bad Handlebars/MJML.
- A platform-global template change broke variable expectations.
- A locale's body has invalid bidi chars.
Auto-handling:
- Per-notification render error →
failed.v1; per-template threshold (10 errors / 5 min) → automatic rollback to previous active version + alert.
Runbook (P2):
- Inspect failed notifications; identify offending templateVersion.
- Roll back manually via
POST /api/v1/notification-templates/{id}/versions/{prevVersionId}/publish. - Author + publish a fix.
F-NTF-14 — Sender-id missing for a tenant
Symptoms: notif.failed_total{reason='sender_id_missing'} for one tenant.
Runbook (P3): tenant-admin notification (auto); guide them through registering a sender id.
F-NTF-15 — Rate-limit tripped at the per-recipient bucket
Symptoms: end-user complains a confirmation never arrived; logs show suppressed reason='rate_limit'.
Runbook (P3): explain to tenant; review per-recipient daily cap; raise it per tenant policy if needed.
F-NTF-16 — Outbox row stuck "unpublished"
Symptoms: a single outbox.id retains published_at IS NULL indefinitely; last_error persistent.
Runbook (P2):
- Inspect
last_error— usually a schema validation issue from a code change. - Patch + redeploy; manually re-attempt via
UPDATE outbox SET publish_attempts=0, last_error=null WHERE id=…. - As a last resort, archive the row to a
outbox_skippedtable with audit.
6. AI integration failures
F-NTF-17 — Orchestrator timeout / unavailable
Symptoms: notif.ai.fallback_to_deterministic_total{reason='timeout|5xx'} rising.
Auto-handling: fall back to deterministic template (functional correctness preserved); UI surfaces a "personalisation unavailable" hint.
Runbook (P3): coordinate with ai-orchestrator-service on-call.
F-NTF-18 — Safety-rejection rate spike
Symptoms: notif.ai.safety_rejections_total{reason} > 1 %/30 min; alert.
Runbook (P2):
- Check whether a tenant is sending unusual variables.
- Engage prompt/policy team if the orchestrator's safety classifier mis-flagged legitimate content.
F-NTF-19 — HITL queue stalled
Symptoms: notif.ai.hitl_queue_depth{tenant=…} > 200 for > 1 h.
Runbook (P3): notify tenant ops; consider time-bound auto-fail-safe per AI_INTEGRATION §4.3.
7. Security & compliance failures
F-NTF-20 — Suspected cross-tenant leak
Severity: P1 (security incident).
Triggers: alert NotifCrossTenantSuspect (synthetic check writes a canary row in tenant A, attempts to read from tenant B; should always return zero rows; non-zero pages on-call).
Runbook:
- Lock the affected tenant ids by setting
tenant.notificationPolicy.suspendOutbound=true. - Snapshot DB for forensic review.
- Engage security on-call; trigger DPO process for breach notification timer.
F-NTF-21 — Webhook signing-key compromise
Severity: P1.
Runbook:
- Rotate the vendor secret immediately via
POST /api/v1/notification-channels/{id}/credentials/{credId}/rotate. - Disable any prior overlapping versions (skip the 24 h overlap window in this incident).
- Audit
webhook_inboundfor the rotation period.
F-NTF-22 — Plaintext PII in logs
Severity: P2.
Trigger: nightly log scan finds an email/phone shape in our log stream.
Runbook:
- Identify the offending log line; correlate to code path.
- Patch + deploy.
- Run targeted log redaction (Cloud Logging exclusion + sink update).
F-NTF-23 — Mobile-key delivery failure
Severity: P1 (guest cannot enter their room).
Symptoms: notif.failed_total{template.key='mobile_key.issued.*'} non-zero.
Auto-handling:
- Trigger map fallback chain: WhatsApp → SMS → Email.
- If all channels fail, emit
notification.failed.v1solock-integration-servicecan re-issue the credential or fall back to a mechanical key.
Runbook (P1):
- Front-desk receives an automated alert in backoffice ("mobile-key not delivered for guest X — please verify channel availability or use mechanical").
- Engineer follows F-NTF-01 if a vendor is the cause.
8. Sync & client failures
F-NTF-24 — Desktop replica diverges
Symptoms: staff sees stale notification status; pull cursor invalid.
Auto-handling: client falls back to a full sync (see SYNC_CONTRACT §8).
F-NTF-25 — WS feed connection storm
Symptoms: notif.feed.ws.active_connections 2× baseline; alert NotifWSConnectionsSurge.
Possible causes: client retry storm after a regional outage.
Runbook (P2): scale notification-api; consider adding jittered reconnection guidance to clients.
9. Worker / scheduler failures
F-NTF-26 — Scheduler tick missed
Symptoms: worker.tick.outcome_total{worker='scheduler'} flat for > 2 ticks.
Auto-handling: Cloud Run recovers; on next tick the scheduler queries notification_scheduled for all overdue rows and processes them.
Runbook (P3): confirm Cloud Run revision health; check for pod-killer noisy neighbour.
F-NTF-27 — Late dispatch (queue backlog)
Symptoms: alert NotifDispatchBacklog (count > 1000 pending > 60 s).
Runbook (P1):
- Identify which channel.
- Scale the corresponding
notification-worker-<channel>(maxInstbump). - Investigate whether it's a vendor degradation (F-NTF-01) or a self-induced bottleneck.
10. DR (regional) failures
F-NTF-28 — Regional outage
Severity: P1.
Runbook:
- Confirm scope via GCP status page.
- Failover Cloud SQL to DR region (manual; promotes the cross-region replica).
- Update routing rules to point the affected tenants' regional backend to the DR region.
- Re-issue Cloud NAT egress IPs (update vendor allowlists if needed — typically reserved IPs persist).
- Monitor SLOs; expect 30 min recovery, 2 h full restoration.
- Post-incident: PIR within 5 business days; permanent fixes tracked in SERVICE_RISK_REGISTER.
11. DLQ runbook
GET /api/v1/internal/dlq lists DLQ entries by consumer. For each:
- Inspect payload + error.
- If a transient (e.g., DB blip) →
POST /api/v1/internal/dlq/{id}/retry. - If a deterministic bug → patch, deploy, then retry.
- If poison message (malformed event from a misbehaving producer) → coordinate with producer team; if data is unsalvageable, archive to
dlq_archivedwith audit.
DLQ growth alert (NotifPubSubDLQGrowing) calls on-call directly.
12. Backpressure rules
When backlogs grow, we shed lower-priority work to protect critical paths:
| Backlog signal | Action |
|---|---|
queued count > 50 000 across all channels | pause marketing dispatch (workers skip rows where category='marketing'); alert |
queued count > 200 000 | additionally pause reminder; only transactional+security flow |
| Outbox lag > 30 s | refuse new notifications/batch requests with 429 |
| Vendor concurrency exhausted | increase per-channel concurrency cap; if at max, queue with smaller channel parallelism |
These rules are encoded in BackpressureController and toggled by Memorystore-backed feature flags so they take effect within seconds.
13. Data corruption & rollback
If a deploy corrupts data (e.g., a faulty migration writes wrong status):
- Halt the rollout (auto if SLO breach; manual otherwise).
- Roll back to previous Cloud Run revision (Cloud Deploy one-click).
- PITR to Cloud SQL snapshot from before the deploy.
- Replay outbox rows that were never published (safe — receivers dedupe by event id).
- Communicate impacted tenants per the platform incident comms standard.
14. Communication during incidents
- P1: pages on-call within 1 min; status page update within 10 min; tenant comms (in-app + email to tenant admins) within 30 min.
- P2: ticket; status page if customer-impacting > 30 min.
- P3: ticket only.
- Post-incident review (PIR) within 5 business days for P1/P2; remediation actions tracked in SERVICE_RISK_REGISTER.