Skip to main content

FAILURE_MODES — notification-service

Sibling: APPLICATION_LOGIC · OBSERVABILITY · DEPLOYMENT_TOPOLOGY · SERVICE_RISK_REGISTER

Strategic anchors: 02 Enterprise Architecture §14 Reliability · 04 Event-Driven §11 DLQ

This document enumerates the realistic failure modes for notification-service and the runbook for each. Severity follows the platform incident scale (P1: customer-impacting now; P2: customer-impacting soon or partial; P3: degradation, no immediate impact; P4: cosmetic).


1. Vendor (channel adapter) failures

F-NTF-01 — Single vendor returns 5xx / times out

Symptoms: rising notif.failed_total{vendor=…,reason=vendor_unreachable|timeout}; alert NotifVendorDegraded fires.

Auto-handling:

  1. Dispatch worker retries with exponential backoff (2^attempt s, capped at 1024 s; honours Retry-After).
  2. After 3 consecutive failures on a Channel, status flips active → degraded; emits channel.health_changed.v1.
  3. After 5 consecutive failures or 50 % failure rate over 10 min, status flips degraded → down. Worker switches to fallbackVendor if configured.
  4. If no fallback, transactional notifications keep retrying within their TTL (24 h default); marketing fails fast.

Runbook (P2):

  1. Confirm vendor status page; correlate with reservation throughput.
  2. Verify fallbackVendor activation in Grafana; if not configured, enable per-tenant via API.
  3. If outage > 30 min: page tenant ops; consider disabling marketing categories temporarily.
  4. After recovery, prober flips status back to active; backlog should drain within 15 min.

F-NTF-02 — Vendor returns 4xx terminal (invalid recipient, bad sender, template not approved)

Auto-handling:

  • Per-error mapping (e.g., Twilio 21211 → invalid_recipient → suppression; SendGrid 401 → vendor_credential_invalid → halt channel + page).
  • Per-recipient terminal failures emit failed.v1 and add a SuppressionRecord.

Runbook (P3): review delivery_attempts audit; if a tenant misconfiguration (e.g., unverified sender), notify them via in-app + email.

F-NTF-03 — WhatsApp template not approved

Symptoms: notif.failed_total{reason='whatsapp_template_not_approved'} non-zero.

Auto-handling:

  • For transactional categories with an SMS counterpart in the trigger map, fall back to SMS automatically; emit a warning.
  • For marketing, defer the send until template approval; alert tenant admin via in-app.

Runbook (P3):

  1. Run POST /api/v1/notification-channels/{id}/whatsapp-templates/sync to refresh approval status from Meta.
  2. If still pending after 24 h, advise tenant to revise the template per Meta policy.

2. Datastore failures

F-NTF-04 — Cloud SQL primary failover

Symptoms: connection errors spike for 30–60 s; pgbouncer reconnects.

Auto-handling:

  • Application uses Cloud SQL Connector with built-in retry; PgBouncer recycles connections.
  • Workers idempotently re-pick rows after reconnect.
  • WS feed reconnects automatically (clients have backoff).

Runbook (P2):

  1. Confirm via Cloud SQL Operations dashboard.
  2. Watch notif.outbox.lag_seconds — should recover within 2 min.
  3. If lag > 5 min, scale outbox-relay; if > 10 min, manually fail over reads to a replica.

F-NTF-05 — Memorystore Redis failover

Symptoms: rate-limit cache + suppression-set cache miss → fallback to DB lookup; latency spike.

Auto-handling:

  • Suppression check falls back to SuppressionRepository.isSuppressed against Postgres (slower but correct).
  • Rate-limit counters fall back to Postgres counters (notification_rate_counters table — degraded mode); accuracy is best-effort during the gap.

Runbook (P2):

  1. Wait for Memorystore HA failover (≤ 60 s).
  2. Cache warmer rehydrates trigger map and template caches automatically (TTL 30 s).

F-NTF-06 — Postgres partition pruning gap (pg_partman lag)

Symptoms: writes to a future date return constraint violation; partition does not exist errors.

Auto-handling: scheduled partition-maintainer cron job creates +14 future months daily.

Runbook (P3): manually CALL partman.run_maintenance_proc(); investigate why the cron failed (Cloud Scheduler logs).


3. Event bus failures

F-NTF-07 — Pub/Sub publish unavailable

Symptoms: notif.outbox.lag_seconds rising; outbox.publish_attempts rising; alert NotifOutboxLag fires.

Auto-handling:

  • Outbox-relay backs off and retries; rows accumulate.
  • API/Worker writes still succeed (outbox is in-DB).

Runbook (P1):

  1. Confirm GCP status; if widespread, follow GCP guidance.
  2. Watch outbox depth; the system catches up automatically on recovery.
  3. If outage > 60 min, scale outbox-relay (maxInst raised) so the catch-up completes within SLO.

F-NTF-08 — Pub/Sub subscriber stuck (high redelivery)

Symptoms: pubsub.consumer.messages_total{outcome='nack'} rising; ack rate dropping.

Auto-handling:

  • After 5 redeliveries, message routes to DLQ (melmastoon.dlq.notif.<consumer>).

Runbook (P2):

  1. Inspect DLQ in GET /api/v1/internal/dlq.
  2. Identify root cause (often: a malformed upstream event or a regression in the consumer).
  3. Patch + deploy; replay DLQ via POST /api/v1/internal/dlq/{id}/retry.

F-NTF-09 — Consumed event ordering violation (rare)

Symptoms: state appears to "jump back" (e.g., delivered arrives before dispatched).

Auto-handling:

  • Use case explicitly validates state-machine transitions; out-of-order arrivals are buffered (re-enqueued with delay) up to 60 s, then dropped with audit.

Runbook (P3): investigate per-aggregate ordering at the upstream service; ordering keys should ensure correctness.


4. Webhook failures

F-NTF-10 — Vendor webhook flood

Symptoms: ingestion 99th-percentile latency rising; notif.webhook.received_total >> baseline.

Auto-handling:

  • Cloud Armor throttles to 1000 rps/vendor.
  • Persist-then-process pattern means we accept 204 quickly; the queue absorbs spikes.

Runbook (P2): scale notification-api; confirm Cloud Armor rule is active; reach out to vendor if accidental retransmit storm.

F-NTF-11 — HMAC mismatch surge

Symptoms: notif.webhook.received_total{signature_valid=false} rising; alert NotifWebhookHmacFailures.

Possible causes:

  • Vendor key rotated without coordination.
  • Spoofing attempt.

Runbook (P1):

  1. Verify vendor signing key in Secret Manager; rotate if needed.
  2. If spoofing, add Cloud Armor block on offending source.
  3. Audit recent webhook_inbound.status='rejected' rows.

F-NTF-12 — Late-correlation backlog

Symptoms: notif.webhook.late_correlation_total rising > 5 % of inbound.

Possible causes: dispatch attempt persisted vendorMessageId after the webhook arrived (race); or the notification belongs to a tenant that was deleted mid-flight.

Runbook (P3): tune dispatch flush ordering; for orphan webhooks > 24h, archive to DLQ for manual review.


5. Application failures

F-NTF-13 — Render error spike

Symptoms: notif.render.errors_total rising; transactional sends failing.

Possible causes:

  • A tenant published a template version with bad Handlebars/MJML.
  • A platform-global template change broke variable expectations.
  • A locale's body has invalid bidi chars.

Auto-handling:

  • Per-notification render error → failed.v1; per-template threshold (10 errors / 5 min) → automatic rollback to previous active version + alert.

Runbook (P2):

  1. Inspect failed notifications; identify offending templateVersion.
  2. Roll back manually via POST /api/v1/notification-templates/{id}/versions/{prevVersionId}/publish.
  3. Author + publish a fix.

F-NTF-14 — Sender-id missing for a tenant

Symptoms: notif.failed_total{reason='sender_id_missing'} for one tenant.

Runbook (P3): tenant-admin notification (auto); guide them through registering a sender id.

F-NTF-15 — Rate-limit tripped at the per-recipient bucket

Symptoms: end-user complains a confirmation never arrived; logs show suppressed reason='rate_limit'.

Runbook (P3): explain to tenant; review per-recipient daily cap; raise it per tenant policy if needed.

F-NTF-16 — Outbox row stuck "unpublished"

Symptoms: a single outbox.id retains published_at IS NULL indefinitely; last_error persistent.

Runbook (P2):

  1. Inspect last_error — usually a schema validation issue from a code change.
  2. Patch + redeploy; manually re-attempt via UPDATE outbox SET publish_attempts=0, last_error=null WHERE id=….
  3. As a last resort, archive the row to a outbox_skipped table with audit.

6. AI integration failures

F-NTF-17 — Orchestrator timeout / unavailable

Symptoms: notif.ai.fallback_to_deterministic_total{reason='timeout|5xx'} rising.

Auto-handling: fall back to deterministic template (functional correctness preserved); UI surfaces a "personalisation unavailable" hint.

Runbook (P3): coordinate with ai-orchestrator-service on-call.

F-NTF-18 — Safety-rejection rate spike

Symptoms: notif.ai.safety_rejections_total{reason} > 1 %/30 min; alert.

Runbook (P2):

  1. Check whether a tenant is sending unusual variables.
  2. Engage prompt/policy team if the orchestrator's safety classifier mis-flagged legitimate content.

F-NTF-19 — HITL queue stalled

Symptoms: notif.ai.hitl_queue_depth{tenant=…} > 200 for > 1 h.

Runbook (P3): notify tenant ops; consider time-bound auto-fail-safe per AI_INTEGRATION §4.3.


7. Security & compliance failures

F-NTF-20 — Suspected cross-tenant leak

Severity: P1 (security incident).

Triggers: alert NotifCrossTenantSuspect (synthetic check writes a canary row in tenant A, attempts to read from tenant B; should always return zero rows; non-zero pages on-call).

Runbook:

  1. Lock the affected tenant ids by setting tenant.notificationPolicy.suspendOutbound=true.
  2. Snapshot DB for forensic review.
  3. Engage security on-call; trigger DPO process for breach notification timer.

F-NTF-21 — Webhook signing-key compromise

Severity: P1.

Runbook:

  1. Rotate the vendor secret immediately via POST /api/v1/notification-channels/{id}/credentials/{credId}/rotate.
  2. Disable any prior overlapping versions (skip the 24 h overlap window in this incident).
  3. Audit webhook_inbound for the rotation period.

F-NTF-22 — Plaintext PII in logs

Severity: P2.

Trigger: nightly log scan finds an email/phone shape in our log stream.

Runbook:

  1. Identify the offending log line; correlate to code path.
  2. Patch + deploy.
  3. Run targeted log redaction (Cloud Logging exclusion + sink update).

F-NTF-23 — Mobile-key delivery failure

Severity: P1 (guest cannot enter their room).

Symptoms: notif.failed_total{template.key='mobile_key.issued.*'} non-zero.

Auto-handling:

  • Trigger map fallback chain: WhatsApp → SMS → Email.
  • If all channels fail, emit notification.failed.v1 so lock-integration-service can re-issue the credential or fall back to a mechanical key.

Runbook (P1):

  1. Front-desk receives an automated alert in backoffice ("mobile-key not delivered for guest X — please verify channel availability or use mechanical").
  2. Engineer follows F-NTF-01 if a vendor is the cause.

8. Sync & client failures

F-NTF-24 — Desktop replica diverges

Symptoms: staff sees stale notification status; pull cursor invalid.

Auto-handling: client falls back to a full sync (see SYNC_CONTRACT §8).

F-NTF-25 — WS feed connection storm

Symptoms: notif.feed.ws.active_connections 2× baseline; alert NotifWSConnectionsSurge.

Possible causes: client retry storm after a regional outage.

Runbook (P2): scale notification-api; consider adding jittered reconnection guidance to clients.


9. Worker / scheduler failures

F-NTF-26 — Scheduler tick missed

Symptoms: worker.tick.outcome_total{worker='scheduler'} flat for > 2 ticks.

Auto-handling: Cloud Run recovers; on next tick the scheduler queries notification_scheduled for all overdue rows and processes them.

Runbook (P3): confirm Cloud Run revision health; check for pod-killer noisy neighbour.

F-NTF-27 — Late dispatch (queue backlog)

Symptoms: alert NotifDispatchBacklog (count > 1000 pending > 60 s).

Runbook (P1):

  1. Identify which channel.
  2. Scale the corresponding notification-worker-<channel> (maxInst bump).
  3. Investigate whether it's a vendor degradation (F-NTF-01) or a self-induced bottleneck.

10. DR (regional) failures

F-NTF-28 — Regional outage

Severity: P1.

Runbook:

  1. Confirm scope via GCP status page.
  2. Failover Cloud SQL to DR region (manual; promotes the cross-region replica).
  3. Update routing rules to point the affected tenants' regional backend to the DR region.
  4. Re-issue Cloud NAT egress IPs (update vendor allowlists if needed — typically reserved IPs persist).
  5. Monitor SLOs; expect 30 min recovery, 2 h full restoration.
  6. Post-incident: PIR within 5 business days; permanent fixes tracked in SERVICE_RISK_REGISTER.

11. DLQ runbook

GET /api/v1/internal/dlq lists DLQ entries by consumer. For each:

  1. Inspect payload + error.
  2. If a transient (e.g., DB blip) → POST /api/v1/internal/dlq/{id}/retry.
  3. If a deterministic bug → patch, deploy, then retry.
  4. If poison message (malformed event from a misbehaving producer) → coordinate with producer team; if data is unsalvageable, archive to dlq_archived with audit.

DLQ growth alert (NotifPubSubDLQGrowing) calls on-call directly.


12. Backpressure rules

When backlogs grow, we shed lower-priority work to protect critical paths:

Backlog signalAction
queued count > 50 000 across all channelspause marketing dispatch (workers skip rows where category='marketing'); alert
queued count > 200 000additionally pause reminder; only transactional+security flow
Outbox lag > 30 srefuse new notifications/batch requests with 429
Vendor concurrency exhaustedincrease per-channel concurrency cap; if at max, queue with smaller channel parallelism

These rules are encoded in BackpressureController and toggled by Memorystore-backed feature flags so they take effect within seconds.


13. Data corruption & rollback

If a deploy corrupts data (e.g., a faulty migration writes wrong status):

  1. Halt the rollout (auto if SLO breach; manual otherwise).
  2. Roll back to previous Cloud Run revision (Cloud Deploy one-click).
  3. PITR to Cloud SQL snapshot from before the deploy.
  4. Replay outbox rows that were never published (safe — receivers dedupe by event id).
  5. Communicate impacted tenants per the platform incident comms standard.

14. Communication during incidents

  • P1: pages on-call within 1 min; status page update within 10 min; tenant comms (in-app + email to tenant admins) within 30 min.
  • P2: ticket; status page if customer-impacting > 30 min.
  • P3: ticket only.
  • Post-incident review (PIR) within 5 business days for P1/P2; remediation actions tracked in SERVICE_RISK_REGISTER.