FAILURE_MODES — notification-service

Sibling: APPLICATION_LOGIC · OBSERVABILITY · DEPLOYMENT_TOPOLOGY · SERVICE_RISK_REGISTER

Strategic anchors: 02 Enterprise Architecture §14 Reliability · 04 Event-Driven §11 DLQ

This document enumerates the realistic failure modes for notification-service and the runbook for each. Severity follows the platform incident scale (P1: customer-impacting now; P2: customer-impacting soon or partial; P3: degradation, no immediate impact; P4: cosmetic).

1. Vendor (channel adapter) failures

F-NTF-01 — Single vendor returns 5xx / times out

Symptoms: rising notif.failed_total{vendor=…,reason=vendor_unreachable|timeout}; alert NotifVendorDegraded fires.

Auto-handling:

Dispatch worker retries with exponential backoff (2^attempt s, capped at 1024 s; honours Retry-After).
After 3 consecutive failures on a Channel, status flips active → degraded; emits channel.health_changed.v1.
After 5 consecutive failures or 50 % failure rate over 10 min, status flips degraded → down. Worker switches to fallbackVendor if configured.
If no fallback, transactional notifications keep retrying within their TTL (24 h default); marketing fails fast.

Runbook (P2):

Confirm vendor status page; correlate with reservation throughput.
Verify fallbackVendor activation in Grafana; if not configured, enable per-tenant via API.
If outage > 30 min: page tenant ops; consider disabling marketing categories temporarily.
After recovery, prober flips status back to active; backlog should drain within 15 min.

F-NTF-02 — Vendor returns 4xx terminal (invalid recipient, bad sender, template not approved)

Auto-handling:

Per-error mapping (e.g., Twilio 21211 → invalid_recipient → suppression; SendGrid 401 → vendor_credential_invalid → halt channel + page).
Per-recipient terminal failures emit failed.v1 and add a SuppressionRecord.

Runbook (P3): review delivery_attempts audit; if a tenant misconfiguration (e.g., unverified sender), notify them via in-app + email.

F-NTF-03 — WhatsApp template not approved

Symptoms: notif.failed_total{reason='whatsapp_template_not_approved'} non-zero.

Auto-handling:

For transactional categories with an SMS counterpart in the trigger map, fall back to SMS automatically; emit a warning.
For marketing, defer the send until template approval; alert tenant admin via in-app.

Runbook (P3):

Run POST /api/v1/notification-channels/{id}/whatsapp-templates/sync to refresh approval status from Meta.
If still pending after 24 h, advise tenant to revise the template per Meta policy.

2. Datastore failures

F-NTF-04 — Cloud SQL primary failover

Symptoms: connection errors spike for 30–60 s; pgbouncer reconnects.

Auto-handling:

Application uses Cloud SQL Connector with built-in retry; PgBouncer recycles connections.
Workers idempotently re-pick rows after reconnect.
WS feed reconnects automatically (clients have backoff).

Runbook (P2):

Confirm via Cloud SQL Operations dashboard.
Watch notif.outbox.lag_seconds — should recover within 2 min.
If lag > 5 min, scale outbox-relay; if > 10 min, manually fail over reads to a replica.

F-NTF-05 — Memorystore Redis failover

Symptoms: rate-limit cache + suppression-set cache miss → fallback to DB lookup; latency spike.

Auto-handling:

Suppression check falls back to SuppressionRepository.isSuppressed against Postgres (slower but correct).
Rate-limit counters fall back to Postgres counters (notification_rate_counters table — degraded mode); accuracy is best-effort during the gap.

Runbook (P2):

Wait for Memorystore HA failover (≤ 60 s).
Cache warmer rehydrates trigger map and template caches automatically (TTL 30 s).

F-NTF-06 — Postgres partition pruning gap (`pg_partman` lag)

Symptoms: writes to a future date return constraint violation; partition does not exist errors.

Auto-handling: scheduled partition-maintainer cron job creates +14 future months daily.

Runbook (P3): manually CALL partman.run_maintenance_proc(); investigate why the cron failed (Cloud Scheduler logs).

3. Event bus failures

F-NTF-07 — Pub/Sub publish unavailable

Symptoms: notif.outbox.lag_seconds rising; outbox.publish_attempts rising; alert NotifOutboxLag fires.

Auto-handling:

Outbox-relay backs off and retries; rows accumulate.
API/Worker writes still succeed (outbox is in-DB).

Runbook (P1):

Confirm GCP status; if widespread, follow GCP guidance.
Watch outbox depth; the system catches up automatically on recovery.
If outage > 60 min, scale outbox-relay (maxInst raised) so the catch-up completes within SLO.

F-NTF-08 — Pub/Sub subscriber stuck (high redelivery)

Symptoms: pubsub.consumer.messages_total{outcome='nack'} rising; ack rate dropping.

Auto-handling:

After 5 redeliveries, message routes to DLQ (melmastoon.dlq.notif.<consumer>).

Runbook (P2):

Inspect DLQ in GET /api/v1/internal/dlq.
Identify root cause (often: a malformed upstream event or a regression in the consumer).
Patch + deploy; replay DLQ via POST /api/v1/internal/dlq/{id}/retry.

F-NTF-09 — Consumed event ordering violation (rare)

Symptoms: state appears to "jump back" (e.g., delivered arrives before dispatched).

Auto-handling:

Use case explicitly validates state-machine transitions; out-of-order arrivals are buffered (re-enqueued with delay) up to 60 s, then dropped with audit.

Runbook (P3): investigate per-aggregate ordering at the upstream service; ordering keys should ensure correctness.

4. Webhook failures

F-NTF-10 — Vendor webhook flood

Symptoms: ingestion 99th-percentile latency rising; notif.webhook.received_total >> baseline.

Auto-handling:

Cloud Armor throttles to 1000 rps/vendor.
Persist-then-process pattern means we accept 204 quickly; the queue absorbs spikes.

Runbook (P2): scale notification-api; confirm Cloud Armor rule is active; reach out to vendor if accidental retransmit storm.

F-NTF-11 — HMAC mismatch surge

Symptoms: notif.webhook.received_total{signature_valid=false} rising; alert NotifWebhookHmacFailures.

Possible causes:

Vendor key rotated without coordination.
Spoofing attempt.

Runbook (P1):

Verify vendor signing key in Secret Manager; rotate if needed.
If spoofing, add Cloud Armor block on offending source.
Audit recent webhook_inbound.status='rejected' rows.

F-NTF-12 — Late-correlation backlog

Symptoms: notif.webhook.late_correlation_total rising > 5 % of inbound.

Possible causes: dispatch attempt persisted vendorMessageId after the webhook arrived (race); or the notification belongs to a tenant that was deleted mid-flight.

Runbook (P3): tune dispatch flush ordering; for orphan webhooks > 24h, archive to DLQ for manual review.

5. Application failures

F-NTF-13 — Render error spike

Symptoms: notif.render.errors_total rising; transactional sends failing.

Possible causes:

A tenant published a template version with bad Handlebars/MJML.
A platform-global template change broke variable expectations.
A locale's body has invalid bidi chars.

Auto-handling:

Per-notification render error → failed.v1; per-template threshold (10 errors / 5 min) → automatic rollback to previous active version + alert.

Runbook (P2):

Inspect failed notifications; identify offending templateVersion.
Roll back manually via POST /api/v1/notification-templates/{id}/versions/{prevVersionId}/publish.
Author + publish a fix.

F-NTF-14 — Sender-id missing for a tenant

Symptoms: notif.failed_total{reason='sender_id_missing'} for one tenant.

Runbook (P3): tenant-admin notification (auto); guide them through registering a sender id.

F-NTF-15 — Rate-limit tripped at the per-recipient bucket

Symptoms: end-user complains a confirmation never arrived; logs show suppressed reason='rate_limit'.

Runbook (P3): explain to tenant; review per-recipient daily cap; raise it per tenant policy if needed.

F-NTF-16 — Outbox row stuck "unpublished"

Symptoms: a single outbox.id retains published_at IS NULL indefinitely; last_error persistent.

Runbook (P2):

Inspect last_error — usually a schema validation issue from a code change.
Patch + redeploy; manually re-attempt via UPDATE outbox SET publish_attempts=0, last_error=null WHERE id=….
As a last resort, archive the row to a outbox_skipped table with audit.

6. AI integration failures

F-NTF-17 — Orchestrator timeout / unavailable

Symptoms: notif.ai.fallback_to_deterministic_total{reason='timeout|5xx'} rising.

Auto-handling: fall back to deterministic template (functional correctness preserved); UI surfaces a "personalisation unavailable" hint.

Runbook (P3): coordinate with ai-orchestrator-service on-call.

F-NTF-18 — Safety-rejection rate spike

Symptoms: notif.ai.safety_rejections_total{reason} > 1 %/30 min; alert.

Runbook (P2):

Check whether a tenant is sending unusual variables.
Engage prompt/policy team if the orchestrator's safety classifier mis-flagged legitimate content.

F-NTF-19 — HITL queue stalled

Symptoms: notif.ai.hitl_queue_depth{tenant=…} > 200 for > 1 h.

Runbook (P3): notify tenant ops; consider time-bound auto-fail-safe per AI_INTEGRATION §4.3.

7. Security & compliance failures

F-NTF-20 — Suspected cross-tenant leak

Severity: P1 (security incident).

Triggers: alert NotifCrossTenantSuspect (synthetic check writes a canary row in tenant A, attempts to read from tenant B; should always return zero rows; non-zero pages on-call).

Runbook:

Lock the affected tenant ids by setting tenant.notificationPolicy.suspendOutbound=true.
Snapshot DB for forensic review.
Engage security on-call; trigger DPO process for breach notification timer.

F-NTF-21 — Webhook signing-key compromise

Severity: P1.

Runbook:

Rotate the vendor secret immediately via POST /api/v1/notification-channels/{id}/credentials/{credId}/rotate.
Disable any prior overlapping versions (skip the 24 h overlap window in this incident).
Audit webhook_inbound for the rotation period.

F-NTF-22 — Plaintext PII in logs

Severity: P2.

Trigger: nightly log scan finds an email/phone shape in our log stream.

Runbook:

Identify the offending log line; correlate to code path.
Patch + deploy.
Run targeted log redaction (Cloud Logging exclusion + sink update).

F-NTF-23 — Mobile-key delivery failure

Severity: P1 (guest cannot enter their room).

Symptoms: notif.failed_total{template.key='mobile_key.issued.*'} non-zero.

Auto-handling:

Trigger map fallback chain: WhatsApp → SMS → Email.
If all channels fail, emit notification.failed.v1 so lock-integration-service can re-issue the credential or fall back to a mechanical key.

Runbook (P1):

Front-desk receives an automated alert in backoffice ("mobile-key not delivered for guest X — please verify channel availability or use mechanical").
Engineer follows F-NTF-01 if a vendor is the cause.

8. Sync & client failures

F-NTF-24 — Desktop replica diverges

Symptoms: staff sees stale notification status; pull cursor invalid.

Auto-handling: client falls back to a full sync (see SYNC_CONTRACT §8).

F-NTF-25 — WS feed connection storm

Symptoms: notif.feed.ws.active_connections 2× baseline; alert NotifWSConnectionsSurge.

Possible causes: client retry storm after a regional outage.

Runbook (P2): scale notification-api; consider adding jittered reconnection guidance to clients.

9. Worker / scheduler failures

F-NTF-26 — Scheduler tick missed

Symptoms: worker.tick.outcome_total{worker='scheduler'} flat for > 2 ticks.

Auto-handling: Cloud Run recovers; on next tick the scheduler queries notification_scheduled for all overdue rows and processes them.

Runbook (P3): confirm Cloud Run revision health; check for pod-killer noisy neighbour.

F-NTF-27 — Late dispatch (queue backlog)

Symptoms: alert NotifDispatchBacklog (count > 1000 pending > 60 s).

Runbook (P1):

Identify which channel.
Scale the corresponding notification-worker-<channel> (maxInst bump).
Investigate whether it's a vendor degradation (F-NTF-01) or a self-induced bottleneck.

10. DR (regional) failures

F-NTF-28 — Regional outage

Severity: P1.

Runbook:

Confirm scope via GCP status page.
Failover Cloud SQL to DR region (manual; promotes the cross-region replica).
Update routing rules to point the affected tenants' regional backend to the DR region.
Re-issue Cloud NAT egress IPs (update vendor allowlists if needed — typically reserved IPs persist).
Monitor SLOs; expect 30 min recovery, 2 h full restoration.
Post-incident: PIR within 5 business days; permanent fixes tracked in SERVICE_RISK_REGISTER.

11. DLQ runbook

GET /api/v1/internal/dlq lists DLQ entries by consumer. For each:

Inspect payload + error.
If a transient (e.g., DB blip) → POST /api/v1/internal/dlq/{id}/retry.
If a deterministic bug → patch, deploy, then retry.
If poison message (malformed event from a misbehaving producer) → coordinate with producer team; if data is unsalvageable, archive to dlq_archived with audit.

DLQ growth alert (NotifPubSubDLQGrowing) calls on-call directly.

12. Backpressure rules

When backlogs grow, we shed lower-priority work to protect critical paths:

Backlog signal	Action
`queued` count > 50 000 across all channels	pause `marketing` dispatch (workers skip rows where `category='marketing'`); alert
`queued` count > 200 000	additionally pause `reminder`; only `transactional`+`security` flow
Outbox lag > 30 s	refuse new `notifications/batch` requests with 429
Vendor concurrency exhausted	increase per-channel concurrency cap; if at max, queue with smaller channel parallelism

These rules are encoded in BackpressureController and toggled by Memorystore-backed feature flags so they take effect within seconds.

13. Data corruption & rollback

If a deploy corrupts data (e.g., a faulty migration writes wrong status):

Halt the rollout (auto if SLO breach; manual otherwise).
Roll back to previous Cloud Run revision (Cloud Deploy one-click).
PITR to Cloud SQL snapshot from before the deploy.
Replay outbox rows that were never published (safe — receivers dedupe by event id).
Communicate impacted tenants per the platform incident comms standard.

14. Communication during incidents

P1: pages on-call within 1 min; status page update within 10 min; tenant comms (in-app + email to tenant admins) within 30 min.
P2: ticket; status page if customer-impacting > 30 min.
P3: ticket only.
Post-incident review (PIR) within 5 business days for P1/P2; remediation actions tracked in SERVICE_RISK_REGISTER.

1. Vendor (channel adapter) failures​

F-NTF-01 — Single vendor returns 5xx / times out​

F-NTF-02 — Vendor returns 4xx terminal (invalid recipient, bad sender, template not approved)​

F-NTF-03 — WhatsApp template not approved​

2. Datastore failures​

F-NTF-04 — Cloud SQL primary failover​

F-NTF-05 — Memorystore Redis failover​

F-NTF-06 — Postgres partition pruning gap (pg_partman lag)​

3. Event bus failures​

F-NTF-07 — Pub/Sub publish unavailable​

F-NTF-08 — Pub/Sub subscriber stuck (high redelivery)​

F-NTF-09 — Consumed event ordering violation (rare)​

4. Webhook failures​

F-NTF-10 — Vendor webhook flood​

F-NTF-11 — HMAC mismatch surge​

F-NTF-12 — Late-correlation backlog​

5. Application failures​

F-NTF-13 — Render error spike​

F-NTF-14 — Sender-id missing for a tenant​

F-NTF-15 — Rate-limit tripped at the per-recipient bucket​

F-NTF-16 — Outbox row stuck "unpublished"​

6. AI integration failures​

F-NTF-17 — Orchestrator timeout / unavailable​

F-NTF-18 — Safety-rejection rate spike​

F-NTF-19 — HITL queue stalled​

7. Security & compliance failures​

F-NTF-20 — Suspected cross-tenant leak​

F-NTF-21 — Webhook signing-key compromise​

F-NTF-22 — Plaintext PII in logs​

F-NTF-23 — Mobile-key delivery failure​

8. Sync & client failures​

F-NTF-24 — Desktop replica diverges​

F-NTF-25 — WS feed connection storm​

9. Worker / scheduler failures​

F-NTF-26 — Scheduler tick missed​

F-NTF-27 — Late dispatch (queue backlog)​

10. DR (regional) failures​

F-NTF-28 — Regional outage​

11. DLQ runbook​

12. Backpressure rules​

13. Data corruption & rollback​

14. Communication during incidents​

1. Vendor (channel adapter) failures

F-NTF-01 — Single vendor returns 5xx / times out

F-NTF-02 — Vendor returns 4xx terminal (invalid recipient, bad sender, template not approved)

F-NTF-03 — WhatsApp template not approved

2. Datastore failures

F-NTF-04 — Cloud SQL primary failover

F-NTF-05 — Memorystore Redis failover

F-NTF-06 — Postgres partition pruning gap (`pg_partman` lag)

3. Event bus failures

F-NTF-07 — Pub/Sub publish unavailable

F-NTF-08 — Pub/Sub subscriber stuck (high redelivery)

F-NTF-09 — Consumed event ordering violation (rare)

4. Webhook failures

F-NTF-10 — Vendor webhook flood

F-NTF-11 — HMAC mismatch surge

F-NTF-12 — Late-correlation backlog

5. Application failures

F-NTF-13 — Render error spike

F-NTF-14 — Sender-id missing for a tenant

F-NTF-15 — Rate-limit tripped at the per-recipient bucket

F-NTF-16 — Outbox row stuck "unpublished"

6. AI integration failures

F-NTF-17 — Orchestrator timeout / unavailable

F-NTF-18 — Safety-rejection rate spike

F-NTF-19 — HITL queue stalled

7. Security & compliance failures

F-NTF-20 — Suspected cross-tenant leak

F-NTF-21 — Webhook signing-key compromise

F-NTF-22 — Plaintext PII in logs

F-NTF-23 — Mobile-key delivery failure

8. Sync & client failures

F-NTF-24 — Desktop replica diverges

F-NTF-25 — WS feed connection storm

9. Worker / scheduler failures

F-NTF-26 — Scheduler tick missed

F-NTF-27 — Late dispatch (queue backlog)

10. DR (regional) failures

F-NTF-28 — Regional outage

11. DLQ runbook

12. Backpressure rules

13. Data corruption & rollback

14. Communication during incidents