Skip to main content

SERVICE_RISK_REGISTER — notification-service

Sibling: FAILURE_MODES · SECURITY_MODEL · SERVICE_READINESS · AI_INTEGRATION

Strategic anchors: docs/07-security-compliance-tenancy · docs/08-ai-architecture

A living register of risks specific to notification-service. Each row is reviewed at the monthly service review. Fields:

  • IDR-NTF-NN stable, never re-used.
  • Likelihood / Impactlow / med / high.
  • Severity — derived (low / med / high / critical).
  • Statusopen / mitigating / accepted / closed.
  • Owner — accountable engineer/team.

Severity matrix:

Impact →
Likelihood ↓ low med high
low low low med
med low med high
high med high critical

1. Active risk register

Domain & delivery

IDRiskLikelihoodImpactSeverityMitigationStatusOwner
R-NTF-01Vendor concentration on a single provider per channel (e.g., SendGrid for email) → single point of failuremedhighhighMulti-vendor Channel config with fallbackVendor; channel prober flips on degradation; quarterly DR drill. Short-term: only 60 % tenants have fallback configured.mitigatingteam-comms
R-NTF-02WhatsApp Business template approval delays block guest comms in target marketsmedmedmedPre-approve all transactional templates centrally; auto-fallback to SMS when WA pending; dashboard surfacing pending templates.mitigatingteam-comms
R-NTF-03Country-specific sender-ID requirements (PK PTA, AF MoCIT) reject messagesmedhighhighPer-tenant per-country sender-ID registry; pre-flight validation in RegisterChannelUseCase; suppression on consistent reject reasons.mitigatingteam-comms
R-NTF-04Plaintext PII could leak via log lines added without lint coveragelowhighmedField-level redactor + structured logger; CI lint forbids console.*; quarterly log audit. Lint coverage incomplete on worker code paths.mitigatingteam-platform
R-NTF-05Webhook signing key rotation drift → vendor signs with key A while we accept only key BlowhighmedDual-key acceptance window for ≥ 72 h on rotation; rotation runbook (F-NTF-21); automated key-window calendar in Secret Manager.mitigatingteam-platform
R-NTF-06Mobile-key delivery failure (channel down at check-in) → guest cannot enter roomlowhighmedHighest-priority queue; SMS+push fallback; SLA in lock-integration-service triggers staff intervention via maintenance task.mitigatingteam-comms
R-NTF-07Marketing notifications spam recipients due to misuse of bulk APImedmedmedTenant-level rate caps; explicit category=marketing tag; opt-in proof required at consent table; per-recipient daily cap default 5.mitigatingteam-comms
R-NTF-08Template misconfiguration (wrong locale fallback) → guest receives wrong-language confirmationmedmedmedTemplateRenderer strict-mode flag per tenant; canary against synthetic recipient before publish; staff approval gate for transactional templates.mitigatingteam-comms
R-NTF-09Time-zone bug in scheduledFor causing reminders to fire at 03:00 locallowmedlowDST-aware scheduler; tenant-tz stamping at enqueue; integration test suite covers Asia/Kabul half-hour offset.acceptedteam-comms
R-NTF-10Vendor delivery webhooks arrive after retention boundary → orphan correlationlowlowlow30-day late-correlation window; orphans logged but not retried; reconciliation job emits delivery.orphan metric.acceptedteam-comms

AI integration

IDRiskLikelihoodImpactSeverityMitigationStatusOwner
R-NTF-AI-01AI-drafted message contains hallucinated reservation details (wrong room, wrong dates)medhighhighAll AI-rendered values are decorative; binding values come from authoritative event payload. HITL on ai_drafted template publish. Safety post-render diff check.mitigatingteam-comms+team-ai
R-NTF-AI-02Prompt injection from guest reply causes orchestrator to leak datalowhighmedReplies are untrusted input; ai-orchestrator-service enforces system-prompt isolation, structured output contracts, and red-team eval suite. Sentiment classifier returns labels only, not free text.mitigatingteam-ai
R-NTF-AI-03Translation introduces culturally inappropriate phrasing in Pashto/Dari/ArabicmedmedmedNative-speaker style profile per locale; ai_translated requires HITL on first publish per locale; tenant-staff review queue.mitigatingteam-ai
R-NTF-AI-04Cost runaway via uncapped AI personalisation on marketing batcheslowhighmedPer-tenant AI quota in ai-orchestrator-service; per-batch cost cap; fallback to deterministic templates when quota exceeded. Budget alert at 50/80/100 %.mitigatingteam-ai
R-NTF-AI-05HITL queue stalls during low staffing → AI-drafted templates blockedmedmedmedTTL with revert-to-deterministic on expiry; escalation to platform reviewer when no tenant approver acts in 24 h.mitigatingteam-comms
R-NTF-AI-06Sentiment-aware reply suggestions surface tone mismatched with brandlowmedlowTone profile per tenant; suggestions are advisory only; staff edits tracked, used to refine prompt.acceptedteam-ai

Data, security & compliance

IDRiskLikelihoodImpactSeverityMitigationStatusOwner
R-NTF-SEC-01Cross-tenant data leak via missing RLS predicatelowhighmedRLS enforced at table level; integration test suite enumerates every read path; nightly leak-canary; F-NTF-20 runbook.mitigatingteam-platform
R-NTF-SEC-02Vendor API key compromise via supply-chainlowhighmedSecret Manager + workload-identity; quarterly rotation; vendor-side IP allowlist where supported; SAST + dependency scan in CI.mitigatingteam-platform
R-NTF-SEC-03Opt-out token replay or guesslowmedlow32-byte random tokens hashed at rest; single-use; expiry 90 d; rate-limit on opt-out endpoint.acceptedteam-platform
R-NTF-SEC-04Data residency violation if tenant moves regionslowhighmedTenant region pinned at provisioning; region-routed traffic; cross-region replication only for ops/audit; migration runbook in MIGRATION_PLAN.mitigatingteam-platform
R-NTF-SEC-05GDPR Art 17 erasure incomplete (e.g., template variable history retains email)lowhighmedCrypto-shredding of address ciphertext; variable hashing not raw values; erasure verifier nightly.mitigatingteam-platform
R-NTF-SEC-06DKIM/SPF misconfig per tenant → emails routed to spammedmedmedTenant onboarding wizard validates DNS; periodic re-verification; alert on inbound-bounce uplift per tenant domain.mitigatingteam-comms
R-NTF-SEC-07Guest impersonation via display-name spoofing in WhatsApplowmedlowSender display-name allowlisted per tenant; tenant brand verified; cannot be changed by self-serve marketing.acceptedteam-comms

Operational & infrastructure

IDRiskLikelihoodImpactSeverityMitigationStatusOwner
R-NTF-OPS-01Cloud SQL primary failover causes ≥ 60 s of enqueue downtimemedmedmedBackoff + retry in BFF; outbox queue absorbs enqueue burst on recovery; HA Cloud SQL with sub-60 s failover; F-NTF-04 runbook.mitigatingteam-sre
R-NTF-OPS-02Pub/Sub region partial outage → consumed events delayedlowhighmedCross-region subscription replicas; staffed escalation procedure; outbox replays on recovery.mitigatingteam-sre
R-NTF-OPS-03Partition pruning gap → inserts faillowmedlowCron-driven partition creator runs daily; alert if next-month partition missing 3 d before boundary; F-NTF-06.mitigatingteam-platform
R-NTF-OPS-04Worker scaling hits Cloud Run instance cap → backlogmedmedmedQuotas reviewed quarterly; load test against 2× peak; scheduled scale-up before known peaks (e.g., Hajj season).mitigatingteam-sre
R-NTF-OPS-05Memorystore eviction during traffic burst → render misses spikelowlowlowCache sized 2× peak working set; circuit breaker falls back to Postgres reads; metric on template_render.cache.miss_rate.acceptedteam-sre
R-NTF-OPS-06Cost overrun due to Pub/Sub fan-out for global topicsmedmedmedPer-topic cost dashboards; budget alerts; archival of low-value high-volume events to BigQuery.mitigatingteam-platform

Sync, client & UX

IDRiskLikelihoodImpactSeverityMitigationStatusOwner
R-NTF-CLI-01Electron desktop replica diverges → stale notification statuslowlowlowServer-authoritative; periodic full-resync; conflict policy server_authoritative for status; F-NTF-24.acceptedteam-electron
R-NTF-CLI-02WS feed reconnect storm during regional failoverlowmedlowJittered reconnect; subscription throttle; F-NTF-25; sticky session reuse where available.mitigatingteam-comms
R-NTF-CLI-03Tenant brand logo 404 → fallback to defaultlowlowlowAsset CDN with origin failover; logo schema validates URL on save.acceptedteam-comms

People & process

IDRiskLikelihoodImpactSeverityMitigationStatusOwner
R-NTF-ORG-01Single team owns notification + AI integration → bus factormedmedmedCross-train SRE rotation; documented runbooks; pair on-call shadow program.mitigatingteam-comms
R-NTF-ORG-02Lack of 24×7 on-call coverage in early monthshighmedhighFollow-the-sun on-call shared with team-platform until headcount lands.mitigatingteam-comms
R-NTF-ORG-03Vendor relationship knowledge concentrated in one engineermedlowlowVendor playbooks per provider; quarterly vendor-management review.acceptedteam-comms

2. Closed risks (last 90 d)

IDRiskResolutionClosed
R-NTF-CLOSED-01Render pipeline susceptible to MJML version driftPinned MJML 4.x.x in lockfile; renderer version in event metadata; snapshot test suite.2026-03-12
R-NTF-CLOSED-02Webhook DoS susceptibilityCloud Armor rules + per-vendor rate limits + replay window enforced.2026-02-20

3. Risk acceptance log

Risks marked accepted carry an explicit acceptance signed by the service tech lead + an authorising stakeholder (security, SRE, or product) recorded in the rollout ticket.

Acceptance entries summarise: rationale, compensating controls, review date.


4. Review cadence

CadenceOwnerOutput
Monthlyservice tech leadUpdated severity, status, new entries
Quarterlyplatform architect + security leadTrend review; promote/demote risks
On incidentincident commanderAdd risk if root cause not in register
On contract change with vendorteam-commsRe-evaluate concentration & key-rotation risks

This register is the source of truth for "what could break the notification surface and what are we doing about it." It feeds the SERVICE_READINESS §11 sign-off matrix, the platform top-N risk dashboard, and the audit pack.