SERVICE_RISK_REGISTER — notification-service
Sibling: FAILURE_MODES · SECURITY_MODEL · SERVICE_READINESS · AI_INTEGRATION
Strategic anchors: docs/07-security-compliance-tenancy · docs/08-ai-architecture
A living register of risks specific to notification-service. Each row is reviewed at the monthly service review. Fields:
- ID —
R-NTF-NNstable, never re-used. - Likelihood / Impact —
low / med / high. - Severity — derived (
low / med / high / critical). - Status —
open / mitigating / accepted / closed. - Owner — accountable engineer/team.
Severity matrix:
Impact →
Likelihood ↓ low med high
low low low med
med low med high
high med high critical
1. Active risk register
Domain & delivery
| ID | Risk | Likelihood | Impact | Severity | Mitigation | Status | Owner |
|---|---|---|---|---|---|---|---|
| R-NTF-01 | Vendor concentration on a single provider per channel (e.g., SendGrid for email) → single point of failure | med | high | high | Multi-vendor Channel config with fallbackVendor; channel prober flips on degradation; quarterly DR drill. Short-term: only 60 % tenants have fallback configured. | mitigating | team-comms |
| R-NTF-02 | WhatsApp Business template approval delays block guest comms in target markets | med | med | med | Pre-approve all transactional templates centrally; auto-fallback to SMS when WA pending; dashboard surfacing pending templates. | mitigating | team-comms |
| R-NTF-03 | Country-specific sender-ID requirements (PK PTA, AF MoCIT) reject messages | med | high | high | Per-tenant per-country sender-ID registry; pre-flight validation in RegisterChannelUseCase; suppression on consistent reject reasons. | mitigating | team-comms |
| R-NTF-04 | Plaintext PII could leak via log lines added without lint coverage | low | high | med | Field-level redactor + structured logger; CI lint forbids console.*; quarterly log audit. Lint coverage incomplete on worker code paths. | mitigating | team-platform |
| R-NTF-05 | Webhook signing key rotation drift → vendor signs with key A while we accept only key B | low | high | med | Dual-key acceptance window for ≥ 72 h on rotation; rotation runbook (F-NTF-21); automated key-window calendar in Secret Manager. | mitigating | team-platform |
| R-NTF-06 | Mobile-key delivery failure (channel down at check-in) → guest cannot enter room | low | high | med | Highest-priority queue; SMS+push fallback; SLA in lock-integration-service triggers staff intervention via maintenance task. | mitigating | team-comms |
| R-NTF-07 | Marketing notifications spam recipients due to misuse of bulk API | med | med | med | Tenant-level rate caps; explicit category=marketing tag; opt-in proof required at consent table; per-recipient daily cap default 5. | mitigating | team-comms |
| R-NTF-08 | Template misconfiguration (wrong locale fallback) → guest receives wrong-language confirmation | med | med | med | TemplateRenderer strict-mode flag per tenant; canary against synthetic recipient before publish; staff approval gate for transactional templates. | mitigating | team-comms |
| R-NTF-09 | Time-zone bug in scheduledFor causing reminders to fire at 03:00 local | low | med | low | DST-aware scheduler; tenant-tz stamping at enqueue; integration test suite covers Asia/Kabul half-hour offset. | accepted | team-comms |
| R-NTF-10 | Vendor delivery webhooks arrive after retention boundary → orphan correlation | low | low | low | 30-day late-correlation window; orphans logged but not retried; reconciliation job emits delivery.orphan metric. | accepted | team-comms |
AI integration
| ID | Risk | Likelihood | Impact | Severity | Mitigation | Status | Owner |
|---|---|---|---|---|---|---|---|
| R-NTF-AI-01 | AI-drafted message contains hallucinated reservation details (wrong room, wrong dates) | med | high | high | All AI-rendered values are decorative; binding values come from authoritative event payload. HITL on ai_drafted template publish. Safety post-render diff check. | mitigating | team-comms+team-ai |
| R-NTF-AI-02 | Prompt injection from guest reply causes orchestrator to leak data | low | high | med | Replies are untrusted input; ai-orchestrator-service enforces system-prompt isolation, structured output contracts, and red-team eval suite. Sentiment classifier returns labels only, not free text. | mitigating | team-ai |
| R-NTF-AI-03 | Translation introduces culturally inappropriate phrasing in Pashto/Dari/Arabic | med | med | med | Native-speaker style profile per locale; ai_translated requires HITL on first publish per locale; tenant-staff review queue. | mitigating | team-ai |
| R-NTF-AI-04 | Cost runaway via uncapped AI personalisation on marketing batches | low | high | med | Per-tenant AI quota in ai-orchestrator-service; per-batch cost cap; fallback to deterministic templates when quota exceeded. Budget alert at 50/80/100 %. | mitigating | team-ai |
| R-NTF-AI-05 | HITL queue stalls during low staffing → AI-drafted templates blocked | med | med | med | TTL with revert-to-deterministic on expiry; escalation to platform reviewer when no tenant approver acts in 24 h. | mitigating | team-comms |
| R-NTF-AI-06 | Sentiment-aware reply suggestions surface tone mismatched with brand | low | med | low | Tone profile per tenant; suggestions are advisory only; staff edits tracked, used to refine prompt. | accepted | team-ai |
Data, security & compliance
| ID | Risk | Likelihood | Impact | Severity | Mitigation | Status | Owner |
|---|---|---|---|---|---|---|---|
| R-NTF-SEC-01 | Cross-tenant data leak via missing RLS predicate | low | high | med | RLS enforced at table level; integration test suite enumerates every read path; nightly leak-canary; F-NTF-20 runbook. | mitigating | team-platform |
| R-NTF-SEC-02 | Vendor API key compromise via supply-chain | low | high | med | Secret Manager + workload-identity; quarterly rotation; vendor-side IP allowlist where supported; SAST + dependency scan in CI. | mitigating | team-platform |
| R-NTF-SEC-03 | Opt-out token replay or guess | low | med | low | 32-byte random tokens hashed at rest; single-use; expiry 90 d; rate-limit on opt-out endpoint. | accepted | team-platform |
| R-NTF-SEC-04 | Data residency violation if tenant moves regions | low | high | med | Tenant region pinned at provisioning; region-routed traffic; cross-region replication only for ops/audit; migration runbook in MIGRATION_PLAN. | mitigating | team-platform |
| R-NTF-SEC-05 | GDPR Art 17 erasure incomplete (e.g., template variable history retains email) | low | high | med | Crypto-shredding of address ciphertext; variable hashing not raw values; erasure verifier nightly. | mitigating | team-platform |
| R-NTF-SEC-06 | DKIM/SPF misconfig per tenant → emails routed to spam | med | med | med | Tenant onboarding wizard validates DNS; periodic re-verification; alert on inbound-bounce uplift per tenant domain. | mitigating | team-comms |
| R-NTF-SEC-07 | Guest impersonation via display-name spoofing in WhatsApp | low | med | low | Sender display-name allowlisted per tenant; tenant brand verified; cannot be changed by self-serve marketing. | accepted | team-comms |
Operational & infrastructure
| ID | Risk | Likelihood | Impact | Severity | Mitigation | Status | Owner |
|---|---|---|---|---|---|---|---|
| R-NTF-OPS-01 | Cloud SQL primary failover causes ≥ 60 s of enqueue downtime | med | med | med | Backoff + retry in BFF; outbox queue absorbs enqueue burst on recovery; HA Cloud SQL with sub-60 s failover; F-NTF-04 runbook. | mitigating | team-sre |
| R-NTF-OPS-02 | Pub/Sub region partial outage → consumed events delayed | low | high | med | Cross-region subscription replicas; staffed escalation procedure; outbox replays on recovery. | mitigating | team-sre |
| R-NTF-OPS-03 | Partition pruning gap → inserts fail | low | med | low | Cron-driven partition creator runs daily; alert if next-month partition missing 3 d before boundary; F-NTF-06. | mitigating | team-platform |
| R-NTF-OPS-04 | Worker scaling hits Cloud Run instance cap → backlog | med | med | med | Quotas reviewed quarterly; load test against 2× peak; scheduled scale-up before known peaks (e.g., Hajj season). | mitigating | team-sre |
| R-NTF-OPS-05 | Memorystore eviction during traffic burst → render misses spike | low | low | low | Cache sized 2× peak working set; circuit breaker falls back to Postgres reads; metric on template_render.cache.miss_rate. | accepted | team-sre |
| R-NTF-OPS-06 | Cost overrun due to Pub/Sub fan-out for global topics | med | med | med | Per-topic cost dashboards; budget alerts; archival of low-value high-volume events to BigQuery. | mitigating | team-platform |
Sync, client & UX
| ID | Risk | Likelihood | Impact | Severity | Mitigation | Status | Owner |
|---|---|---|---|---|---|---|---|
| R-NTF-CLI-01 | Electron desktop replica diverges → stale notification status | low | low | low | Server-authoritative; periodic full-resync; conflict policy server_authoritative for status; F-NTF-24. | accepted | team-electron |
| R-NTF-CLI-02 | WS feed reconnect storm during regional failover | low | med | low | Jittered reconnect; subscription throttle; F-NTF-25; sticky session reuse where available. | mitigating | team-comms |
| R-NTF-CLI-03 | Tenant brand logo 404 → fallback to default | low | low | low | Asset CDN with origin failover; logo schema validates URL on save. | accepted | team-comms |
People & process
| ID | Risk | Likelihood | Impact | Severity | Mitigation | Status | Owner |
|---|---|---|---|---|---|---|---|
| R-NTF-ORG-01 | Single team owns notification + AI integration → bus factor | med | med | med | Cross-train SRE rotation; documented runbooks; pair on-call shadow program. | mitigating | team-comms |
| R-NTF-ORG-02 | Lack of 24×7 on-call coverage in early months | high | med | high | Follow-the-sun on-call shared with team-platform until headcount lands. | mitigating | team-comms |
| R-NTF-ORG-03 | Vendor relationship knowledge concentrated in one engineer | med | low | low | Vendor playbooks per provider; quarterly vendor-management review. | accepted | team-comms |
2. Closed risks (last 90 d)
| ID | Risk | Resolution | Closed |
|---|---|---|---|
| R-NTF-CLOSED-01 | Render pipeline susceptible to MJML version drift | Pinned MJML 4.x.x in lockfile; renderer version in event metadata; snapshot test suite. | 2026-03-12 |
| R-NTF-CLOSED-02 | Webhook DoS susceptibility | Cloud Armor rules + per-vendor rate limits + replay window enforced. | 2026-02-20 |
3. Risk acceptance log
Risks marked accepted carry an explicit acceptance signed by the service tech lead + an authorising stakeholder (security, SRE, or product) recorded in the rollout ticket.
Acceptance entries summarise: rationale, compensating controls, review date.
4. Review cadence
| Cadence | Owner | Output |
|---|---|---|
| Monthly | service tech lead | Updated severity, status, new entries |
| Quarterly | platform architect + security lead | Trend review; promote/demote risks |
| On incident | incident commander | Add risk if root cause not in register |
| On contract change with vendor | team-comms | Re-evaluate concentration & key-rotation risks |
This register is the source of truth for "what could break the notification surface and what are we doing about it." It feeds the SERVICE_READINESS §11 sign-off matrix, the platform top-N risk dashboard, and the audit pack.