SECURITY_MODEL — notification-service

Sibling: API_CONTRACTS · DATA_MODEL · AI_INTEGRATION · OBSERVABILITY

Strategic anchors: 07 Security/Compliance/Tenancy · 02 Enterprise Architecture §11 Security · ADR-0002 Multi-Tenancy

The notification path is one of the highest-risk surfaces in the platform: it touches guest PII, sends messages on behalf of tenants, holds sender-IDs and channel credentials, and is a frequent phishing/abuse vector. The controls below assume an honest-but-curious staff and a hostile internet.

1. Trust boundaries

                       ┌──────────────────────────────────────────────────────┐
                       │  Internet (vendors, end-users)                       │
                       │  - Vendor webhook callers (HMAC, no JWT)             │
                       │  - End-user opt-out token bearers                    │
                       └──────────────────┬───────────────────────────────────┘
                                          │ TLS 1.3 + WAF
                                          ▼
                       ┌──────────────────────────────────────────────────────┐
                       │  GCP Cloud Load Balancer + Cloud Armor               │
                       │  - DDoS protection, IP throttle, geo rules           │
                       │  - WAF managed rules                                 │
                       └──────────────────┬───────────────────────────────────┘
                                          │ mTLS (vendor mesh) / TLS+JWT (BFFs)
                                          ▼
                       ┌──────────────────────────────────────────────────────┐
                       │  notification-service Cloud Run                      │
                       │  - REST + WS API                                     │
                       │  - Pub/Sub subscribers                                │
                       │  - Background workers                                 │
                       └─┬────────────────┬───────────────┬──────────────────┬┘
                         │                │               │                  │
                         ▼                ▼               ▼                  ▼
                  ┌──────────┐     ┌────────────┐   ┌────────────┐    ┌────────────┐
                  │ Cloud SQL│     │ Memorystore│   │   GCS      │    │ Secret Mgr │
                  │ Postgres │     │   Redis    │   │ (CMEK)     │    │ (vendor    │
                  │ (CMEK)   │     │ (in VPC)   │   │            │    │  creds)    │
                  └──────────┘     └────────────┘   └────────────┘    └────────────┘
                          ▲                                                  │
                          │  RLS-enforced session                            │
                          │                                                  ▼
                  ┌──────────────────┐                            ┌──────────────────┐
                  │  Cloud KMS CMEK  │                            │ vendor APIs      │
                  │  per region/env  │                            │ (egress via NAT)  │
                  └──────────────────┘                            └──────────────────┘

All east-west traffic between services rides over the internal VPC with mTLS issued by iam-service's SPIFFE/SPIRE intermediate. North-south egress to vendors goes through Cloud NAT with a fixed egress IP per region (so vendors can allowlist us).

2. AuthN

Caller class	Mechanism	Issued by	Lifetime
Backoffice staff via `bff-backoffice-service`	OAuth2/OIDC → JWT (RS256) with `aud='notification-service'`	`iam-service`	15 min access; 8 h refresh
Guest via `bff-tenant-booking-service` (in-app feed, opt-out)	Anonymous booking session token (JWT, scoped)	`iam-service` (anonymous flow)	30 min sliding
Internal services (iam, billing, tenant, reservation, etc.)	mTLS + SPIFFE id + service JWT	`iam-service` SPIRE	1 h
Vendors (webhooks)	HMAC-SHA256 over raw body + per-vendor secret stored in Secret Manager	per-vendor	rotated quarterly
End users (opt-out token)	bearer token in URL; lookup by `sha256(token)` against `opt_out_tokens`	this service	30 days

JWTs are validated against the JWKS published by iam-service (cached 10 min, revalidated on kid miss). Token replays beyond 60 s skew are rejected.

For the WebSocket feed (WS /api/v1/notifications/feed/stream), the JWT is passed in Sec-WebSocket-Protocol; we verify before accepting 101 Switching Protocols. Mid-stream re-auth happens on heartbeat; expired token closes the socket with code 4401.

3. AuthZ — RBAC + scope claims

Roles are issued by iam-service per tenant; we evaluate them against route scopes. The full matrix lives in API_CONTRACTS §12; operational summary:

Scope	Granted to roles
`notifications:read`	`OWNER, GM, FRONT_DESK, RESERVATIONS, MARKETING_MANAGER, SUPPORT`
`notifications:write`	`OWNER, GM, FRONT_DESK, RESERVATIONS, SUPPORT` (transactional/operational only); `MARKETING_MANAGER` for `marketing` category
`notifications:batch`	`OWNER, GM, MARKETING_MANAGER`
`notifications:feed.read.self`	any authenticated user (guest or staff) for own feed
`notifications:feed.read.any`	`OWNER, GM, FRONT_DESK, SUPPORT` (with property scope)
`notifications:preferences.write.self`	the user themselves
`notifications:preferences.write.any`	`OWNER, GM, SUPPORT` with documented consent (audit-tagged)
`notifications:templates.read`	`OWNER, GM, MARKETING_MANAGER, FRONT_DESK`
`notifications:templates.write`	`OWNER, GM, MARKETING_MANAGER`
`notifications:templates.publish`	`OWNER, GM, MARKETING_MANAGER` (HITL-gated for `ai_drafted` source)
`notifications:channels.read`	`OWNER, GM, PLATFORM_ENG`
`notifications:channels.write`	`OWNER, PLATFORM_ENG`
`notifications:channels.rotate`	`OWNER, PLATFORM_ENG`
`notifications:internal`	`PLATFORM_ENG` only; available only via internal mesh
`notifications:suppressions.read/write`	`OWNER, GM, COMPLIANCE_OFFICER`
`notifications:suppressions.release`	`OWNER, COMPLIANCE_OFFICER` (4-eyes via approval ticket for hard bounces older than 30 days)

ABAC overlay: X-Property-Id is cross-checked against iam.property_assignments for staff scopes (read/write per property). Deny by default.

4. Database isolation (RLS + roles)

Three Postgres roles in melmastoon_notification:

Role	Capabilities
`melmastoon_notification_app`	`SELECT/INSERT/UPDATE/DELETE` on tenant-scoped tables; no `BYPASSRLS`; sets `app.tenant_id` per connection
`melmastoon_notification_dba`	`BYPASSRLS`; granted only via break-glass (audit-tagged); MFA-required
`melmastoon_notification_ro_analytics`	`SELECT` on materialised views with PII redacted; used by Datastream/BigQuery sink

Every connection sets SET LOCAL app.tenant_id = '<tenantId>' at the start of every transaction (Drizzle middleware). RLS policies (see DATA_MODEL §3) reject rows with mismatched tenant_id. Cross-tenant queries by the app role are impossible.

CMEK on:

Cloud SQL data at rest (per-region key in Cloud KMS).
Cloud SQL backups.
GCS buckets for rendered/webhooks/attachments.
Memorystore (TLS in transit; no on-disk keys; in-memory only).

KMS keys rotate every 90 days; rotation is automated; old key versions retained for 5 years to decrypt historical backups.

5. PII handling

PII class	Where it lives	Encryption	Access
Guest email / phone (recipient address)	`recipient_addresses.address_ciphertext` (envelope-encrypted DEK + KMS KEK), plus `address_hash`	yes — envelope; key per tenant; KEK in KMS	server-side decrypt only when sending; never in logs/events; never in BigQuery; never on Electron
Vendor message id	`delivery_attempts.vendor_message_id`	none (opaque)	service + analytics
Guest name (display)	`recipients.display_name`	none (low risk; needed for staff UI)	RLS-bound
Free-text notes attached to ad-hoc sends	not persisted; only present in `render_snapshot`	server-side only; never exfiltrated to AI without explicit opt-in	RLS-bound
Webhook raw bodies (may contain user agents, IPs)	`gs://…/webhooks/**`	CMEK on bucket	restricted to platform engineers; access audited

Logging policy: addresses are NEVER logged in plaintext; only addressKindHash and recipient_id. A pre-commit log-scanner (pii-grep) blocks PRs that introduce log lines containing @ or phone-number-shaped strings outside an allow-list. CI also runs a Loki query to assert recent prod logs hold no plaintext PII (sample-based).

Crypto-shred on iam.user.deleted.v1: we destroy the per-tenant DEK that wraps the user's address ciphertexts; the row remains for suppression continuity but is unreadable.

6. Vendor credentials

All vendor API keys / OAuth client secrets / WhatsApp Business tokens / FCM service account JSONs live in Secret Manager (regional, CMEK).
A ChannelCredential row stores only a secret_ref (secretmanager://projects/.../secrets/.../versions/N); the plaintext is fetched at adapter-init time and held in process memory.
Rotation: POST /api/v1/notification-channels/{id}/credentials/{credId}/rotate writes a new Secret Manager version, sets the new ChannelCredential to active, leaves the prior one superseded for 24 h overlap. After 24 h the superseded version is revoked and the secret version is disabled.
Secret access audit: every Secret Manager AccessSecretVersion call is exported to BigQuery (audit.secret_access) and tagged with requesterServiceAccount.

Access surface:

Cloud Run service account: notification-service-runtime@<project>.iam.gserviceaccount.com — roles/secretmanager.secretAccessor only on notification-* secrets.
Tenant admins NEVER see plaintext credentials. The "create credential" UI accepts a one-shot input, returns success or error, and only echoes back the secret_ref and last-4 fingerprint thereafter.

7. Webhook security

Each vendor has a dedicated route POST /api/v1/webhooks/vendors/{vendor} with a per-vendor HMAC verifier picked from a registry.
The verifier reads the raw body before body parsing; SignedHeader + body bytes are concatenated per vendor algorithm; comparison is constant-time.
Replay window: most vendors include a timestamp in headers; we reject > 5 min skew (rejected with 401, no body persisted).
Even if an attacker sends a replay within window, the application-level dedupe (vendor + vendorMessageId + type + occurredAt) prevents double-applying state.
Cloud Armor rule: 1000 req/s/vendor source; 429 above; large bodies (>1 MB) rejected at LB.

For SendGrid / Mailgun / SES / Twilio / Infobip / Meta WhatsApp / FCM, the vendor-specific HMAC algorithms and headers are codified in src/infrastructure/webhooks/<vendor>/verifier.ts. A contract test (tests/contracts/webhooks/<vendor>.test.ts) exercises positive and negative samples per vendor.

8. Opt-out tokens

Generated as 256-bit random; URL-safe base64; sha256 stored, plaintext only in the URL.
Bound to (recipient_id, channel, tenant_id); single use; 30-day expiry.
The opt-out endpoint POST /api/v1/notification-preferences/opt-out/:token is public and rate-limited (30 req/min/IP), Cache-Control: no-store.
Successful opt-out emits notification.opted_out.v1; the corresponding suppression row prevents future sends.
CSRF: the opt-out is a single explicit POST gated by a one-tap confirmation page; we do not rely on session cookies.

9. Rate limiting and abuse control

Surface	Limit	Action when exceeded
`POST /api/v1/notifications`	100 req/s/tenant; 5 req/s/IP from staff origin	429 with `Retry-After`; bulk patterns trigger Cloud Armor backoff
`POST /api/v1/notifications/batch`	5 req/min/tenant; max 1 active per category	429
`POST /api/v1/notifications/{id}/resend`	3 resends per notification per 24 h	429
Per-recipient per-day across all channels	tenant-policy default 5; tenant-configurable up to 20	suppress with `reason='rate_limit'`
WhatsApp marketing per recipient	1 per 24 h (Meta policy)	suppress
Vendor webhooks	1000 req/s/vendor IP	429
Opt-out endpoint	30 req/min/IP	429; sustained → Cloud Armor block
WS feed connections	1 per recipient	refuse second connect; existing wins

Repeat 401/429 from a single IP (>50/min) triggers a Cloud Armor adaptive rule for 1h.

10. Sender-ID and brand impersonation

A Sender is bound to a Channel and validated against per-tenant verification records:
- Email: DKIM/SPF for the fromEmail.domain must verify (verified out-of-band via tenant-service DNS workflow). EnqueueNotificationUseCase rejects sends whose Sender domain is not verified.
- SMS PK/AF: senderId must reference a registrationRef (PTA/regulator approval). Missing → 422.
- WhatsApp: whatsappPhoneNumberId must match Meta's verified number for the tenant.
Tenant-on-tenant brand impersonation: a tenant cannot configure a Sender that uses another tenant's verified domain or sender id. The verification table is unique per (domain | senderId | phoneNumberId).
Display-name spoof checks: Sender.fromName cannot exactly match well-known platform names ("Melmastoon", "Ghasi", "Support", etc.) for tenant senders.

11. Output sanitisation

Email HTML: rendered MJML output is sanitised through dompurify server-side with an allowlist; inline scripts and event handlers are stripped before checksum.
SMS / WhatsApp / push bodies: stripped of control chars; emoji allowed; URL footers signed (HMAC tracking) so we can attribute clicks without exposing internal ids.
Opt-out URLs and tracking pixels live on https://*.notify.melmastoon.com (separate cookie-isolated domain).
Subject lines have a length cap per channel (email <= 140, others n/a) enforced post-render.

12. Threat model (STRIDE summary)

Threat	Attack	Mitigation
Spoofing	Forged inbound webhook	HMAC + replay-window + dedupe + Cloud Armor
Spoofing	Forged staff token	RS256 JWKS validation, tenant claim check, mTLS for service-to-service
Tampering	Modify rendered HTML in transit	`renderSnapshot.checksum` (sha256) + immutable GCS object versioning
Repudiation	Tenant denies sending a marketing blast	Outbox + DispatchBatch + `template.published.v1` audit trail with `publishedBy`
Information disclosure	Cross-tenant data leak	RLS + `app.tenant_id` discipline + envelope encryption per tenant
Information disclosure	PII in logs	`pii-grep` pre-commit + sample log scan
Denial of service	Abuse of `POST /notifications`	per-route + per-tenant + per-recipient rate limits + CDN-side WAF
Denial of service	Vendor webhook flood	1000 rps cap + early HMAC reject + persist-then-process
Elevation of privilege	Staff with `RESERVATIONS` triggering marketing batch	Scope check (`notifications:batch` requires MARKETING_MANAGER+)
Elevation of privilege	Tenant admin reading another tenant's templates	RLS + tenant claim check
AI prompt injection	Guest name carries `ignore previous instructions`	Structured input, untrusted=true mark, output validation against canonical input — see AI_INTEGRATION §5
Phishing-by-template	Tenant authors a template that mimics a bank/login	Template lint flags suspicious patterns; OWNER required to publish; per-locale moderation queue for new tenants in first 30 days

13. Compliance hooks

GDPR Art 7 (consent) for marketing — recorded in RecipientPreferences.marketingConsent with timestamp + source.
GDPR Art 17 (erasure) — see crypto-shred path above.
CAN-SPAM, CASL, UK PECR: every marketing email carries a physical postal address (from tenant.profile.postalAddress) and a one-tap unsubscribe; marketing batch publish refuses if address is missing.
WhatsApp Business policy: marketing only with opt-in; transactional only via approved templates.
Local SMS regulators: PK PTA registered sender-IDs; AF/IR transmission via licensed aggregators; templates in marketing for these markets require a registered short-code or business-name sender id.
Data residency: per-tenant region pinning enforced by routing (tenants_local.region); cross-region replication only for DR backups within the same data-residency zone.

Every compliance-relevant action publishes an event consumed by audit-service (see EVENT_SCHEMAS published list).

14. Penetration test scope

This service is in scope for the platform's annual third-party pentest. Specific test cases:

HMAC bypass on each vendor webhook route.
Forced opt-out via guessable tokens.
Cross-tenant template read via path manipulation.
Cross-tenant suppression read.
Prompt injection via guest-controlled variables.
Brand impersonation via Sender configuration.
DKIM bypass via misconfigured tenant override.
Resend abuse (loops via webhook → resend).
WS feed cross-tenant subscription.

Findings tracked in SERVICE_RISK_REGISTER with severity and time-bound remediation.

1. Trust boundaries​

2. AuthN​

3. AuthZ — RBAC + scope claims​

4. Database isolation (RLS + roles)​

5. PII handling​

6. Vendor credentials​

7. Webhook security​

8. Opt-out tokens​

9. Rate limiting and abuse control​

10. Sender-ID and brand impersonation​

11. Output sanitisation​

12. Threat model (STRIDE summary)​

13. Compliance hooks​

14. Penetration test scope​