SERVICE_OVERVIEW — notification-service
Bundle index: SERVICE_OVERVIEW · DOMAIN_MODEL · APPLICATION_LOGIC · API_CONTRACTS · EVENT_SCHEMAS · DATA_MODEL · SYNC_CONTRACT · AI_INTEGRATION · SECURITY_MODEL · OBSERVABILITY · TESTING_STRATEGY · DEPLOYMENT_TOPOLOGY · FAILURE_MODES · LOCAL_DEV_SETUP · SERVICE_READINESS · SERVICE_RISK_REGISTER · MIGRATION_PLAN
Strategic anchors: 02 Enterprise Architecture · 04 Event-Driven Architecture · 05 API Design · 06 Data Models · 07 Security/Compliance/Tenancy · 08 AI Architecture · 09 Lock & Key · 10 Payments
1. Purpose
notification-service owns every outbound communication on Ghasi Melmastoon — the multi-tenant hotel SaaS platform whose backoffice is an Electron offline-first desktop and whose cloud is GCP. It is the single mouth of the platform: no other service writes to SMTP, Twilio, FCM, APNs, the WhatsApp Business API, or a voice gateway. Every email, SMS, WhatsApp message, push notification, in-app pop, and (Phase 3+) IVR call to a guest, staff member, tenant admin, or vendor flows through this service.
The service exists for four reasons that no other service can satisfy:
- One preference gate, one suppression list. Opt-outs, regulatory holds, hard-bounce suppression, and per-recipient rate limits must be enforced uniformly across every domain. Spreading them across services produces inconsistent compliance and embarrassing leaks.
- One render pipeline. Templates live versioned, multi-language (Pashto / Dari / Arabic RTL; English / French / Urdu LTR), per-tenant-branded, and previewable. Re-implementing rendering in each upstream service guarantees drift and broken RTL.
- One channel-adapter abstraction. Vendors change. SendGrid → Mailgun, Twilio → Africa's Talking, WhatsApp Business API → Meta Cloud API, FCM → APNs. The
NotificationPortkeeps every upstream service from owning vendor SDKs. - One audit trail. Hospitality regulators (especially in PK, UAE) and tenant disputes require provable delivery records. We persist the full lifecycle (
requested → scheduled → dispatched → delivered → opened/clickedorfailed/bounced/suppressed) with vendor message ids, latency, and error codes.
2. Bounded context
Context name: Communication
Domain class: Generic (commodity capability — reliable, but the differentiator is not sending — it's preferences, branding, audit, and AI-drafted content)
Ubiquitous language: Notification, Channel, Template, TemplateVersion, Recipient, RecipientPreferences, DeliveryAttempt, SuppressionRecord, ChannelCredential, WebhookInbound, TriggerMap, Sender (the per-(tenant, country, channel) identity), DispatchBatch, RenderSnapshot, OptOutToken, AIProvenance (reference only — owned by ai-orchestrator-service), MobileKeyToken (reference only — owned by lock-integration-service).
What is in:
- The
Notificationlifecycle and its state machine. - The
Templateregistry, version workflow, preview, test-send, publish, archive. RecipientPreferences, opt-out tokens, suppression list, regulatory overrides.- The
NotificationPortabstraction over channel adapters: SendGrid (email primary), Mailgun (fallback), Twilio (SMS + Voice primary), Africa's Talking (SMS fallback for KE/TZ Phase 4), WhatsApp Business API + Meta Cloud API, FCM, native APNs, Web Push (VAPID). - Vendor webhook ingestion, signature verification, and feedback into the audit + suppression list.
- The trigger map (event-type → template keys) for every consumed event.
- The scheduled-send worker (pre-arrival, post-stay, dunning).
- Per-tenant per-channel daily and per-recipient daily rate limiting.
- Per-(tenant, country code) sender-ID resolution for SMS/WhatsApp.
What is out:
- Generating message content — copy that requires AI personalisation/translation is generated by
ai-orchestrator-serviceand arrives here as a rendered string +AIProvenance. We never call models. - Owning recipient identity / contact verification —
iam-service(for users) andreservation-service(for guest contact captured at booking) are sources of truth. We project to a localRecipientcache. - Sender domain DKIM/SPF/DMARC provisioning —
tenant-serviceowns the per-tenant domain configuration. We read the verified sender identity from there. - Mobile-key generation —
lock-integration-serviceissues the credential and emitskey_credential.issued.v1carrying the one-time-link token; we deliver the link. - Folio/invoice rendering —
billing-servicerenders the PDF and stores it; we attach it. - Theme tokens —
theme-config-serviceowns colors/logo/typography; we read them at render time.
3. Aggregates owned
| Aggregate | Cardinality | Purpose | Identity prefix |
|---|---|---|---|
Notification | root, 1 per (recipient, channel, intent) | The dispatch record: status machine, attempts, render snapshot, suppression reason, source-event linkage | ntf_ |
Template | root, platform-global or tenant-scoped | Logical template by key (e.g., reservation.confirmed.email); pointer to active version, archived versions | tpl_ |
TemplateVersion | child of Template, 1..N | Immutable rendered body per locale; semver-versioned; states draft → active → archived | tpv_ |
Recipient | root, 1 per (tenant, contact identity) | Cached projection of guest/staff/vendor identity with verified addresses | rcp_ |
RecipientPreferences | child of Recipient, 1 | Channel × category opt-outs, locale, quiet hours, timezone | (composite) |
DeliveryAttempt | child of Notification, 1..N (cap 6) | Per-attempt vendor record (vendor name, vendor message id, latency, outcome, error) | (composite, ULID) |
SuppressionRecord | root, 1 per (tenant, channel, address-hash) | Hard-bounce / complaint / manual block | sup_ |
Channel | root, 1 per (tenant, channel-kind) | Per-tenant channel configuration: status, primary vendor, fallback vendor | ch_ |
ChannelCredential | child of Channel, 1..N | Vendor-specific credentials (API key ciphertext, sender-IDs, DKIM selector, WhatsApp display name, voice caller ID) | chc_ |
WebhookInbound | root, 1 per inbound vendor callback | Audit row of every vendor delivery webhook received and processed | whi_ |
OptOutToken | child of Recipient, 1..N | Single-use signed unsubscribe tokens emitted in email footers | (composite, ULID) |
DispatchBatch | root, 1 per batched send | A marketing blast / dunning sweep: tracks the parent batch metadata + child Notification rows | dbt_ |
Template may be platform-global (tenant_id IS NULL) or tenant-overridden (tenant_id = tnt_…). Resolution is "tenant-override wins; otherwise fall back to platform"; a Template.key exists at most once per (tenant, key) and at most once at platform scope.
4. Responsibilities (numbered)
- Trigger-map projection. Subscribe to every consumed event; map (event type, payload signals) to one or more
(templateKey, recipientResolver, channel)triples. The trigger map is data-driven, hot-reloadable, and tenant-overridable. - Recipient resolution. Given a source event, derive the recipient(s): the booker guest from a
reservation.confirmed.v1; the assigned vendor from amaintenance.work_order.assigned.v1; the tenant admin(s) forbilling.subscription.payment_failed.v1. Resolution is via localRecipientcache, falling back to a synchronous read ofiam-service/reservation-serviceif cold. - Preference + suppression gate. For each
(recipient, channel, category), decidesend | suppress(reason) | defer(untilTimestamp). Security and regulated categories bypass opt-out (e.g., password reset cannot be opted out). - Template selection + render. Pick the active
TemplateVersionfor(tenant?, key)in the recipient's preferred locale (with a fallback chainps-AF → fa-AF → ar-SA → en). Render with variables, tenant theme tokens, and channel-specific transforms (MJML→HTML+inlined CSS for email; markdown→plaintext for SMS truncation; deep-link wrapping for push). - Sender-ID resolution. For SMS/WhatsApp, look up the per-(tenant, recipient-country, channel) sender-ID from
Channel/ChannelCredential. PK requires registered alphanumeric sender-IDs; UAE TRA the same; AF/IR/TJ accept generic long codes. Fail-fast withMELMASTOON.NOTIFICATION.SENDER_ID_MISSINGrather than dispatching with a non-compliant sender. - Rate-limit + budget gate. Enforce per-tenant per-channel per-day limits and per-recipient per-day limits. Marketing categories also check
tenant.notification_budget_remaining. Defer or suppress with audit if exceeded. - Dispatch. Hand the rendered message to the channel adapter via
NotificationPort. The adapter calls the vendor with idempotency keys, captures the vendor message id, and returns an outcome. Retries follow exponential backoff with jitter (capped at 6). - Webhook ingestion. Vendor delivery callbacks (SendGrid events, Twilio status callbacks, FCM/APNs feedback, WhatsApp BSP status, Meta Cloud API webhooks) are HMAC-validated, persisted as
WebhookInbound, and applied to the correspondingNotification's state machine. - Bounce/complaint handling. Hard bounces and explicit complaints add the address to the per-tenant
SuppressionRecordtable within 5 minutes of receipt; subsequent sends to that address auto-suppress. - Scheduled sends. A scheduler worker enqueues pre-arrival reminders (T-24h before
stay_start), post-stay thank-you (T+24h afterchecked_out), and dunning sequences (T+0/T+3/T+7) by reading thenotification_scheduledtable and creatingNotificationrows whenrun_after <= now(). - Mobile-key delivery. On
lock_integration.key_credential.issued.v1, render and dispatch the one-time-link or QR-code message over the recipient's chosen channel. Coordinate token expiry withlock-integration-servicevia themobile_key_tokenpayload. - AI-drafted content reception. Accept pre-rendered AI content via the
ai_drafted_content_ready.v1event fromai-orchestrator-service; store the rendered body +AIProvenanceblock; require HITL approval before activating an AI-draftedTemplateVersionfor non-test sends. - Channel health monitoring. Probe each channel's primary vendor every 60 s; flip
Channel.statustodegradedafter 3 consecutive failures and emitmelmastoon.notification.channel.health_changed.v1; fall back to the secondary vendor automatically for high-priority categories. - Audit trail. Every state transition, every webhook, every suppression flip is appended to the audit projection consumed by
audit-servicefor compliance reporting.
5. Upstream / downstream context map
┌─────────────────────────────────┐
│ tenant-service │ per-tenant theme,
│ │ branding, sender-IDs,
│ │ channel budget,
│ │ domain (DKIM/SPF/DMARC)
└────────────────┬────────────────┘
│ tenant.settings.changed.v1
│ tenant.invitation.sent.v1
│ tenant.domain.verified.v1
▼
┌──────────────────┐ ┌──────────────────────────┐
│ reservation-svc │ confirmed/cancelled/modified/checked_in/ │ billing-service │
│ │ checked_out/dates_changed │ invoice.generated/ │
└────────┬─────────┘ │ subscription.payment_ │
│ │ failed │
│ └─────────────┬────────────┘
┌────────▼─────────┐ │
│ lock-integration │ key_credential.issued/revoked/expired │
│ service │ │
└────────┬─────────┘ │
│ │
┌────────▼─────────┐ ┌─────────────▼────────────┐
│ iam-service │ password.reset_requested/ │ maintenance-service │
│ │ user.invited/ │ work_order.assigned/ │
│ │ session.suspicious_login │ vendor.notify_required │
└────────┬─────────┘ └─────────────┬────────────┘
│ │
▼ ▼
┌────────────────────────────────────────────────────────────────────────────────────────┐
│ notification-service │
│ ┌────────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Trigger map │→ │ Pref + supp │→ │ Renderer │→ │ Dispatcher │ → Vendor │
│ │ │ │ gate │ │ (i18n+RTL+ │ │ (Port + │ │
│ │ │ │ │ │ per-tenant │ │ adapters) │ │
│ │ │ │ │ │ branding) │ │ │ │
│ └────────────────┘ └──────────────┘ └──────────────┘ └──────────────┘ │
│ ▲ │ │
│ │ ▼ │
│ ┌──────── ─ ─ ────┐ ┌─────────────────────┐ │
│ │ ai-drafted_ │ ai_drafted_content_ready.v1 │ webhook ingest │ ◀── vendor │
│ │ content (HITL) │ ◀────────────────────────── │ (HMAC-validated) │ callbacks │
│ └─────────────────┘ from ai-orchestrator-service └─────────────────────┘ │
└────────────────────────────────────────────────────────────────────────────────────────┘
│
▼
┌───────────────────────────────────────────────────────────────┐
│ Vendors: SendGrid · Mailgun · Twilio · WhatsApp BSP / Meta │
│ Cloud API · FCM · APNs · Web Push (VAPID) · Voice │
└───────────────────────────────────────────────────────────────┘
│
▼
Recipients (guests, staff, vendors, tenant admins)
6. Booking-confirmation notification — ASCII sequence
The most-trafficked flow: a guest confirms a booking; we send confirmation in their language with the tenant's branding.
reservation-svc notification-svc ai-orchestrator-svc channel-adapter Vendor (e.g., Twilio)
│ │ │ │ │
│ reservation.confirmed.v1 │ │ │
├───────────────────────▶│ on event: │ │ │
│ │ 1. trigger-map lookup │ │ │
│ │ 2. resolve recipient │ │ │
│ │ 3. pref+supp gate │ │ │
│ │ 4. select TemplateVersion │ │ │
│ │ (tenant override?) │ │ │
│ │ 5. is AI personalisation │ │ │
│ │ enabled for tenant? │ │ │
│ │ YES ──────────────────▶ POST /draft (HITL gated) │ │
│ │ │ returns rendered body │ │
│ │ │ + AIProvenance │ │
│ │ ◀──────────────────────────│ │ │
│ │ NO → render via Handlebars + MJML deterministically │
│ │ 6. resolve sender-ID for │ │
│ │ +93 (AF) → 'GHASI' or LC │ │
│ │ 7. rate-limit check │ │
│ │ 8. persist Notification + outbox │ │
│ │ ──── melmastoon.notification.requested.v1 │ │
│ │ │ │
│ │ 9. dispatch via Port ──────────────────────────────▶ │ Twilio API call │
│ │ │ ───────────────────────▶ │
│ │ │ ◀─── 202 + sid ─────────│
│ │ ──── melmastoon.notification.dispatched.v1 │ │
│ │ │ │
│ │ ◀── webhook /webhooks/vendors/twilio (HMAC-signed) ──┴── status=delivered ──── │
│ │ 10. apply to Notification: state=delivered │ │
│ │ ──── melmastoon.notification.delivered.v1 │ │
│ │ │ │
│ (T-24h scheduled): pre-arrival reminder │ │
│ (T+24h scheduled): post-stay thank-you + invoice link │ │
Compensation paths:
- Recipient suppressed (hard-bounce on file) →
Notificationenterssuppressed; emitmelmastoon.notification.suppressed.v1; no dispatch attempt. - Vendor down → after 3 consecutive 5xx, fall back to secondary vendor for high-priority categories; for low-priority (marketing) defer to retry queue with exponential backoff.
- WhatsApp template approval pending → fall back to SMS for transactional categories; for marketing, defer to
template.approval_pendingqueue and alert tenant admin. - Sender-ID missing for the recipient country → reject the dispatch with
MELMASTOON.NOTIFICATION.SENDER_ID_MISSING; emitmelmastoon.notification.failed.v1; alert tenant admin in backoffice.
7. Key invariants enforced in the domain layer
- No cross-tenant references. Every aggregate carries a
TenantIdvalue object; the constructor refuses missing or mismatched values. Platform-global templates carrytenantId = nullonly and are constructed via a separatePlatformTemplate.create()factory. (MELMASTOON.GENERAL.CROSS_TENANT_REFERENCE) - A
Notificationcannot dispatch without a resolved Sender. SMS/WhatsApp dispatches require asenderIdpresent inChannel.credentials; email requires a verified DKIM-signedfromdomain. (MELMASTOON.NOTIFICATION.SENDER_ID_MISSING) - Regulated categories cannot be opted out.
RecipientPreferences.channels.security.email|smsalways evaluates toinstant;compliance.email|inappalwaysinstant. Constructor rejects updates that would set them tooff. (MELMASTOON.NOTIFICATION.CHANNEL_DISABLEDis only raised for non-regulated categories.) - A Notification's
templateVersionis pinned at enqueue. Once the row is written, even if a new template version is published, the in-flight notification renders the snapshot. Re-render is a new Notification. - State transitions follow the declared graph only (DOMAIN_MODEL §3). Re-sending a
deliverednotification creates a new sibling row rather than rewinding the state machine. attempts.length <= 6for non-inappchannels.inapphas no retries (web socket deliver-or-drop with reconciliation on next connect).- Suppression is tenant-scoped. A hard-bounce for
guest@example.comagainst tenant A does not suppress the same address for tenant B (different sending domain, different reputation). - AI-drafted templates require HITL before activation. A
TemplateVersionwhosesource = 'ai_drafted'cannot transitiondraft → activewithout anapprovedByactor andapprovedAttimestamp captured. (MELMASTOON.AI.HITL_REQUIRED) - OCC version checked on every preference save. (
MELMASTOON.GENERAL.PRECONDITION_FAILED) - Webhook idempotency. Vendor webhook ingestion deduplicates by
(vendor, vendor_message_id, event_type)for 7 days.
8. Hot read paths
| Read | Frequency | Caching strategy |
|---|---|---|
Active TemplateVersion for (tenant?, key, locale) | per dispatch (very high) | Memorystore key `notif:tpl:<tenantId |
RecipientPreferences for (tenantId, recipientId) | per dispatch | Memorystore (TTL 10 min, invalidated on preferences.updated.v1) |
Suppression check for (tenantId, channel, addressHash) | per dispatch | Memorystore set membership (TTL 1 min) backed by Postgres suppression_records |
Channel config for (tenantId, channelKind) | per dispatch | Memorystore (TTL 5 min, invalidated on channel.health_changed.v1 / channel CRUD) |
| Notification feed for an in-app recipient | UI poll every 30 s + WS push | Memorystore key notif:feed:<tenantId>:<userId> LRU 100 items, TTL 60 s |
Notification audit lookup by (tenantId, sourceEventId) | low (admin) | Postgres indexed lookup; no cache |
9. Cost & scale envelope
| Dimension | Target |
|---|---|
| Notifications per active tenant per day | 10 (smallest guesthouse) → 5,000 (large chain property + marketing blast) |
| Platform-wide steady state (Phase 2, 200 tenants) | ~500K notifications/day, peak 50/s |
| Template render p99 | < 30 ms (cached AST, warm Redis) |
| Dispatch p99 (excluding vendor latency) | < 100 ms |
| Webhook ingest p99 | < 200 ms (HMAC verify + persist + state apply) |
| Cloud Run min replicas (API) | 3 |
| Cloud Run min replicas (dispatch worker per channel) | 2 (email, SMS, WhatsApp, push share a pool with channel-affinity) |
| Cloud Run min replicas (scheduler worker) | 2 |
| Cloud Run min replicas (webhook ingest) | 1 (autoscale on RPS) |
| Cloud SQL Postgres CPU | shared with PMS-core services on the regional HA instance |
delivery_attempts partitioned by month, 24-month rolling on hot tier | ~15M rows/month at Phase 2 |
10. Decision log (anchors)
- Why a separate service rather than a library — every service avoiding the duplication of preference, suppression, branding, audit, and adapter logic; centralising vendor SDKs (and their secrets) into one runtime; one place to enforce per-market sender-ID rules.
- Why we accept events rather than expose a write API as the primary path — most notifications are reactions to domain facts; consuming the events makes the notification subsystem decoupled from upstream service runtimes (a
reservation-serviceoutage doesn't lose us a confirmation; we replay events). Ad-hoc API exists for staff-initiated and AI-drafted sends. - Why we do not call AI models directly — every AI call must flow through
ai-orchestrator-servicefor routing, moderation, budget, HITL, andAIProvenanceper 02 §11. We receive pre-rendered content via an event; we never embed model SDKs. - Why SMS-first for AF/PK/IR/TJ — feature-phone reach. WhatsApp is preferred when the recipient has a known WA number, but transactional flows fall back to SMS automatically.
- Why per-tenant suppression scoping — sender domain reputation is per-tenant. Tenants with verified domains have their own DKIM-signed sender; a hard-bounce affects their reputation, not the platform's.
- Why 7-year retention for regulated categories — financial receipts, KYC notices, mobile-key issuance audit, dunning sequences are subject to local regulator audit windows in PK/UAE and tenant-side disputes; 7 years exceeds every relevant regime.
11. What this service depends on (libraries, ports, infrastructure)
- NestJS for presentation + DI composition root (out of the domain layer).
- Drizzle ORM for Postgres access in the infrastructure layer.
@google-cloud/pubsubfor outbox publishing and consumed-event subscription.@google-cloud/storagefor storing rendered email HTML bodies (hot 30 d) and inbound webhook raw payloads (audit 90 d).- Memorystore (Redis) for template AST cache, preferences cache, suppression set, rate counters, in-app feed.
- Handlebars (sandboxed) for variable interpolation; MJML for email layout → HTML; html-to-text for plain-text fallback; mjml-rtl for RTL flipping.
- Vendor SDKs behind adapters:
@sendgrid/mail,mailgun.js,twilio,whatsapp-business,firebase-admin,node-apn,web-push. - Ports the application layer depends on (interfaces only):
NotificationRepositoryTemplateRepositoryRecipientRepositoryRecipientPreferencesRepositorySuppressionRepositoryChannelConfigRepositoryWebhookInboundRepositoryEventPublisher(outbox-backed)EmailPort,SmsPort,WhatsAppPort,PushPort,InAppPort,VoicePort(one per channel kind, one adapter per vendor)Clock,IdGenerator,Hasher,HmacVerifierAIClient(callsai-orchestrator-serviceonly for HITL approval surface; never for live model calls)IdentityResolver(resolves recipient identity fromiam-service/reservation-service)TenantConfigClient(reads per-tenant theme, sender-IDs, domain, budget fromtenant-service)
The domain layer depends on nothing outside @ghasi/domain-primitives and the standard library. CI fails the build on any framework or I/O import inside src/domain/.
12. References
- Booking saga and notification ordering: 04 Event-Driven Architecture §7
- API conventions: 05 API Design
- Schema, RLS, ID prefixes: 06 Data Models
- Multi-tenancy and data residency: 07 Security/Compliance/Tenancy
- AI orchestration and provenance: 08 AI Architecture
- Mobile-key delivery interaction: 09 Lock & Key Integration
- Naming, error codes: standards/NAMING.md, standards/ERROR_CODES.md
- Sibling: reservation-service, billing-service, lock-integration-service, ai-orchestrator-service