Skip to main content

SERVICE_OVERVIEW — notification-service

Bundle index: SERVICE_OVERVIEW · DOMAIN_MODEL · APPLICATION_LOGIC · API_CONTRACTS · EVENT_SCHEMAS · DATA_MODEL · SYNC_CONTRACT · AI_INTEGRATION · SECURITY_MODEL · OBSERVABILITY · TESTING_STRATEGY · DEPLOYMENT_TOPOLOGY · FAILURE_MODES · LOCAL_DEV_SETUP · SERVICE_READINESS · SERVICE_RISK_REGISTER · MIGRATION_PLAN

Strategic anchors: 02 Enterprise Architecture · 04 Event-Driven Architecture · 05 API Design · 06 Data Models · 07 Security/Compliance/Tenancy · 08 AI Architecture · 09 Lock & Key · 10 Payments

1. Purpose

notification-service owns every outbound communication on Ghasi Melmastoon — the multi-tenant hotel SaaS platform whose backoffice is an Electron offline-first desktop and whose cloud is GCP. It is the single mouth of the platform: no other service writes to SMTP, Twilio, FCM, APNs, the WhatsApp Business API, or a voice gateway. Every email, SMS, WhatsApp message, push notification, in-app pop, and (Phase 3+) IVR call to a guest, staff member, tenant admin, or vendor flows through this service.

The service exists for four reasons that no other service can satisfy:

  1. One preference gate, one suppression list. Opt-outs, regulatory holds, hard-bounce suppression, and per-recipient rate limits must be enforced uniformly across every domain. Spreading them across services produces inconsistent compliance and embarrassing leaks.
  2. One render pipeline. Templates live versioned, multi-language (Pashto / Dari / Arabic RTL; English / French / Urdu LTR), per-tenant-branded, and previewable. Re-implementing rendering in each upstream service guarantees drift and broken RTL.
  3. One channel-adapter abstraction. Vendors change. SendGrid → Mailgun, Twilio → Africa's Talking, WhatsApp Business API → Meta Cloud API, FCM → APNs. The NotificationPort keeps every upstream service from owning vendor SDKs.
  4. One audit trail. Hospitality regulators (especially in PK, UAE) and tenant disputes require provable delivery records. We persist the full lifecycle (requested → scheduled → dispatched → delivered → opened/clicked or failed/bounced/suppressed) with vendor message ids, latency, and error codes.

2. Bounded context

Context name: Communication Domain class: Generic (commodity capability — reliable, but the differentiator is not sending — it's preferences, branding, audit, and AI-drafted content) Ubiquitous language: Notification, Channel, Template, TemplateVersion, Recipient, RecipientPreferences, DeliveryAttempt, SuppressionRecord, ChannelCredential, WebhookInbound, TriggerMap, Sender (the per-(tenant, country, channel) identity), DispatchBatch, RenderSnapshot, OptOutToken, AIProvenance (reference only — owned by ai-orchestrator-service), MobileKeyToken (reference only — owned by lock-integration-service).

What is in:

  • The Notification lifecycle and its state machine.
  • The Template registry, version workflow, preview, test-send, publish, archive.
  • RecipientPreferences, opt-out tokens, suppression list, regulatory overrides.
  • The NotificationPort abstraction over channel adapters: SendGrid (email primary), Mailgun (fallback), Twilio (SMS + Voice primary), Africa's Talking (SMS fallback for KE/TZ Phase 4), WhatsApp Business API + Meta Cloud API, FCM, native APNs, Web Push (VAPID).
  • Vendor webhook ingestion, signature verification, and feedback into the audit + suppression list.
  • The trigger map (event-type → template keys) for every consumed event.
  • The scheduled-send worker (pre-arrival, post-stay, dunning).
  • Per-tenant per-channel daily and per-recipient daily rate limiting.
  • Per-(tenant, country code) sender-ID resolution for SMS/WhatsApp.

What is out:

  • Generating message content — copy that requires AI personalisation/translation is generated by ai-orchestrator-service and arrives here as a rendered string + AIProvenance. We never call models.
  • Owning recipient identity / contact verificationiam-service (for users) and reservation-service (for guest contact captured at booking) are sources of truth. We project to a local Recipient cache.
  • Sender domain DKIM/SPF/DMARC provisioningtenant-service owns the per-tenant domain configuration. We read the verified sender identity from there.
  • Mobile-key generationlock-integration-service issues the credential and emits key_credential.issued.v1 carrying the one-time-link token; we deliver the link.
  • Folio/invoice renderingbilling-service renders the PDF and stores it; we attach it.
  • Theme tokenstheme-config-service owns colors/logo/typography; we read them at render time.

3. Aggregates owned

AggregateCardinalityPurposeIdentity prefix
Notificationroot, 1 per (recipient, channel, intent)The dispatch record: status machine, attempts, render snapshot, suppression reason, source-event linkagentf_
Templateroot, platform-global or tenant-scopedLogical template by key (e.g., reservation.confirmed.email); pointer to active version, archived versionstpl_
TemplateVersionchild of Template, 1..NImmutable rendered body per locale; semver-versioned; states draft → active → archivedtpv_
Recipientroot, 1 per (tenant, contact identity)Cached projection of guest/staff/vendor identity with verified addressesrcp_
RecipientPreferenceschild of Recipient, 1Channel × category opt-outs, locale, quiet hours, timezone(composite)
DeliveryAttemptchild of Notification, 1..N (cap 6)Per-attempt vendor record (vendor name, vendor message id, latency, outcome, error)(composite, ULID)
SuppressionRecordroot, 1 per (tenant, channel, address-hash)Hard-bounce / complaint / manual blocksup_
Channelroot, 1 per (tenant, channel-kind)Per-tenant channel configuration: status, primary vendor, fallback vendorch_
ChannelCredentialchild of Channel, 1..NVendor-specific credentials (API key ciphertext, sender-IDs, DKIM selector, WhatsApp display name, voice caller ID)chc_
WebhookInboundroot, 1 per inbound vendor callbackAudit row of every vendor delivery webhook received and processedwhi_
OptOutTokenchild of Recipient, 1..NSingle-use signed unsubscribe tokens emitted in email footers(composite, ULID)
DispatchBatchroot, 1 per batched sendA marketing blast / dunning sweep: tracks the parent batch metadata + child Notification rowsdbt_

Template may be platform-global (tenant_id IS NULL) or tenant-overridden (tenant_id = tnt_…). Resolution is "tenant-override wins; otherwise fall back to platform"; a Template.key exists at most once per (tenant, key) and at most once at platform scope.

4. Responsibilities (numbered)

  1. Trigger-map projection. Subscribe to every consumed event; map (event type, payload signals) to one or more (templateKey, recipientResolver, channel) triples. The trigger map is data-driven, hot-reloadable, and tenant-overridable.
  2. Recipient resolution. Given a source event, derive the recipient(s): the booker guest from a reservation.confirmed.v1; the assigned vendor from a maintenance.work_order.assigned.v1; the tenant admin(s) for billing.subscription.payment_failed.v1. Resolution is via local Recipient cache, falling back to a synchronous read of iam-service/reservation-service if cold.
  3. Preference + suppression gate. For each (recipient, channel, category), decide send | suppress(reason) | defer(untilTimestamp). Security and regulated categories bypass opt-out (e.g., password reset cannot be opted out).
  4. Template selection + render. Pick the active TemplateVersion for (tenant?, key) in the recipient's preferred locale (with a fallback chain ps-AF → fa-AF → ar-SA → en). Render with variables, tenant theme tokens, and channel-specific transforms (MJML→HTML+inlined CSS for email; markdown→plaintext for SMS truncation; deep-link wrapping for push).
  5. Sender-ID resolution. For SMS/WhatsApp, look up the per-(tenant, recipient-country, channel) sender-ID from Channel/ChannelCredential. PK requires registered alphanumeric sender-IDs; UAE TRA the same; AF/IR/TJ accept generic long codes. Fail-fast with MELMASTOON.NOTIFICATION.SENDER_ID_MISSING rather than dispatching with a non-compliant sender.
  6. Rate-limit + budget gate. Enforce per-tenant per-channel per-day limits and per-recipient per-day limits. Marketing categories also check tenant.notification_budget_remaining. Defer or suppress with audit if exceeded.
  7. Dispatch. Hand the rendered message to the channel adapter via NotificationPort. The adapter calls the vendor with idempotency keys, captures the vendor message id, and returns an outcome. Retries follow exponential backoff with jitter (capped at 6).
  8. Webhook ingestion. Vendor delivery callbacks (SendGrid events, Twilio status callbacks, FCM/APNs feedback, WhatsApp BSP status, Meta Cloud API webhooks) are HMAC-validated, persisted as WebhookInbound, and applied to the corresponding Notification's state machine.
  9. Bounce/complaint handling. Hard bounces and explicit complaints add the address to the per-tenant SuppressionRecord table within 5 minutes of receipt; subsequent sends to that address auto-suppress.
  10. Scheduled sends. A scheduler worker enqueues pre-arrival reminders (T-24h before stay_start), post-stay thank-you (T+24h after checked_out), and dunning sequences (T+0/T+3/T+7) by reading the notification_scheduled table and creating Notification rows when run_after <= now().
  11. Mobile-key delivery. On lock_integration.key_credential.issued.v1, render and dispatch the one-time-link or QR-code message over the recipient's chosen channel. Coordinate token expiry with lock-integration-service via the mobile_key_token payload.
  12. AI-drafted content reception. Accept pre-rendered AI content via the ai_drafted_content_ready.v1 event from ai-orchestrator-service; store the rendered body + AIProvenance block; require HITL approval before activating an AI-drafted TemplateVersion for non-test sends.
  13. Channel health monitoring. Probe each channel's primary vendor every 60 s; flip Channel.status to degraded after 3 consecutive failures and emit melmastoon.notification.channel.health_changed.v1; fall back to the secondary vendor automatically for high-priority categories.
  14. Audit trail. Every state transition, every webhook, every suppression flip is appended to the audit projection consumed by audit-service for compliance reporting.

5. Upstream / downstream context map

┌─────────────────────────────────┐
│ tenant-service │ per-tenant theme,
│ │ branding, sender-IDs,
│ │ channel budget,
│ │ domain (DKIM/SPF/DMARC)
└────────────────┬────────────────┘
│ tenant.settings.changed.v1
│ tenant.invitation.sent.v1
│ tenant.domain.verified.v1

┌──────────────────┐ ┌──────────────────────────┐
│ reservation-svc │ confirmed/cancelled/modified/checked_in/ │ billing-service │
│ │ checked_out/dates_changed │ invoice.generated/ │
└────────┬─────────┘ │ subscription.payment_ │
│ │ failed │
│ └─────────────┬────────────┘
┌────────▼─────────┐ │
│ lock-integration │ key_credential.issued/revoked/expired │
│ service │ │
└────────┬─────────┘ │
│ │
┌────────▼─────────┐ ┌─────────────▼────────────┐
│ iam-service │ password.reset_requested/ │ maintenance-service │
│ │ user.invited/ │ work_order.assigned/ │
│ │ session.suspicious_login │ vendor.notify_required │
└────────┬─────────┘ └─────────────┬────────────┘
│ │
▼ ▼
┌────────────────────────────────────────────────────────────────────────────────────────┐
│ notification-service │
│ ┌────────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Trigger map │→ │ Pref + supp │→ │ Renderer │→ │ Dispatcher │ → Vendor │
│ │ │ │ gate │ │ (i18n+RTL+ │ │ (Port + │ │
│ │ │ │ │ │ per-tenant │ │ adapters) │ │
│ │ │ │ │ │ branding) │ │ │ │
│ └────────────────┘ └──────────────┘ └──────────────┘ └──────────────┘ │
│ ▲ │ │
│ │ ▼ │
│ ┌──────── ─ ─ ────┐ ┌─────────────────────┐ │
│ │ ai-drafted_ │ ai_drafted_content_ready.v1 │ webhook ingest │ ◀── vendor │
│ │ content (HITL) │ ◀────────────────────────── │ (HMAC-validated) │ callbacks │
│ └─────────────────┘ from ai-orchestrator-service └─────────────────────┘ │
└────────────────────────────────────────────────────────────────────────────────────────┘


┌───────────────────────────────────────────────────────────────┐
│ Vendors: SendGrid · Mailgun · Twilio · WhatsApp BSP / Meta │
│ Cloud API · FCM · APNs · Web Push (VAPID) · Voice │
└───────────────────────────────────────────────────────────────┘


Recipients (guests, staff, vendors, tenant admins)

6. Booking-confirmation notification — ASCII sequence

The most-trafficked flow: a guest confirms a booking; we send confirmation in their language with the tenant's branding.

reservation-svc notification-svc ai-orchestrator-svc channel-adapter Vendor (e.g., Twilio)
│ │ │ │ │
│ reservation.confirmed.v1 │ │ │
├───────────────────────▶│ on event: │ │ │
│ │ 1. trigger-map lookup │ │ │
│ │ 2. resolve recipient │ │ │
│ │ 3. pref+supp gate │ │ │
│ │ 4. select TemplateVersion │ │ │
│ │ (tenant override?) │ │ │
│ │ 5. is AI personalisation │ │ │
│ │ enabled for tenant? │ │ │
│ │ YES ──────────────────▶ POST /draft (HITL gated) │ │
│ │ │ returns rendered body │ │
│ │ │ + AIProvenance │ │
│ │ ◀──────────────────────────│ │ │
│ │ NO → render via Handlebars + MJML deterministically │
│ │ 6. resolve sender-ID for │ │
│ │ +93 (AF) → 'GHASI' or LC │ │
│ │ 7. rate-limit check │ │
│ │ 8. persist Notification + outbox │ │
│ │ ──── melmastoon.notification.requested.v1 │ │
│ │ │ │
│ │ 9. dispatch via Port ──────────────────────────────▶ │ Twilio API call │
│ │ │ ───────────────────────▶ │
│ │ │ ◀─── 202 + sid ─────────│
│ │ ──── melmastoon.notification.dispatched.v1 │ │
│ │ │ │
│ │ ◀── webhook /webhooks/vendors/twilio (HMAC-signed) ──┴── status=delivered ──── │
│ │ 10. apply to Notification: state=delivered │ │
│ │ ──── melmastoon.notification.delivered.v1 │ │
│ │ │ │
│ (T-24h scheduled): pre-arrival reminder │ │
│ (T+24h scheduled): post-stay thank-you + invoice link │ │

Compensation paths:

  • Recipient suppressed (hard-bounce on file) → Notification enters suppressed; emit melmastoon.notification.suppressed.v1; no dispatch attempt.
  • Vendor down → after 3 consecutive 5xx, fall back to secondary vendor for high-priority categories; for low-priority (marketing) defer to retry queue with exponential backoff.
  • WhatsApp template approval pending → fall back to SMS for transactional categories; for marketing, defer to template.approval_pending queue and alert tenant admin.
  • Sender-ID missing for the recipient country → reject the dispatch with MELMASTOON.NOTIFICATION.SENDER_ID_MISSING; emit melmastoon.notification.failed.v1; alert tenant admin in backoffice.

7. Key invariants enforced in the domain layer

  1. No cross-tenant references. Every aggregate carries a TenantId value object; the constructor refuses missing or mismatched values. Platform-global templates carry tenantId = null only and are constructed via a separate PlatformTemplate.create() factory. (MELMASTOON.GENERAL.CROSS_TENANT_REFERENCE)
  2. A Notification cannot dispatch without a resolved Sender. SMS/WhatsApp dispatches require a senderId present in Channel.credentials; email requires a verified DKIM-signed from domain. (MELMASTOON.NOTIFICATION.SENDER_ID_MISSING)
  3. Regulated categories cannot be opted out. RecipientPreferences.channels.security.email|sms always evaluates to instant; compliance.email|inapp always instant. Constructor rejects updates that would set them to off. (MELMASTOON.NOTIFICATION.CHANNEL_DISABLED is only raised for non-regulated categories.)
  4. A Notification's templateVersion is pinned at enqueue. Once the row is written, even if a new template version is published, the in-flight notification renders the snapshot. Re-render is a new Notification.
  5. State transitions follow the declared graph only (DOMAIN_MODEL §3). Re-sending a delivered notification creates a new sibling row rather than rewinding the state machine.
  6. attempts.length <= 6 for non-inapp channels. inapp has no retries (web socket deliver-or-drop with reconciliation on next connect).
  7. Suppression is tenant-scoped. A hard-bounce for guest@example.com against tenant A does not suppress the same address for tenant B (different sending domain, different reputation).
  8. AI-drafted templates require HITL before activation. A TemplateVersion whose source = 'ai_drafted' cannot transition draft → active without an approvedBy actor and approvedAt timestamp captured. (MELMASTOON.AI.HITL_REQUIRED)
  9. OCC version checked on every preference save. (MELMASTOON.GENERAL.PRECONDITION_FAILED)
  10. Webhook idempotency. Vendor webhook ingestion deduplicates by (vendor, vendor_message_id, event_type) for 7 days.

8. Hot read paths

ReadFrequencyCaching strategy
Active TemplateVersion for (tenant?, key, locale)per dispatch (very high)Memorystore key `notif:tpl:<tenantId
RecipientPreferences for (tenantId, recipientId)per dispatchMemorystore (TTL 10 min, invalidated on preferences.updated.v1)
Suppression check for (tenantId, channel, addressHash)per dispatchMemorystore set membership (TTL 1 min) backed by Postgres suppression_records
Channel config for (tenantId, channelKind)per dispatchMemorystore (TTL 5 min, invalidated on channel.health_changed.v1 / channel CRUD)
Notification feed for an in-app recipientUI poll every 30 s + WS pushMemorystore key notif:feed:<tenantId>:<userId> LRU 100 items, TTL 60 s
Notification audit lookup by (tenantId, sourceEventId)low (admin)Postgres indexed lookup; no cache

9. Cost & scale envelope

DimensionTarget
Notifications per active tenant per day10 (smallest guesthouse) → 5,000 (large chain property + marketing blast)
Platform-wide steady state (Phase 2, 200 tenants)~500K notifications/day, peak 50/s
Template render p99< 30 ms (cached AST, warm Redis)
Dispatch p99 (excluding vendor latency)< 100 ms
Webhook ingest p99< 200 ms (HMAC verify + persist + state apply)
Cloud Run min replicas (API)3
Cloud Run min replicas (dispatch worker per channel)2 (email, SMS, WhatsApp, push share a pool with channel-affinity)
Cloud Run min replicas (scheduler worker)2
Cloud Run min replicas (webhook ingest)1 (autoscale on RPS)
Cloud SQL Postgres CPUshared with PMS-core services on the regional HA instance
delivery_attempts partitioned by month, 24-month rolling on hot tier~15M rows/month at Phase 2

10. Decision log (anchors)

  • Why a separate service rather than a library — every service avoiding the duplication of preference, suppression, branding, audit, and adapter logic; centralising vendor SDKs (and their secrets) into one runtime; one place to enforce per-market sender-ID rules.
  • Why we accept events rather than expose a write API as the primary path — most notifications are reactions to domain facts; consuming the events makes the notification subsystem decoupled from upstream service runtimes (a reservation-service outage doesn't lose us a confirmation; we replay events). Ad-hoc API exists for staff-initiated and AI-drafted sends.
  • Why we do not call AI models directly — every AI call must flow through ai-orchestrator-service for routing, moderation, budget, HITL, and AIProvenance per 02 §11. We receive pre-rendered content via an event; we never embed model SDKs.
  • Why SMS-first for AF/PK/IR/TJ — feature-phone reach. WhatsApp is preferred when the recipient has a known WA number, but transactional flows fall back to SMS automatically.
  • Why per-tenant suppression scoping — sender domain reputation is per-tenant. Tenants with verified domains have their own DKIM-signed sender; a hard-bounce affects their reputation, not the platform's.
  • Why 7-year retention for regulated categories — financial receipts, KYC notices, mobile-key issuance audit, dunning sequences are subject to local regulator audit windows in PK/UAE and tenant-side disputes; 7 years exceeds every relevant regime.

11. What this service depends on (libraries, ports, infrastructure)

  • NestJS for presentation + DI composition root (out of the domain layer).
  • Drizzle ORM for Postgres access in the infrastructure layer.
  • @google-cloud/pubsub for outbox publishing and consumed-event subscription.
  • @google-cloud/storage for storing rendered email HTML bodies (hot 30 d) and inbound webhook raw payloads (audit 90 d).
  • Memorystore (Redis) for template AST cache, preferences cache, suppression set, rate counters, in-app feed.
  • Handlebars (sandboxed) for variable interpolation; MJML for email layout → HTML; html-to-text for plain-text fallback; mjml-rtl for RTL flipping.
  • Vendor SDKs behind adapters: @sendgrid/mail, mailgun.js, twilio, whatsapp-business, firebase-admin, node-apn, web-push.
  • Ports the application layer depends on (interfaces only):
    • NotificationRepository
    • TemplateRepository
    • RecipientRepository
    • RecipientPreferencesRepository
    • SuppressionRepository
    • ChannelConfigRepository
    • WebhookInboundRepository
    • EventPublisher (outbox-backed)
    • EmailPort, SmsPort, WhatsAppPort, PushPort, InAppPort, VoicePort (one per channel kind, one adapter per vendor)
    • Clock, IdGenerator, Hasher, HmacVerifier
    • AIClient (calls ai-orchestrator-service only for HITL approval surface; never for live model calls)
    • IdentityResolver (resolves recipient identity from iam-service/reservation-service)
    • TenantConfigClient (reads per-tenant theme, sender-IDs, domain, budget from tenant-service)

The domain layer depends on nothing outside @ghasi/domain-primitives and the standard library. CI fails the build on any framework or I/O import inside src/domain/.

12. References