Skip to main content

DEPLOYMENT_TOPOLOGY — notification-service

Sibling: SECURITY_MODEL · OBSERVABILITY · FAILURE_MODES · SERVICE_READINESS

Strategic anchors: 02 Enterprise Architecture §9 Deployment · ADR-0001 Core Architecture & Tech Stack

notification-service runs on Google Cloud Platform: Cloud Run for compute, Cloud SQL Postgres for state, Memorystore Redis for hot caches, Pub/Sub for the event bus, Cloud Storage for renderable assets, Secret Manager for vendor credentials, Cloud KMS for CMEK. The service is regional with active-passive multi-region for DR.


1. Environments

EnvironmentGCP projectRegionsDomain
local(none — Docker Compose)localhost
devmelmastoon-devasia-south1*.dev.melmastoon.com
stagingmelmastoon-stagingasia-south1 (active) + me-central1 (warm)*.stg.melmastoon.com
prodmelmastoon-prodasia-south1 (active) + me-central1 (active for ME-residency tenants) + europe-west4 (warm DR)*.melmastoon.com

Tenant data residency is enforced by routing: each tenants_local.region pins the tenant; the Cloud LB picks the regional backend by X-Tenant-Id lookup at the gateway.


2. Compute topology (per region, prod)

┌────────────────────────┐
│ Cloud Load Balancer │ (global) + Cloud Armor + Cloud CDN (where applicable)
└────────────┬───────────┘
│ TLS 1.3

┌─────────────────────────────────────────────────────────┐
│ Regional Backend Service (per tenant region) │
└────────────┬────────────────────────────────────────────┘


┌──────────────────────────────────────────────────────────────────────────┐
│ Cloud Run services (separate revisions per role) │
│ │
│ notification-api (REST + WS) minInst=3, maxInst=200
│ notification-router (Pub/Sub subscribers) minInst=2, maxInst=100
│ notification-worker-email minInst=2, maxInst=80
│ notification-worker-sms minInst=2, maxInst=80
│ notification-worker-whatsapp minInst=2, maxInst=60
│ notification-worker-push minInst=1, maxInst=40
│ notification-worker-inapp minInst=2, maxInst=40
│ notification-worker-voice (phase 3) minInst=0, maxInst=20
│ notification-scheduler minInst=1, maxInst=10
│ notification-outbox-relay minInst=2, maxInst=10
│ notification-webhook-correlator minInst=1, maxInst=10
│ notification-channel-prober minInst=1, maxInst=4
│ notification-cache-warmer minInst=1, maxInst=4
│ │
└──────────────────────────────────────────────────────────────────────────┘

┌──────────────────────┼──────────────────────┬─────────────────────┐
▼ ▼ ▼ ▼
┌─────────┐ ┌──────────────┐ ┌──────────┐ ┌──────────────┐
│ Cloud │ │ Memorystore │ │ Pub/Sub │ │ Secret Mgr │
│ SQL HA │ │ Redis HA │ │ topics │ │ │
└─────────┘ └──────────────┘ └──────────┘ └──────────────┘
│ │
▼ ▼
┌────────────┐ ┌──────────┐
│ Cloud KMS │ │ Cloud │
│ CMEK │ │ Storage │
└────────────┘ └──────────┘

The WebSocket service is the same Cloud Run revision as REST (notification-api), with WS sticky-session via Cloud Run session affinity.


3. Cloud Run service inventory

Each role is its own service (own scaling, own SLO, own rollout). Common config:

  • Image: gcr.io/melmastoon-{env}/notification-service:{git-sha}.
  • Service account: notification-service-runtime@<project>.iam.gserviceaccount.com.
  • Network: VPC connector (notification-vpc-connector-{region}), egress through Cloud NAT with reserved egress IP per region.
  • Autoscaling: CPU=70 % target; concurrency=80 for API, 50 for workers (vendor latency dominated), 1 for outbox-relay (single-flight per partition).
  • Health: /api/v1/internal/health for liveness; /api/v1/internal/health?ready=true for readiness (checks DB pool + Redis ping + Pub/Sub publisher).
  • Startup CPU boost: enabled for notification-api and notification-router (cold-start sensitive).
  • Request timeout: 60 s API; 540 s workers (long Pub/Sub ack windows).
ServiceWhy separate?
notification-apiSynchronous HTTP/WS; needs low latency; serves staff
notification-routerPub/Sub fan-in; bursty; isolated scaling
notification-worker-<channel>Per-channel scaling (vendor latency varies); isolates blast radius (a sick SMS adapter doesn't take down email)
notification-schedulerSingle-flight semantics; small but periodic
notification-outbox-relayThroughput-sensitive; isolated to prevent starvation
notification-webhook-correlatorLate-correlation pass; off the critical path
notification-channel-proberPeriodic; can be paused without downtime
notification-cache-warmerPeriodic; auxiliary

4. Datastores

4.1 Cloud SQL Postgres 16

  • Instance: melmastoon-notification-{env}-{region}-pg (regional HA).
  • Tier: prod db-custom-8-32768; staging db-custom-4-16384; dev db-custom-2-8192.
  • Storage: SSD with auto-grow, starting at 500 GB prod / 100 GB staging.
  • Backups: automated daily + PITR 7 days; cross-region backup snapshots to DR region.
  • Maintenance window: Sunday 02:00 local region.
  • Connections: PgBouncer in transaction mode in front (notification-pgbouncer-{region} Cloud Run); max client connections 800.
  • Read replicas: 2 in primary region for analytics/read-only queries.
  • Extensions: pg_partman, pgcrypto, pg_stat_statements, uuid-ossp.
  • CMEK: per-region key in Cloud KMS (projects/.../keys/notification-data-{region}).

4.2 Memorystore Redis

  • Tier: STANDARD_HA (3 nodes); 6 GB prod, 1 GB staging, 256 MB dev.
  • Network: private VPC; TLS in transit.
  • Eviction: allkeys-lru.
  • Used for: rate-limit counters, suppression set, template cache, trigger-map cache, channel cache, WS routing.

4.3 Cloud Storage (GCS)

  • Bucket per env: melmastoon-notifications-{env} (dual-region in prod for ME residency).
  • CMEK; uniform bucket-level access; signed URLs only for staff downloads.
  • Lifecycle policies in DATA_MODEL §5.

4.4 Pub/Sub

  • Topics: per published subject (melmastoon.notification.requested.v1, etc.). Schema-validated (Avro JSON schema on the topic when supported; JSON Schema enforced at producer otherwise).
  • Subscriptions: one per consumer × topic, with DLQ topic per consumer (melmastoon.dlq.notif.<consumer>).
  • Dead-letter policy: max 5 delivery attempts; min ack deadline 30 s.
  • Ordering: enabled; ordering key per EVENT_SCHEMAS §1.
  • Message retention: 7 days operational topics; 30 days regulated.

4.5 Secret Manager

  • Per-vendor: notification/{env}/vendor/{vendor}/{tenantId}/{credKind}.
  • Per-platform: notification/{env}/platform/{purpose} (e.g., HMAC keys, opt-out signing key).
  • Rotation policy in SECURITY_MODEL §6.

5. Networking

  • VPC: shared VPC melmastoon-{env} with subnets per region (/22).
  • Connector: Serverless VPC Access connector per region (notification-vpc-connector-{region}).
  • Egress: Cloud NAT (melmastoon-nat-{region}) with reserved external IPs (one per region) — vendors allowlist these.
  • Ingress: Cloud LB → Cloud Armor → Cloud Run with private endpoints; WAF enabled.
  • DNS: notify.melmastoon.com (public) → Cloud LB; notification-internal.melmastoon.com (private) → internal LB for service-mesh callers.
  • mTLS: SPIFFE/SPIRE certs issued by iam-service; sidecar melm-mesh-agent runs in every Cloud Run revision for cert rotation.

6. Release & rollout

CI: GitHub Actions → Artifact Registry → Cloud Deploy.

Pipeline:

  1. dev continuous deploy on every merge to develop.
  2. staging continuous deploy on every merge to main.
  3. prod manual promotion via Cloud Deploy.

Rollout strategy in prod (per service):

  • Canary: 5 % traffic for 15 min.
  • Half: 50 % traffic for 30 min.
  • Full: 100 %.
  • Auto-rollback if any of:
    • Error rate > 1 % over 5 min on canary
    • p95 latency > 2× baseline
    • Synthetic post-deploy check fails (see TESTING_STRATEGY §13)

Database migrations:

  • Applied before the new revision rolls out.
  • Forward-only; backwards-compatible writes for at least one revision (i.e., the prior revision must be able to read the new schema).
  • Two-phase code+schema changes (drop column, rename column) span ≥ 2 releases per MIGRATION_PLAN.

Feature flags:

  • LaunchDarkly with platform-managed keys.
  • Notable flags: notifications.ai.enabled, notifications.whatsapp.enabled, notifications.voice.enabled (default off until phase 3), notifications.batch.enabled, notifications.sync.feed.enabled.

7. Capacity targets and scale envelope

Capacity (prod, asia-south1)TargetBurst
API requests1 000 rps sustained5 000 rps for 60 s
Pub/Sub fan-in5 000 messages/s20 000 messages/s
Outbound dispatches3 000 sends/s aggregate12 000 sends/s
Webhook ingestion2 000 req/s/vendor10 000 req/s/vendor (Cloud Armor caps)
WS connections50 000 concurrent100 000
Postgres TPS4 000 TPS12 000 TPS
Memorystore ops50 000 ops/s200 000 ops/s

Scale tests in TESTING_STRATEGY §8 validate these before each major release.


8. Disaster recovery

  • RPO: 5 minutes (PITR + warm cross-region replica).
  • RTO: 30 minutes for tenant-region active-active; 2 hours for cold DR region.
  • Strategy: Cloud SQL cross-region read replica in DR region; promoted manually during DR. Pub/Sub topics are global; subscriptions per region. GCS dual-region bucket avoids data loss for rendered/webhook stores.
  • DR drill: quarterly tabletop + annual live failover in staging.
  • Runbook: FAILURE_MODES §10.

9. Cost guardrails

  • Per-environment budget alerts (50 %, 80 %, 100 %) on the GCP billing account.
  • Per-tenant cost dashboards (vendor + GCP-attributed) — see OBSERVABILITY §10.
  • Cloud Run min instances are tuned per environment to balance cold-start latency and cost; dev/staging use minInst=0 for non-API services.

10. Cross-region routing example

Guest reservation in Kabul (tenant region asia-south1):
Browser → notify.melmastoon.com (global LB)
→ routed to backend service "notification-api-asia-south1"
→ Cloud Run notification-api in asia-south1
→ reads/writes Cloud SQL asia-south1
→ publishes events to Pub/Sub topic (global)
→ consumer subscriptions per region; routing rule keeps tenant traffic regional

Guest reservation in Saudi Arabia (tenant region me-central1):
Same global LB; routed to "notification-api-me-central1"; isolated stack

The routing decision uses a Cloud Armor edge rule populated from tenants_local.region; mismatches return 451 to enforce data residency.


11. Bootstrapping a new region

When the platform adds a new tenant region:

  1. Provision Cloud SQL HA, Memorystore, Pub/Sub subscriptions, Secret Manager replicas, GCS bucket (regional or dual-region per residency rule).
  2. Stand up Cloud Run services (Terraform module notification-service parameterised on region).
  3. Issue SPIFFE identities for the new region's service accounts.
  4. Configure Cloud LB backend service + edge routing rule for the new tenant region.
  5. Run smoke + synthetic from TESTING_STRATEGY §13.
  6. Update SERVICE_READINESS regional gates.

12. IaC

All infrastructure is Terraform (infra/terraform/notification-service/):

  • main.tf (modules: cloud_run, cloud_sql, memorystore, pubsub, secret_manager, cloud_storage, cloud_kms).
  • vars/{env}.tfvars per environment.
  • pipelines/cloud-deploy/{env} for the rollout pipeline definitions.
  • Drift detection runs nightly; PRs required to apply.