DEPLOYMENT_TOPOLOGY — notification-service

Sibling: SECURITY_MODEL · OBSERVABILITY · FAILURE_MODES · SERVICE_READINESS

Strategic anchors: 02 Enterprise Architecture §9 Deployment · ADR-0001 Core Architecture & Tech Stack

notification-service runs on Google Cloud Platform: Cloud Run for compute, Cloud SQL Postgres for state, Memorystore Redis for hot caches, Pub/Sub for the event bus, Cloud Storage for renderable assets, Secret Manager for vendor credentials, Cloud KMS for CMEK. The service is regional with active-passive multi-region for DR.

1. Environments

Environment	GCP project	Regions	Domain
`local`	(none — Docker Compose)	—	`localhost`
`dev`	`melmastoon-dev`	`asia-south1`	`*.dev.melmastoon.com`
`staging`	`melmastoon-staging`	`asia-south1` (active) + `me-central1` (warm)	`*.stg.melmastoon.com`
`prod`	`melmastoon-prod`	`asia-south1` (active) + `me-central1` (active for ME-residency tenants) + `europe-west4` (warm DR)	`*.melmastoon.com`

Tenant data residency is enforced by routing: each tenants_local.region pins the tenant; the Cloud LB picks the regional backend by X-Tenant-Id lookup at the gateway.

2. Compute topology (per region, prod)

                  ┌────────────────────────┐
                  │ Cloud Load Balancer    │  (global) + Cloud Armor + Cloud CDN (where applicable)
                  └────────────┬───────────┘
                               │ TLS 1.3
                               ▼
                  ┌─────────────────────────────────────────────────────────┐
                  │ Regional Backend Service (per tenant region)            │
                  └────────────┬────────────────────────────────────────────┘
                               │
                               ▼
       ┌──────────────────────────────────────────────────────────────────────────┐
       │ Cloud Run services (separate revisions per role)                         │
       │                                                                          │
       │   notification-api          (REST + WS)              minInst=3, maxInst=200
       │   notification-router       (Pub/Sub subscribers)    minInst=2, maxInst=100
       │   notification-worker-email                          minInst=2, maxInst=80
       │   notification-worker-sms                            minInst=2, maxInst=80
       │   notification-worker-whatsapp                       minInst=2, maxInst=60
       │   notification-worker-push                           minInst=1, maxInst=40
       │   notification-worker-inapp                          minInst=2, maxInst=40
       │   notification-worker-voice (phase 3)                minInst=0, maxInst=20
       │   notification-scheduler                             minInst=1, maxInst=10
       │   notification-outbox-relay                          minInst=2, maxInst=10
       │   notification-webhook-correlator                    minInst=1, maxInst=10
       │   notification-channel-prober                        minInst=1, maxInst=4
       │   notification-cache-warmer                          minInst=1, maxInst=4
       │                                                                          │
       └──────────────────────────────────────────────────────────────────────────┘
                               │
        ┌──────────────────────┼──────────────────────┬─────────────────────┐
        ▼                      ▼                      ▼                     ▼
   ┌─────────┐         ┌──────────────┐        ┌──────────┐          ┌──────────────┐
   │ Cloud   │         │ Memorystore  │        │  Pub/Sub │          │ Secret Mgr   │
   │ SQL HA  │         │ Redis HA     │        │  topics  │          │              │
   └─────────┘         └──────────────┘        └──────────┘          └──────────────┘
        │                                          │
        ▼                                          ▼
   ┌────────────┐                            ┌──────────┐
   │ Cloud KMS  │                            │ Cloud    │
   │ CMEK       │                            │ Storage  │
   └────────────┘                            └──────────┘

The WebSocket service is the same Cloud Run revision as REST (notification-api), with WS sticky-session via Cloud Run session affinity.

3. Cloud Run service inventory

Each role is its own service (own scaling, own SLO, own rollout). Common config:

Image: gcr.io/melmastoon-{env}/notification-service:{git-sha}.
Service account: notification-service-runtime@<project>.iam.gserviceaccount.com.
Network: VPC connector (notification-vpc-connector-{region}), egress through Cloud NAT with reserved egress IP per region.
Autoscaling: CPU=70 % target; concurrency=80 for API, 50 for workers (vendor latency dominated), 1 for outbox-relay (single-flight per partition).
Health: /api/v1/internal/health for liveness; /api/v1/internal/health?ready=true for readiness (checks DB pool + Redis ping + Pub/Sub publisher).
Startup CPU boost: enabled for notification-api and notification-router (cold-start sensitive).
Request timeout: 60 s API; 540 s workers (long Pub/Sub ack windows).

Service	Why separate?
`notification-api`	Synchronous HTTP/WS; needs low latency; serves staff
`notification-router`	Pub/Sub fan-in; bursty; isolated scaling
`notification-worker-<channel>`	Per-channel scaling (vendor latency varies); isolates blast radius (a sick SMS adapter doesn't take down email)
`notification-scheduler`	Single-flight semantics; small but periodic
`notification-outbox-relay`	Throughput-sensitive; isolated to prevent starvation
`notification-webhook-correlator`	Late-correlation pass; off the critical path
`notification-channel-prober`	Periodic; can be paused without downtime
`notification-cache-warmer`	Periodic; auxiliary

4. Datastores

4.1 Cloud SQL Postgres 16

Instance: melmastoon-notification-{env}-{region}-pg (regional HA).
Tier: prod db-custom-8-32768; staging db-custom-4-16384; dev db-custom-2-8192.
Storage: SSD with auto-grow, starting at 500 GB prod / 100 GB staging.
Backups: automated daily + PITR 7 days; cross-region backup snapshots to DR region.
Maintenance window: Sunday 02:00 local region.
Connections: PgBouncer in transaction mode in front (notification-pgbouncer-{region} Cloud Run); max client connections 800.
Read replicas: 2 in primary region for analytics/read-only queries.
Extensions: pg_partman, pgcrypto, pg_stat_statements, uuid-ossp.
CMEK: per-region key in Cloud KMS (projects/.../keys/notification-data-{region}).

4.2 Memorystore Redis

Tier: STANDARD_HA (3 nodes); 6 GB prod, 1 GB staging, 256 MB dev.
Network: private VPC; TLS in transit.
Eviction: allkeys-lru.
Used for: rate-limit counters, suppression set, template cache, trigger-map cache, channel cache, WS routing.

4.3 Cloud Storage (GCS)

Bucket per env: melmastoon-notifications-{env} (dual-region in prod for ME residency).
CMEK; uniform bucket-level access; signed URLs only for staff downloads.
Lifecycle policies in DATA_MODEL §5.

4.4 Pub/Sub

Topics: per published subject (melmastoon.notification.requested.v1, etc.). Schema-validated (Avro JSON schema on the topic when supported; JSON Schema enforced at producer otherwise).
Subscriptions: one per consumer × topic, with DLQ topic per consumer (melmastoon.dlq.notif.<consumer>).
Dead-letter policy: max 5 delivery attempts; min ack deadline 30 s.
Ordering: enabled; ordering key per EVENT_SCHEMAS §1.
Message retention: 7 days operational topics; 30 days regulated.

4.5 Secret Manager

Per-vendor: notification/{env}/vendor/{vendor}/{tenantId}/{credKind}.
Per-platform: notification/{env}/platform/{purpose} (e.g., HMAC keys, opt-out signing key).
Rotation policy in SECURITY_MODEL §6.

5. Networking

VPC: shared VPC melmastoon-{env} with subnets per region (/22).
Connector: Serverless VPC Access connector per region (notification-vpc-connector-{region}).
Egress: Cloud NAT (melmastoon-nat-{region}) with reserved external IPs (one per region) — vendors allowlist these.
Ingress: Cloud LB → Cloud Armor → Cloud Run with private endpoints; WAF enabled.
DNS: notify.melmastoon.com (public) → Cloud LB; notification-internal.melmastoon.com (private) → internal LB for service-mesh callers.
mTLS: SPIFFE/SPIRE certs issued by iam-service; sidecar melm-mesh-agent runs in every Cloud Run revision for cert rotation.

6. Release & rollout

CI: GitHub Actions → Artifact Registry → Cloud Deploy.

Pipeline:

dev continuous deploy on every merge to develop.
staging continuous deploy on every merge to main.
prod manual promotion via Cloud Deploy.

Rollout strategy in prod (per service):

Canary: 5 % traffic for 15 min.
Half: 50 % traffic for 30 min.
Full: 100 %.
Auto-rollback if any of:
- Error rate > 1 % over 5 min on canary
- p95 latency > 2× baseline
- Synthetic post-deploy check fails (see TESTING_STRATEGY §13)

Database migrations:

Applied before the new revision rolls out.
Forward-only; backwards-compatible writes for at least one revision (i.e., the prior revision must be able to read the new schema).
Two-phase code+schema changes (drop column, rename column) span ≥ 2 releases per MIGRATION_PLAN.

Feature flags:

LaunchDarkly with platform-managed keys.
Notable flags: notifications.ai.enabled, notifications.whatsapp.enabled, notifications.voice.enabled (default off until phase 3), notifications.batch.enabled, notifications.sync.feed.enabled.

7. Capacity targets and scale envelope

Capacity (prod, asia-south1)	Target	Burst
API requests	1 000 rps sustained	5 000 rps for 60 s
Pub/Sub fan-in	5 000 messages/s	20 000 messages/s
Outbound dispatches	3 000 sends/s aggregate	12 000 sends/s
Webhook ingestion	2 000 req/s/vendor	10 000 req/s/vendor (Cloud Armor caps)
WS connections	50 000 concurrent	100 000
Postgres TPS	4 000 TPS	12 000 TPS
Memorystore ops	50 000 ops/s	200 000 ops/s

Scale tests in TESTING_STRATEGY §8 validate these before each major release.

8. Disaster recovery

RPO: 5 minutes (PITR + warm cross-region replica).
RTO: 30 minutes for tenant-region active-active; 2 hours for cold DR region.
Strategy: Cloud SQL cross-region read replica in DR region; promoted manually during DR. Pub/Sub topics are global; subscriptions per region. GCS dual-region bucket avoids data loss for rendered/webhook stores.
DR drill: quarterly tabletop + annual live failover in staging.
Runbook: FAILURE_MODES §10.

9. Cost guardrails

Per-environment budget alerts (50 %, 80 %, 100 %) on the GCP billing account.
Per-tenant cost dashboards (vendor + GCP-attributed) — see OBSERVABILITY §10.
Cloud Run min instances are tuned per environment to balance cold-start latency and cost; dev/staging use minInst=0 for non-API services.

10. Cross-region routing example

Guest reservation in Kabul (tenant region asia-south1):
  Browser → notify.melmastoon.com (global LB)
        → routed to backend service "notification-api-asia-south1"
        → Cloud Run notification-api in asia-south1
        → reads/writes Cloud SQL asia-south1
        → publishes events to Pub/Sub topic (global)
        → consumer subscriptions per region; routing rule keeps tenant traffic regional

Guest reservation in Saudi Arabia (tenant region me-central1):
  Same global LB; routed to "notification-api-me-central1"; isolated stack

The routing decision uses a Cloud Armor edge rule populated from tenants_local.region; mismatches return 451 to enforce data residency.

11. Bootstrapping a new region

When the platform adds a new tenant region:

Provision Cloud SQL HA, Memorystore, Pub/Sub subscriptions, Secret Manager replicas, GCS bucket (regional or dual-region per residency rule).
Stand up Cloud Run services (Terraform module notification-service parameterised on region).
Issue SPIFFE identities for the new region's service accounts.
Configure Cloud LB backend service + edge routing rule for the new tenant region.
Run smoke + synthetic from TESTING_STRATEGY §13.
Update SERVICE_READINESS regional gates.

12. IaC

All infrastructure is Terraform (infra/terraform/notification-service/):

main.tf (modules: cloud_run, cloud_sql, memorystore, pubsub, secret_manager, cloud_storage, cloud_kms).
vars/{env}.tfvars per environment.
pipelines/cloud-deploy/{env} for the rollout pipeline definitions.
Drift detection runs nightly; PRs required to apply.

1. Environments​

2. Compute topology (per region, prod)​

3. Cloud Run service inventory​

4. Datastores​

4.1 Cloud SQL Postgres 16​

4.2 Memorystore Redis​

4.3 Cloud Storage (GCS)​

4.4 Pub/Sub​

4.5 Secret Manager​

5. Networking​

6. Release & rollout​

7. Capacity targets and scale envelope​

8. Disaster recovery​

9. Cost guardrails​

10. Cross-region routing example​

11. Bootstrapping a new region​

12. IaC​