DEPLOYMENT_TOPOLOGY — notification-service
Sibling: SECURITY_MODEL · OBSERVABILITY · FAILURE_MODES · SERVICE_READINESS
Strategic anchors: 02 Enterprise Architecture §9 Deployment · ADR-0001 Core Architecture & Tech Stack
notification-service runs on Google Cloud Platform: Cloud Run for compute, Cloud SQL Postgres for state, Memorystore Redis for hot caches, Pub/Sub for the event bus, Cloud Storage for renderable assets, Secret Manager for vendor credentials, Cloud KMS for CMEK. The service is regional with active-passive multi-region for DR.
1. Environments
| Environment | GCP project | Regions | Domain |
|---|---|---|---|
local | (none — Docker Compose) | — | localhost |
dev | melmastoon-dev | asia-south1 | *.dev.melmastoon.com |
staging | melmastoon-staging | asia-south1 (active) + me-central1 (warm) | *.stg.melmastoon.com |
prod | melmastoon-prod | asia-south1 (active) + me-central1 (active for ME-residency tenants) + europe-west4 (warm DR) | *.melmastoon.com |
Tenant data residency is enforced by routing: each tenants_local.region pins the tenant; the Cloud LB picks the regional backend by X-Tenant-Id lookup at the gateway.
2. Compute topology (per region, prod)
┌────────────────────────┐
│ Cloud Load Balancer │ (global) + Cloud Armor + Cloud CDN (where applicable)
└────────────┬───────────┘
│ TLS 1.3
▼
┌─────────────────────────────────────────────────────────┐
│ Regional Backend Service (per tenant region) │
└────────────┬────────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────────────────┐
│ Cloud Run services (separate revisions per role) │
│ │
│ notification-api (REST + WS) minInst=3, maxInst=200
│ notification-router (Pub/Sub subscribers) minInst=2, maxInst=100
│ notification-worker-email minInst=2, maxInst=80
│ notification-worker-sms minInst=2, maxInst=80
│ notification-worker-whatsapp minInst=2, maxInst=60
│ notification-worker-push minInst=1, maxInst=40
│ notification-worker-inapp minInst=2, maxInst=40
│ notification-worker-voice (phase 3) minInst=0, maxInst=20
│ notification-scheduler minInst=1, maxInst=10
│ notification-outbox-relay minInst=2, maxInst=10
│ notification-webhook-correlator minInst=1, maxInst=10
│ notification-channel-prober minInst=1, maxInst=4
│ notification-cache-warmer minInst=1, maxInst=4
│ │
└──────────────────────────────────────────────────────────────────────────┘
│
┌──────────────────────┼──────────────────────┬─────────────────────┐
▼ ▼ ▼ ▼
┌─────────┐ ┌──────────────┐ ┌──────────┐ ┌──────────────┐
│ Cloud │ │ Memorystore │ │ Pub/Sub │ │ Secret Mgr │
│ SQL HA │ │ Redis HA │ │ topics │ │ │
└─────────┘ └──────────────┘ └──────────┘ └──────────────┘
│ │
▼ ▼
┌────────────┐ ┌──────────┐
│ Cloud KMS │ │ Cloud │
│ CMEK │ │ Storage │
└────────────┘ └──────────┘
The WebSocket service is the same Cloud Run revision as REST (notification-api), with WS sticky-session via Cloud Run session affinity.
3. Cloud Run service inventory
Each role is its own service (own scaling, own SLO, own rollout). Common config:
- Image:
gcr.io/melmastoon-{env}/notification-service:{git-sha}. - Service account:
notification-service-runtime@<project>.iam.gserviceaccount.com. - Network: VPC connector (
notification-vpc-connector-{region}), egress through Cloud NAT with reserved egress IP per region. - Autoscaling: CPU=70 % target; concurrency=80 for API, 50 for workers (vendor latency dominated), 1 for outbox-relay (single-flight per partition).
- Health:
/api/v1/internal/healthfor liveness;/api/v1/internal/health?ready=truefor readiness (checks DB pool + Redis ping + Pub/Sub publisher). - Startup CPU boost: enabled for
notification-apiandnotification-router(cold-start sensitive). - Request timeout: 60 s API; 540 s workers (long Pub/Sub ack windows).
| Service | Why separate? |
|---|---|
notification-api | Synchronous HTTP/WS; needs low latency; serves staff |
notification-router | Pub/Sub fan-in; bursty; isolated scaling |
notification-worker-<channel> | Per-channel scaling (vendor latency varies); isolates blast radius (a sick SMS adapter doesn't take down email) |
notification-scheduler | Single-flight semantics; small but periodic |
notification-outbox-relay | Throughput-sensitive; isolated to prevent starvation |
notification-webhook-correlator | Late-correlation pass; off the critical path |
notification-channel-prober | Periodic; can be paused without downtime |
notification-cache-warmer | Periodic; auxiliary |
4. Datastores
4.1 Cloud SQL Postgres 16
- Instance:
melmastoon-notification-{env}-{region}-pg(regional HA). - Tier: prod
db-custom-8-32768; stagingdb-custom-4-16384; devdb-custom-2-8192. - Storage: SSD with auto-grow, starting at 500 GB prod / 100 GB staging.
- Backups: automated daily + PITR 7 days; cross-region backup snapshots to DR region.
- Maintenance window: Sunday 02:00 local region.
- Connections: PgBouncer in transaction mode in front (
notification-pgbouncer-{region}Cloud Run); max client connections 800. - Read replicas: 2 in primary region for analytics/read-only queries.
- Extensions:
pg_partman,pgcrypto,pg_stat_statements,uuid-ossp. - CMEK: per-region key in Cloud KMS (
projects/.../keys/notification-data-{region}).
4.2 Memorystore Redis
- Tier: STANDARD_HA (3 nodes); 6 GB prod, 1 GB staging, 256 MB dev.
- Network: private VPC; TLS in transit.
- Eviction:
allkeys-lru. - Used for: rate-limit counters, suppression set, template cache, trigger-map cache, channel cache, WS routing.
4.3 Cloud Storage (GCS)
- Bucket per env:
melmastoon-notifications-{env}(dual-region in prod for ME residency). - CMEK; uniform bucket-level access; signed URLs only for staff downloads.
- Lifecycle policies in DATA_MODEL §5.
4.4 Pub/Sub
- Topics: per published subject (
melmastoon.notification.requested.v1, etc.). Schema-validated (Avro JSON schema on the topic when supported; JSON Schema enforced at producer otherwise). - Subscriptions: one per consumer × topic, with DLQ topic per consumer (
melmastoon.dlq.notif.<consumer>). - Dead-letter policy: max 5 delivery attempts; min ack deadline 30 s.
- Ordering: enabled; ordering key per EVENT_SCHEMAS §1.
- Message retention: 7 days operational topics; 30 days regulated.
4.5 Secret Manager
- Per-vendor:
notification/{env}/vendor/{vendor}/{tenantId}/{credKind}. - Per-platform:
notification/{env}/platform/{purpose}(e.g., HMAC keys, opt-out signing key). - Rotation policy in SECURITY_MODEL §6.
5. Networking
- VPC: shared VPC
melmastoon-{env}with subnets per region (/22). - Connector: Serverless VPC Access connector per region (
notification-vpc-connector-{region}). - Egress: Cloud NAT (
melmastoon-nat-{region}) with reserved external IPs (one per region) — vendors allowlist these. - Ingress: Cloud LB → Cloud Armor → Cloud Run with private endpoints; WAF enabled.
- DNS:
notify.melmastoon.com(public) → Cloud LB;notification-internal.melmastoon.com(private) → internal LB for service-mesh callers. - mTLS: SPIFFE/SPIRE certs issued by
iam-service; sidecarmelm-mesh-agentruns in every Cloud Run revision for cert rotation.
6. Release & rollout
CI: GitHub Actions → Artifact Registry → Cloud Deploy.
Pipeline:
devcontinuous deploy on every merge todevelop.stagingcontinuous deploy on every merge tomain.prodmanual promotion via Cloud Deploy.
Rollout strategy in prod (per service):
- Canary: 5 % traffic for 15 min.
- Half: 50 % traffic for 30 min.
- Full: 100 %.
- Auto-rollback if any of:
- Error rate > 1 % over 5 min on canary
- p95 latency > 2× baseline
- Synthetic post-deploy check fails (see TESTING_STRATEGY §13)
Database migrations:
- Applied before the new revision rolls out.
- Forward-only; backwards-compatible writes for at least one revision (i.e., the prior revision must be able to read the new schema).
- Two-phase code+schema changes (drop column, rename column) span ≥ 2 releases per MIGRATION_PLAN.
Feature flags:
- LaunchDarkly with platform-managed keys.
- Notable flags:
notifications.ai.enabled,notifications.whatsapp.enabled,notifications.voice.enabled(default off until phase 3),notifications.batch.enabled,notifications.sync.feed.enabled.
7. Capacity targets and scale envelope
| Capacity (prod, asia-south1) | Target | Burst |
|---|---|---|
| API requests | 1 000 rps sustained | 5 000 rps for 60 s |
| Pub/Sub fan-in | 5 000 messages/s | 20 000 messages/s |
| Outbound dispatches | 3 000 sends/s aggregate | 12 000 sends/s |
| Webhook ingestion | 2 000 req/s/vendor | 10 000 req/s/vendor (Cloud Armor caps) |
| WS connections | 50 000 concurrent | 100 000 |
| Postgres TPS | 4 000 TPS | 12 000 TPS |
| Memorystore ops | 50 000 ops/s | 200 000 ops/s |
Scale tests in TESTING_STRATEGY §8 validate these before each major release.
8. Disaster recovery
- RPO: 5 minutes (PITR + warm cross-region replica).
- RTO: 30 minutes for tenant-region active-active; 2 hours for cold DR region.
- Strategy: Cloud SQL cross-region read replica in DR region; promoted manually during DR. Pub/Sub topics are global; subscriptions per region. GCS dual-region bucket avoids data loss for rendered/webhook stores.
- DR drill: quarterly tabletop + annual live failover in staging.
- Runbook: FAILURE_MODES §10.
9. Cost guardrails
- Per-environment budget alerts (50 %, 80 %, 100 %) on the GCP billing account.
- Per-tenant cost dashboards (vendor + GCP-attributed) — see OBSERVABILITY §10.
- Cloud Run min instances are tuned per environment to balance cold-start latency and cost; dev/staging use
minInst=0for non-API services.
10. Cross-region routing example
Guest reservation in Kabul (tenant region asia-south1):
Browser → notify.melmastoon.com (global LB)
→ routed to backend service "notification-api-asia-south1"
→ Cloud Run notification-api in asia-south1
→ reads/writes Cloud SQL asia-south1
→ publishes events to Pub/Sub topic (global)
→ consumer subscriptions per region; routing rule keeps tenant traffic regional
Guest reservation in Saudi Arabia (tenant region me-central1):
Same global LB; routed to "notification-api-me-central1"; isolated stack
The routing decision uses a Cloud Armor edge rule populated from tenants_local.region; mismatches return 451 to enforce data residency.
11. Bootstrapping a new region
When the platform adds a new tenant region:
- Provision Cloud SQL HA, Memorystore, Pub/Sub subscriptions, Secret Manager replicas, GCS bucket (regional or dual-region per residency rule).
- Stand up Cloud Run services (Terraform module
notification-serviceparameterised on region). - Issue SPIFFE identities for the new region's service accounts.
- Configure Cloud LB backend service + edge routing rule for the new tenant region.
- Run smoke + synthetic from TESTING_STRATEGY §13.
- Update SERVICE_READINESS regional gates.
12. IaC
All infrastructure is Terraform (infra/terraform/notification-service/):
main.tf(modules: cloud_run, cloud_sql, memorystore, pubsub, secret_manager, cloud_storage, cloud_kms).vars/{env}.tfvarsper environment.pipelines/cloud-deploy/{env}for the rollout pipeline definitions.- Drift detection runs nightly; PRs required to apply.