Skip to main content

DEPLOYMENT_TOPOLOGY — reservation-service

Sibling: LOCAL_DEV_SETUP · OBSERVABILITY · FAILURE_MODES

Strategic anchors: 02 §12 GCP Reference Topology · 04 §11 Pub/Sub topology

reservation-service runs on Google Cloud Run (managed, regional) with the platform's standard NestJS base image. It ships as two Cloud Run services — the request-handling API and the hold-expiry worker — to keep their scaling and IAM boundaries independent.


1. Runtime

PropertyValue
LanguageTypeScript
RuntimeNode 20 LTS
FrameworkNestJS 10 (composition root only; domain framework-free)
Container basegcr.io/melmastoon-platform/node-20-distroless:<sha>
Boot scriptnode --enable-source-maps dist/main.js (telemetry initialized first)
Health endpointsGET /internal/health (liveness), GET /internal/ready (readiness)
Graceful shutdownSIGTERM → drain in-flight HTTP and inbox handlers; max 30 s

2. Cloud Run services

2.1 reservation-service (API + inbox handlers)

SettingValue
Regionme-central1 (primary) and asia-south1 (active for region-pinned tenants)
Min replicas3 per region (hot path; eliminates cold start during business hours)
Max replicas30 per region
Concurrency per instance80
CPU2 vCPU (always-allocated)
Memory1 GiB
VPC connectormelmastoon-private-connector
Egressprivate VPC (Cloud SQL, Memorystore, Pub/Sub via private endpoints)
Ingressinternal + load-balancer (Kong upstream)
AuthenticationIAM (service-to-service); Pub/Sub push principal whitelisted; Cloud Scheduler principal whitelisted
Service accountreservation-svc@<project>.iam.gserviceaccount.com

2.2 reservation-hold-expiry-worker (separate Cloud Run job)

SettingValue
ScheduleCloud Scheduler */30 * * * * * (every 30 s)
TriggerHTTPS POST to /internal/jobs/expire-holds on a dedicated single-replica Cloud Run service (not job) — chosen for steady-state warm cache
Min replicas1
Max replicas1 (single writer for the sweeper batch)
CPU1 vCPU (CPU-on-request)
Memory512 MiB
Service accountreservation-holds-sweeper@<project>.iam.gserviceaccount.com (RLS-bypass on reservation_holds only)

The sweeper is intentionally separated so its IAM scope is narrower and its outage cannot pin API capacity. It always runs single-replica to avoid two sweepers contending for the same hold row.


3. Infrastructure dependencies

DependencyProvisioning
Cloud SQL Postgres 15 (HA primary + read replica)Shared instance with other PMS-core services; schema reservation; per-service IAM database users
Memorystore (Redis 7)Shared with PMS-core for hot caches; namespaced keys reservation:*
GCP Pub/SubOne topic per produced subject; pull subscriptions for inbox; DLQs per subscription; ordering enabled per <tenantId>:<aggregateId> ordering key
Cloud KMSPer-tenant CMK ring melmastoon-tenants for guest field-level DEKs
Secret ManagertenantSalt per tenant (HMAC for hash-for-search); no payment or lock secrets
Cloud Schedulerreservation-hold-expiry-30s job
Cloud StorageNone (we hold only mediaId references)
VPC Service ControlsService is in the melmastoon-prod-perimeter VPC-SC perimeter; egress to non-perimeter Pub/Sub blocked

4. Network topology

Internet ──► Kong (Cloud Run) ──► reservation-service (Cloud Run, internal+LB)

├── Cloud SQL (private endpoint)
├── Memorystore (VPC connector)
├── Pub/Sub (private service connect)
├── Cloud KMS (private endpoint)
└── Secret Manager (private endpoint)

Pub/Sub push ──► reservation-service `/internal/events/*` (IAM-gated)
Cloud Scheduler ──► reservation-hold-expiry-worker `/internal/jobs/expire-holds` (IAM-gated)

There is no direct public ingress to reservation-service. The only public surface is via bff-tenant-booking-service and bff-backoffice-service, fronted by Kong and Cloudflare.


5. Deploy & release

StageMechanism
BuildGitHub Actions → Cloud Build → distroless image; SBOM + Cosign signature attached
Image registryArtifact Registry gcr.io/melmastoon-platform/reservation-service:<git-sha>
Migrationsdrizzle-kit push runs as a Cloud Build step before Cloud Run revision rollout; backwards-compatible only
Canary5% traffic split for 30 minutes; abort and roll back on alert ladder (RESV-001..010) firing for 10 min
Rollbackgcloud run services update-traffic --to-revisions=<prev>=100; image stays in registry
PromotionManual gate from staging to prod; tagged release notes link to PRs

Helm/Terraform module references: terraform/modules/cloud-run-service and terraform/modules/cloud-scheduler-job from melmastoon-infra.


6. Resource sizing rationale

  • Min 3 replicas (API): the booking saga is a hot synchronous path; a single cold start would push p99 above the 5 s SLO. Three replicas survive a single-AZ blip and absorb burst from morning check-in spikes.
  • Concurrency 80: Drizzle pool size scales linearly with concurrency × instances; with 80 × 3 = 240 max in-flight queries, the Postgres pool is sized at 60 connections per instance with overflow blocking.
  • Single sweeper: the hold-expiry batch is bounded (≤ 100 holds per pass typical) and idempotent; concurrency would only add coordination overhead.

7. Region & residency

  • me-central1 (Doha): primary region; serves Afghan, Iranian (where lawful), GCC, Tajik tenants by default.
  • asia-south1 (Mumbai): secondary; serves South Asia tenants.
  • Tenant pinning is read from tenant.region; cross-region writes are blocked at the connection middleware. Cross-region reads are allowed only for audit-service and analytics-service.

8. Cross-references