DEPLOYMENT_TOPOLOGY — bff-tenant-booking-service
Sibling: DATA_MODEL · SECURITY_MODEL · LOCAL_DEV_SETUP
Cross-cutting: 02 Enterprise Architecture · §4 GCP Reference Architecture
1. Runtime
| Property | Value |
|---|---|
| Compute | Google Cloud Run (managed) |
| Region (primary) | asia-south1 (Mumbai) |
| Region (DR-warm) | europe-west4 (Eemshaven) |
| Container | Distroless Node 20, multi-stage build, non-root node user, read-only root FS |
| Min instances | 3 (per region) |
| Max instances | 30 (per region; raised to 60 in flashSale mode) |
| Concurrency per instance | 60 |
| CPU | 2 vCPU, always-allocated |
| Memory | 1 GiB |
| Startup latency budget | < 800 ms |
| Request timeout | 30 s (matches longest upstream chain) |
| VPC connector | bff-connector-asia-south1 (private egress) |
2. Ingress
Tenant subdomain (kabul-grand-hotel.melmastoon.ghasi.io) OR custom domain (booking.<tenant>.com)
│
▼
Cloud DNS (CNAME → GCLB anycast)
│
▼
Global HTTPS Load Balancer
├── Cloud Armor (WAF + bot rules)
├── Cloud CDN (cache for documented public GETs)
├── SNI cert (managed by Certificate Manager; per-tenant for custom domains)
│
▼
Serverless NEG → Cloud Run (bff-tenant-booking-service)
Custom domains: tenants submit DNS CNAME → booking.tenant.melmastoon.ghasi.io; we provision a managed cert via Cloud Certificate Manager (DNS-validated).
3. Egress (upstream connections)
All upstreams reached over internal Cloud Run-to-Cloud Run via VPC connector with Google ID tokens minted from bff-tenant-sa.
| Upstream | Hostname | Auth | Timeout | Retries |
|---|---|---|---|---|
tenant-service | tenant.melmastoon.internal | Google ID token | 400 ms | 0 |
theme-config-service | theme.melmastoon.internal | Google ID token | 500 ms | 1 |
property-service | property.melmastoon.internal | Google ID token | 800 ms | 1 |
inventory-service | inventory.melmastoon.internal | Google ID token | 700 ms | 1 |
pricing-service | pricing.melmastoon.internal | Google ID token | 800 ms | 0 (quote); 1 (cheapest) |
reservation-service | reservation.melmastoon.internal | Google ID token | 1500 ms | 1 (with idem-key) |
payment-gateway-service | payment.melmastoon.internal | Google ID token | 2000 ms | 0 |
billing-service | billing.melmastoon.internal | Google ID token | 600 ms | 1 |
lock-integration-service | lock.melmastoon.internal | Google ID token | 600 ms | 0 (soft) |
ai-orchestrator-service | ai.melmastoon.internal | Google ID token | 1200 ms | 0 |
bff-consumer-service | bff-consumer.melmastoon.internal | Google ID token + shared HMAC key | 800 ms | 0 |
4. Stateful dependencies
| Dependency | Type | Region | HA |
|---|---|---|---|
| Memorystore (Redis 7) — cache tier | bff-tenant-cache-asia-south1, 5 GiB, standard | asia-south1 | Standby + auto-failover |
| Memorystore (Redis 7) — session tier (no eviction) | bff-tenant-session-asia-south1, 3 GiB, standard | asia-south1 | Standby + auto-failover |
| Cloud SQL (Postgres 16) | bff-tenant-db-asia-south1, db-custom-2-8192 | asia-south1 | Regional HA + cross-region read replica in europe-west4 |
| Pub/Sub topics | melmastoon.bff.tenant.* | global | n/a |
| Secret Manager | handoff-hmac (current+previous), pepper, recaptcha | global | replicated automatically |
Two Memorystore instances avoid cache eviction pressure displacing live booking-draft state.
5. CI/CD pipeline
GitHub PR → GitHub Actions
├── Lint + typecheck + unit + integration + contract tests
├── Build container (Cloud Build)
├── Trivy scan (block high/critical CVE)
├── Cosign sign with Fulcio identity
├── Push to Artifact Registry
├── Deploy to dev Cloud Run (no-traffic + smoke test → 100%)
├── Manual approval → stage
└── Manual approval → prod (canary 5% → 25% → 100% over 30 min, with metric guardrails)
Binary authorization on prod cluster requires Cosign signature.
6. Traffic management
- Default routing: tenant subdomain or custom domain → GCLB → nearest healthy region.
- Canary control: Cloud Deploy + Cloud Run revisions; rollback budget 5 minutes if SLO burn detected.
- Flash-sale mode: tenant flag promotes prod max-instances to 60; pre-warmed via Memorystore cache priming.
- Custom-domain provisioning: Cloud Certificate Manager (DNS-validated); BFF reads tenant→domain map from
tenant-service.
7. Configuration
| Source | What |
|---|---|
| Cloud Run env vars | Non-secret toggles |
| Secret Manager (file mounts) | All secrets per SECURITY_MODEL §9 |
| Cloud Run Service YAML | Sizing, concurrency, scaling, VPC connector |
bff-tenant-flags (Memorystore) | Feature flags + sample rates (refresh 30 s) |
8. Networking
- VPC:
melmastoon-prod-vpc. - Subnet (Cloud Run connector):
bff-tenant-connector-asia-south1(10.20.5.0/28). - Private Service Access for Cloud SQL.
- Memorystore via VPC connector private IP.
- Egress NAT only for
payment-gateway-serviceprovider redirects (which actually leave from the gateway service, not this BFF — kept for completeness).
9. Cost posture
| Item | Estimated monthly @ 150 RPS steady |
|---|---|
| Cloud Run | ~$280 |
| Memorystore (cache 5 GiB + session 3 GiB) | ~$280 |
| Cloud SQL (db-custom-2-8192 + HA) | ~$240 |
| Pub/Sub | ~$50 |
| Cloud CDN | ~$30 |
| Cloud Armor | ~$30 |
| Cert Manager (custom domains × 50) | ~$50 |
| Logging + Trace + Monitoring | ~$80 |
| Total | ~$1,040 / month / region |
10. Disaster recovery
- RPO: 5 min (Cloud SQL PITR; Memorystore is ephemeral cache + best-effort session).
- RTO: 30 min (DNS + Cloud Run redeploy in DR).
- Quarterly DR drill: cut traffic to
europe-west4; verify booking flow against replicated tenant catalog. - Booking drafts in flight at the moment of failover: client receives
MELMASTOON.BFF.TENANT.DRAFT_NOT_FOUNDand is redirected to re-search;idempotencyrows in Cloud SQL prevent double-charges if the user retriesconfirmafter failover.
11. Custom-domain operations
- Tenants self-serve domain claim in
tenant-servicevia DNS CNAME challenge. - Cert Manager auto-provisions via DNS-01.
- BFF reads
tenant.config.customDomains[]and rejects requests for unclaimed domains with 404. - DNS CAA records on tenant zones recommended (
googletrust.com).
12. Health endpoints
/health/live: returns 200 if process running./health/ready: returns 200 if Memorystore + Postgres + at least 80% of upstream circuits closed.- Cloud Run liveness + readiness probes pointed at these.