Skip to main content

tenant-service — DEPLOYMENT_TOPOLOGY

Companion: SERVICE_OVERVIEW · DATA_MODEL · OBSERVABILITY · FAILURE_MODES · Platform: 02 Enterprise Architecture §9 GCP topology

Tenant-service is deployed as a regional Cloud Run service in each residency region. It is active-active across regions for read traffic and active-standby per residency for writes (writes pinned to the residency region of the tenant, per data-residency policy).


1. GCP Topology

┌──────────────────────────── melmastoon-prod (GCP project) ────────────────────────────┐
│ │
│ asia-south1 (Mumbai) me-central1 (Doha) europe-west1 (Belgium) │
│ ┌──────────────────┐ ┌──────────────────┐ ┌──────────────────┐ │
│ │ Cloud Run │ │ Cloud Run │ │ Cloud Run │ │
internal LB │ │ tenant-service │ │ tenant-service │ │ tenant-service │ │
────────────►│ │ min=2 max=50 │ │ min=2 max=50 │ │ min=2 max=50 │ │
│ └──────────────────┘ └──────────────────┘ └──────────────────┘ │
│ │ │ │ │
│ │ workload identity │ │ │
│ ▼ ▼ ▼ │
│ ┌──────────────────┐ ┌──────────────────┐ ┌──────────────────┐ │
│ │ Cloud SQL HA │ │ Cloud SQL HA │ │ Cloud SQL HA │ │
│ │ primary + sync │ │ primary + sync │ │ primary + sync │ │
│ │ + 1 read replica │ │ + 1 read replica │ │ + 1 read replica │ │
│ └──────────────────┘ └──────────────────┘ └──────────────────┘ │
│ │
│ ┌──── Memorystore (Redis Standard, regional) per region ────┐ │
│ └─────────────────────────────────────────────────────────────┘ │
│ │
│ ┌──── Pub/Sub topics (global, with per-region subs) ─────────┐ │
│ │ melmastoon.tenant.* + tenant-deletion-saga DLQs │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ │
│ ┌──── Secret Manager (CMEK), Cloud KMS, Cloud Build, Artifact Registry ────┐ │
│ └─────────────────────────────────────────────────────────────────────────┘ │
└───────────────────────────────────────────────────────────────────────────────────────┘

Cross-region failover: tenant data does not cross residency boundaries. If a region fails, tenants pinned to that region experience downtime; we do not silently relocate data. Healthcheck-driven LB routes only to the matching region per tenant. The platform's RTO target for a regional outage is 30 min via DR runbook (manual cutover) — never automatic.


2. Cloud Run Configuration

SettingValue
Imageasia-south1-docker.pkg.dev/melmastoon-prod/services/tenant-service:<sha> (distroless Node 20)
CPU / Memory1 vCPU / 1 GiB (default), 2 vCPU / 2 GiB for me-central1 (highest tenant volume)
Concurrency80 requests / instance
Min instances2 per region
Max instances50 per region (auto-scale on CPU > 65 % and request concurrency > 70)
Timeout30 s (writes), 10 s (PDP authz/check)
Service accounttenant-service@melmastoon-prod.iam.gserviceaccount.com
VPC connectorvpc-tenant-internal (Cloud SQL + Memorystore reachable only via private IP)
Binary Authorizationrequired; image must be signed by cloudbuild-attestor
Egressrestricted (no public egress; AI / iam / billing all internal LB)

Health probes: GET /healthz (liveness, 5 s), GET /readyz (readiness, 10 s) — see OBSERVABILITY §8.


3. Cloud SQL

SettingValue
EnginePostgres 16
Tierdb-custom-4-16384 (4 vCPU / 16 GiB) — bumped to db-custom-8-32768 for me-central1
HAregional (synchronous standby in second zone)
Replicas1 read replica per region (analytics + read-heavy queries)
Backupsdaily snapshot 02:00 local, 35 d retention; PITR enabled
Maintenance windowWednesday 02:00–06:00 local
EncryptionCMEK from Cloud KMS keyring data
ConnectionsPgBouncer sidecar pool, 100 backend conns reserved per Cloud Run service
Extensionsltree, pgcrypto, pg_trgm, pgaudit, uuid-ossp

PgBouncer sits in a Cloud Run sidecar pattern (separate revision); transaction-mode pooling is required because RLS via SET LOCAL app.tenant_id would not survive session-mode reuse.


4. Pub/Sub

Topics created via Terraform per EVENT_SCHEMAS §2. Subscriptions:

SubscriptionTypeConsumer
tenant-self.iam-user-registeredpush (Cloud Run OnUserRegistered)self
tenant-self.iam-user-deletedpushself
tenant-self.billing-cancelledpushself
tenant-self.billing-reactivatedpushself
tenant-self.deletion-ackedpush (saga)self

DLQ topics under …dlq.v1 route to audit-service and PagerDuty alert. Subscriptions enable ordered delivery via ordering_key = tenant_id.


5. Memorystore

SettingValue
TierStandard (HA, automatic failover)
Capacity5 GiB per region
TLSrequired
Evictionvolatile-lru

Key prefixes: t:<tenantId>:cfg, t:<tenantId>:mbr:<userId>, t:<tenantId>:flags, rl:<bucket>:<key>.


6. CI/CD

StageTool
BuildCloud Build trigger on push to main and tag v*
TestCloud Build runs full unit + integration suite via Testcontainers (with Cloud Build's Docker daemon)
SignBinary Authorization attestor signs the image after tests pass
DeployCloud Deploy pipeline: dev → staging → canary (5 %) → prod (10 %) → prod (50 %) → prod (100 %)
RollbackOne-click revert to previous Cloud Run revision
MigrationFlyway run as a Cloud Run Job before each deploy stage; backfills as separate jobs

Canary criteria (auto-promote after 30 min if all green):

  • Error budget burn < 10 %.
  • p95 within 110 % of baseline.
  • Zero saga timeouts.
  • Zero tenant-isolation violations.

7. Secrets

All secrets via Secret Manager + Workload Identity. tenant-service SA has roles/secretmanager.secretAccessor on:

  • tenant-postgres-password
  • tenant-pgcrypto-key
  • ai-orchestrator-token

No env-var secrets at runtime.


8. Resource Planning

Baseline (per region):

MetricBaselineNotes
Requests / s~ 2 000mostly authz/check and config reads
Cloud Run instances4typical, 50 burst
DB QPS~ 80090 % read
DB CPU~ 30 % avgwell under threshold
Outbox publish rate~ 50 events/sper region
Memorystore RAM~ 1.5 GiBof 5 GiB
Monthly cost (estimate)~ $1 800Cloud SQL ~ 60 %, Cloud Run ~ 25 %, others ~ 15 %

Scale targets at 100x current volume documented in SERVICE_RISK_REGISTER.


9. Service Dependencies (boot order)

  1. iam-service (JWKS endpoint, user lookup)
  2. tenant-service (this) — boots after iam is ready
  3. All other services — must wait for tenant-service readyz before serving traffic

If tenant-service readyz is failing in a region for more than 60 s, the gateway in that region returns 503 MELMASTOON.AUTH.PDP_UNAVAILABLE to all callers (fail-closed).


10. Disaster Recovery

ScenarioRPORTOProcedure
Cloud Run outage in one region05 minLB sheds traffic; clients see 503; Cloud Run auto-recovers
Cloud SQL primary failure< 1 min5 minHA failover automatic
Cloud SQL region loss1 h30 min (manual)Restore from cross-region backup; declare region-down per residency
Memorystore loss01 minCold cache; latency degrades within SLO grace
Pub/Sub outage0unknownOutbox absorbs; events drain on recovery
Bad migration05 minCloud Deploy revert + Flyway down-migration
Image vulnerabilityn/a60 minPull image; Binary Authorization revokes attestation; revert to previous

Runbooks in runbooks/tenant-service/dr/.