tenant-service — DEPLOYMENT_TOPOLOGY
Companion: SERVICE_OVERVIEW · DATA_MODEL · OBSERVABILITY · FAILURE_MODES · Platform: 02 Enterprise Architecture §9 GCP topology
Tenant-service is deployed as a regional Cloud Run service in each residency region. It is active-active across regions for read traffic and active-standby per residency for writes (writes pinned to the residency region of the tenant, per data-residency policy).
1. GCP Topology
┌──────────────────────────── melmastoon-prod (GCP project) ────────────────────────────┐
│ │
│ asia-south1 (Mumbai) me-central1 (Doha) europe-west1 (Belgium) │
│ ┌──────────────────┐ ┌──────────────────┐ ┌──────────────────┐ │
│ │ Cloud Run │ │ Cloud Run │ │ Cloud Run │ │
internal LB │ │ tenant-service │ │ tenant-service │ │ tenant-service │ │
────────────►│ │ min=2 max=50 │ │ min=2 max=50 │ │ min=2 max=50 │ │
│ └──────────────────┘ └──────────────────┘ └──────────────────┘ │
│ │ │ │ │
│ │ workload identity │ │ │
│ ▼ ▼ ▼ │
│ ┌──────────────────┐ ┌──────────────────┐ ┌──────────────────┐ │
│ │ Cloud SQL HA │ │ Cloud SQL HA │ │ Cloud SQL HA │ │
│ │ primary + sync │ │ primary + sync │ │ primary + sync │ │
│ │ + 1 read replica │ │ + 1 read replica │ │ + 1 read replica │ │
│ └──────────────────┘ └──────────────────┘ └──────────────────┘ │
│ │
│ ┌──── Memorystore (Redis Standard, regional) per region ────┐ │
│ └─────────────────────────────────────────────────────────────┘ │
│ │
│ ┌──── Pub/Sub topics (global, with per-region subs) ─────────┐ │
│ │ melmastoon.tenant.* + tenant-deletion-saga DLQs │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ │
│ ┌──── Secret Manager (CMEK), Cloud KMS, Cloud Build, Artifact Registry ────┐ │
│ └─────────────────────────────────────────────────────────────────────────┘ │
└───────────────────────────────────────────────────────────────────────────────────────┘
Cross-region failover: tenant data does not cross residency boundaries. If a region fails, tenants pinned to that region experience downtime; we do not silently relocate data. Healthcheck-driven LB routes only to the matching region per tenant. The platform's RTO target for a regional outage is 30 min via DR runbook (manual cutover) — never automatic.
2. Cloud Run Configuration
| Setting | Value |
|---|---|
| Image | asia-south1-docker.pkg.dev/melmastoon-prod/services/tenant-service:<sha> (distroless Node 20) |
| CPU / Memory | 1 vCPU / 1 GiB (default), 2 vCPU / 2 GiB for me-central1 (highest tenant volume) |
| Concurrency | 80 requests / instance |
| Min instances | 2 per region |
| Max instances | 50 per region (auto-scale on CPU > 65 % and request concurrency > 70) |
| Timeout | 30 s (writes), 10 s (PDP authz/check) |
| Service account | tenant-service@melmastoon-prod.iam.gserviceaccount.com |
| VPC connector | vpc-tenant-internal (Cloud SQL + Memorystore reachable only via private IP) |
| Binary Authorization | required; image must be signed by cloudbuild-attestor |
| Egress | restricted (no public egress; AI / iam / billing all internal LB) |
Health probes: GET /healthz (liveness, 5 s), GET /readyz (readiness, 10 s) — see OBSERVABILITY §8.
3. Cloud SQL
| Setting | Value |
|---|---|
| Engine | Postgres 16 |
| Tier | db-custom-4-16384 (4 vCPU / 16 GiB) — bumped to db-custom-8-32768 for me-central1 |
| HA | regional (synchronous standby in second zone) |
| Replicas | 1 read replica per region (analytics + read-heavy queries) |
| Backups | daily snapshot 02:00 local, 35 d retention; PITR enabled |
| Maintenance window | Wednesday 02:00–06:00 local |
| Encryption | CMEK from Cloud KMS keyring data |
| Connections | PgBouncer sidecar pool, 100 backend conns reserved per Cloud Run service |
| Extensions | ltree, pgcrypto, pg_trgm, pgaudit, uuid-ossp |
PgBouncer sits in a Cloud Run sidecar pattern (separate revision); transaction-mode pooling is required because RLS via SET LOCAL app.tenant_id would not survive session-mode reuse.
4. Pub/Sub
Topics created via Terraform per EVENT_SCHEMAS §2. Subscriptions:
| Subscription | Type | Consumer |
|---|---|---|
tenant-self.iam-user-registered | push (Cloud Run OnUserRegistered) | self |
tenant-self.iam-user-deleted | push | self |
tenant-self.billing-cancelled | push | self |
tenant-self.billing-reactivated | push | self |
tenant-self.deletion-acked | push (saga) | self |
DLQ topics under …dlq.v1 route to audit-service and PagerDuty alert. Subscriptions enable ordered delivery via ordering_key = tenant_id.
5. Memorystore
| Setting | Value |
|---|---|
| Tier | Standard (HA, automatic failover) |
| Capacity | 5 GiB per region |
| TLS | required |
| Eviction | volatile-lru |
Key prefixes: t:<tenantId>:cfg, t:<tenantId>:mbr:<userId>, t:<tenantId>:flags, rl:<bucket>:<key>.
6. CI/CD
| Stage | Tool |
|---|---|
| Build | Cloud Build trigger on push to main and tag v* |
| Test | Cloud Build runs full unit + integration suite via Testcontainers (with Cloud Build's Docker daemon) |
| Sign | Binary Authorization attestor signs the image after tests pass |
| Deploy | Cloud Deploy pipeline: dev → staging → canary (5 %) → prod (10 %) → prod (50 %) → prod (100 %) |
| Rollback | One-click revert to previous Cloud Run revision |
| Migration | Flyway run as a Cloud Run Job before each deploy stage; backfills as separate jobs |
Canary criteria (auto-promote after 30 min if all green):
- Error budget burn < 10 %.
- p95 within 110 % of baseline.
- Zero saga timeouts.
- Zero tenant-isolation violations.
7. Secrets
All secrets via Secret Manager + Workload Identity. tenant-service SA has roles/secretmanager.secretAccessor on:
tenant-postgres-passwordtenant-pgcrypto-keyai-orchestrator-token
No env-var secrets at runtime.
8. Resource Planning
Baseline (per region):
| Metric | Baseline | Notes |
|---|---|---|
| Requests / s | ~ 2 000 | mostly authz/check and config reads |
| Cloud Run instances | 4 | typical, 50 burst |
| DB QPS | ~ 800 | 90 % read |
| DB CPU | ~ 30 % avg | well under threshold |
| Outbox publish rate | ~ 50 events/s | per region |
| Memorystore RAM | ~ 1.5 GiB | of 5 GiB |
| Monthly cost (estimate) | ~ $1 800 | Cloud SQL ~ 60 %, Cloud Run ~ 25 %, others ~ 15 % |
Scale targets at 100x current volume documented in SERVICE_RISK_REGISTER.
9. Service Dependencies (boot order)
iam-service(JWKS endpoint, user lookup)tenant-service(this) — boots after iam is ready- All other services — must wait for tenant-service
readyzbefore serving traffic
If tenant-service readyz is failing in a region for more than 60 s, the gateway in that region returns 503 MELMASTOON.AUTH.PDP_UNAVAILABLE to all callers (fail-closed).
10. Disaster Recovery
| Scenario | RPO | RTO | Procedure |
|---|---|---|---|
| Cloud Run outage in one region | 0 | 5 min | LB sheds traffic; clients see 503; Cloud Run auto-recovers |
| Cloud SQL primary failure | < 1 min | 5 min | HA failover automatic |
| Cloud SQL region loss | 1 h | 30 min (manual) | Restore from cross-region backup; declare region-down per residency |
| Memorystore loss | 0 | 1 min | Cold cache; latency degrades within SLO grace |
| Pub/Sub outage | 0 | unknown | Outbox absorbs; events drain on recovery |
| Bad migration | 0 | 5 min | Cloud Deploy revert + Flyway down-migration |
| Image vulnerability | n/a | 60 min | Pull image; Binary Authorization revokes attestation; revert to previous |
Runbooks in runbooks/tenant-service/dr/.