DEPLOYMENT_TOPOLOGY — analytics-service

Sibling: SECURITY_MODEL · OBSERVABILITY · platform anchor: docs/02 §10 Deployment

analytics-service ships as four Cloud Run units per region plus orchestrated GCP-native components:

Unit	Purpose	Runtime
`analytics-api`	REST surface (queries, dashboards, widgets, metrics, internal sync)	Cloud Run service, Node 20 LTS
`analytics-pubsub-sink`	Pub/Sub → `events_raw.*` BigQuery sink + DLQ	Cloud Run service (push subscriptions)
`analytics-etl-worker`	Scheduled aggregation + DQ + forecast writeback	Cloud Run Job invoked by Cloud Workflows + Cloud Composer
`analytics-looker-broker`	Looker Studio embed token + binding management	Cloud Run service (internal-only)

Container base: node:20-alpine + minimal libs; no headless Chromium.

1. Compute & orchestration

Unit	Min	Max	Concurrency	CPU / Mem	Notes
`analytics-api`	2	50	80	2 / 4 GiB	Stateless; warm pool keeps latency stable
`analytics-pubsub-sink`	2	30	100	1 / 2 GiB	Push subscription with ack deadline 60 s
`analytics-etl-worker`	n/a	parallel up to 20 jobs	1	4 / 8 GiB	Cloud Run Jobs; per-job timeout 60 min
`analytics-looker-broker`	1	5	40	0.5 / 1 GiB	Mints embed JWTs

Orchestration:

Cloud Workflows drive ETL DAGs (extract → transform → load → DQ → publish). Each step calls analytics-etl-worker job with parameters.
Cloud Composer (Airflow 2.x) used only for cross-domain DAGs (e.g., demand-forecast feature pipeline that spans multiple services); single small environment per region.
Cloud Scheduler triggers Workflows on cron (per metric definition cadence).
Pub/Sub push subscriptions for the sink; pull for control-plane events.

2. Regions & residency

Region	Tenants	Cloud SQL	BigQuery dataset region	Composer
`europe-west3`	EU/MENA tenants	regional HA	`EU` (multi-region) → optional regional pin	`europe-west3`
`asia-south1`	South Asia (PK/IN tenants where allowed)	regional HA	`asia-south1`	`asia-south1`
`me-central1`	GCC	regional HA	`me-central1`	`me-central1`

Cross-region replication is forbidden. Per-tenant residency is decided at tenant creation by tenant-service and recorded; deployments are duplicated across regions.

3. Service accounts (least privilege)

GSA	Scope
`analytics-api@…`	Cloud SQL Client; BigQuery Data Viewer on `analytics_curated.*`; BigQuery Job User; Secret Manager accessor (signing key)
`analytics-sink@…`	Pub/Sub Subscriber; BigQuery Data Editor on `events_raw.*`; KMS Encrypter
`analytics-etl@…`	BigQuery Job User; Data Editor on `analytics_curated.` and `dq_results.`; Cloud SQL Client (write)
`analytics-looker@…`	KMS Signer (embed key); Cloud SQL Client (read on `tenant_views.access_bindings`)
`looker-studio-<tenantId>@…`	Per-tenant principal; granted to authorized views only

Workload Identity binds Kubernetes/Cloud Run identities to GSAs. No JSON keys ever materialized.

4. Networking

All units internal-and-cloud-load-balancing ingress; public path only via API gateway → BFF.
Egress to BigQuery, Pub/Sub, Secret Manager, KMS via private Google access.
VPC-SC perimeter encloses BigQuery + GCS + Pub/Sub for analytics; ingress from outside perimeter denied.

5. Configuration (12-factor)

Env-driven; all secrets via Secret Manager refs. Sample (prod):

NODE_ENV=production
SERVICE_NAME=analytics-service
REGION=europe-west3
DATABASE_URL=__resolved_at_boot__
BIGQUERY_PROJECT=ghasi-melmastoon-prod
BIGQUERY_LOCATION=europe-west3
BIGQUERY_CURATED_DATASET=analytics_curated
BIGQUERY_RAW_DATASET=events_raw
PUBSUB_PROJECT=ghasi-melmastoon-prod
PUBSUB_DLQ_TOPIC=analytics.dlq
DEFAULT_QUERY_BYTE_CAP=1073741824           # 1 GiB
DEFAULT_TENANT_DAILY_BUDGET=53687091200     # 50 GiB
LOOKER_EMBED_KMS_KEY=projects/.../cryptoKeys/melmastoon-analytics-embed-signer
AI_ORCHESTRATOR_BASE_URL=https://ai-orchestrator.internal
AI_ORCHESTRATOR_AUDIENCE=https://ai-orchestrator.internal
OTEL_EXPORTER_OTLP_ENDPOINT=https://otel.melmastoon.internal
LOG_LEVEL=info

6. Deploy pipeline

Cloud Deploy targets per region:

build  →  unit-and-integration-tests  →  schema-drift-check  →  artifact-registry-push
        ↓
    deploy:dev (auto)  →  smoke + load smoke
        ↓
    deploy:stg (auto on develop)  →  Pact verify + DQ replay + canary 10% / 30 min
        ↓
    deploy:prod-eu (manual approve)  →  canary 10% / 30 min  →  100%
        ↓
    deploy:prod-asia / prod-me  (parallel after EU green)

Migrations:

Postgres migrations run from a one-shot Cloud Run Job before traffic shift.
BigQuery DDL applies via Terraform; new tables ship as additive; renames/breaking changes follow two-phase coexistence (MIGRATION_PLAN).

Rollback: each release is a Cloud Run revision; rollback flips traffic to previous revision in seconds. ETL job rollbacks restore Workflow definition + worker container.

7. Capacity & cost envelope (per-region steady state)

Resource	Estimate
`analytics-api`	4 instances avg, 12 peak
`analytics-pubsub-sink`	6 instances avg, 18 peak
`analytics-etl-worker`	~120 job runs/day (mixed cadences)
Cloud SQL	2 vCPU / 8 GiB HA
BigQuery storage (curated)	~10 GiB/tenant/year (active), ~3 GiB/tenant/year (long-term)
BigQuery slots	reservation 200 baseline + autoscale 200
Composer	small env (3 worker nodes)

Cost guardrails: per-tenant byte budgets (default 50 GiB/day), reservation autoscale ceiling, snapshot generators auto-paused when budget exceeded (SECURITY_MODEL §9).

8. Disaster recovery

Postgres: PITR 7 days, daily snapshot 35-day retention; HA replica.
BigQuery: time-travel 7 days; snapshot tables for curated layer weekly with 90-day retention.
Workflows / Composer: definitions in IaC (Terraform); restorable by re-applying.
Pub/Sub: subscription retention 7 days; replay possible from message storage.
RTO / RPO: RTO 30 min (region failover possible only within residency); RPO 5 min for Postgres, 1 h for curated tables (replayable from raw).

Cross-references: SECURITY_MODEL §3, OBSERVABILITY §10 cost, MIGRATION_PLAN.

1. Compute & orchestration​

2. Regions & residency​

3. Service accounts (least privilege)​

4. Networking​

5. Configuration (12-factor)​

6. Deploy pipeline​

7. Capacity & cost envelope (per-region steady state)​

8. Disaster recovery​