Skip to main content

DEPLOYMENT_TOPOLOGY — analytics-service

Sibling: SECURITY_MODEL · OBSERVABILITY · platform anchor: docs/02 §10 Deployment

analytics-service ships as four Cloud Run units per region plus orchestrated GCP-native components:

UnitPurposeRuntime
analytics-apiREST surface (queries, dashboards, widgets, metrics, internal sync)Cloud Run service, Node 20 LTS
analytics-pubsub-sinkPub/Sub → events_raw.* BigQuery sink + DLQCloud Run service (push subscriptions)
analytics-etl-workerScheduled aggregation + DQ + forecast writebackCloud Run Job invoked by Cloud Workflows + Cloud Composer
analytics-looker-brokerLooker Studio embed token + binding managementCloud Run service (internal-only)

Container base: node:20-alpine + minimal libs; no headless Chromium.


1. Compute & orchestration

UnitMinMaxConcurrencyCPU / MemNotes
analytics-api250802 / 4 GiBStateless; warm pool keeps latency stable
analytics-pubsub-sink2301001 / 2 GiBPush subscription with ack deadline 60 s
analytics-etl-workern/aparallel up to 20 jobs14 / 8 GiBCloud Run Jobs; per-job timeout 60 min
analytics-looker-broker15400.5 / 1 GiBMints embed JWTs

Orchestration:

  • Cloud Workflows drive ETL DAGs (extract → transform → load → DQ → publish). Each step calls analytics-etl-worker job with parameters.
  • Cloud Composer (Airflow 2.x) used only for cross-domain DAGs (e.g., demand-forecast feature pipeline that spans multiple services); single small environment per region.
  • Cloud Scheduler triggers Workflows on cron (per metric definition cadence).
  • Pub/Sub push subscriptions for the sink; pull for control-plane events.

2. Regions & residency

RegionTenantsCloud SQLBigQuery dataset regionComposer
europe-west3EU/MENA tenantsregional HAEU (multi-region) → optional regional pineurope-west3
asia-south1South Asia (PK/IN tenants where allowed)regional HAasia-south1asia-south1
me-central1GCCregional HAme-central1me-central1

Cross-region replication is forbidden. Per-tenant residency is decided at tenant creation by tenant-service and recorded; deployments are duplicated across regions.


3. Service accounts (least privilege)

GSAScope
analytics-api@…Cloud SQL Client; BigQuery Data Viewer on analytics_curated.*; BigQuery Job User; Secret Manager accessor (signing key)
analytics-sink@…Pub/Sub Subscriber; BigQuery Data Editor on events_raw.*; KMS Encrypter
analytics-etl@…BigQuery Job User; Data Editor on analytics_curated.* and dq_results.*; Cloud SQL Client (write)
analytics-looker@…KMS Signer (embed key); Cloud SQL Client (read on tenant_views.access_bindings)
looker-studio-<tenantId>@…Per-tenant principal; granted to authorized views only

Workload Identity binds Kubernetes/Cloud Run identities to GSAs. No JSON keys ever materialized.


4. Networking

  • All units internal-and-cloud-load-balancing ingress; public path only via API gateway → BFF.
  • Egress to BigQuery, Pub/Sub, Secret Manager, KMS via private Google access.
  • VPC-SC perimeter encloses BigQuery + GCS + Pub/Sub for analytics; ingress from outside perimeter denied.

5. Configuration (12-factor)

Env-driven; all secrets via Secret Manager refs. Sample (prod):

NODE_ENV=production
SERVICE_NAME=analytics-service
REGION=europe-west3
DATABASE_URL=__resolved_at_boot__
BIGQUERY_PROJECT=ghasi-melmastoon-prod
BIGQUERY_LOCATION=europe-west3
BIGQUERY_CURATED_DATASET=analytics_curated
BIGQUERY_RAW_DATASET=events_raw
PUBSUB_PROJECT=ghasi-melmastoon-prod
PUBSUB_DLQ_TOPIC=analytics.dlq
DEFAULT_QUERY_BYTE_CAP=1073741824 # 1 GiB
DEFAULT_TENANT_DAILY_BUDGET=53687091200 # 50 GiB
LOOKER_EMBED_KMS_KEY=projects/.../cryptoKeys/melmastoon-analytics-embed-signer
AI_ORCHESTRATOR_BASE_URL=https://ai-orchestrator.internal
AI_ORCHESTRATOR_AUDIENCE=https://ai-orchestrator.internal
OTEL_EXPORTER_OTLP_ENDPOINT=https://otel.melmastoon.internal
LOG_LEVEL=info

6. Deploy pipeline

Cloud Deploy targets per region:

build → unit-and-integration-tests → schema-drift-check → artifact-registry-push

deploy:dev (auto) → smoke + load smoke

deploy:stg (auto on develop) → Pact verify + DQ replay + canary 10% / 30 min

deploy:prod-eu (manual approve) → canary 10% / 30 min → 100%

deploy:prod-asia / prod-me (parallel after EU green)

Migrations:

  • Postgres migrations run from a one-shot Cloud Run Job before traffic shift.
  • BigQuery DDL applies via Terraform; new tables ship as additive; renames/breaking changes follow two-phase coexistence (MIGRATION_PLAN).

Rollback: each release is a Cloud Run revision; rollback flips traffic to previous revision in seconds. ETL job rollbacks restore Workflow definition + worker container.


7. Capacity & cost envelope (per-region steady state)

ResourceEstimate
analytics-api4 instances avg, 12 peak
analytics-pubsub-sink6 instances avg, 18 peak
analytics-etl-worker~120 job runs/day (mixed cadences)
Cloud SQL2 vCPU / 8 GiB HA
BigQuery storage (curated)~10 GiB/tenant/year (active), ~3 GiB/tenant/year (long-term)
BigQuery slotsreservation 200 baseline + autoscale 200
Composersmall env (3 worker nodes)

Cost guardrails: per-tenant byte budgets (default 50 GiB/day), reservation autoscale ceiling, snapshot generators auto-paused when budget exceeded (SECURITY_MODEL §9).


8. Disaster recovery

  • Postgres: PITR 7 days, daily snapshot 35-day retention; HA replica.
  • BigQuery: time-travel 7 days; snapshot tables for curated layer weekly with 90-day retention.
  • Workflows / Composer: definitions in IaC (Terraform); restorable by re-applying.
  • Pub/Sub: subscription retention 7 days; replay possible from message storage.
  • RTO / RPO: RTO 30 min (region failover possible only within residency); RPO 5 min for Postgres, 1 h for curated tables (replayable from raw).

Cross-references: SECURITY_MODEL §3, OBSERVABILITY §10 cost, MIGRATION_PLAN.