Skip to main content

DEPLOYMENT_TOPOLOGY — theme-config-service

Sibling: SERVICE_OVERVIEW · LOCAL_DEV_SETUP · OBSERVABILITY

Platform anchors: docs/02-enterprise-architecture.md · docs/standards/SERVICE_TEMPLATE.md

This document describes how theme-config-service is deployed on GCP, the infrastructure components it depends on, capacity sizing, environment topology, deployment pipeline, and DR.


1. Environments

EnvironmentProjectRegionPurpose
devmelmastoon-deveurope-west1shared developer environment + integration tests
stagingmelmastoon-stagingeurope-west1pre-prod soak + customer-validation theme migrations
prod-eumelmastoon-prodeurope-west1 (primary), europe-west4 (DR replica)production
prod-asia (Phase 2)melmastoon-prod-asiaasia-south1regional expansion for South Asia tenants

Sandbox / preview environments are spun up per-PR via the platform pr-preview reusable workflow but only deploy authoring components, not the CDN edge.


2. Runtime topology (per environment)

┌──────────────────────────────────────────────┐
│ Cloud CDN (global edge) │
│ /themes/<id>/published.json │
└────────────┬─────────────────────────────────┘
│ origin pull

┌──────────────────────────────────────────────┐
│ GCS bucket: melmastoon-theme-bundles-<env> │
│ (object versioning, dual-region storage) │
└────────────▲─────────────────────────────────┘
│ writes (publish/rollback)

┌──────────────────────────────────────────────────────────────────────────────┐
│ Cloud Run (us no, this is europe-west1) │
│ │
│ theme-config-service-api theme-config-service-workers │
│ ┌──────────────────────┐ ┌────────────────────────────────────┐ │
│ │ Container: NestJS │ │ Containers (one per worker): │ │
│ │ Min 2, max 30 │ │ - outbox-publisher (min 1) │ │
│ │ CPU 1, Mem 1Gi │ │ - cdn-invalidation-retrier (1) │ │
│ │ Concurrency 80 │ │ - preview-token-sweeper (1) │ │
│ │ HTTPS via gateway │ │ - broken-asset-scanner (1) │ │
│ └──────────┬───────────┘ │ - tenant-purge (1) │ │
│ │ │ - cache-warmer (1) │ │
│ │ │ - inbox-consumer (per subscriber) │ │
│ │ └──────────────┬─────────────────────┘ │
│ ▼ ▼ │
│ Cloud SQL Auth Proxy (sidecar) Cloud SQL Auth Proxy (sidecar) │
└─────────────┬─────────────────────────────────────┬──────────────────────────┘
│ │
▼ ▼
┌────────────────┐ ┌────────────────┐
│ Cloud SQL PG16 │ │ Memorystore │
│ Regional HA │ │ Redis 7 HA │
│ + RO replica │ │ 1 GB tier │
└────────────────┘ └────────────────┘
▲ ▲
│ │
│ Pub/Sub topics │
│ ───────────── │
│ melmastoon.theme.events │
│ melmastoon.theme.versions │
│ melmastoon.theme.cdn │
│ (consumes) tenant.events, │
│ media.events, │
│ property.events │


Cloud Logging / Cloud Monitoring / Cloud Trace
(OTel sidecar collector)

3. Compute

ComponentServiceSizing (prod-eu start)Scaling
APICloud Run (managed)1 vCPU, 1 GiB, concurrency 80, min 2, max 30request-based
outbox-publisherCloud Run (always-on)1 vCPU, 512 MiB, min 1, max 3CPU + Pub/Sub queue depth
cdn-invalidation-retrierCloud Run (always-on)0.5 vCPU, 256 MiB, min 1, max 2CPU
preview-token-sweeperCloud Run Jobsn/a (cron)hourly
broken-asset-scannerCloud Run Jobs1 vCPU, 1 GiBdaily 02:30 UTC + reactive
tenant-purgeCloud Run Jobs0.5 vCPU, 512 MiBhourly
cache-warmerCloud Run Jobs0.5 vCPU, 256 MiBper-publish trigger
inbox-consumer (per subscription)Cloud Run (always-on)0.5 vCPU, 512 MiB, min 1, max 5Pub/Sub backlog

All Cloud Run revisions run with CPU always allocated for outbox/inbox workers, CPU during requests only for the API.

VPC connector: melmastoon-prod-vpc-connector for private connectivity to Cloud SQL + Memorystore.


4. Data plane

4.1 Cloud SQL Postgres 16

SettingValue
Tierdb-custom-4-15360 (4 vCPU, 15 GiB) — start; vertical scale per growth
HARegional, automatic failover
BackupsNightly, 30-day retention, PITR 7 days
Maintenance windowTuesdays 02:00–03:00 UTC (Afghanistan business hours = day)
Connection poolerPgBouncer in transaction mode, 80 client connections, 20 server pool
Read replicaeurope-west4, async replication, used by analytics + DR
EncryptionCMEK with org-managed KMS key
IAM authYes; passwords disabled
Flagscloudsql.iam_authentication=on, max_connections=200, track_activity_query_size=4096, pg_stat_statements.track=ALL

4.2 Memorystore Redis 7

SettingValue
TierStandard HA, 1 GB
AuthYes (AUTH string in Secret Manager)
TLSrequired
Eviction policyallkeys-lfu
Usepublished bundle URL cache, host→theme cache, tenant-config snapshot cache

4.3 GCS

BucketPurposeSettings
melmastoon-theme-bundles-prodpublished bundle JSONsdual-region (eu), object versioning on, lifecycle rule: orphan deletion at 7d, version retention 30d, CMEK
melmastoon-theme-bundles-prod-cdn-logsCDN access logsregional
melmastoon-theme-evals-prodAI eval cassettes & datasetsregional, restricted IAM

4.4 Pub/Sub

Topics + subscriptions defined in EVENT_SCHEMAS §2. DLQs are mandatory; max delivery 5; alerting on DLQ inflow.

4.5 Cloud CDN

  • Origin: melmastoon-theme-bundles-prod GCS bucket via a serverless NEG.
  • Cache key: full URL + selected request headers (Accept-Encoding).
  • Cache modes: CACHE_ALL_STATIC. TTL: 86400s default + Cache-Control honored.
  • Negative caching: 60s for 404s.
  • Tagged-invalidation: enabled; tag theme:<themeId> issued on publish/rollback.

5. Networking

  • All Cloud Run services on melmastoon-prod-vpc via Serverless VPC Access.
  • API ingress only from the platform API gateway (api.melmastoon.app) — Cloud Armor policy enforces gateway egress IP allow-list at the LB tier.
  • Internal mTLS endpoints (e.g. /internal/email-theme/...) reachable only from internal VPC (Cloud Run ingress = internal-and-cloud-load-balancing).
  • Egress routed via Cloud NAT for AI orchestrator + file-storage HTTP calls (both internal so usually private; NAT used as fallback).

6. CI/CD pipeline

PR opened
└─ GitHub Actions: lint, typecheck, unit, integration, contract, http (Testcontainers)
PR merged into main
└─ Build container (Cloud Build) → push to Artifact Registry
└─ Deploy to dev (auto)
└─ Smoke test suite on dev
Promotion to staging (auto on green dev)
└─ Apply migrations (manual approval if migration is destructive — see MIGRATION_PLAN)
└─ Deploy
└─ Run E2E + load smoke
Promotion to prod (manual approval, two-person)
└─ Apply migrations (manual approval, runbook attached)
└─ Deploy with canary: 10 % traffic for 30 min, watch error budget; auto-rollback on burn
└─ Promote to 100 %

Migrations always run before the new container revision is rolled out, and migrations are required to be backwards-compatible with the previous revision (see MIGRATION_PLAN).

Image hardening: distroless base, non-root, read-only filesystem (with tmpfs), no shell.


7. Secrets

All secrets in Secret Manager, mounted as files via Secret Manager CSI driver:

SecretPathRotation
memorystore-auth-string/secrets/redis-auth90 days
desktop-pairing-hmac/secrets/desktop-pairing-hmac90 days
(workload-identity managed) Cloud SQL, Pub/Sub, GCS, AI orchestrator mTLS — no static secret

8. Capacity planning (year-1 prod-eu)

DimensionYear-1 expectationHeadroom designed
Active themes5 000 (tenants)scales to 50 000 with current sizing
Theme versions per theme (live)1 published + ~3 draftsunbounded archived
Authoring rps p9550800 (max instances × concurrency)
Public bundle reads rps5 000 (mostly CDN-cache-hit)50 000+ (CDN scales)
Publish events / day20010 000
AI calls / day2 00020 000 (orchestrator-bound)
DB IOPS p952008 000 (db-custom-4)
DB storage30 GB1 TB
Memorystore memory200 MB1 GB
GCS object count30 000unbounded

Quarterly capacity review feeds the platform finance dashboard.


9. Disaster recovery

ScenarioRTORPOProcedure
Cloud Run revision failure5 min0auto rollback to previous revision
Cloud SQL primary failure5 min< 5 sregional HA failover (automatic)
europe-west1 region failure1 h< 1 minpromote replica in europe-west4; reconfigure private endpoint; redeploy from CD with region=europe-west4 flag
GCS bucket data loss4 h24 hrestore from versioned objects; rebuild from theme_versions snapshot if needed
Pub/Sub regional failure30 min0 (queue durable)wait for region recovery; messages buffered
Misconfigured publish (bad theme deployed)5 min0rollback via POST /themes/:id/rollback
Catastrophic data corruption24 h24 hPITR restore to a side instance; reconcile with outbox to replay events

Runbooks per scenario in services/theme-config-service/runbooks/.


10. Observability deployment

  • OTel collector deployed as sidecar container in every Cloud Run revision.
  • Exports traces/metrics to Cloud Trace / Cloud Monitoring; logs collected via stdout to Cloud Logging.
  • Sinks: BigQuery for long-term log retention (90 d), Cloud Storage for compliance archive (7 y).

11. Cost levers (in priority order)

  1. CDN cache-hit ratio — every additional 1 % avoids ~ $300/mo at year-1 traffic.
  2. Memorystore tier size vs origin reads — keep at 1 GB until p95 origin hit ratio drops below 98 %.
  3. AI surfaces — orchestrator-side budgets prevent runaway spend; eval suite reduces re-prompting.
  4. Cloud SQL CPU class — vertical-scale only after pgBouncer pool saturation and slow-query optimisation.
  5. Pub/Sub message size — keep payloads small; never inline bundles in events.

12. References