DEPLOYMENT_TOPOLOGY — theme-config-service
Sibling: SERVICE_OVERVIEW · LOCAL_DEV_SETUP · OBSERVABILITY
Platform anchors:
docs/02-enterprise-architecture.md·docs/standards/SERVICE_TEMPLATE.md
This document describes how theme-config-service is deployed on GCP, the infrastructure components it depends on, capacity sizing, environment topology, deployment pipeline, and DR.
1. Environments
| Environment | Project | Region | Purpose |
|---|---|---|---|
dev | melmastoon-dev | europe-west1 | shared developer environment + integration tests |
staging | melmastoon-staging | europe-west1 | pre-prod soak + customer-validation theme migrations |
prod-eu | melmastoon-prod | europe-west1 (primary), europe-west4 (DR replica) | production |
prod-asia (Phase 2) | melmastoon-prod-asia | asia-south1 | regional expansion for South Asia tenants |
Sandbox / preview environments are spun up per-PR via the platform pr-preview reusable workflow but only deploy authoring components, not the CDN edge.
2. Runtime topology (per environment)
┌──────────────────────────────────────────────┐
│ Cloud CDN (global edge) │
│ /themes/<id>/published.json │
└────────────┬─────────────────────────────────┘
│ origin pull
▼
┌──────────────────────────────────────────────┐
│ GCS bucket: melmastoon-theme-bundles-<env> │
│ (object versioning, dual-region storage) │
└────────────▲─────────────────────────────────┘
│ writes (publish/rollback)
│
┌──────────────────────────────────────────────────────────────────────────────┐
│ Cloud Run (us no, this is europe-west1) │
│ │
│ theme-config-service-api theme-config-service-workers │
│ ┌──────────────────────┐ ┌────────────────────────────────────┐ │
│ │ Container: NestJS │ │ Containers (one per worker): │ │
│ │ Min 2, max 30 │ │ - outbox-publisher (min 1) │ │
│ │ CPU 1, Mem 1Gi │ │ - cdn-invalidation-retrier (1) │ │
│ │ Concurrency 80 │ │ - preview-token-sweeper (1) │ │
│ │ HTTPS via gateway │ │ - broken-asset-scanner (1) │ │
│ └──────────┬───────────┘ │ - tenant-purge (1) │ │
│ │ │ - cache-warmer (1) │ │
│ │ │ - inbox-consumer (per subscriber) │ │
│ │ └──────────────┬─────────────────────┘ │
│ ▼ ▼ │
│ Cloud SQL Auth Proxy (sidecar) Cloud SQL Auth Proxy (sidecar) │
└─────────────┬─────────────────────────────────────┬──────────────────────────┘
│ │
▼ ▼
┌────────────────┐ ┌────────────────┐
│ Cloud SQL PG16 │ │ Memorystore │
│ Regional HA │ │ Redis 7 HA │
│ + RO replica │ │ 1 GB tier │
└────────────────┘ └────────────────┘
▲ ▲
│ │
│ Pub/Sub topics │
│ ───────────── │
│ melmastoon.theme.events │
│ melmastoon.theme.versions │
│ melmastoon.theme.cdn │
│ (consumes) tenant.events, │
│ media.events, │
│ property.events │
│
▼
Cloud Logging / Cloud Monitoring / Cloud Trace
(OTel sidecar collector)
3. Compute
| Component | Service | Sizing (prod-eu start) | Scaling |
|---|---|---|---|
| API | Cloud Run (managed) | 1 vCPU, 1 GiB, concurrency 80, min 2, max 30 | request-based |
| outbox-publisher | Cloud Run (always-on) | 1 vCPU, 512 MiB, min 1, max 3 | CPU + Pub/Sub queue depth |
| cdn-invalidation-retrier | Cloud Run (always-on) | 0.5 vCPU, 256 MiB, min 1, max 2 | CPU |
| preview-token-sweeper | Cloud Run Jobs | n/a (cron) | hourly |
| broken-asset-scanner | Cloud Run Jobs | 1 vCPU, 1 GiB | daily 02:30 UTC + reactive |
| tenant-purge | Cloud Run Jobs | 0.5 vCPU, 512 MiB | hourly |
| cache-warmer | Cloud Run Jobs | 0.5 vCPU, 256 MiB | per-publish trigger |
| inbox-consumer (per subscription) | Cloud Run (always-on) | 0.5 vCPU, 512 MiB, min 1, max 5 | Pub/Sub backlog |
All Cloud Run revisions run with CPU always allocated for outbox/inbox workers, CPU during requests only for the API.
VPC connector: melmastoon-prod-vpc-connector for private connectivity to Cloud SQL + Memorystore.
4. Data plane
4.1 Cloud SQL Postgres 16
| Setting | Value |
|---|---|
| Tier | db-custom-4-15360 (4 vCPU, 15 GiB) — start; vertical scale per growth |
| HA | Regional, automatic failover |
| Backups | Nightly, 30-day retention, PITR 7 days |
| Maintenance window | Tuesdays 02:00–03:00 UTC (Afghanistan business hours = day) |
| Connection pooler | PgBouncer in transaction mode, 80 client connections, 20 server pool |
| Read replica | europe-west4, async replication, used by analytics + DR |
| Encryption | CMEK with org-managed KMS key |
| IAM auth | Yes; passwords disabled |
| Flags | cloudsql.iam_authentication=on, max_connections=200, track_activity_query_size=4096, pg_stat_statements.track=ALL |
4.2 Memorystore Redis 7
| Setting | Value |
|---|---|
| Tier | Standard HA, 1 GB |
| Auth | Yes (AUTH string in Secret Manager) |
| TLS | required |
| Eviction policy | allkeys-lfu |
| Use | published bundle URL cache, host→theme cache, tenant-config snapshot cache |
4.3 GCS
| Bucket | Purpose | Settings |
|---|---|---|
melmastoon-theme-bundles-prod | published bundle JSONs | dual-region (eu), object versioning on, lifecycle rule: orphan deletion at 7d, version retention 30d, CMEK |
melmastoon-theme-bundles-prod-cdn-logs | CDN access logs | regional |
melmastoon-theme-evals-prod | AI eval cassettes & datasets | regional, restricted IAM |
4.4 Pub/Sub
Topics + subscriptions defined in EVENT_SCHEMAS §2. DLQs are mandatory; max delivery 5; alerting on DLQ inflow.
4.5 Cloud CDN
- Origin:
melmastoon-theme-bundles-prodGCS bucket via a serverless NEG. - Cache key: full URL + selected request headers (
Accept-Encoding). - Cache modes:
CACHE_ALL_STATIC. TTL: 86400s default +Cache-Controlhonored. - Negative caching: 60s for 404s.
- Tagged-invalidation: enabled; tag
theme:<themeId>issued on publish/rollback.
5. Networking
- All Cloud Run services on
melmastoon-prod-vpcvia Serverless VPC Access. - API ingress only from the platform API gateway (
api.melmastoon.app) — Cloud Armor policy enforces gateway egress IP allow-list at the LB tier. - Internal mTLS endpoints (e.g.
/internal/email-theme/...) reachable only from internal VPC (Cloud Run ingress =internal-and-cloud-load-balancing). - Egress routed via Cloud NAT for AI orchestrator + file-storage HTTP calls (both internal so usually private; NAT used as fallback).
6. CI/CD pipeline
PR opened
└─ GitHub Actions: lint, typecheck, unit, integration, contract, http (Testcontainers)
PR merged into main
└─ Build container (Cloud Build) → push to Artifact Registry
└─ Deploy to dev (auto)
└─ Smoke test suite on dev
Promotion to staging (auto on green dev)
└─ Apply migrations (manual approval if migration is destructive — see MIGRATION_PLAN)
└─ Deploy
└─ Run E2E + load smoke
Promotion to prod (manual approval, two-person)
└─ Apply migrations (manual approval, runbook attached)
└─ Deploy with canary: 10 % traffic for 30 min, watch error budget; auto-rollback on burn
└─ Promote to 100 %
Migrations always run before the new container revision is rolled out, and migrations are required to be backwards-compatible with the previous revision (see MIGRATION_PLAN).
Image hardening: distroless base, non-root, read-only filesystem (with tmpfs), no shell.
7. Secrets
All secrets in Secret Manager, mounted as files via Secret Manager CSI driver:
| Secret | Path | Rotation |
|---|---|---|
memorystore-auth-string | /secrets/redis-auth | 90 days |
desktop-pairing-hmac | /secrets/desktop-pairing-hmac | 90 days |
| (workload-identity managed) Cloud SQL, Pub/Sub, GCS, AI orchestrator mTLS — no static secret |
8. Capacity planning (year-1 prod-eu)
| Dimension | Year-1 expectation | Headroom designed |
|---|---|---|
| Active themes | 5 000 (tenants) | scales to 50 000 with current sizing |
| Theme versions per theme (live) | 1 published + ~3 drafts | unbounded archived |
| Authoring rps p95 | 50 | 800 (max instances × concurrency) |
| Public bundle reads rps | 5 000 (mostly CDN-cache-hit) | 50 000+ (CDN scales) |
| Publish events / day | 200 | 10 000 |
| AI calls / day | 2 000 | 20 000 (orchestrator-bound) |
| DB IOPS p95 | 200 | 8 000 (db-custom-4) |
| DB storage | 30 GB | 1 TB |
| Memorystore memory | 200 MB | 1 GB |
| GCS object count | 30 000 | unbounded |
Quarterly capacity review feeds the platform finance dashboard.
9. Disaster recovery
| Scenario | RTO | RPO | Procedure |
|---|---|---|---|
| Cloud Run revision failure | 5 min | 0 | auto rollback to previous revision |
| Cloud SQL primary failure | 5 min | < 5 s | regional HA failover (automatic) |
| europe-west1 region failure | 1 h | < 1 min | promote replica in europe-west4; reconfigure private endpoint; redeploy from CD with region=europe-west4 flag |
| GCS bucket data loss | 4 h | 24 h | restore from versioned objects; rebuild from theme_versions snapshot if needed |
| Pub/Sub regional failure | 30 min | 0 (queue durable) | wait for region recovery; messages buffered |
| Misconfigured publish (bad theme deployed) | 5 min | 0 | rollback via POST /themes/:id/rollback |
| Catastrophic data corruption | 24 h | 24 h | PITR restore to a side instance; reconcile with outbox to replay events |
Runbooks per scenario in services/theme-config-service/runbooks/.
10. Observability deployment
- OTel collector deployed as sidecar container in every Cloud Run revision.
- Exports traces/metrics to Cloud Trace / Cloud Monitoring; logs collected via stdout to Cloud Logging.
- Sinks: BigQuery for long-term log retention (90 d), Cloud Storage for compliance archive (7 y).
11. Cost levers (in priority order)
- CDN cache-hit ratio — every additional 1 % avoids ~ $300/mo at year-1 traffic.
- Memorystore tier size vs origin reads — keep at 1 GB until p95 origin hit ratio drops below 98 %.
- AI surfaces — orchestrator-side budgets prevent runaway spend; eval suite reduces re-prompting.
- Cloud SQL CPU class — vertical-scale only after pgBouncer pool saturation and slow-query optimisation.
- Pub/Sub message size — keep payloads small; never inline bundles in events.
12. References
- Service template:
docs/standards/SERVICE_TEMPLATE.md - Migration policy:
MIGRATION_PLAN - Local dev:
LOCAL_DEV_SETUP - Observability:
OBSERVABILITY