file-storage-service — DEPLOYMENT_TOPOLOGY

Companion: SERVICE_OVERVIEW · 02 Enterprise Architecture · ADR-0001 Core Architecture · LOCAL_DEV_SETUP

This service is deployed across GCP in the melmastoon-prod project (and melmastoon-stg, melmastoon-dev). The compute is Cloud Run for the API + workers, with a GKE Autopilot carve-out for the long-running ClamAV cluster (it's stateful enough to warrant pods rather than per-request containers). Region primary is europe-west4; failover read replica in europe-west1. Desktop clients are Electron — never Tauri.

1. Workload inventory

Workload	Runtime	Replicas (prod)	Scaling	Image
`file-storage-api`	Cloud Run (Node 20, fastify)	min 3, max 50	CPU 60 % + concurrency target 60	`gcr.io/melmastoon-prod/file-storage-service:<sha>`
`file-storage-outbox-relay`	Cloud Run (always-on)	min 2, max 8	RPS-driven, custom metric `outbox_unpublished_total`	same image, `MODE=outbox-relay`
`file-storage-inbox-consumer`	Cloud Run (Pub/Sub push)	min 2, max 20	Pub/Sub backlog	same image, `MODE=inbox-consumer`
`file-storage-retention-sweeper`	Cloud Run Job	n/a (cron)	every 5 min	same image, `MODE=sweeper-retention`
`file-storage-session-cleaner`	Cloud Run Job	n/a (cron)	every 15 min	same image, `MODE=sweeper-sessions`
`file-storage-cdn-invalidation-worker`	Cloud Run	min 1, max 4	queue depth	same image, `MODE=cdn-invalidate`
`file-storage-optimizer-worker`	Cloud Run Job triggered by Pub/Sub via Eventarc	n/a	per-message; 200 max parallel	`gcr.io/melmastoon-prod/file-optimizer-worker:<sha>`
`file-storage-clamav`	GKE Autopilot StatefulSet (3 pods)	3 → 12 (HPA on queue length)	CPU + custom metric	`gcr.io/melmastoon-prod/clamav-wrapper:<sha>`
`file-storage-private-cdn-sidecar`	Cloud Run (per-region)	min 1, max 8	RPS	small Go binary that fronts private-bucket reads, checks signed-url-blacklist Redis ZSET

The service image and the optimizer image are separate because the optimizer's runtime base (debian-slim + sharp + ffmpeg) is large; keeping them apart minimizes API container cold-start time.

2. Regional topology

              ┌──────────────────────────── europe-west4 (primary) ─────────────────────────────┐
Cloud Load    │  Cloud Run: file-storage-api (multi-region)                                     │
Balancer ───► │  Cloud Run: outbox-relay, inbox-consumer, cdn-invalidator                       │
              │  GKE Autopilot: clamav (zonal HA across 3 zones)                                │
              │  Cloud SQL Postgres 16 HA primary (regional)                                    │
              │  Memorystore Redis (HA tier)                                                    │
              │  GCS buckets: media-prod, private-prod, archive-prod                            │
              │  Cloud KMS: file-storage key ring                                               │
              │  Pub/Sub: melmastoon.file.* topics + DLQs                                       │
              └─────────────────────────────────────────────────────────────────────────────────┘
                                              │
                       ┌──────────────────────┴──────────────────────┐
                       ▼                                              ▼
              ┌─── europe-west1 (warm DR) ───┐         ┌─── me-central1 (read POPs, Phase 3) ───┐
              │ Cloud SQL read replica       │         │ Cloud Run: file-storage-api (read)     │
              │ Cloud Run: file-storage-api  │         │ Memorystore Redis (cache only)         │
              │ (cold; promoted on failover) │         │ Reads only; writes always to ew4       │
              │ Memorystore Redis (mirror)   │         └────────────────────────────────────────┘
              └──────────────────────────────┘

GCS buckets are dual-region (eur4 = europe-west4 + europe-north1) for the media and archive classes; private is single-region (europe-west4) by policy to honour data residency.

3. Infrastructure-as-code

All infra lives in infra/melmastoon/modules/file-storage/:

infra/melmastoon/modules/file-storage/
├── main.tf                          # Cloud Run services, CR jobs, IAM bindings
├── pubsub.tf                        # topics + subscriptions + DLQs
├── gcs.tf                           # buckets, lifecycle rules, CMEK bindings
├── kms.tf                           # key ring + key + rotation schedule
├── cloudsql.tf                      # primary + replica + private IP
├── memorystore.tf                   # Redis instance
├── cdn.tf                           # backend bucket, URL map, edge cache
├── alerts.tf                        # Cloud Monitoring alert policies
├── slo.tf                           # SLO objects per SLI
├── workload_identity.tf             # service accounts + WIF bindings
├── gke_clamav.tf                    # GKE namespace + StatefulSet + HPA
└── env/
    ├── dev.tfvars
    ├── stg.tfvars
    └── prod.tfvars

Helm chart (for the GKE bits only) lives in infra/charts/clamav/ and is deployed by Argo CD from the infra-deploy repo.

4. Service accounts & IAM

Workload	Service account	Roles
`file-storage-api`	`file-storage-api@melmastoon-prod.iam`	`roles/storage.objectAdmin` (scoped to the 3 buckets), `roles/cloudkms.cryptoKeyEncrypterDecrypter`, `roles/iam.serviceAccountTokenCreator` (for `signBlob`), `roles/pubsub.publisher`, `roles/cloudsql.client`, `roles/secretmanager.secretAccessor`
`outbox-relay`	same as API	same
`inbox-consumer`	`file-storage-inbox@…`	`roles/pubsub.subscriber` on tenant.* and property.* topics
`optimizer-worker`	`file-optimizer@…`	`roles/storage.objectViewer` + `roles/storage.objectCreator` on private/media buckets, `roles/cloudkms.cryptoKeyEncrypterDecrypter`
`clamav`	`clamav-scan@…`	`roles/storage.objectViewer` on all 3 buckets (read-only), `roles/run.invoker` on the `/internal/v1/files/scan-callback` endpoint
`cdn-invalidator`	`cdn-invalidator@…`	`roles/compute.urlMapAdmin` (scoped to file-storage URL map)
`retention-sweeper`	`file-sweeper@…`	superset (read DB + delete GCS); audited

Workload Identity Federation is the only auth method — no static keys. The iam.serviceAccounts.signBlob permission is what allows the API to sign GCS V4 URLs without holding a long-lived private key.

5. Networking

All Cloud Run services attach to a Serverless VPC connector (europe-west4/file-storage-vpc-conn).
Cloud SQL is private IP only; no public egress.
Memorystore Redis is private.
Pub/Sub uses Google's private path; no public endpoint.
Cloud Run ingress is internal-and-cloud-load-balancing — only the platform LB can hit the API.
The CDN's backend bucket has Signed URL enforcement on; cache-fill goes via authenticated GCS access.
Egress IPs are pinned for outbound (ClamAV → Cloud Run callback) via NAT.

6. Cloud SQL configuration

Setting	Value
Engine	Postgres 16
Tier (prod)	`db-custom-8-32768` (8 vCPU / 32 GB RAM)
Disk	500 GB SSD, autogrow enabled
HA	Regional (sync replica in another zone)
Backups	daily 02:00 UTC, 35-day retention
PITR	enabled, 7-day window
Read replica	1 in `europe-west1` (warm DR)
Connections	PgBouncer sidecar in transaction pooling, default pool 25 / instance
Extensions	`pgcrypto`, `pg_trgm`, `pg_stat_statements`
Maintenance window	Sunday 04:00–05:00 UTC
CMEK	yes (key in `file-storage` key ring)

7. GCS buckets

Bucket	Class	Region	UBLA	Versioning	Lifecycle	CMEK
`melmastoon-media-prod`	Standard	`eur4` (dual)	on	30 d	thumbs/auto-delete after 30 d for `archived` rows; tx_lifecycle for upload session orphans (1 d)	Google-managed
`melmastoon-private-prod`	Standard	`europe-west4`	on	30 d	per-policy hard delete via service code (lifecycle rules disabled by default)	CMEK
`melmastoon-archive-prod`	Coldline	`eur4` (dual)	on	30 d	per-policy hard delete via service code; Bucket Lock = 7 y for tax_compliance scope	CMEK + Bucket Lock

A separate bucket melmastoon-quarantine-prod (Coldline, single-region, very restrictive IAM) holds quarantined files for the 30-day forensic window.

8. Cloud Run service config

# file-storage-api Cloud Run service
spec:
  template:
    metadata:
      annotations:
        autoscaling.knative.dev/minScale: "3"
        autoscaling.knative.dev/maxScale: "50"
        run.googleapis.com/cpu-throttling: "false"
        run.googleapis.com/execution-environment: gen2
    spec:
      containerConcurrency: 80
      timeoutSeconds: 60
      serviceAccountName: file-storage-api@melmastoon-prod.iam.gserviceaccount.com
      containers:
        - image: gcr.io/melmastoon-prod/file-storage-service:<sha>
          resources:
            limits:
              cpu: "2"
              memory: "1Gi"
          env:
            - name: NODE_ENV
              value: production
            - name: MODE
              value: api
            - name: DATABASE_URL
              valueFrom:
                secretKeyRef: { name: cloudsql-url, key: latest }
            - name: REDIS_URL
              valueFrom:
                secretKeyRef: { name: redis-url, key: latest }
            - name: PUBSUB_PROJECT_ID
              value: melmastoon-prod
            - name: KMS_KEY_RESOURCE
              value: projects/melmastoon-prod/locations/europe-west4/keyRings/file-storage/cryptoKeys/private
            - name: GCS_BUCKET_MEDIA
              value: melmastoon-media-prod
            - name: GCS_BUCKET_PRIVATE
              value: melmastoon-private-prod
            - name: GCS_BUCKET_ARCHIVE
              value: melmastoon-archive-prod
            - name: AI_ORCHESTRATOR_BASE_URL
              value: https://ai-orchestrator.internal.melmastoon.ghasi.io
            - name: OTEL_EXPORTER_OTLP_ENDPOINT
              value: https://otel-collector.internal.melmastoon.ghasi.io
            - name: SCAN_CALLBACK_HMAC_SECRET
              valueFrom:
                secretKeyRef: { name: scan-callback-hmac, key: latest }
          startupProbe: { httpGet: { path: /readyz, port: 8080 }, periodSeconds: 5, failureThreshold: 12 }
          livenessProbe: { httpGet: { path: /healthz, port: 8080 }, periodSeconds: 30 }

9. Pub/Sub topology

Topic	Producer	Subscribers	Retention	DLQ
`melmastoon.file.upload.lifecycle.v1`	api	bff-backoffice, analytics	7 d	yes
`melmastoon.file.scan.results.v1`	api	property, notification, billing, theme-config, security-siem	7 d	yes
`melmastoon.file.optimization.v1`	api	property, theme-config, search-aggregation	7 d	yes
`melmastoon.file.deletion.v1`	api	property, theme-config, billing	7 d	yes
`melmastoon.file.access.v1`	api	security-siem, audit-archive	7 d	yes
`melmastoon.file.retention.v1`	api	compliance-dashboard	30 d	yes
`melmastoon.file.erasure.v1`	api	tenant, reservation, audit-archive	30 d	yes
`melmastoon.file.quota.v1`	api	tenant, bff-backoffice	7 d	yes
`internal.file-storage.optimize-jobs`	api	optimizer-worker (Eventarc)	1 d	yes
`internal.file-storage.scan-jobs`	api	clamav (push)	1 d	yes

DLQ subscriptions per push topic with 5-attempt redelivery and exponential backoff (1s..60s). Alarms at DLQ depth > 10.

10. CDN

Cloud CDN fronts melmastoon-media-prod only:

Backend bucket: melmastoon-media-prod, signed URL required.
URL map: host cdn.melmastoon.ghasi.io, path /tenants/* → backend bucket.
Cache key: include tenant_id from path; include scope; query strings ignored except for image variant suffix in path.
Default TTL: 8 h; max 24 h.
Negative TTL: 0 (don't cache 4xx).
Signed cookies for browser sessions on the booking site (tenant-booking BFF issues).
Invalidation API used by cdn-invalidation-worker on delete / variant publish.

11. Deployment workflow

PR merged to main → GH Actions builds image, runs full CI (see TESTING_STRATEGY §12), pushes to GCR.
Argo CD detects new image tag in the infra-deploy repo → applies Terraform plan against stg.
Smoke tests (Playwright + k6 light) run against staging.
Manual approval gate (Tech Lead + SRE) → Argo CD applies prod plan.
Canary: 5 % traffic to new revision for 30 minutes.
SLO burn check: if availability_write or availability_read SLOs drop ≥ 0.1 % during canary, automatic rollback.
Promote to 100 % traffic → mark deploy successful in Cloud Deploy.

Migrations run as a pre-deploy Helm hook (Cloud Run Job): node-pg-migrate up. Forward-only; rollbacks require an explicit reverse migration committed in advance.

12. Resource sizing (per environment)

Resource	dev	stg	prod
Cloud Run API minScale	0	1	3
Cloud Run API maxScale	5	20	50
Cloud SQL tier	`db-f1-micro`	`db-custom-2-7680`	`db-custom-8-32768`
Redis tier	basic 1 GB	standard 5 GB	standard 16 GB
ClamAV pods	1	2	3 (HPA → 12)
GCS storage class	Standard (single region)	Standard	Standard / Coldline
KMS keys	shared dev key	dedicated stg key	dedicated prod key

13. Disaster recovery

Failure	RTO	RPO	Procedure
Cloud Run rev bug	< 5 min	0	Auto rollback via canary SLO; if breached promote previous revision
Cloud SQL primary failure	< 60 s	0	Auto-failover to sync replica in same region
Region failure (`europe-west4`)	< 30 min	≤ 5 min	Promote `europe-west1` read replica; redirect traffic via DNS; rebuild Redis from cold
GCS partial outage (one region of dual-region)	transparent	0	Dual-region GCS handles automatically
KMS key revoked accidentally	< 30 min	0	Restore from previous key version (kept for retention horizon)
Pub/Sub topic deletion	< 60 min	possible event loss	Recreate from terraform; replay outbox table for unpublished window
Catastrophic data loss (DB + replica corrupt)	< 6 h	≤ 24 h	Restore from automated backup; replay outbox for 24 h

DR drills run quarterly in stg with a prod-replica snapshot.

14. Cost controls

Cloud Run min-instance is set per-env (0 in dev, 3 in prod) to balance cold-starts vs. idle cost.
GCS lifecycle rules clean up upload-session orphans (tx_lifecycle/) after 1 day.
BigQuery export uses partitioned tables with 13-month rolling drop.
Optimizer worker uses Cloud Run Job (per-message billed) rather than always-on; Eventarc fan-out caps at 200 parallel to control burst spend.
ClamAV pods are GKE Autopilot — pay per pod-second, no node management overhead.
Datastream-to-BigQuery filtered to non-PII columns to reduce BQ storage cost.

15. References

LOCAL_DEV_SETUP for the local docker-compose equivalent.
SECURITY_MODEL §6 §7 for KMS / signed URL details.
OBSERVABILITY for SLO and alert wiring referenced in alerts.tf.

1. Workload inventory​

2. Regional topology​

3. Infrastructure-as-code​

4. Service accounts & IAM​

5. Networking​

6. Cloud SQL configuration​

7. GCS buckets​

8. Cloud Run service config​

9. Pub/Sub topology​

10. CDN​

11. Deployment workflow​

12. Resource sizing (per environment)​

13. Disaster recovery​

14. Cost controls​

15. References​