Skip to main content

file-storage-service — DEPLOYMENT_TOPOLOGY

Companion: SERVICE_OVERVIEW · 02 Enterprise Architecture · ADR-0001 Core Architecture · LOCAL_DEV_SETUP

This service is deployed across GCP in the melmastoon-prod project (and melmastoon-stg, melmastoon-dev). The compute is Cloud Run for the API + workers, with a GKE Autopilot carve-out for the long-running ClamAV cluster (it's stateful enough to warrant pods rather than per-request containers). Region primary is europe-west4; failover read replica in europe-west1. Desktop clients are Electron — never Tauri.

1. Workload inventory

WorkloadRuntimeReplicas (prod)ScalingImage
file-storage-apiCloud Run (Node 20, fastify)min 3, max 50CPU 60 % + concurrency target 60gcr.io/melmastoon-prod/file-storage-service:<sha>
file-storage-outbox-relayCloud Run (always-on)min 2, max 8RPS-driven, custom metric outbox_unpublished_totalsame image, MODE=outbox-relay
file-storage-inbox-consumerCloud Run (Pub/Sub push)min 2, max 20Pub/Sub backlogsame image, MODE=inbox-consumer
file-storage-retention-sweeperCloud Run Jobn/a (cron)every 5 minsame image, MODE=sweeper-retention
file-storage-session-cleanerCloud Run Jobn/a (cron)every 15 minsame image, MODE=sweeper-sessions
file-storage-cdn-invalidation-workerCloud Runmin 1, max 4queue depthsame image, MODE=cdn-invalidate
file-storage-optimizer-workerCloud Run Job triggered by Pub/Sub via Eventarcn/aper-message; 200 max parallelgcr.io/melmastoon-prod/file-optimizer-worker:<sha>
file-storage-clamavGKE Autopilot StatefulSet (3 pods)3 → 12 (HPA on queue length)CPU + custom metricgcr.io/melmastoon-prod/clamav-wrapper:<sha>
file-storage-private-cdn-sidecarCloud Run (per-region)min 1, max 8RPSsmall Go binary that fronts private-bucket reads, checks signed-url-blacklist Redis ZSET

The service image and the optimizer image are separate because the optimizer's runtime base (debian-slim + sharp + ffmpeg) is large; keeping them apart minimizes API container cold-start time.

2. Regional topology

┌──────────────────────────── europe-west4 (primary) ─────────────────────────────┐
Cloud Load │ Cloud Run: file-storage-api (multi-region) │
Balancer ───► │ Cloud Run: outbox-relay, inbox-consumer, cdn-invalidator │
│ GKE Autopilot: clamav (zonal HA across 3 zones) │
│ Cloud SQL Postgres 16 HA primary (regional) │
│ Memorystore Redis (HA tier) │
│ GCS buckets: media-prod, private-prod, archive-prod │
│ Cloud KMS: file-storage key ring │
│ Pub/Sub: melmastoon.file.* topics + DLQs │
└─────────────────────────────────────────────────────────────────────────────────┘

┌──────────────────────┴──────────────────────┐
▼ ▼
┌─── europe-west1 (warm DR) ───┐ ┌─── me-central1 (read POPs, Phase 3) ───┐
│ Cloud SQL read replica │ │ Cloud Run: file-storage-api (read) │
│ Cloud Run: file-storage-api │ │ Memorystore Redis (cache only) │
│ (cold; promoted on failover) │ │ Reads only; writes always to ew4 │
│ Memorystore Redis (mirror) │ └────────────────────────────────────────┘
└──────────────────────────────┘

GCS buckets are dual-region (eur4 = europe-west4 + europe-north1) for the media and archive classes; private is single-region (europe-west4) by policy to honour data residency.

3. Infrastructure-as-code

All infra lives in infra/melmastoon/modules/file-storage/:

infra/melmastoon/modules/file-storage/
├── main.tf # Cloud Run services, CR jobs, IAM bindings
├── pubsub.tf # topics + subscriptions + DLQs
├── gcs.tf # buckets, lifecycle rules, CMEK bindings
├── kms.tf # key ring + key + rotation schedule
├── cloudsql.tf # primary + replica + private IP
├── memorystore.tf # Redis instance
├── cdn.tf # backend bucket, URL map, edge cache
├── alerts.tf # Cloud Monitoring alert policies
├── slo.tf # SLO objects per SLI
├── workload_identity.tf # service accounts + WIF bindings
├── gke_clamav.tf # GKE namespace + StatefulSet + HPA
└── env/
├── dev.tfvars
├── stg.tfvars
└── prod.tfvars

Helm chart (for the GKE bits only) lives in infra/charts/clamav/ and is deployed by Argo CD from the infra-deploy repo.

4. Service accounts & IAM

WorkloadService accountRoles
file-storage-apifile-storage-api@melmastoon-prod.iamroles/storage.objectAdmin (scoped to the 3 buckets), roles/cloudkms.cryptoKeyEncrypterDecrypter, roles/iam.serviceAccountTokenCreator (for signBlob), roles/pubsub.publisher, roles/cloudsql.client, roles/secretmanager.secretAccessor
outbox-relaysame as APIsame
inbox-consumerfile-storage-inbox@…roles/pubsub.subscriber on tenant.* and property.* topics
optimizer-workerfile-optimizer@…roles/storage.objectViewer + roles/storage.objectCreator on private/media buckets, roles/cloudkms.cryptoKeyEncrypterDecrypter
clamavclamav-scan@…roles/storage.objectViewer on all 3 buckets (read-only), roles/run.invoker on the /internal/v1/files/scan-callback endpoint
cdn-invalidatorcdn-invalidator@…roles/compute.urlMapAdmin (scoped to file-storage URL map)
retention-sweeperfile-sweeper@…superset (read DB + delete GCS); audited

Workload Identity Federation is the only auth method — no static keys. The iam.serviceAccounts.signBlob permission is what allows the API to sign GCS V4 URLs without holding a long-lived private key.

5. Networking

  • All Cloud Run services attach to a Serverless VPC connector (europe-west4/file-storage-vpc-conn).
  • Cloud SQL is private IP only; no public egress.
  • Memorystore Redis is private.
  • Pub/Sub uses Google's private path; no public endpoint.
  • Cloud Run ingress is internal-and-cloud-load-balancing — only the platform LB can hit the API.
  • The CDN's backend bucket has Signed URL enforcement on; cache-fill goes via authenticated GCS access.
  • Egress IPs are pinned for outbound (ClamAV → Cloud Run callback) via NAT.

6. Cloud SQL configuration

SettingValue
EnginePostgres 16
Tier (prod)db-custom-8-32768 (8 vCPU / 32 GB RAM)
Disk500 GB SSD, autogrow enabled
HARegional (sync replica in another zone)
Backupsdaily 02:00 UTC, 35-day retention
PITRenabled, 7-day window
Read replica1 in europe-west1 (warm DR)
ConnectionsPgBouncer sidecar in transaction pooling, default pool 25 / instance
Extensionspgcrypto, pg_trgm, pg_stat_statements
Maintenance windowSunday 04:00–05:00 UTC
CMEKyes (key in file-storage key ring)

7. GCS buckets

BucketClassRegionUBLAVersioningLifecycleCMEK
melmastoon-media-prodStandardeur4 (dual)on30 dthumbs/auto-delete after 30 d for archived rows; tx_lifecycle for upload session orphans (1 d)Google-managed
melmastoon-private-prodStandardeurope-west4on30 dper-policy hard delete via service code (lifecycle rules disabled by default)CMEK
melmastoon-archive-prodColdlineeur4 (dual)on30 dper-policy hard delete via service code; Bucket Lock = 7 y for tax_compliance scopeCMEK + Bucket Lock

A separate bucket melmastoon-quarantine-prod (Coldline, single-region, very restrictive IAM) holds quarantined files for the 30-day forensic window.

8. Cloud Run service config

# file-storage-api Cloud Run service
spec:
template:
metadata:
annotations:
autoscaling.knative.dev/minScale: "3"
autoscaling.knative.dev/maxScale: "50"
run.googleapis.com/cpu-throttling: "false"
run.googleapis.com/execution-environment: gen2
spec:
containerConcurrency: 80
timeoutSeconds: 60
serviceAccountName: file-storage-api@melmastoon-prod.iam.gserviceaccount.com
containers:
- image: gcr.io/melmastoon-prod/file-storage-service:<sha>
resources:
limits:
cpu: "2"
memory: "1Gi"
env:
- name: NODE_ENV
value: production
- name: MODE
value: api
- name: DATABASE_URL
valueFrom:
secretKeyRef: { name: cloudsql-url, key: latest }
- name: REDIS_URL
valueFrom:
secretKeyRef: { name: redis-url, key: latest }
- name: PUBSUB_PROJECT_ID
value: melmastoon-prod
- name: KMS_KEY_RESOURCE
value: projects/melmastoon-prod/locations/europe-west4/keyRings/file-storage/cryptoKeys/private
- name: GCS_BUCKET_MEDIA
value: melmastoon-media-prod
- name: GCS_BUCKET_PRIVATE
value: melmastoon-private-prod
- name: GCS_BUCKET_ARCHIVE
value: melmastoon-archive-prod
- name: AI_ORCHESTRATOR_BASE_URL
value: https://ai-orchestrator.internal.melmastoon.ghasi.io
- name: OTEL_EXPORTER_OTLP_ENDPOINT
value: https://otel-collector.internal.melmastoon.ghasi.io
- name: SCAN_CALLBACK_HMAC_SECRET
valueFrom:
secretKeyRef: { name: scan-callback-hmac, key: latest }
startupProbe: { httpGet: { path: /readyz, port: 8080 }, periodSeconds: 5, failureThreshold: 12 }
livenessProbe: { httpGet: { path: /healthz, port: 8080 }, periodSeconds: 30 }

9. Pub/Sub topology

TopicProducerSubscribersRetentionDLQ
melmastoon.file.upload.lifecycle.v1apibff-backoffice, analytics7 dyes
melmastoon.file.scan.results.v1apiproperty, notification, billing, theme-config, security-siem7 dyes
melmastoon.file.optimization.v1apiproperty, theme-config, search-aggregation7 dyes
melmastoon.file.deletion.v1apiproperty, theme-config, billing7 dyes
melmastoon.file.access.v1apisecurity-siem, audit-archive7 dyes
melmastoon.file.retention.v1apicompliance-dashboard30 dyes
melmastoon.file.erasure.v1apitenant, reservation, audit-archive30 dyes
melmastoon.file.quota.v1apitenant, bff-backoffice7 dyes
internal.file-storage.optimize-jobsapioptimizer-worker (Eventarc)1 dyes
internal.file-storage.scan-jobsapiclamav (push)1 dyes

DLQ subscriptions per push topic with 5-attempt redelivery and exponential backoff (1s..60s). Alarms at DLQ depth > 10.

10. CDN

Cloud CDN fronts melmastoon-media-prod only:

  • Backend bucket: melmastoon-media-prod, signed URL required.
  • URL map: host cdn.melmastoon.ghasi.io, path /tenants/* → backend bucket.
  • Cache key: include tenant_id from path; include scope; query strings ignored except for image variant suffix in path.
  • Default TTL: 8 h; max 24 h.
  • Negative TTL: 0 (don't cache 4xx).
  • Signed cookies for browser sessions on the booking site (tenant-booking BFF issues).
  • Invalidation API used by cdn-invalidation-worker on delete / variant publish.

11. Deployment workflow

  1. PR merged to main → GH Actions builds image, runs full CI (see TESTING_STRATEGY §12), pushes to GCR.
  2. Argo CD detects new image tag in the infra-deploy repo → applies Terraform plan against stg.
  3. Smoke tests (Playwright + k6 light) run against staging.
  4. Manual approval gate (Tech Lead + SRE) → Argo CD applies prod plan.
  5. Canary: 5 % traffic to new revision for 30 minutes.
  6. SLO burn check: if availability_write or availability_read SLOs drop ≥ 0.1 % during canary, automatic rollback.
  7. Promote to 100 % traffic → mark deploy successful in Cloud Deploy.

Migrations run as a pre-deploy Helm hook (Cloud Run Job): node-pg-migrate up. Forward-only; rollbacks require an explicit reverse migration committed in advance.

12. Resource sizing (per environment)

Resourcedevstgprod
Cloud Run API minScale013
Cloud Run API maxScale52050
Cloud SQL tierdb-f1-microdb-custom-2-7680db-custom-8-32768
Redis tierbasic 1 GBstandard 5 GBstandard 16 GB
ClamAV pods123 (HPA → 12)
GCS storage classStandard (single region)StandardStandard / Coldline
KMS keysshared dev keydedicated stg keydedicated prod key

13. Disaster recovery

FailureRTORPOProcedure
Cloud Run rev bug< 5 min0Auto rollback via canary SLO; if breached promote previous revision
Cloud SQL primary failure< 60 s0Auto-failover to sync replica in same region
Region failure (europe-west4)< 30 min≤ 5 minPromote europe-west1 read replica; redirect traffic via DNS; rebuild Redis from cold
GCS partial outage (one region of dual-region)transparent0Dual-region GCS handles automatically
KMS key revoked accidentally< 30 min0Restore from previous key version (kept for retention horizon)
Pub/Sub topic deletion< 60 minpossible event lossRecreate from terraform; replay outbox table for unpublished window
Catastrophic data loss (DB + replica corrupt)< 6 h≤ 24 hRestore from automated backup; replay outbox for 24 h

DR drills run quarterly in stg with a prod-replica snapshot.

14. Cost controls

  • Cloud Run min-instance is set per-env (0 in dev, 3 in prod) to balance cold-starts vs. idle cost.
  • GCS lifecycle rules clean up upload-session orphans (tx_lifecycle/) after 1 day.
  • BigQuery export uses partitioned tables with 13-month rolling drop.
  • Optimizer worker uses Cloud Run Job (per-message billed) rather than always-on; Eventarc fan-out caps at 200 parallel to control burst spend.
  • ClamAV pods are GKE Autopilot — pay per pod-second, no node management overhead.
  • Datastream-to-BigQuery filtered to non-PII columns to reduce BQ storage cost.

15. References