Skip to main content

DEPLOYMENT_TOPOLOGY — staff-service

Catalog: docs/03-microservices/staff-service.md · 02 Enterprise Architecture · SECURITY_MODEL · OBSERVABILITY

GCP-native deployment. Cloud Run for compute, Cloud SQL for state, Memorystore for cache, KMS for crypto, Pub/Sub for events, Secret Manager for secrets. Multi-region active-active by M2.

1. Containers

ContainerPurposeReplicas (prod)CPUMemory
staff-apiHTTP/REST surface (all /api/v1/*)min 2, max 301 vCPU768 MiB
staff-workerOutbox relay, inbox consumers, scheduled jobs (auto-close, cert-expiry, suggestion TTL)min 2, max 101 vCPU512 MiB
staff-cronCloud Scheduler→Pub/Sub-triggered: nightly reconcile, weekly fairness report, sync-cursor sweepmin 1, max 20.5 vCPU256 MiB

All three from the same source repo, different entrypoints. Image tag is the short git SHA; latest is not used in prod.

2. Scaling Rules

ContainerTriggerThreshold
staff-apiConcurrent requeststarget 80 / instance
staff-apiCPUtarget 60 %
staff-workerOutbox depthscale-up if depth > 100 for 30 s
staff-workerPub/Sub subscription backlogscale-up if num_undelivered_messages > 500
staff-cronn/amin instances = 1

Cold-start mitigation: min instances ≥ 2 for staff-api and staff-worker in prod. CPU is allocated even when idle for staff-api (Cloud Run "CPU always allocated").

3. Resource Budgets

ResourceLimit
Request timeout30 s (most requests < 200 ms; 30 s reserved for slow report exports)
Max concurrent requests / instance100 (tuned per release)
Container startup probeGET /health/startup 200, deadline 30 s
Liveness probeGET /health/live every 30 s
Readiness probeGET /health/ready every 10 s; checks DB, Redis, KMS, Pub/Sub
Memory request50 % of limit

4. Storage Topology

LayerServiceConfig
Primary DBCloud SQL Postgres 16 (Enterprise Plus)HA (regional, read replicas in 2 zones); CMEK; PITR 7 d; backup daily 35 d retention
CacheMemorystore Redis 7HA (Standard tier, 2 GB), VPC-attached, AUTH enabled
CryptoCloud KMSRegion-pinned per data-residency; HSM keyring
SecretsSecret ManagerRegion-replicated; auto-rotation where supported
EventsPub/SubTopics melmastoon.staff.*.v1, retention 7–30 d (per topic)
Sync stateFirestoreNative mode, multi-region
Cold exportBigQueryDaily Datastream from staff schema; partitioned daily, clustered by tenant_id
Object storageCloud StorageCert documents (CMEK, signed URLs); attendance CSV exports (lifecycle 30 d)

5. Region Topology

5.1 M0 (single region)

me-central1 (Doha)
├── staff-api (Cloud Run)
├── staff-worker (Cloud Run)
├── staff-cron (Cloud Run)
├── Cloud SQL primary + 2 read replicas
├── Memorystore primary + replica
├── Cloud KMS keyring
├── Pub/Sub
├── Firestore (multi-region default)
└── Secret Manager

5.2 M2 (multi-region active-active)

┌─ Cloud DNS (geo-routed) ─┐
│ │
▼ ▼
┌─────────────────┐ ┌─────────────────┐
│ me-central1 │ │ europe-west1 │
│ (primary) │ │ (active-active) │
│ │ │ │
│ staff-api ▸▸▸▸▸ │◀──────▶│ staff-api │
│ staff-worker │ │ staff-worker │
│ staff-cron │ │ (cron in primary only) │
│ Cloud SQL HA │ pgsync │ Cloud SQL HA │
│ Memorystore │ │ Memorystore │
│ KMS keyring │ │ KMS keyring │
│ Pub/Sub │ global │ Pub/Sub │
└─────────────────┘ └─────────────────┘
  • Cloud SQL writer is single-region (per 02 §11); the secondary region runs a read-only replica via Datastream + a fail-over runbook.
  • Pub/Sub is global; topics are auto-replicated.
  • Firestore is multi-region by default.
  • KMS keyrings are region-pinned to honor tenant data-residency.
  • staff-cron runs only in the primary region (singleton).

6. Networking

  • All Cloud Run services attached to a Serverless VPC connector; outbound to Cloud SQL, Memorystore, and internal services routed over the VPC.
  • Inbound from bff-backoffice-service and bff-tenant-booking-service via internal load balancer; no public ingress.
  • mTLS enforced between services in M2 via Anthos Service Mesh (per SECURITY_MODEL §7).

7. Configuration

VariableSourceNotes
DATABASE_URLSecret ManagerResolved at startup via Secret Manager API
REDIS_URLSecret Manager
KMS_PIN_PEPPER_KEYConfigmapResource name; auth via workload identity
KMS_PII_ENVELOPE_KEYConfigmap
PUBSUB_PROJECT_IDConfigmap
IAM_SERVICE_BASE_URLConfigmapInternal LB
PROPERTY_SERVICE_BASE_URLConfigmap
AI_ORCHESTRATOR_BASE_URLConfigmap
LOG_LEVELConfigmapinfo in prod, debug in dev
OTEL_EXPORTER_OTLP_ENDPOINTConfigmapOpenTelemetry collector sidecar
STAFF_AUTO_CLOSE_GRACE_MINConfigmapdefault 60
STAFF_GAP_WARN_MINConfigmapdefault 15
STAFF_PIN_LOCKOUT_MINConfigmapdefault 15
STAFF_PIN_PEPPER_VERSIONConfigmapcurrently v3

Configmap delivered via Cloud Run env vars per environment (dev / staging / prod-me / prod-eu).

8. Migrations

  • Flyway runs as a Cloud Run Job triggered by the deploy pipeline before traffic is shifted.
  • Job authenticates via workload identity to Cloud SQL.
  • A successful migration is required before staff-api rollout proceeds.
  • Failure → pipeline aborts; rollback runbook in runbooks/staff/migration-failure.md.

9. Deploy Pipeline

PR merged to main
→ CI (unit / contract / integration / sync / security / e2e + coverage gate)
→ image build (linux/amd64 + arm64)
→ push to Artifact Registry (`me-central1-docker.pkg.dev/melmastoon-prod/staff-service:<sha>`)
→ deploy to dev
→ smoke (Playwright headless top-3)
→ deploy to staging
→ load-test (k6 baseline)
→ manual approval (peer)
→ migration-job → traffic-shift 10 % → 50 % → 100 % (Cloud Deploy)
→ post-deploy: dashboard link + audit row

Rollback: Cloud Deploy rollback to previous revision; new traffic-shift; Flyway has no auto-revert (forward fixes only).

10. Backup & Restore

  • Cloud SQL daily backup, 35 d retention, PITR 7 d.
  • Quarterly restore drill into a transient project; the drill is owned by Platform Ops.
  • BigQuery cold copies of audit_events and key tables retained 7 y.
  • Pub/Sub messages have 7 d retention; replay is operator-initiated.