DEPLOYMENT_TOPOLOGY — staff-service

Catalog: docs/03-microservices/staff-service.md · 02 Enterprise Architecture · SECURITY_MODEL · OBSERVABILITY

GCP-native deployment. Cloud Run for compute, Cloud SQL for state, Memorystore for cache, KMS for crypto, Pub/Sub for events, Secret Manager for secrets. Multi-region active-active by M2.

1. Containers

Container	Purpose	Replicas (prod)	CPU	Memory
`staff-api`	HTTP/REST surface (all `/api/v1/*`)	min 2, max 30	1 vCPU	768 MiB
`staff-worker`	Outbox relay, inbox consumers, scheduled jobs (auto-close, cert-expiry, suggestion TTL)	min 2, max 10	1 vCPU	512 MiB
`staff-cron`	Cloud Scheduler→Pub/Sub-triggered: nightly reconcile, weekly fairness report, sync-cursor sweep	min 1, max 2	0.5 vCPU	256 MiB

All three from the same source repo, different entrypoints. Image tag is the short git SHA; latest is not used in prod.

2. Scaling Rules

Container	Trigger	Threshold
`staff-api`	Concurrent requests	target 80 / instance
`staff-api`	CPU	target 60 %
`staff-worker`	Outbox depth	scale-up if depth > 100 for 30 s
`staff-worker`	Pub/Sub subscription backlog	scale-up if `num_undelivered_messages` > 500
`staff-cron`	n/a	min instances = 1

Cold-start mitigation: min instances ≥ 2 for staff-api and staff-worker in prod. CPU is allocated even when idle for staff-api (Cloud Run "CPU always allocated").

3. Resource Budgets

Resource	Limit
Request timeout	30 s (most requests < 200 ms; 30 s reserved for slow report exports)
Max concurrent requests / instance	100 (tuned per release)
Container startup probe	`GET /health/startup` 200, deadline 30 s
Liveness probe	`GET /health/live` every 30 s
Readiness probe	`GET /health/ready` every 10 s; checks DB, Redis, KMS, Pub/Sub
Memory request	50 % of limit

4. Storage Topology

Layer	Service	Config
Primary DB	Cloud SQL Postgres 16 (Enterprise Plus)	HA (regional, read replicas in 2 zones); CMEK; PITR 7 d; backup daily 35 d retention
Cache	Memorystore Redis 7	HA (Standard tier, 2 GB), VPC-attached, AUTH enabled
Crypto	Cloud KMS	Region-pinned per data-residency; HSM keyring
Secrets	Secret Manager	Region-replicated; auto-rotation where supported
Events	Pub/Sub	Topics `melmastoon.staff.*.v1`, retention 7–30 d (per topic)
Sync state	Firestore	Native mode, multi-region
Cold export	BigQuery	Daily Datastream from `staff` schema; partitioned daily, clustered by `tenant_id`
Object storage	Cloud Storage	Cert documents (CMEK, signed URLs); attendance CSV exports (lifecycle 30 d)

5. Region Topology

5.1 M0 (single region)

me-central1 (Doha)
├── staff-api (Cloud Run)
├── staff-worker (Cloud Run)
├── staff-cron (Cloud Run)
├── Cloud SQL primary + 2 read replicas
├── Memorystore primary + replica
├── Cloud KMS keyring
├── Pub/Sub
├── Firestore (multi-region default)
└── Secret Manager

5.2 M2 (multi-region active-active)

                 ┌─ Cloud DNS (geo-routed) ─┐
                 │                          │
                 ▼                          ▼
         ┌─────────────────┐        ┌─────────────────┐
         │ me-central1     │        │ europe-west1    │
         │ (primary)       │        │ (active-active) │
         │                 │        │                 │
         │ staff-api ▸▸▸▸▸ │◀──────▶│ staff-api       │
         │ staff-worker    │        │ staff-worker    │
         │ staff-cron      │        │ (cron in primary only) │
         │ Cloud SQL HA    │ pgsync │ Cloud SQL HA    │
         │ Memorystore     │        │ Memorystore     │
         │ KMS keyring     │        │ KMS keyring     │
         │ Pub/Sub         │ global │ Pub/Sub         │
         └─────────────────┘        └─────────────────┘

Cloud SQL writer is single-region (per 02 §11); the secondary region runs a read-only replica via Datastream + a fail-over runbook.
Pub/Sub is global; topics are auto-replicated.
Firestore is multi-region by default.
KMS keyrings are region-pinned to honor tenant data-residency.
staff-cron runs only in the primary region (singleton).

6. Networking

All Cloud Run services attached to a Serverless VPC connector; outbound to Cloud SQL, Memorystore, and internal services routed over the VPC.
Inbound from bff-backoffice-service and bff-tenant-booking-service via internal load balancer; no public ingress.
mTLS enforced between services in M2 via Anthos Service Mesh (per SECURITY_MODEL §7).

7. Configuration

Variable	Source	Notes
`DATABASE_URL`	Secret Manager	Resolved at startup via Secret Manager API
`REDIS_URL`	Secret Manager
`KMS_PIN_PEPPER_KEY`	Configmap	Resource name; auth via workload identity
`KMS_PII_ENVELOPE_KEY`	Configmap
`PUBSUB_PROJECT_ID`	Configmap
`IAM_SERVICE_BASE_URL`	Configmap	Internal LB
`PROPERTY_SERVICE_BASE_URL`	Configmap
`AI_ORCHESTRATOR_BASE_URL`	Configmap
`LOG_LEVEL`	Configmap	`info` in prod, `debug` in dev
`OTEL_EXPORTER_OTLP_ENDPOINT`	Configmap	OpenTelemetry collector sidecar
`STAFF_AUTO_CLOSE_GRACE_MIN`	Configmap	default 60
`STAFF_GAP_WARN_MIN`	Configmap	default 15
`STAFF_PIN_LOCKOUT_MIN`	Configmap	default 15
`STAFF_PIN_PEPPER_VERSION`	Configmap	currently `v3`

Configmap delivered via Cloud Run env vars per environment (dev / staging / prod-me / prod-eu).

8. Migrations

Flyway runs as a Cloud Run Job triggered by the deploy pipeline before traffic is shifted.
Job authenticates via workload identity to Cloud SQL.
A successful migration is required before staff-api rollout proceeds.
Failure → pipeline aborts; rollback runbook in runbooks/staff/migration-failure.md.

9. Deploy Pipeline

PR merged to main
  → CI (unit / contract / integration / sync / security / e2e + coverage gate)
  → image build (linux/amd64 + arm64)
  → push to Artifact Registry (`me-central1-docker.pkg.dev/melmastoon-prod/staff-service:<sha>`)
  → deploy to dev
  → smoke (Playwright headless top-3)
  → deploy to staging
  → load-test (k6 baseline)
  → manual approval (peer)
  → migration-job → traffic-shift 10 % → 50 % → 100 % (Cloud Deploy)
  → post-deploy: dashboard link + audit row

Rollback: Cloud Deploy rollback to previous revision; new traffic-shift; Flyway has no auto-revert (forward fixes only).

10. Backup & Restore

Cloud SQL daily backup, 35 d retention, PITR 7 d.
Quarterly restore drill into a transient project; the drill is owned by Platform Ops.
BigQuery cold copies of audit_events and key tables retained 7 y.
Pub/Sub messages have 7 d retention; replay is operator-initiated.

1. Containers​

2. Scaling Rules​

3. Resource Budgets​

4. Storage Topology​

5. Region Topology​

5.1 M0 (single region)​

5.2 M2 (multi-region active-active)​

6. Networking​

7. Configuration​

8. Migrations​

9. Deploy Pipeline​

10. Backup & Restore​