DEPLOYMENT_TOPOLOGY — staff-service
Catalog:
docs/03-microservices/staff-service.md· 02 Enterprise Architecture · SECURITY_MODEL · OBSERVABILITY
GCP-native deployment. Cloud Run for compute, Cloud SQL for state, Memorystore for cache, KMS for crypto, Pub/Sub for events, Secret Manager for secrets. Multi-region active-active by M2.
1. Containers
| Container | Purpose | Replicas (prod) | CPU | Memory |
|---|---|---|---|---|
staff-api | HTTP/REST surface (all /api/v1/*) | min 2, max 30 | 1 vCPU | 768 MiB |
staff-worker | Outbox relay, inbox consumers, scheduled jobs (auto-close, cert-expiry, suggestion TTL) | min 2, max 10 | 1 vCPU | 512 MiB |
staff-cron | Cloud Scheduler→Pub/Sub-triggered: nightly reconcile, weekly fairness report, sync-cursor sweep | min 1, max 2 | 0.5 vCPU | 256 MiB |
All three from the same source repo, different entrypoints. Image tag is the short git SHA; latest is not used in prod.
2. Scaling Rules
| Container | Trigger | Threshold |
|---|---|---|
staff-api | Concurrent requests | target 80 / instance |
staff-api | CPU | target 60 % |
staff-worker | Outbox depth | scale-up if depth > 100 for 30 s |
staff-worker | Pub/Sub subscription backlog | scale-up if num_undelivered_messages > 500 |
staff-cron | n/a | min instances = 1 |
Cold-start mitigation: min instances ≥ 2 for staff-api and staff-worker in prod. CPU is allocated even when idle for staff-api (Cloud Run "CPU always allocated").
3. Resource Budgets
| Resource | Limit |
|---|---|
| Request timeout | 30 s (most requests < 200 ms; 30 s reserved for slow report exports) |
| Max concurrent requests / instance | 100 (tuned per release) |
| Container startup probe | GET /health/startup 200, deadline 30 s |
| Liveness probe | GET /health/live every 30 s |
| Readiness probe | GET /health/ready every 10 s; checks DB, Redis, KMS, Pub/Sub |
| Memory request | 50 % of limit |
4. Storage Topology
| Layer | Service | Config |
|---|---|---|
| Primary DB | Cloud SQL Postgres 16 (Enterprise Plus) | HA (regional, read replicas in 2 zones); CMEK; PITR 7 d; backup daily 35 d retention |
| Cache | Memorystore Redis 7 | HA (Standard tier, 2 GB), VPC-attached, AUTH enabled |
| Crypto | Cloud KMS | Region-pinned per data-residency; HSM keyring |
| Secrets | Secret Manager | Region-replicated; auto-rotation where supported |
| Events | Pub/Sub | Topics melmastoon.staff.*.v1, retention 7–30 d (per topic) |
| Sync state | Firestore | Native mode, multi-region |
| Cold export | BigQuery | Daily Datastream from staff schema; partitioned daily, clustered by tenant_id |
| Object storage | Cloud Storage | Cert documents (CMEK, signed URLs); attendance CSV exports (lifecycle 30 d) |
5. Region Topology
5.1 M0 (single region)
me-central1 (Doha)
├── staff-api (Cloud Run)
├── staff-worker (Cloud Run)
├── staff-cron (Cloud Run)
├── Cloud SQL primary + 2 read replicas
├── Memorystore primary + replica
├── Cloud KMS keyring
├── Pub/Sub
├── Firestore (multi-region default)
└── Secret Manager
5.2 M2 (multi-region active-active)
┌─ Cloud DNS (geo-routed) ─┐
│ │
▼ ▼
┌─────────────────┐ ┌─────────────────┐
│ me-central1 │ │ europe-west1 │
│ (primary) │ │ (active-active) │
│ │ │ │
│ staff-api ▸▸▸▸▸ │◀──────▶│ staff-api │
│ staff-worker │ │ staff-worker │
│ staff-cron │ │ (cron in primary only) │
│ Cloud SQL HA │ pgsync │ Cloud SQL HA │
│ Memorystore │ │ Memorystore │
│ KMS keyring │ │ KMS keyring │
│ Pub/Sub │ global │ Pub/Sub │
└─────────────────┘ └─────────────────┘
- Cloud SQL writer is single-region (per 02 §11); the secondary region runs a read-only replica via Datastream + a fail-over runbook.
- Pub/Sub is global; topics are auto-replicated.
- Firestore is multi-region by default.
- KMS keyrings are region-pinned to honor tenant data-residency.
staff-cronruns only in the primary region (singleton).
6. Networking
- All Cloud Run services attached to a Serverless VPC connector; outbound to Cloud SQL, Memorystore, and internal services routed over the VPC.
- Inbound from
bff-backoffice-serviceandbff-tenant-booking-servicevia internal load balancer; no public ingress. - mTLS enforced between services in M2 via Anthos Service Mesh (per SECURITY_MODEL §7).
7. Configuration
| Variable | Source | Notes |
|---|---|---|
DATABASE_URL | Secret Manager | Resolved at startup via Secret Manager API |
REDIS_URL | Secret Manager | |
KMS_PIN_PEPPER_KEY | Configmap | Resource name; auth via workload identity |
KMS_PII_ENVELOPE_KEY | Configmap | |
PUBSUB_PROJECT_ID | Configmap | |
IAM_SERVICE_BASE_URL | Configmap | Internal LB |
PROPERTY_SERVICE_BASE_URL | Configmap | |
AI_ORCHESTRATOR_BASE_URL | Configmap | |
LOG_LEVEL | Configmap | info in prod, debug in dev |
OTEL_EXPORTER_OTLP_ENDPOINT | Configmap | OpenTelemetry collector sidecar |
STAFF_AUTO_CLOSE_GRACE_MIN | Configmap | default 60 |
STAFF_GAP_WARN_MIN | Configmap | default 15 |
STAFF_PIN_LOCKOUT_MIN | Configmap | default 15 |
STAFF_PIN_PEPPER_VERSION | Configmap | currently v3 |
Configmap delivered via Cloud Run env vars per environment (dev / staging / prod-me / prod-eu).
8. Migrations
- Flyway runs as a Cloud Run Job triggered by the deploy pipeline before traffic is shifted.
- Job authenticates via workload identity to Cloud SQL.
- A successful migration is required before
staff-apirollout proceeds. - Failure → pipeline aborts; rollback runbook in
runbooks/staff/migration-failure.md.
9. Deploy Pipeline
PR merged to main
→ CI (unit / contract / integration / sync / security / e2e + coverage gate)
→ image build (linux/amd64 + arm64)
→ push to Artifact Registry (`me-central1-docker.pkg.dev/melmastoon-prod/staff-service:<sha>`)
→ deploy to dev
→ smoke (Playwright headless top-3)
→ deploy to staging
→ load-test (k6 baseline)
→ manual approval (peer)
→ migration-job → traffic-shift 10 % → 50 % → 100 % (Cloud Deploy)
→ post-deploy: dashboard link + audit row
Rollback: Cloud Deploy rollback to previous revision; new traffic-shift; Flyway has no auto-revert (forward fixes only).
10. Backup & Restore
- Cloud SQL daily backup, 35 d retention, PITR 7 d.
- Quarterly restore drill into a transient project; the drill is owned by Platform Ops.
- BigQuery cold copies of
audit_eventsand key tables retained 7 y. - Pub/Sub messages have 7 d retention; replay is operator-initiated.