Skip to main content

housekeeping-service — DEPLOYMENT_TOPOLOGY

Cloud: GCP. Compute: Cloud Run (hot path) + Cloud Run Jobs (schedulers). DB: Cloud SQL Postgres 15 (regional HA, CMEK). Messaging: Pub/Sub. Networking: internal LB + Serverless VPC connector. CI/CD: GitHub Actions → Artifact Registry → Cloud Deploy.


1. Topology

┌──────────────────────────────────────────────────────┐
│ GCP project: melmastoon-<env> │
│ │
Internet ─▶ HTTPS LB ─▶ ┌────────────────┐ push ┌──────────────┐ │
│ housekeeping- │ ◀────────── │ Pub/Sub │ │
│ service │ └──────────────┘ │
│ (Cloud Run) │ ──── private VPC ─▶ ┌──────┐ │
│ min=2, max=20 │ │Cloud │ │
└─────┬──────────┘ │ SQL │ │
│ outbox relay (sidecar) └──────┘ │
▼ │
┌──────────────┐ │
│ Pub/Sub │ ◀─── Cloud Run Jobs ───┐ │
└──────────────┘ • shift-staffing-tick│ │
• escalation-tick │ │
• mid-stay-cadence │ │
• lost-found-rten │ │
│ │
└──────────────────────────────────────────────────────┘

2. Hot-path service

SettingValue
Regionasia-south1 (Mumbai) primary; asia-southeast1 warm-standby for DR
Image<region>-docker.pkg.dev/<project>/melmastoon/housekeeping-service:<sha>
Min instances2 (cold-start avoidance)
Max instances20 (scale on concurrency)
Concurrency80 per instance
CPU2 vCPU (always allocated)
Memory2 GiB
Request timeout30 s
Service accounthousekeeping-svc@<project>.iam.gserviceaccount.com
IngressInternal + Cloud Load Balancing
VPC connectormelmastoon-svpc-connector (egress to Cloud SQL via private IP)
Secretsmounted from Secret Manager via --update-secrets
Healthchecks/health/ready, /health/live, both 1 s timeout, 3-of-5

3. Cloud Run Jobs

JobScheduleService accountPurpose
housekeeping-shift-staffing-tickevery 60 s (Cloud Scheduler)housekeeping-cron@<project>Detect staffing gaps
housekeeping-escalation-tickevery 30 ssameAuto-escalate stuck urgent tasks
housekeeping-mid-stay-cadencehourlysameEnqueue daily mid-stay cleans for opt-in tenants
housekeeping-lost-found-retentiondaily 03:00 (per tenant TZ via fan-out)sameAuto-dispose expired lost items
housekeeping-board-snapshot-refreshevery 60 ssameSafety-net refresh of board_snapshot_mat
housekeeping-partition-rotateweeklysameRun pg_partman maintenance

Jobs are kept idempotent (replay-safe) and cap at 5 minutes wall-time; longer runs page on-call.

4. Cloud SQL

  • Instance: melmastoon-<env>-pg-1 shared with adjacent core services (per platform standard); per-service schema (housekeeping).
  • Tier: db-custom-4-16384 (production); db-custom-2-8192 (staging).
  • HA: regional, automated failover.
  • Backups: daily; PITR 7 days.
  • Maintenance window: Sun 02:00–04:00 UTC.
  • Encryption: CMEK in melmastoon-<env>-keys keyring.
  • Connections: via private IP through Serverless VPC connector. Pooling via PgBouncer sidecar in transaction-pooling mode (max 200 server connections, 1k client connections).

5. Pub/Sub

Topic / subTypeNotes
melmastoon.housekeeping.task.*publish topicsproducer
melmastoon.housekeeping.room.*publish topicsproducer
melmastoon.housekeeping.lost_item.*publish topicsproducer
melmastoon.housekeeping.linen.*publish topicsproducer
melmastoon.housekeeping.shift.*publish topicsproducer
melmastoon.housekeeping.inspection.*publish topicsproducer
melmastoon.housekeeping.checklist.*publish topicsproducer
housekeeping.consumer.reservation-checked-outpush subscriptionOIDC, ACK 60 s
housekeeping.consumer.reservation-early-checkoutpush subscription
housekeeping.consumer.reservation-modification-requestedpush subscription
housekeeping.consumer.reservation-cancelledpush subscription
housekeeping.consumer.maintenance-completedpush subscription
housekeeping.consumer.staff-shift-started / -endedpush subscriptions
housekeeping.consumer.ai-routing-suggestionpush subscription
housekeeping.consumer.property-room-archivedpush subscription
housekeeping.consumer.tenant-settings-changedpush subscription
melmastoon.dlq.housekeepingDLQ topicshared by all our subscriptions

6. Networking

  • Public ingress only via the global HTTPS load balancer with Cloud Armor (rate limits, WAF rules) in front.
  • Internal endpoints (/internal/*, /sync/v1/* from desktop is via the public LB but is auth-gated) configured to require the iam-service JWT or OIDC respectively.
  • Egress: through Serverless VPC connector to Cloud SQL private IP and ai-orchestrator-service's internal LB.

7. CI/CD

GitHub Actions pipeline:

lint → typecheck → unit + contract → migration-up/down → build image → push to Artifact Registry
→ deploy staging (Cloud Deploy) → smoke + integration core → manual approval → deploy prod
  • Trunk-based; PRs require green checks.
  • Cloud Deploy promotes the same image SHA from staging to prod.
  • Canary rollout: 10% → 30% → 100% traffic over 30 minutes; auto-rollback on error budget burn or 5xx spike.
  • Migrations run as a Cloud Run Job before traffic-split (expand step). Contract step runs in the next release after monitoring.

8. Environments

EnvPurposeNotes
devper-engineerCloud SQL dev shared, Pub/Sub emulator local
stagingsharedReal GCP; integration suite gates promotion
prodliveSLOs apply; on-call paged
drwarmasia-southeast1; Cloud SQL replica + Cloud Run idle min=0

DR exercise quarterly: pause prod traffic, route to DR for 30 minutes, verify SLO.

9. Cost guardrails

  • Cloud Run min=2 → ~$45/month idle (offset by responsiveness gain).
  • Cloud SQL is shared (per-service schema) so per-service amortized cost is low.
  • Outbox publish → Pub/Sub message cost dominates; we batch up to 100 per relay tick.
  • Anomaly alerting: BillingDailyDelta > 25% pages platform on-call.