housekeeping-service — DEPLOYMENT_TOPOLOGY
Cloud: GCP. Compute: Cloud Run (hot path) + Cloud Run Jobs (schedulers). DB: Cloud SQL Postgres 15 (regional HA, CMEK). Messaging: Pub/Sub. Networking: internal LB + Serverless VPC connector. CI/CD: GitHub Actions → Artifact Registry → Cloud Deploy.
1. Topology
┌──────────────────────────────────────────────────────┐
│ GCP project: melmastoon-<env> │
│ │
Internet ─▶ HTTPS LB ─▶ ┌────────────────┐ push ┌──────────────┐ │
│ housekeeping- │ ◀────────── │ Pub/Sub │ │
│ service │ └──────────────┘ │
│ (Cloud Run) │ ──── private VPC ─▶ ┌──────┐ │
│ min=2, max=20 │ │Cloud │ │
└─────┬──────────┘ │ SQL │ │
│ outbox relay (sidecar) └──────┘ │
▼ │
┌──────────────┐ │
│ Pub/Sub │ ◀─── Cloud Run Jobs ───┐ │
└──────────────┘ • shift-staffing-tick│ │
• escalation-tick │ │
• mid-stay-cadence │ │
• lost-found-rten │ │
│ │
└──────────────────────────────────────────────────────┘
2. Hot-path service
| Setting | Value |
|---|---|
| Region | asia-south1 (Mumbai) primary; asia-southeast1 warm-standby for DR |
| Image | <region>-docker.pkg.dev/<project>/melmastoon/housekeeping-service:<sha> |
| Min instances | 2 (cold-start avoidance) |
| Max instances | 20 (scale on concurrency) |
| Concurrency | 80 per instance |
| CPU | 2 vCPU (always allocated) |
| Memory | 2 GiB |
| Request timeout | 30 s |
| Service account | housekeeping-svc@<project>.iam.gserviceaccount.com |
| Ingress | Internal + Cloud Load Balancing |
| VPC connector | melmastoon-svpc-connector (egress to Cloud SQL via private IP) |
| Secrets | mounted from Secret Manager via --update-secrets |
| Healthchecks | /health/ready, /health/live, both 1 s timeout, 3-of-5 |
3. Cloud Run Jobs
| Job | Schedule | Service account | Purpose |
|---|---|---|---|
housekeeping-shift-staffing-tick | every 60 s (Cloud Scheduler) | housekeeping-cron@<project> | Detect staffing gaps |
housekeeping-escalation-tick | every 30 s | same | Auto-escalate stuck urgent tasks |
housekeeping-mid-stay-cadence | hourly | same | Enqueue daily mid-stay cleans for opt-in tenants |
housekeeping-lost-found-retention | daily 03:00 (per tenant TZ via fan-out) | same | Auto-dispose expired lost items |
housekeeping-board-snapshot-refresh | every 60 s | same | Safety-net refresh of board_snapshot_mat |
housekeeping-partition-rotate | weekly | same | Run pg_partman maintenance |
Jobs are kept idempotent (replay-safe) and cap at 5 minutes wall-time; longer runs page on-call.
4. Cloud SQL
- Instance:
melmastoon-<env>-pg-1shared with adjacent core services (per platform standard); per-service schema (housekeeping). - Tier:
db-custom-4-16384(production);db-custom-2-8192(staging). - HA: regional, automated failover.
- Backups: daily; PITR 7 days.
- Maintenance window: Sun 02:00–04:00 UTC.
- Encryption: CMEK in
melmastoon-<env>-keyskeyring. - Connections: via private IP through Serverless VPC connector. Pooling via PgBouncer sidecar in transaction-pooling mode (max 200 server connections, 1k client connections).
5. Pub/Sub
| Topic / sub | Type | Notes |
|---|---|---|
melmastoon.housekeeping.task.* | publish topics | producer |
melmastoon.housekeeping.room.* | publish topics | producer |
melmastoon.housekeeping.lost_item.* | publish topics | producer |
melmastoon.housekeeping.linen.* | publish topics | producer |
melmastoon.housekeeping.shift.* | publish topics | producer |
melmastoon.housekeeping.inspection.* | publish topics | producer |
melmastoon.housekeeping.checklist.* | publish topics | producer |
housekeeping.consumer.reservation-checked-out | push subscription | OIDC, ACK 60 s |
housekeeping.consumer.reservation-early-checkout | push subscription | |
housekeeping.consumer.reservation-modification-requested | push subscription | |
housekeeping.consumer.reservation-cancelled | push subscription | |
housekeeping.consumer.maintenance-completed | push subscription | |
housekeeping.consumer.staff-shift-started / -ended | push subscriptions | |
housekeeping.consumer.ai-routing-suggestion | push subscription | |
housekeeping.consumer.property-room-archived | push subscription | |
housekeeping.consumer.tenant-settings-changed | push subscription | |
melmastoon.dlq.housekeeping | DLQ topic | shared by all our subscriptions |
6. Networking
- Public ingress only via the global HTTPS load balancer with Cloud Armor (rate limits, WAF rules) in front.
- Internal endpoints (
/internal/*,/sync/v1/*from desktop is via the public LB but is auth-gated) configured to require theiam-serviceJWT or OIDC respectively. - Egress: through Serverless VPC connector to Cloud SQL private IP and
ai-orchestrator-service's internal LB.
7. CI/CD
GitHub Actions pipeline:
lint → typecheck → unit + contract → migration-up/down → build image → push to Artifact Registry
→ deploy staging (Cloud Deploy) → smoke + integration core → manual approval → deploy prod
- Trunk-based; PRs require green checks.
- Cloud Deploy promotes the same image SHA from staging to prod.
- Canary rollout: 10% → 30% → 100% traffic over 30 minutes; auto-rollback on error budget burn or 5xx spike.
- Migrations run as a Cloud Run Job before traffic-split (expand step). Contract step runs in the next release after monitoring.
8. Environments
| Env | Purpose | Notes |
|---|---|---|
dev | per-engineer | Cloud SQL dev shared, Pub/Sub emulator local |
staging | shared | Real GCP; integration suite gates promotion |
prod | live | SLOs apply; on-call paged |
dr | warm | asia-southeast1; Cloud SQL replica + Cloud Run idle min=0 |
DR exercise quarterly: pause prod traffic, route to DR for 30 minutes, verify SLO.
9. Cost guardrails
- Cloud Run min=2 → ~$45/month idle (offset by responsiveness gain).
- Cloud SQL is shared (per-service schema) so per-service amortized cost is low.
- Outbox publish → Pub/Sub message cost dominates; we batch up to 100 per relay tick.
- Anomaly alerting:
BillingDailyDelta > 25%pages platform on-call.
10. Cross-link
- Service-level NFRs:
SERVICE_OVERVIEW.md§10. - DB layout & RLS:
DATA_MODEL.md. - Pub/Sub event topology:
EVENT_SCHEMAS.md§4. - Failure runbooks:
FAILURE_MODES.md.