maintenance-service · DEPLOYMENT_TOPOLOGY
Two Cloud Run services share the same container image but use different entrypoints and IAM roles: api (request-serving) and workers (cron + Pub/Sub push handlers + outbox relay if not using shared platform relay). Configuration via env + Secret Manager. Region:
europe-west1primary,europe-west4warm replica.
1. Runtime
| Property | Value |
|---|---|
| Language | TypeScript |
| Runtime | Node.js 20 LTS |
| Framework | NestJS 10 |
| Container base | node:20-alpine (multi-stage; final stage is non-root) |
| Image registry | Artifact Registry europe-west1-docker.pkg.dev/<project>/melmastoon/maintenance-service |
| Build tool | Cloud Build |
| Migration tool | node-pg-migrate (run as a Cloud Run Job pre-deploy) |
2. Cloud Run services
2.1 maintenance-service-api
Serves the public REST API and the internal/pubsub/* push endpoints.
| Setting | Value |
|---|---|
| Region | europe-west1 (primary), europe-west4 (warm replica) |
| Min instances | 2 |
| Max instances | 12 |
| Concurrency | 80 |
| CPU | 1 vCPU (always-on; not boosted-on-request) |
| Memory | 512 MiB |
| Timeout | 30 s |
| Ingress | internal + load-balancer (Kong is the public edge) |
| VPC connector | melmastoon-vpc-conn-eu-west1 |
| Egress | private (all egress through VPC) |
| SA | maintenance-api@<project>.iam (least-privilege; see §4) |
| Liveness | GET /healthz |
| Readiness | GET /readyz (checks DB ping + Pub/Sub publisher token + outbox heartbeat) |
2.2 maintenance-service-workers
Runs the preventive scheduler tick, SLA breach scanner, vendor reminder, asset health forecaster. One container; tasks scheduled by Cloud Scheduler hitting internal endpoints.
| Setting | Value |
|---|---|
| Region | europe-west1 |
| Min instances | 1 |
| Max instances | 3 |
| Concurrency | 4 (workers are I/O-bound but DB-heavy) |
| CPU | 1 vCPU |
| Memory | 1 GiB (forecaster needs more headroom) |
| Timeout | 540 s (max for some long sweeps) |
| Ingress | internal only |
| SA | maintenance-workers@<project>.iam |
2.3 Outbox relay
Uses the shared platform outbox-relay-service which is configured to read maintenance.outbox table. We do not run our own. Lag SLO and dashboards live in this doc, but the worker is shared.
3. Cloud Scheduler entries
| Job | Schedule | Endpoint | Purpose |
|---|---|---|---|
mnt-preventive-tick | * * * * * (every minute) | POST /internal/cron/preventive-scheduler on workers | Materialise due preventive WOs |
mnt-sla-tick | * * * * * | POST /internal/cron/sla-breach-scanner on workers | Detect SLA breaches |
mnt-vendor-reminder | */5 * * * * | POST /internal/cron/vendor-reminder on workers | Re-notify pending vendors |
mnt-asset-health | 0 * * * * (hourly) | POST /internal/cron/asset-health-forecaster on workers | AI health updates |
mnt-preventive-due-digest | 0 6 * * * (per-tenant timezone via fan-out service) | POST /internal/cron/preventive-due-digest | Daily digest |
mnt-archiver | 0 3 * * * | POST /internal/cron/archive-closed | Archive WOs > 24 mo to BigQuery |
mnt-sweeper | 0 * * * * | POST /internal/cron/sweep | Prune outbox/inbox/idempotency rows |
4. IAM and Workload Identity
maintenance-api@… roles
roles/cloudsql.client(Cloud SQL connector)roles/secretmanager.secretAccessoronsecrets/maintenance-db-passwordroles/pubsub.publisheronmelmastoon.maintenance.*topicsroles/iam.serviceAccountTokenCreator(for issuing OIDC tokens to callai-orchestrator-service,notification-service, etc.)roles/cloudkms.cryptoKeyEncrypterDecrypterondata/maintenance-db(for app-side encryption of new fields if added)roles/storage.objectCreatorandroles/storage.objectVieweronmelmastoon-vendor-invoices/
maintenance-workers@… roles
- All of the above
roles/run.invokeron itself (Cloud Scheduler must invoke it)roles/pubsub.subscriberon subscriptions starting withmnt.in.*roles/bigquery.dataEditoronmelmastoon_events_v1.maintenance_*(archive job only)
No human user has direct DB access in production; access is via cloud-sql-proxy with org-wide audited break-glass roles.
5. Infrastructure dependencies
| Dependency | Purpose |
|---|---|
| Cloud SQL Postgres 16 (regional HA, 4 vCPU / 16 GB at Phase 1) | Primary store; CMEK; PITR 7 days |
| Memorystore Redis 7.2 (1 GB Standard) | Hot caches |
| Pub/Sub | Event backbone; topics under melmastoon.maintenance.* and subscriptions mnt.in.* |
| KMS | Keyring data in europe-west1 |
| Secret Manager | DB password, internal API tokens (where used) |
| Cloud Scheduler | Cron jobs above |
| Cloud Storage | melmastoon-vendor-invoices/ (CMEK, 7-yr lifecycle) |
| Artifact Registry | Container images |
| Cloud Build | CI image build |
| BigQuery | Event archive sink + audit destination |
| OTLP collector | OpenTelemetry export (Cloud Operations + SigNoz) |
| Kong (or platform-equivalent) | Public edge for /api/v1/maintenance/* |
6. Network topology
[Internet / BFFs / Other GCP services]
│ HTTPS (mTLS internal)
▼
┌─────────┐
│ Kong │ (validates JWT, rate-limits per tenant)
└────┬────┘
│ (private)
▼
┌─────────────────────────┐
│ maintenance-service-api │ Cloud Run (private ingress only)
└────┬─────────┬──────────┘
│ │
┌────▼───┐ ┌───▼────┐ ┌────▼─────┐ ┌──────────┐
│Cloud SQL│ │ Redis │ │ Pub/Sub │ │ KMS / SM │
└─────────┘ └────────┘ └──────────┘ └──────────┘
[Cloud Scheduler] ──OIDC──► /internal/cron/* on workers
[Pub/Sub push] ──OIDC──► /internal/pubsub/* on api
Egress to the internet is denied at the VPC firewall except for OTLP and platform-external services (e.g., outbound notification gateways are reached via notification-service, never directly from us).
7. Configuration
Environment variables (non-secret):
| Var | Purpose |
|---|---|
NODE_ENV | production / staging / local |
BUILD_VERSION | injected at build time |
OTEL_EXPORTER_OTLP_ENDPOINT | OTLP gRPC endpoint |
OTEL_RESOURCE_ATTRIBUTES | service.name, service.namespace, deployment.environment |
DB_HOST, DB_NAME, DB_USER, DB_SOCKET_PATH | DB connection (password from Secret Manager) |
REDIS_HOST, REDIS_PORT | Memorystore |
PUBSUB_PROJECT_ID | Topic/Subscription scope |
AI_ORCHESTRATOR_URL | https://ai-orchestrator.<env>.melmastoon.app |
NOTIFICATION_URL | https://notification.<env>.melmastoon.app |
IAM_URL | https://iam.<env>.melmastoon.app |
SYNC_URL | https://sync.<env>.melmastoon.app |
RESERVATION_PROJECTION_URL | for relocation overlap lookup |
OUTBOX_TABLE | maintenance.outbox |
WORKER_BATCH_SIZE_PREVENTIVE | default 200 |
WORKER_BATCH_SIZE_SLA | default 500 |
VENDOR_REMINDER_DEFAULT_MINUTES | default 30 |
Secrets (mounted as env from Secret Manager):
DB_PASSWORDJWT_PUBLIC_KEYS(rotated)
8. Deploy & release process
- PR merged to
main. - GitHub Actions → Cloud Build →
- lint, unit, integration with Testcontainers
- build container, push to Artifact Registry tagged
:<sha>and:staging - run
node-pg-migrate upagainst staging Cloud SQL via Cloud Run Job gcloud run deploy maintenance-service-api --image=:<sha> --region=europe-west1 --no-traffic- smoke tests against new revision (
?revision=<id>) - shift traffic 10% → 50% → 100% with 5 min bake intervals (canary)
- same for
maintenance-service-workers(single-instance canary then promote)
- Production: tag a release; same pipeline against prod with mandatory manual approval gate.
Rollback: gcloud run services update-traffic ... --to-revisions=<previous>=100. Migrations are expand-only, so no rollback migration needed for additive changes.
9. Capacity planning
| Metric | Phase 0 (50 props) | Phase 1 (500 props) |
|---|---|---|
| Avg API RPS | 5 | 50 |
| Peak API RPS | 25 | 250 |
| Avg Pub/Sub publishes/s | 0.07 | 0.7 |
| Peak Pub/Sub publishes/s | 5 | 50 |
| Cloud SQL CPU avg | 10% (4 vCPU) | 35% (8 vCPU) |
| Storage growth | 1.5 GB / yr | 15 GB / yr |
| Cloud Run cost / mo (api) | ~$30 | ~$200 |
10. Disaster recovery
| Scenario | RTO | RPO | Action |
|---|---|---|---|
Region europe-west1 outage | 30 min | 5 min | Promote europe-west4 Cloud SQL replica; flip Cloud Run traffic via global LB |
| Cloud SQL data corruption | 60 min | 5 min | PITR restore to fresh instance; redirect connection string |
| Container image registry corruption | 30 min | 0 | Re-build from source; we keep last 30 builds |
| Pub/Sub subscription deletion | < 5 min | 0 | Recreate from Terraform; outbox replays |
| Mass mis-configuration | 15 min | 0 | Roll back Cloud Run revision |
DR drill: quarterly, on staging. Production drill annually with executive sign-off.