maintenance-service · DEPLOYMENT_TOPOLOGY

Two Cloud Run services share the same container image but use different entrypoints and IAM roles: api (request-serving) and workers (cron + Pub/Sub push handlers + outbox relay if not using shared platform relay). Configuration via env + Secret Manager. Region: europe-west1 primary, europe-west4 warm replica.

1. Runtime

Property	Value
Language	TypeScript
Runtime	Node.js 20 LTS
Framework	NestJS 10
Container base	`node:20-alpine` (multi-stage; final stage is non-root)
Image registry	Artifact Registry `europe-west1-docker.pkg.dev/<project>/melmastoon/maintenance-service`
Build tool	Cloud Build
Migration tool	`node-pg-migrate` (run as a Cloud Run Job pre-deploy)

2. Cloud Run services

2.1 `maintenance-service-api`

Serves the public REST API and the internal/pubsub/* push endpoints.

Setting	Value
Region	`europe-west1` (primary), `europe-west4` (warm replica)
Min instances	2
Max instances	12
Concurrency	80
CPU	1 vCPU (always-on; not boosted-on-request)
Memory	512 MiB
Timeout	30 s
Ingress	internal + load-balancer (Kong is the public edge)
VPC connector	`melmastoon-vpc-conn-eu-west1`
Egress	private (all egress through VPC)
SA	`maintenance-api@<project>.iam` (least-privilege; see §4)
Liveness	`GET /healthz`
Readiness	`GET /readyz` (checks DB ping + Pub/Sub publisher token + outbox heartbeat)

2.2 `maintenance-service-workers`

Runs the preventive scheduler tick, SLA breach scanner, vendor reminder, asset health forecaster. One container; tasks scheduled by Cloud Scheduler hitting internal endpoints.

Setting	Value
Region	`europe-west1`
Min instances	1
Max instances	3
Concurrency	4 (workers are I/O-bound but DB-heavy)
CPU	1 vCPU
Memory	1 GiB (forecaster needs more headroom)
Timeout	540 s (max for some long sweeps)
Ingress	internal only
SA	`maintenance-workers@<project>.iam`

2.3 Outbox relay

Uses the shared platform outbox-relay-service which is configured to read maintenance.outbox table. We do not run our own. Lag SLO and dashboards live in this doc, but the worker is shared.

3. Cloud Scheduler entries

Job	Schedule	Endpoint	Purpose
`mnt-preventive-tick`	`* * * * *` (every minute)	`POST /internal/cron/preventive-scheduler` on workers	Materialise due preventive WOs
`mnt-sla-tick`	`* * * * *`	`POST /internal/cron/sla-breach-scanner` on workers	Detect SLA breaches
`mnt-vendor-reminder`	`/5 * * *`	`POST /internal/cron/vendor-reminder` on workers	Re-notify pending vendors
`mnt-asset-health`	`0 * * * *` (hourly)	`POST /internal/cron/asset-health-forecaster` on workers	AI health updates
`mnt-preventive-due-digest`	`0 6 * * *` (per-tenant timezone via fan-out service)	`POST /internal/cron/preventive-due-digest`	Daily digest
`mnt-archiver`	`0 3 * * *`	`POST /internal/cron/archive-closed`	Archive WOs > 24 mo to BigQuery
`mnt-sweeper`	`0 * * * *`	`POST /internal/cron/sweep`	Prune outbox/inbox/idempotency rows

4. IAM and Workload Identity

`maintenance-api@…` roles

roles/cloudsql.client (Cloud SQL connector)
roles/secretmanager.secretAccessor on secrets/maintenance-db-password
roles/pubsub.publisher on melmastoon.maintenance.* topics
roles/iam.serviceAccountTokenCreator (for issuing OIDC tokens to call ai-orchestrator-service, notification-service, etc.)
roles/cloudkms.cryptoKeyEncrypterDecrypter on data/maintenance-db (for app-side encryption of new fields if added)
roles/storage.objectCreator and roles/storage.objectViewer on melmastoon-vendor-invoices/

`maintenance-workers@…` roles

All of the above
roles/run.invoker on itself (Cloud Scheduler must invoke it)
roles/pubsub.subscriber on subscriptions starting with mnt.in.*
roles/bigquery.dataEditor on melmastoon_events_v1.maintenance_* (archive job only)

No human user has direct DB access in production; access is via cloud-sql-proxy with org-wide audited break-glass roles.

5. Infrastructure dependencies

Dependency	Purpose
Cloud SQL Postgres 16 (regional HA, 4 vCPU / 16 GB at Phase 1)	Primary store; CMEK; PITR 7 days
Memorystore Redis 7.2 (1 GB Standard)	Hot caches
Pub/Sub	Event backbone; topics under `melmastoon.maintenance.` and subscriptions `mnt.in.`
KMS	Keyring `data` in `europe-west1`
Secret Manager	DB password, internal API tokens (where used)
Cloud Scheduler	Cron jobs above
Cloud Storage	`melmastoon-vendor-invoices/` (CMEK, 7-yr lifecycle)
Artifact Registry	Container images
Cloud Build	CI image build
BigQuery	Event archive sink + audit destination
OTLP collector	OpenTelemetry export (Cloud Operations + SigNoz)
Kong (or platform-equivalent)	Public edge for `/api/v1/maintenance/*`

6. Network topology

[Internet / BFFs / Other GCP services]
        │  HTTPS (mTLS internal)
        ▼
   ┌─────────┐
   │  Kong   │  (validates JWT, rate-limits per tenant)
   └────┬────┘
        │ (private)
        ▼
   ┌─────────────────────────┐
   │ maintenance-service-api │  Cloud Run (private ingress only)
   └────┬─────────┬──────────┘
        │         │
   ┌────▼───┐ ┌───▼────┐ ┌────▼─────┐ ┌──────────┐
   │Cloud SQL│ │ Redis  │ │ Pub/Sub  │ │ KMS / SM │
   └─────────┘ └────────┘ └──────────┘ └──────────┘

[Cloud Scheduler] ──OIDC──► /internal/cron/* on workers
[Pub/Sub push]    ──OIDC──► /internal/pubsub/* on api

Egress to the internet is denied at the VPC firewall except for OTLP and platform-external services (e.g., outbound notification gateways are reached via notification-service, never directly from us).

7. Configuration

Environment variables (non-secret):

Var	Purpose
`NODE_ENV`	`production` / `staging` / `local`
`BUILD_VERSION`	injected at build time
`OTEL_EXPORTER_OTLP_ENDPOINT`	OTLP gRPC endpoint
`OTEL_RESOURCE_ATTRIBUTES`	service.name, service.namespace, deployment.environment
`DB_HOST`, `DB_NAME`, `DB_USER`, `DB_SOCKET_PATH`	DB connection (password from Secret Manager)
`REDIS_HOST`, `REDIS_PORT`	Memorystore
`PUBSUB_PROJECT_ID`	Topic/Subscription scope
`AI_ORCHESTRATOR_URL`	`https://ai-orchestrator.<env>.melmastoon.app`
`NOTIFICATION_URL`	`https://notification.<env>.melmastoon.app`
`IAM_URL`	`https://iam.<env>.melmastoon.app`
`SYNC_URL`	`https://sync.<env>.melmastoon.app`
`RESERVATION_PROJECTION_URL`	for relocation overlap lookup
`OUTBOX_TABLE`	`maintenance.outbox`
`WORKER_BATCH_SIZE_PREVENTIVE`	default 200
`WORKER_BATCH_SIZE_SLA`	default 500
`VENDOR_REMINDER_DEFAULT_MINUTES`	default 30

Secrets (mounted as env from Secret Manager):

DB_PASSWORD
JWT_PUBLIC_KEYS (rotated)

8. Deploy & release process

PR merged to main.
GitHub Actions → Cloud Build →
- lint, unit, integration with Testcontainers
- build container, push to Artifact Registry tagged :<sha> and :staging
- run node-pg-migrate up against staging Cloud SQL via Cloud Run Job
- gcloud run deploy maintenance-service-api --image=:<sha> --region=europe-west1 --no-traffic
- smoke tests against new revision (?revision=<id>)
- shift traffic 10% → 50% → 100% with 5 min bake intervals (canary)
- same for maintenance-service-workers (single-instance canary then promote)
Production: tag a release; same pipeline against prod with mandatory manual approval gate.

Rollback: gcloud run services update-traffic ... --to-revisions=<previous>=100. Migrations are expand-only, so no rollback migration needed for additive changes.

9. Capacity planning

Metric	Phase 0 (50 props)	Phase 1 (500 props)
Avg API RPS	5	50
Peak API RPS	25	250
Avg Pub/Sub publishes/s	0.07	0.7
Peak Pub/Sub publishes/s	5	50
Cloud SQL CPU avg	10% (4 vCPU)	35% (8 vCPU)
Storage growth	1.5 GB / yr	15 GB / yr
Cloud Run cost / mo (api)	~$30	~$200

10. Disaster recovery

Scenario	RTO	RPO	Action
Region `europe-west1` outage	30 min	5 min	Promote `europe-west4` Cloud SQL replica; flip Cloud Run traffic via global LB
Cloud SQL data corruption	60 min	5 min	PITR restore to fresh instance; redirect connection string
Container image registry corruption	30 min	0	Re-build from source; we keep last 30 builds
Pub/Sub subscription deletion	< 5 min	0	Recreate from Terraform; outbox replays
Mass mis-configuration	15 min	0	Roll back Cloud Run revision

DR drill: quarterly, on staging. Production drill annually with executive sign-off.

1. Runtime​

2. Cloud Run services​

2.1 maintenance-service-api​

2.2 maintenance-service-workers​

2.3 Outbox relay​

3. Cloud Scheduler entries​

4. IAM and Workload Identity​

maintenance-api@… roles​

maintenance-workers@… roles​

5. Infrastructure dependencies​

6. Network topology​

7. Configuration​

8. Deploy & release process​

9. Capacity planning​

10. Disaster recovery​