Skip to main content

maintenance-service · DEPLOYMENT_TOPOLOGY

Two Cloud Run services share the same container image but use different entrypoints and IAM roles: api (request-serving) and workers (cron + Pub/Sub push handlers + outbox relay if not using shared platform relay). Configuration via env + Secret Manager. Region: europe-west1 primary, europe-west4 warm replica.

1. Runtime

PropertyValue
LanguageTypeScript
RuntimeNode.js 20 LTS
FrameworkNestJS 10
Container basenode:20-alpine (multi-stage; final stage is non-root)
Image registryArtifact Registry europe-west1-docker.pkg.dev/<project>/melmastoon/maintenance-service
Build toolCloud Build
Migration toolnode-pg-migrate (run as a Cloud Run Job pre-deploy)

2. Cloud Run services

2.1 maintenance-service-api

Serves the public REST API and the internal/pubsub/* push endpoints.

SettingValue
Regioneurope-west1 (primary), europe-west4 (warm replica)
Min instances2
Max instances12
Concurrency80
CPU1 vCPU (always-on; not boosted-on-request)
Memory512 MiB
Timeout30 s
Ingressinternal + load-balancer (Kong is the public edge)
VPC connectormelmastoon-vpc-conn-eu-west1
Egressprivate (all egress through VPC)
SAmaintenance-api@<project>.iam (least-privilege; see §4)
LivenessGET /healthz
ReadinessGET /readyz (checks DB ping + Pub/Sub publisher token + outbox heartbeat)

2.2 maintenance-service-workers

Runs the preventive scheduler tick, SLA breach scanner, vendor reminder, asset health forecaster. One container; tasks scheduled by Cloud Scheduler hitting internal endpoints.

SettingValue
Regioneurope-west1
Min instances1
Max instances3
Concurrency4 (workers are I/O-bound but DB-heavy)
CPU1 vCPU
Memory1 GiB (forecaster needs more headroom)
Timeout540 s (max for some long sweeps)
Ingressinternal only
SAmaintenance-workers@<project>.iam

2.3 Outbox relay

Uses the shared platform outbox-relay-service which is configured to read maintenance.outbox table. We do not run our own. Lag SLO and dashboards live in this doc, but the worker is shared.

3. Cloud Scheduler entries

JobScheduleEndpointPurpose
mnt-preventive-tick* * * * * (every minute)POST /internal/cron/preventive-scheduler on workersMaterialise due preventive WOs
mnt-sla-tick* * * * *POST /internal/cron/sla-breach-scanner on workersDetect SLA breaches
mnt-vendor-reminder*/5 * * * *POST /internal/cron/vendor-reminder on workersRe-notify pending vendors
mnt-asset-health0 * * * * (hourly)POST /internal/cron/asset-health-forecaster on workersAI health updates
mnt-preventive-due-digest0 6 * * * (per-tenant timezone via fan-out service)POST /internal/cron/preventive-due-digestDaily digest
mnt-archiver0 3 * * *POST /internal/cron/archive-closedArchive WOs > 24 mo to BigQuery
mnt-sweeper0 * * * *POST /internal/cron/sweepPrune outbox/inbox/idempotency rows

4. IAM and Workload Identity

maintenance-api@… roles

  • roles/cloudsql.client (Cloud SQL connector)
  • roles/secretmanager.secretAccessor on secrets/maintenance-db-password
  • roles/pubsub.publisher on melmastoon.maintenance.* topics
  • roles/iam.serviceAccountTokenCreator (for issuing OIDC tokens to call ai-orchestrator-service, notification-service, etc.)
  • roles/cloudkms.cryptoKeyEncrypterDecrypter on data/maintenance-db (for app-side encryption of new fields if added)
  • roles/storage.objectCreator and roles/storage.objectViewer on melmastoon-vendor-invoices/

maintenance-workers@… roles

  • All of the above
  • roles/run.invoker on itself (Cloud Scheduler must invoke it)
  • roles/pubsub.subscriber on subscriptions starting with mnt.in.*
  • roles/bigquery.dataEditor on melmastoon_events_v1.maintenance_* (archive job only)

No human user has direct DB access in production; access is via cloud-sql-proxy with org-wide audited break-glass roles.

5. Infrastructure dependencies

DependencyPurpose
Cloud SQL Postgres 16 (regional HA, 4 vCPU / 16 GB at Phase 1)Primary store; CMEK; PITR 7 days
Memorystore Redis 7.2 (1 GB Standard)Hot caches
Pub/SubEvent backbone; topics under melmastoon.maintenance.* and subscriptions mnt.in.*
KMSKeyring data in europe-west1
Secret ManagerDB password, internal API tokens (where used)
Cloud SchedulerCron jobs above
Cloud Storagemelmastoon-vendor-invoices/ (CMEK, 7-yr lifecycle)
Artifact RegistryContainer images
Cloud BuildCI image build
BigQueryEvent archive sink + audit destination
OTLP collectorOpenTelemetry export (Cloud Operations + SigNoz)
Kong (or platform-equivalent)Public edge for /api/v1/maintenance/*

6. Network topology

[Internet / BFFs / Other GCP services]
│ HTTPS (mTLS internal)

┌─────────┐
│ Kong │ (validates JWT, rate-limits per tenant)
└────┬────┘
│ (private)

┌─────────────────────────┐
│ maintenance-service-api │ Cloud Run (private ingress only)
└────┬─────────┬──────────┘
│ │
┌────▼───┐ ┌───▼────┐ ┌────▼─────┐ ┌──────────┐
│Cloud SQL│ │ Redis │ │ Pub/Sub │ │ KMS / SM │
└─────────┘ └────────┘ └──────────┘ └──────────┘

[Cloud Scheduler] ──OIDC──► /internal/cron/* on workers
[Pub/Sub push] ──OIDC──► /internal/pubsub/* on api

Egress to the internet is denied at the VPC firewall except for OTLP and platform-external services (e.g., outbound notification gateways are reached via notification-service, never directly from us).

7. Configuration

Environment variables (non-secret):

VarPurpose
NODE_ENVproduction / staging / local
BUILD_VERSIONinjected at build time
OTEL_EXPORTER_OTLP_ENDPOINTOTLP gRPC endpoint
OTEL_RESOURCE_ATTRIBUTESservice.name, service.namespace, deployment.environment
DB_HOST, DB_NAME, DB_USER, DB_SOCKET_PATHDB connection (password from Secret Manager)
REDIS_HOST, REDIS_PORTMemorystore
PUBSUB_PROJECT_IDTopic/Subscription scope
AI_ORCHESTRATOR_URLhttps://ai-orchestrator.<env>.melmastoon.app
NOTIFICATION_URLhttps://notification.<env>.melmastoon.app
IAM_URLhttps://iam.<env>.melmastoon.app
SYNC_URLhttps://sync.<env>.melmastoon.app
RESERVATION_PROJECTION_URLfor relocation overlap lookup
OUTBOX_TABLEmaintenance.outbox
WORKER_BATCH_SIZE_PREVENTIVEdefault 200
WORKER_BATCH_SIZE_SLAdefault 500
VENDOR_REMINDER_DEFAULT_MINUTESdefault 30

Secrets (mounted as env from Secret Manager):

  • DB_PASSWORD
  • JWT_PUBLIC_KEYS (rotated)

8. Deploy & release process

  1. PR merged to main.
  2. GitHub Actions → Cloud Build →
    • lint, unit, integration with Testcontainers
    • build container, push to Artifact Registry tagged :<sha> and :staging
    • run node-pg-migrate up against staging Cloud SQL via Cloud Run Job
    • gcloud run deploy maintenance-service-api --image=:<sha> --region=europe-west1 --no-traffic
    • smoke tests against new revision (?revision=<id>)
    • shift traffic 10% → 50% → 100% with 5 min bake intervals (canary)
    • same for maintenance-service-workers (single-instance canary then promote)
  3. Production: tag a release; same pipeline against prod with mandatory manual approval gate.

Rollback: gcloud run services update-traffic ... --to-revisions=<previous>=100. Migrations are expand-only, so no rollback migration needed for additive changes.

9. Capacity planning

MetricPhase 0 (50 props)Phase 1 (500 props)
Avg API RPS550
Peak API RPS25250
Avg Pub/Sub publishes/s0.070.7
Peak Pub/Sub publishes/s550
Cloud SQL CPU avg10% (4 vCPU)35% (8 vCPU)
Storage growth1.5 GB / yr15 GB / yr
Cloud Run cost / mo (api)~$30~$200

10. Disaster recovery

ScenarioRTORPOAction
Region europe-west1 outage30 min5 minPromote europe-west4 Cloud SQL replica; flip Cloud Run traffic via global LB
Cloud SQL data corruption60 min5 minPITR restore to fresh instance; redirect connection string
Container image registry corruption30 min0Re-build from source; we keep last 30 builds
Pub/Sub subscription deletion< 5 min0Recreate from Terraform; outbox replays
Mass mis-configuration15 min0Roll back Cloud Run revision

DR drill: quarterly, on staging. Production drill annually with executive sign-off.