Skip to main content

maintenance-service · MIGRATION_PLAN

Two flavours of migration: schema (DDL changes via node-pg-migrate) and data / onboarding (importing legacy maintenance records when a new tenant joins from another PMS / spreadsheet).

1. Schema migration policy

1.1 Tooling

  • node-pg-migrate for all DDL changes; migrations live in services/maintenance-service/db/migrations/.
  • Migration filename: YYYYMMDDHHmm_<verb>_<subject>.sql.ts (e.g., 202604220900_add_relocation_required_to_work_orders.sql.ts).
  • One concern per migration. No mixed ALTER + data backfill in the same file (data backfills go through a worker; see §1.4).

1.2 Compatibility rules

Change kindAllowed in single step?Notes
Add nullable columnAlways with default if needed
Add column with non-null default⚠️Requires DEFAULT to avoid table rewrite on Postgres < 11; we're on 16 so default add is metadata-only
Add CHECK constraint❌ in single stepMust be NOT VALID first, then VALIDATE CONSTRAINT after backfill
Add UNIQUE / regular INDEXMust use CONCURRENTLY
Drop columnExpand-then-contract: stop reading → stop writing → drop in next release
Rename columnExpand-then-contract: add new column → dual-write → backfill → switch reads → drop old
Change column typeExpand-then-contract or via shadow column
Drop / change CHECK enum valueAdd new enum value → migrate writes → migrate reads → remove old in later release
Rename tableUse a view aliasing the new name during the transition

1.3 Pre-deploy gate

CI runs:

  1. node-pg-migrate up --dry-run against staging.
  2. The new migration applied to a shadow Postgres restored from a recent prod backup.
  3. pnpm test:int runs against the migrated shadow.
  4. If green: pipeline applies the migration to staging Cloud SQL via a Cloud Run Job, then deploys the new revision behind canary traffic.
  5. Production: same job runs before the new image is rolled out to traffic. A migration that fails halts the pipeline.

1.4 Long-running backfills

Data backfills are never done in a migration file — only schema changes are. Backfills are dedicated workers under services/maintenance-service/jobs/backfills/<id>/, each:

  • emits a metric mnt.backfill.<id>.processed_total and _remaining,
  • is restart-safe (resumes from last cursor),
  • can be paused via tools/backfill-control.ts,
  • writes audit entries.

Once the backfill is complete and verified, a follow-up migration can tighten constraints (e.g., VALIDATE CONSTRAINT, SET NOT NULL).

1.5 Sample expand-contract: adding relocation_required

  1. Migration 202604220900_add_relocation_required_to_work_orders.sql.ts:
    ALTER TABLE maintenance.work_orders
    ADD COLUMN relocation_required boolean NOT NULL DEFAULT false;
  2. Code release: writes set relocation_required correctly; reads ignore it.
  3. Code release: reads start using relocation_required (e.g., for the relocation event).
  4. Verify via metric and dashboards for ≥ 14 days.
  5. (No contract step needed; the column is here to stay.)

If we later want to remove it: stop reads → stop writes (1 release each) → drop column.

1.6 Rollback policy

  • Every migration must be forward-compatible with the previous app version (so canary works).
  • True rollback = revert app code; the schema can stay because of expand-contract.
  • DOWN scripts exist for emergency only and are reviewed but rarely executed in prod.

2. Data migration: legacy import (tenant onboarding)

2.1 Supported sources

SourceFormatNotes
Excel .xlsxone sheet per entitymost common; templates provided
CSVper-entity filesfor power users
PMS exportJSONadapter per known PMS (Cloudbeds, Mews — Phase 2; Phase 1 uses the Excel template)
Manual UIn/athe BFF supports adding assets/vendors directly; not a "migration" path

2.2 Required entities & order

Order matters because of FK dependencies:

  1. Vendor (no FKs)
  2. Asset (FK to property — must already exist in property-service)
  3. Part (FK to property)
  4. PreventiveSchedule (FK to asset)
  5. WorkOrder (FK to property/room/asset/vendor) — usually only the open ones at cutover; closed history is optional
  6. MaintenanceTask (FK to WO)
  7. PartUsage (FK to WO + part)

2.3 Excel template columns (Phase 1)

  • Vendors sheet: external_id (legacy), display_name, categories (semicolon-joined), contact_name, phone_e164, email, whatsapp_e164, channel_primary, channel_fallback, address_free_text.
  • Assets sheet: external_id, property_external_id, room_external_id (optional), class, display_name, model, manufacturer, serial_number, installed_at, last_serviced_at, run_hours, capacity_value, capacity_unit.
  • Parts sheet: external_id, property_external_id, part_number, display_name, category, on_hand, reorder_threshold, last_unit_cost_currency, last_unit_cost_amount.
  • PreventiveSchedules sheet: external_id, property_external_id, asset_external_id (or asset_class), category, title, cadence_kind, every_days, every_hours, tz, sla_target_minutes, default_severity, default_assignee_kind, default_assignee_external_id, notify_channel, active.
  • WorkOrders sheet: external_id, property_external_id, room_external_id, asset_external_id, category, severity, title, description, source, status, assignee_kind, assignee_external_id, created_at_iso.
  • PartUsages sheet: external_id, wo_external_id, part_external_id, quantity, unit_cost_currency, unit_cost_amount, recorded_at, recorded_by_external_id.

2.4 Legacy status mapping

Legacy termMaps to
new, reported, submittedopen
acknowledged, dispatched, assignedassigned
working, in-progress, wipin_progress
on-hold, paused, awaiting partsblocked
done, closed, completeresolved
verified, qc-approved, signed-offverified
cancelled, void, droppedcancelled
anything elseopen + warning row

Severity mapping is configurable per import (default low/medium/high/critical with synonyms).

2.5 Idempotency

  • Every imported row carries a legacy_external_id stored in a side table maintenance.legacy_id_map(tenant_id, entity, legacy_external_id, internal_id).
  • Re-running the importer with the same input produces zero new rows; only updates if the source has changed (hash compared).
  • New rows in source produce new internal rows; deletions in source do not propagate (manual decommission required).

2.6 Dry-run vs commit

tools/import-maintenance.ts modes:

  • --dry-run (default): produces report-<runId>.json with counts per entity, validation errors per row, and projected effects (creates / updates / skips). No DB writes.
  • --commit: applies changes inside one transaction per batch (default 500 rows). Failures roll back the batch and continue.
  • --resume <runId>: continues from last successful batch.

2.7 Side effects suppressed during import

  • Outbox events suppressed: import runs with OUTBOX_MODE=suppress; no work_order.created.v1 storm. Instead, a single melmastoon.maintenance.tenant.imported.v1 event is published at the end with summary counts.
  • Notifications disabled per import.
  • AI capabilities disabled per import (no model bills for legacy data).
  • SLA timers still computed but not actively scanned for breach until cutover flag is flipped.

2.8 Cutover

  1. Last sync of legacy → produce final Excel.
  2. Dry-run; review report-<runId>.json with the tenant's GM.
  3. Commit run during a maintenance window.
  4. Set tenant flag maintenance.cutoverComplete = true in tenant-service.
  5. Outbox event tenant.imported.v1 triggers analytics-service, search-aggregation-service, etc., to backfill projections.
  6. Enable SLA scanner for the tenant.
  7. Flip the BFF to point at maintenance-service for that tenant.

2.9 Reporting

After every commit run, an HTML report is uploaded to gs://melmastoon-import-reports/<tenantId>/<runId>/index.html containing:

  • Per-sheet success / failure counts
  • Validation errors with cell references
  • A diff preview for updated rows
  • Total time, batch sizes, total rows
  • A signed-off acknowledgement section for the tenant GM

2.10 Rollback

Every commit run records a runId in maintenance.import_runs. Rollback (tools/import-rollback.ts --runId=<id>):

  • Deletes all rows created with that runId (FK cascade respected).
  • Reverts updates to prior values from the run's snapshot table.
  • Does not restore deleted legacy_external_id mappings (those are kept for forward idempotency).

A rollback past cutover requires SRE + DBA approval; we do not silently undo a tenant who has been live.

3. Backfills for new columns / event versions

3.1 Adding a new column with computed value

Pattern: add column nullable → run worker to backfill → tighten to NOT NULL.

Example: adding WorkOrder.estimatedDurationMinutes (Phase 1.1):

  1. Schema migration: add nullable column.
  2. Backfill worker: for each closed WO, computes (resolved_at - created_at) clamped to 0..N; for each open WO, infers from category default.
  3. After 100% backfilled, schema migration tightens to NOT NULL.

3.2 New event version (e.g., work_order.created.v2)

  1. Add v2 publishing alongside v1 (dual-publish for ≥ 6 months).
  2. Consumers migrate independently.
  3. v1 frozen, then archived.

3.3 Re-emit historical events

For analytics consumers that join late, the importer can re-emit a synthetic work_order.created.v1 (with producedAt = original createdAt, marked replay=true in envelope metadata). Consumers must respect replay=true if they have side effects.

4. Tenant offboarding

When a tenant leaves the platform:

  1. Export: tools/export-maintenance.ts --tenantId=<id> --format=excel|json --bucket=gs://... Exports the entire maintenance dataset (WOs, tasks, parts, usages, vendors, assets, schedules, audit summary).
  2. Cool-down period (typically 30 days), during which the data is read-only.
  3. Erasure: tools/erase-maintenance.ts --tenantId=<id> --confirm=<token>:
    • Deletes operational rows (WOs, tasks, parts, usages, schedules).
    • Anonymises regulated rows (vendor_invoice_* retained 7 years per policy with all PII redacted).
    • Archives event payloads in BigQuery with PII fields nulled.
    • Vendor records anonymised (displayName[redacted-vnd-<n>], contact fields nulled).
  4. Audit entry written to audit-service with the export bucket URI and the erasure receipt.

5. Risks specific to migration

RiskMitigation
Legacy data has duplicate external_idImporter deduplicates within a sheet; warns; first wins
Legacy severity values can't mapImporter rejects row; staff adjusts mapping config
Large imports timeout single transactionBatch size 500; per-batch transactions; resume from cursor
Backfill worker drifts behindMetric mnt.backfill.<id>.lag_seconds; alert at 1h
Migration applies but schema desync from appCI gate compares Drizzle/Nest model vs DB introspection; build fails on diff

6. Linked artefacts

  • Schema policies: this doc §1
  • Tools: services/maintenance-service/tools/
  • Run history: maintenance.import_runs table
  • Reports: gs://melmastoon-import-reports/
  • Risk register entries: R12 (data corruption), R23 (prefix rename frozen)
  • Audit destination: audit-service