Skip to main content

file-storage-service — MIGRATION_PLAN

Companion: DATA_MODEL · API_CONTRACTS · EVENT_SCHEMAS · SERVICE_READINESS

This document is the playbook for changing the surface of file-storage-service without breaking consumers, downstream services, or stored bytes. It covers (1) general policy, (2) the recurring expand → backfill → contract workflow for DB, events, and APIs, (3) named bootstrap migrations from the current "no service" state to MVP, (4) named upcoming migrations to Phase 2 and Phase 3, and (5) one-time data movements (re-bucketing, re-keying, CDN cutovers).


1. Migration policy

PrincipleStatement
Backward compatibilityAll released contracts (REST, events, DB) are extended only via additive changes within a major version. Breaking changes require an ADR + a versioned, parallel surface for ≥ 90 d.
Expand → migrate → contractSchema changes ship in stages: (1) expand: add new fields/tables, deploy code that writes both, (2) backfill, (3) contract: remove old fields/columns. Never combine stages in a single deploy.
Tenant safetyEvery migration includes a dry-run in staging on a snapshot of prod. Any data path that touches bytes is gated by a feature flag at boot and fails closed.
Zero downtimeAll deploys are zero-downtime. Long-running data work runs in idempotent, resumable batches (Cloud Run jobs) and is observable on a dedicated dashboard.
ReversibilityEvery migration ships with a documented rollback. If rollback is impossible (e.g., bytes re-encrypted), it is called out explicitly in the ADR and the readiness sign-off.
AuditabilityEvery bulk data movement writes one audit_events row per affected FileObject, plus a single summary row, signed by the same key family used for erasure certificates.

Hard invariants (FAILURE_MODES §11) may not be relaxed by a migration without an ADR co-signed by security and platform.


2. Standard migration types

2.1 Database schema (DDL)

Pattern: expand → backfill → contract

  1. Expand PR:
    • Add nullable column / new table / new index CREATE INDEX CONCURRENTLY.
    • Update INSERT/UPDATE paths to write the new field; reads still use the old.
    • Migration test ensures it applies in < 60 s on a snapshot the size of prod.
  2. Backfill PR (or Cloud Run job):
    • Idempotent script in batches of 1 000 rows under a tenant lock; emits progress metric file_storage.migration.backfilled_rows{migration_id}.
  3. Read switch PR:
    • Reads start using the new column; old field still written for ≥ 14 d.
  4. Contract PR:
    • Stop writing old field.
    • After ≥ 14 d, drop column (ALTER … DROP COLUMN) in a quiet window.

Expand-only never includes a NOT NULL on a new column without a default; we add default first, then backfill, then SET NOT NULL in a separate PR.

2.2 Event schemas

  • Additive optional fields → bump schemaVersion in JSON Schema; same topic version.
  • Renames / removals / type changes → publish .v(n+1) topic alongside .v n. Both run for ≥ 90 d. Producers gated by per-tenant rollout flag. Old topic deprecation announced in contracts/events/DEPRECATIONS.md.
  • Pact provider verifications must pass for every consumer at every step.

2.3 REST API

  • Additive responses fields and optional request fields → no version bump.
  • Breaking changes → mount under /api/v2/... while v1 continues; deprecation header Sunset + Deprecation per RFC 8594; minimum overlap ≥ 180 d.
  • OpenAPI snapshot test enforces no accidental break.

2.4 GCS object key changes

Object keys are immutable. To change a key:

  1. Copy bytes to the new key (same bucket; CMEK preserved).
  2. Update file_objects.object_key in DB.
  3. Verify via reconciliation job.
  4. Delete old key.
  5. Invalidate CDN for any change touching public_media.

This is wrapped by a job key-rewrite-job taking {tenantId, scope, fromPattern, toPattern} and is gated by a per-job feature flag and a dry-run mode.

2.5 Bucket changes

Moving objects between buckets (e.g. private → archive) uses the same job pattern with two extra steps:

  • The destination bucket's CMEK and lifecycle are validated.
  • The source key is soft-deleted (versioned + lifecycled), not hard-deleted, for 30 d to allow rollback.

3. Bootstrap migrations (M0001 → M0010, MVP)

Initial schema lands in this order. Each migration is a separate file under src/infrastructure/migrations/.

MigrationDescriptionRiskNotes
M0001_init_schema.sqlCreates file_storage schema; extensions (pgcrypto)lowIdempotent
M0002_buckets_and_retention_policies.sqlbuckets, retention_policies; seeds canonical policieslowSeeded for tenant_id IS NULL
M0003_file_objects.sqlCreates file_objects with all CHECKs incl. object_key LIKE 'tenants/...'; RLSmediumHeaviest table; ensure indexes via CONCURRENTLY if seeding
M0004_upload_sessions.sqlupload_sessions table; RLSlow
M0005_variants_scan_results.sqlvariants, scan_results; RLSlow
M0006_access_grants_audit.sqlaccess_grants, audit_events (append-only rules)mediumMust apply rules in same migration
M0007_quotas.sqlquotas; RLS; default cap inserted only by tenant onboarding (handled by tenant-service)low
M0008_retention_holds_erasure.sqlretention_holds, erasure_requests; RLSmedium
M0009_outbox_inbox_idempotency.sqloutbox, inbox, idempotency_recordslow
M0010_seed_jurisdiction_policies.sqlSeeds jurisdiction-specific policieslowAppend-only seed

GCS-side bootstrap (managed in Terraform, not SQL):

StepResource
T0001Create melmastoon-media-{env}, melmastoon-private-{env}, melmastoon-archive-{env}, melmastoon-quarantine-{env}, melmastoon-uploads-tmp-{env} with appropriate CMEK / lifecycle / IAM
T0002Create CDN URL map + backend bucket for melmastoon-media-{env}
T0003Create Pub/Sub topics + subscriptions per DEPLOYMENT_TOPOLOGY §8
T0004Create signer SA + grant iam.serviceAccountTokenCreator to api SA
T0005Create KMS keys (file-storage-cmek, file-storage-erasure-signer)
T0006Create Memorystore instance
T0007Create Cloud SQL instance + private IP

The full bootstrap is encapsulated in infra/terraform/file-storage/. Apply order is enforced by Terraform graph; manual steps are zero.


4. Phase 2 migrations

M0101 — Quota enforcement (warn → block)

  • Goal: switch quotas.cap_bytes/cap_objects from observability to hard enforcement.
  • Steps:
    1. Expand: add quotas.enforcement_mode TEXT NOT NULL DEFAULT 'warn' (values: 'warn' | 'block').
    2. Seed per-tenant value from tenant-service plan; default 'warn' for incumbents.
    3. Code change behind MELMASTOON_FLAG_QUOTA_ENFORCEMENT=true reads enforcement_mode and rejects on block.
    4. Per-tenant rollout: flip mode to block in batches; monitor uploads.initiated{result='quota_exceeded'}.
    5. Contract: when 100 % of tenants in block, drop the flag.
  • Rollback: flip enforcement_mode back to warn.

M0102 — AI alt-text drafting

  • Goal: enable alt-text drafting for property_photo per tenant opt-in.
  • Steps:
    1. Add tenant.ai.altText.enabled flag on tenant-service.
    2. Subscribe optimizer-completion to fan out to alt-text task in orchestrator.
    3. Backfill: run alt-text-backfill-job for tenants opting in (rate-limited; per-tenant budget).
    4. Privacy report endpoint exposes per-tenant counts.
  • Rollback: disable flag; existing drafts remain (they're optional content).

M0103 — Per-tenant CMEK on private

  • Goal: allow enterprise tenants to bring their own CMEK key.
  • Steps:
    1. Add tenant.bucket.cmek_key_resource on tenant-service.
    2. Per-tenant prefix re-encrypted via re-encrypt-job (reads current bytes with current key, writes new bytes encrypted with tenant key into a temporary new key path, then renames).
    3. Strict tenant-by-tenant rollout; admin endpoint POST /admin/storage/cmek-rotate/{tenantId}.
  • Rollback: re-encrypt back to platform key (rare; bytes are not lost).

M0104 — DR replica for melmastoon-private-{env}

  • Goal: cross-region replication for private data class.
  • Steps:
    1. Enable Turbo replication to melmastoon-private-{env}-dr in europe-west4.
    2. Verify lag in dashboards.
    3. Add admin failover endpoint with 2-eyes approval.
  • Rollback: disable replication; bytes in destination remain readable for 7 d (lifecycle cleanup).

5. Phase 3 migrations

M0201 — Video transcoding

  • New Variant presets hls_360p|hls_720p|hls_1080p; new MIME allowlist for property_video scope; new optimizer pipeline using Transcoder API.
  • Requires ADR; consumers (property-service) must opt-in to the new variants.

M0202 — ME tenancy (me-central1)

  • New project melmastoon-prod-me; replicate Terraform module; tenant-service routes new tenants by jurisdiction. Existing tenants stay in EU.
  • Cross-region transfers require explicit migration job + customer consent.

M0203 — Multi-region private bucket

  • Move from regional → multi-region (europe-west) for private to remove DR window. Requires re-bucketing job + per-tenant verification.

6. One-time data movements

6.1 Re-keying (e.g., adoption of date-sharded keys)

If the date-sharded key shape ({YYYY}/{MM}/{DD}/) ever needs to change (e.g., to {YYYY}/{MM}/{tenantHash}/), use key-rewrite-job per §2.4. Per-tenant batched; CDN invalidations automatic.

6.2 Re-bucketing (e.g., archive cold tier)

To move tax-compliance invoices from private to archive after N years:

job: rebucket-job
inputs: { fromBucket: "melmastoon-private-prod", toBucket: "melmastoon-archive-prod", scope: "invoice_pdf", olderThanDays: 730 }
flow: iterate file_objects → copy → update object_key + bucket_id → soft-delete source → cdn n/a
audit: per-row + summary

6.3 CDN base URL cutover

If we change the public CDN host (e.g., move from cdn.melmastoon.com to media.melmastoon.com):

  1. Add new URL map; both URLs serve the same backend bucket.
  2. Notify property-service and theme-config-service to update embed URLs.
  3. After ≥ 30 d, retire the old hostname.

6.4 Tenant offboarding (full erasure)

Tenant deletion in tenant-service triggers tenant.deleted.v1. Our handler:

  1. Inserts a retention_holds row spanning the regulatory window per scope (e.g., tax_compliance 7 y).
  2. After hold release, EraseByTenantUseCase runs and produces a per-tenant erasure certificate.
  3. The tenant's GCS prefix tenants/{tenantId}/ is deleted by the runner; the prefix itself ceases to exist.

7. Cross-service migration coordination

ChangeCoordinated servicesMechanism
New event fieldproperty-service, billing-service, theme-config-serviceadditive; Pact provider verification
New API fieldbff-backoffice, bff-bookingsiteadditive; OpenAPI
ID prefix addedplatform (NAMING.md)PR to docs repo + this repo in same release train
New scope valuetenant-service (quotas), security WG (allowlist), bff (UI)RFC + ADR
Retention policy seed changecompliance, tenant-service (jurisdiction lookup)seeded migration; consumers re-fetch via cache invalidation
AI provider addedai-orchestrator-service, DPO sign-offPR + DPIA update

8. Templates

8.1 Migration template

src/infrastructure/migrations/_templates/MNNNN_<name>.sql:

-- migration: MNNNN_<name>
-- author: <handle>
-- jira: MEL-NNN
-- risk: <low|medium|high>
-- rollback: <link to rollback note>
-- prereqs: <link>

BEGIN;

-- DDL goes here, idempotent where possible (IF NOT EXISTS / IF EXISTS).

COMMIT;

8.2 Backfill job template

scripts/backfill/MNNNN_<name>.ts exporting:

export async function run(opts: { tenantId?: string; batchSize?: number; dryRun?: boolean }): Promise<{ scanned: number; updated: number }> {
// … resumable, idempotent, observable.
}

8.3 Rollback note template

migrations/rollbacks/MNNNN_<name>.md with:

  • What it changes.
  • How to roll back (DDL or code path or feature flag).
  • Time-to-rollback target.
  • Any data loss caveats.

9. Versioning summary

SurfaceVersioningCoexistence
REST/api/v1/.../api/v2/... for breaking; deprecation ≥ 180 dparallel mounts
Eventsmelmastoon.file.<aggregate>.<verb>.v1.v2 for breakingparallel topics ≥ 90 d
Event payloadadditive within .vN; bump schemaVersionforward-compatible
DBexpand → backfill → contract; ≥ 14 d between writes-stop and column-droprolling deploys safe
Object keysimmutable; copy-then-rewrite for changetwo keys briefly; reconciliation cleans up
Bucketsnew bucket parallel; rebucket-job; soft-delete source 30 dparallel reads via DB pointer

10. Acceptance criteria for any migration PR

A migration PR is accepted only if it includes:

  1. Migration file (or Terraform change) following template.
  2. Tests:
    • Migration applies + rolls back on a fresh DB.
    • For data changes: a backfill test on a representative fixture.
    • Pact and OpenAPI snapshot updates if surface changed.
  3. Updated documentation in this repo:
  4. Operator note (runbooks/migrations/MNNNN.md) with:
    • Pre-flight checks.
    • Rollout plan + flag names.
    • Verification queries.
    • Rollback steps.
  5. Sign-off labels:
    • migration:reviewed-by-platform
    • migration:reviewed-by-security (when touching private/archive/CMEK/RLS/signed URLs)
    • migration:reviewed-by-dpo (when touching PII or retention)

A migration that does not have all five is not eligible to merge.