file-storage-service — MIGRATION_PLAN
Companion: DATA_MODEL · API_CONTRACTS · EVENT_SCHEMAS · SERVICE_READINESS
This document is the playbook for changing the surface of file-storage-service without breaking consumers, downstream services, or stored bytes. It covers (1) general policy, (2) the recurring expand → backfill → contract workflow for DB, events, and APIs, (3) named bootstrap migrations from the current "no service" state to MVP, (4) named upcoming migrations to Phase 2 and Phase 3, and (5) one-time data movements (re-bucketing, re-keying, CDN cutovers).
1. Migration policy
| Principle | Statement |
|---|---|
| Backward compatibility | All released contracts (REST, events, DB) are extended only via additive changes within a major version. Breaking changes require an ADR + a versioned, parallel surface for ≥ 90 d. |
| Expand → migrate → contract | Schema changes ship in stages: (1) expand: add new fields/tables, deploy code that writes both, (2) backfill, (3) contract: remove old fields/columns. Never combine stages in a single deploy. |
| Tenant safety | Every migration includes a dry-run in staging on a snapshot of prod. Any data path that touches bytes is gated by a feature flag at boot and fails closed. |
| Zero downtime | All deploys are zero-downtime. Long-running data work runs in idempotent, resumable batches (Cloud Run jobs) and is observable on a dedicated dashboard. |
| Reversibility | Every migration ships with a documented rollback. If rollback is impossible (e.g., bytes re-encrypted), it is called out explicitly in the ADR and the readiness sign-off. |
| Auditability | Every bulk data movement writes one audit_events row per affected FileObject, plus a single summary row, signed by the same key family used for erasure certificates. |
Hard invariants (FAILURE_MODES §11) may not be relaxed by a migration without an ADR co-signed by security and platform.
2. Standard migration types
2.1 Database schema (DDL)
Pattern: expand → backfill → contract
- Expand PR:
- Add nullable column / new table / new index
CREATE INDEX CONCURRENTLY. - Update
INSERT/UPDATEpaths to write the new field; reads still use the old. - Migration test ensures it applies in
< 60 son a snapshot the size of prod.
- Add nullable column / new table / new index
- Backfill PR (or Cloud Run job):
- Idempotent script in batches of 1 000 rows under a tenant lock; emits progress metric
file_storage.migration.backfilled_rows{migration_id}.
- Idempotent script in batches of 1 000 rows under a tenant lock; emits progress metric
- Read switch PR:
- Reads start using the new column; old field still written for
≥ 14 d.
- Reads start using the new column; old field still written for
- Contract PR:
- Stop writing old field.
- After ≥ 14 d, drop column (
ALTER … DROP COLUMN) in a quiet window.
Expand-only never includes a NOT NULL on a new column without a default; we add default first, then backfill, then SET NOT NULL in a separate PR.
2.2 Event schemas
- Additive optional fields → bump
schemaVersionin JSON Schema; same topic version. - Renames / removals / type changes → publish
.v(n+1)topic alongside.v n. Both run for ≥ 90 d. Producers gated by per-tenant rollout flag. Old topic deprecation announced incontracts/events/DEPRECATIONS.md. - Pact provider verifications must pass for every consumer at every step.
2.3 REST API
- Additive responses fields and optional request fields → no version bump.
- Breaking changes → mount under
/api/v2/...whilev1continues; deprecation headerSunset+Deprecationper RFC 8594; minimum overlap≥ 180 d. - OpenAPI snapshot test enforces no accidental break.
2.4 GCS object key changes
Object keys are immutable. To change a key:
- Copy bytes to the new key (same bucket; CMEK preserved).
- Update
file_objects.object_keyin DB. - Verify via reconciliation job.
- Delete old key.
- Invalidate CDN for any change touching
public_media.
This is wrapped by a job key-rewrite-job taking {tenantId, scope, fromPattern, toPattern} and is gated by a per-job feature flag and a dry-run mode.
2.5 Bucket changes
Moving objects between buckets (e.g. private → archive) uses the same job pattern with two extra steps:
- The destination bucket's CMEK and lifecycle are validated.
- The source key is soft-deleted (versioned + lifecycled), not hard-deleted, for 30 d to allow rollback.
3. Bootstrap migrations (M0001 → M0010, MVP)
Initial schema lands in this order. Each migration is a separate file under src/infrastructure/migrations/.
| Migration | Description | Risk | Notes |
|---|---|---|---|
M0001_init_schema.sql | Creates file_storage schema; extensions (pgcrypto) | low | Idempotent |
M0002_buckets_and_retention_policies.sql | buckets, retention_policies; seeds canonical policies | low | Seeded for tenant_id IS NULL |
M0003_file_objects.sql | Creates file_objects with all CHECKs incl. object_key LIKE 'tenants/...'; RLS | medium | Heaviest table; ensure indexes via CONCURRENTLY if seeding |
M0004_upload_sessions.sql | upload_sessions table; RLS | low | |
M0005_variants_scan_results.sql | variants, scan_results; RLS | low | |
M0006_access_grants_audit.sql | access_grants, audit_events (append-only rules) | medium | Must apply rules in same migration |
M0007_quotas.sql | quotas; RLS; default cap inserted only by tenant onboarding (handled by tenant-service) | low | |
M0008_retention_holds_erasure.sql | retention_holds, erasure_requests; RLS | medium | |
M0009_outbox_inbox_idempotency.sql | outbox, inbox, idempotency_records | low | |
M0010_seed_jurisdiction_policies.sql | Seeds jurisdiction-specific policies | low | Append-only seed |
GCS-side bootstrap (managed in Terraform, not SQL):
| Step | Resource |
|---|---|
T0001 | Create melmastoon-media-{env}, melmastoon-private-{env}, melmastoon-archive-{env}, melmastoon-quarantine-{env}, melmastoon-uploads-tmp-{env} with appropriate CMEK / lifecycle / IAM |
T0002 | Create CDN URL map + backend bucket for melmastoon-media-{env} |
T0003 | Create Pub/Sub topics + subscriptions per DEPLOYMENT_TOPOLOGY §8 |
T0004 | Create signer SA + grant iam.serviceAccountTokenCreator to api SA |
T0005 | Create KMS keys (file-storage-cmek, file-storage-erasure-signer) |
T0006 | Create Memorystore instance |
T0007 | Create Cloud SQL instance + private IP |
The full bootstrap is encapsulated in infra/terraform/file-storage/. Apply order is enforced by Terraform graph; manual steps are zero.
4. Phase 2 migrations
M0101 — Quota enforcement (warn → block)
- Goal: switch
quotas.cap_bytes/cap_objectsfrom observability to hard enforcement. - Steps:
- Expand: add
quotas.enforcement_mode TEXT NOT NULL DEFAULT 'warn'(values:'warn' | 'block'). - Seed per-tenant value from
tenant-serviceplan; default'warn'for incumbents. - Code change behind
MELMASTOON_FLAG_QUOTA_ENFORCEMENT=truereadsenforcement_modeand rejects onblock. - Per-tenant rollout: flip mode to
blockin batches; monitoruploads.initiated{result='quota_exceeded'}. - Contract: when 100 % of tenants in
block, drop the flag.
- Expand: add
- Rollback: flip enforcement_mode back to
warn.
M0102 — AI alt-text drafting
- Goal: enable alt-text drafting for
property_photoper tenant opt-in. - Steps:
- Add
tenant.ai.altText.enabledflag ontenant-service. - Subscribe optimizer-completion to fan out to alt-text task in orchestrator.
- Backfill: run
alt-text-backfill-jobfor tenants opting in (rate-limited; per-tenant budget). - Privacy report endpoint exposes per-tenant counts.
- Add
- Rollback: disable flag; existing drafts remain (they're optional content).
M0103 — Per-tenant CMEK on private
- Goal: allow enterprise tenants to bring their own CMEK key.
- Steps:
- Add
tenant.bucket.cmek_key_resourceontenant-service. - Per-tenant prefix re-encrypted via
re-encrypt-job(reads current bytes with current key, writes new bytes encrypted with tenant key into a temporary new key path, then renames). - Strict tenant-by-tenant rollout; admin endpoint
POST /admin/storage/cmek-rotate/{tenantId}.
- Add
- Rollback: re-encrypt back to platform key (rare; bytes are not lost).
M0104 — DR replica for melmastoon-private-{env}
- Goal: cross-region replication for
privatedata class. - Steps:
- Enable Turbo replication to
melmastoon-private-{env}-drineurope-west4. - Verify lag in dashboards.
- Add admin failover endpoint with 2-eyes approval.
- Enable Turbo replication to
- Rollback: disable replication; bytes in destination remain readable for 7 d (lifecycle cleanup).
5. Phase 3 migrations
M0201 — Video transcoding
- New
Variantpresetshls_360p|hls_720p|hls_1080p; new MIME allowlist forproperty_videoscope; new optimizer pipeline using Transcoder API. - Requires ADR; consumers (property-service) must opt-in to the new variants.
M0202 — ME tenancy (me-central1)
- New project
melmastoon-prod-me; replicate Terraform module; tenant-service routes new tenants by jurisdiction. Existing tenants stay in EU. - Cross-region transfers require explicit migration job + customer consent.
M0203 — Multi-region private bucket
- Move from regional → multi-region (
europe-west) forprivateto remove DR window. Requires re-bucketing job + per-tenant verification.
6. One-time data movements
6.1 Re-keying (e.g., adoption of date-sharded keys)
If the date-sharded key shape ({YYYY}/{MM}/{DD}/) ever needs to change (e.g., to {YYYY}/{MM}/{tenantHash}/), use key-rewrite-job per §2.4. Per-tenant batched; CDN invalidations automatic.
6.2 Re-bucketing (e.g., archive cold tier)
To move tax-compliance invoices from private to archive after N years:
job: rebucket-job
inputs: { fromBucket: "melmastoon-private-prod", toBucket: "melmastoon-archive-prod", scope: "invoice_pdf", olderThanDays: 730 }
flow: iterate file_objects → copy → update object_key + bucket_id → soft-delete source → cdn n/a
audit: per-row + summary
6.3 CDN base URL cutover
If we change the public CDN host (e.g., move from cdn.melmastoon.com to media.melmastoon.com):
- Add new URL map; both URLs serve the same backend bucket.
- Notify property-service and theme-config-service to update embed URLs.
- After ≥ 30 d, retire the old hostname.
6.4 Tenant offboarding (full erasure)
Tenant deletion in tenant-service triggers tenant.deleted.v1. Our handler:
- Inserts a
retention_holdsrow spanning the regulatory window per scope (e.g.,tax_compliance7 y). - After hold release,
EraseByTenantUseCaseruns and produces a per-tenant erasure certificate. - The tenant's GCS prefix
tenants/{tenantId}/is deleted by the runner; the prefix itself ceases to exist.
7. Cross-service migration coordination
| Change | Coordinated services | Mechanism |
|---|---|---|
| New event field | property-service, billing-service, theme-config-service | additive; Pact provider verification |
| New API field | bff-backoffice, bff-bookingsite | additive; OpenAPI |
| ID prefix added | platform (NAMING.md) | PR to docs repo + this repo in same release train |
New scope value | tenant-service (quotas), security WG (allowlist), bff (UI) | RFC + ADR |
| Retention policy seed change | compliance, tenant-service (jurisdiction lookup) | seeded migration; consumers re-fetch via cache invalidation |
| AI provider added | ai-orchestrator-service, DPO sign-off | PR + DPIA update |
8. Templates
8.1 Migration template
src/infrastructure/migrations/_templates/MNNNN_<name>.sql:
-- migration: MNNNN_<name>
-- author: <handle>
-- jira: MEL-NNN
-- risk: <low|medium|high>
-- rollback: <link to rollback note>
-- prereqs: <link>
BEGIN;
-- DDL goes here, idempotent where possible (IF NOT EXISTS / IF EXISTS).
COMMIT;
8.2 Backfill job template
scripts/backfill/MNNNN_<name>.ts exporting:
export async function run(opts: { tenantId?: string; batchSize?: number; dryRun?: boolean }): Promise<{ scanned: number; updated: number }> {
// … resumable, idempotent, observable.
}
8.3 Rollback note template
migrations/rollbacks/MNNNN_<name>.md with:
- What it changes.
- How to roll back (DDL or code path or feature flag).
- Time-to-rollback target.
- Any data loss caveats.
9. Versioning summary
| Surface | Versioning | Coexistence |
|---|---|---|
| REST | /api/v1/... → /api/v2/... for breaking; deprecation ≥ 180 d | parallel mounts |
| Events | melmastoon.file.<aggregate>.<verb>.v1 → .v2 for breaking | parallel topics ≥ 90 d |
| Event payload | additive within .vN; bump schemaVersion | forward-compatible |
| DB | expand → backfill → contract; ≥ 14 d between writes-stop and column-drop | rolling deploys safe |
| Object keys | immutable; copy-then-rewrite for change | two keys briefly; reconciliation cleans up |
| Buckets | new bucket parallel; rebucket-job; soft-delete source 30 d | parallel reads via DB pointer |
10. Acceptance criteria for any migration PR
A migration PR is accepted only if it includes:
- Migration file (or Terraform change) following template.
- Tests:
- Migration applies + rolls back on a fresh DB.
- For data changes: a backfill test on a representative fixture.
- Pact and OpenAPI snapshot updates if surface changed.
- Updated documentation in this repo:
- This file (a row in §3/§4/§5 if it's a named migration; otherwise §2).
- SERVICE_READINESS if it lifts a checklist item.
- SERVICE_RISK_REGISTER if it changes residual risk.
- Operator note (
runbooks/migrations/MNNNN.md) with:- Pre-flight checks.
- Rollout plan + flag names.
- Verification queries.
- Rollback steps.
- Sign-off labels:
migration:reviewed-by-platformmigration:reviewed-by-security(when touchingprivate/archive/CMEK/RLS/signed URLs)migration:reviewed-by-dpo(when touching PII or retention)
A migration that does not have all five is not eligible to merge.