ai-orchestrator-service — Migration Plan

Companion to: DATA_MODEL.md · docs/standards/SERVICE_TEMPLATE.md

1. Migration tooling

Flyway for SQL migrations against Cloud SQL Postgres 16 + pgvector.
Migrations live in services/ai-orchestrator-service/src/db/migrations/V<N>__<name>.sql.
Forward-only; no down scripts. Rollback is via deploy-revert + restore-from-PITR for data accidents.
CI runs the full migration set against an empty DB and against a snapshot of staging on every PR; both must succeed.
Production migrations are applied by a one-shot Cloud Run Job before the corresponding service revision rolls out.

2. Backwards-compatibility rules

Every migration must be forward + backward compatible with the previous service revision so a rolling deploy can pass through a brief mixed state:

Change	Allowed in a single migration?
Add nullable column	yes
Add column with default	yes (ensure default is fast: literal, not function)
Add index `CONCURRENTLY`	yes
Add table	yes
Drop column	only after a release that stopped reading + writing it
Rename column	no — use add+backfill+drop in 3 releases
Tighten NOT NULL	yes if backfilled in the same migration AND prior release tolerated null reads
Add CHECK constraint	yes if validated NOT VALID, then VALIDATED in a later migration
Drop table	only after a release that stopped reading + writing it

3. Initial migration set (V1..V18)

File	Purpose
`V1__extensions.sql`	`CREATE EXTENSION IF NOT EXISTS pgcrypto, vector;`
`V2__capabilities_and_overrides.sql`	`capabilities`, `tenant_capability_overrides`
`V3__prompts.sql`	`prompts`, `prompt_versions`, `ab_assignments`
`V4__models_and_providers.sql`	`models`, `model_deployments`, `providers`
`V5__inference_audit.sql`	`inference_requests`, `inference_results`, `provenances`
`V6__budget.sql`	`budget_counters`
`V7__hitl.sql`	`hitl_gates`, `hitl_decisions`
`V8__eval.sql`	`eval_suites`, `eval_runs`
`V9__rag_corpora_and_chunks.sql`	`rag_corpora`, `rag_chunks`
`V10__embeddings_cloud.sql`	`embeddings_cloud` (vector(768)) + HNSW
`V11__embeddings_edge.sql`	`embeddings_edge` (vector(384)) + HNSW
`V12__edge_manifest.sql`	`edge_model_manifests`
`V13__moderation_audit.sql`	`moderation_audit`
`V14__prompt_cache_audit.sql`	`prompt_cache_audit`
`V15__outbox_inbox.sql`	`outbox`, `inbox`
`V16__rls_policies.sql`	enable RLS + per-table tenant policies
`V17__app_role_grants.sql`	`app_user`, `app_admin` roles + grants
`V18__seed_platform_capabilities.sql`	seed the platform capabilities (idempotent, `ON CONFLICT DO NOTHING`)

4. Release-train cadence

A release train cuts every 2 weeks.
Each train carries at most 5 schema migrations (smaller is better) to keep the deploy-job runtime under 60 s.
Migrations heavier than 60 s (large backfills, index builds on huge tables) move to multi-release plans (see §5).

5. Multi-release plans for large changes

For changes that can't fit in a single migration safely, the plan is documented in docs/migrations/AI-<n>-<slug>.md and tracked in this file. Active plans:

MP-AI-001: Embeddings 768 → 1024 dim (planned)

We expect to upgrade the cloud embedder to text-embedding-005 (1024 dim) in Q3.

Release	Action
R+0	Add `embeddings_cloud_1024` table with `vector(1024)`; dual-write from the embedding pipeline; reads still use `embeddings_cloud`
R+1	Reindex backfill job populates `embeddings_cloud_1024` from raw chunks (1 corpus per night); RAG query begins reading from new table when `corpus.embedding_dim=1024`
R+2	Cut all corpora over (per-tenant feature flag); deprecate the 768 table for new corpora
R+3	Drop `embeddings_cloud` after 30 days of zero reads (verified via metric)

MP-AI-002: HITL gate `policy_key` becomes ENUM-backed (planned)

Release	Action
R+0	Add `hitl_policies` table (lookup) with seed
R+1	Add nullable `policy_id` FK on `hitl_gates`; backfill from `policy_key`
R+2	Make `policy_id` NOT NULL; stop writing `policy_key`
R+3	Drop `policy_key` column

6. Data backfills

Backfills run as Cloud Run Jobs with the following discipline:

Always idempotent (UPSERT or filter WHERE col IS NULL).
Chunked by tenant_id to limit blast radius and to honour RLS context per chunk.
Throttled (≤ 200 rows/s by default; tunable per-job env var).
Emit a progress metric melmastoon.ai_orchestrator.backfill.processed_total{job_id, tenant_id}.
Resumable via --since-id cursor.

7. Rollback strategy

Class	Rollback
Code-only	Cloud Run revert to previous traffic split; no DB action
Additive schema (new col / table / index)	Code revert; new objects remain in DB harmlessly
Backfill	Stop the job; data correctness verified via spot queries
Destructive schema (drop / rename)	NOT performed live — multi-release plan ensures no destructive change happens during a single deploy
Catastrophic data corruption	Restore Cloud SQL PITR to a point before the bad migration; replay outbox from the corresponding cursor (the outbox table is in the restore so consumers see the same events again — consumers are idempotent)

8. Verification checklist (per release)

pnpm migrate succeeds against an empty DB.
pnpm migrate succeeds against a snapshot of staging.
No migration takes > 60 s on staging snapshot.
All RLS policies still pass IT-AI-007.
OpenAPI diff reviewed — no breaking changes inside a major version.
Event registry diff reviewed — no breaking changes inside a major version.
BigQuery ai_calls_fact schema unchanged or evolved with documented mapping.

1. Migration tooling​

2. Backwards-compatibility rules​

3. Initial migration set (V1..V18)​

4. Release-train cadence​

5. Multi-release plans for large changes​

MP-AI-001: Embeddings 768 → 1024 dim (planned)​

MP-AI-002: HITL gate policy_key becomes ENUM-backed (planned)​

6. Data backfills​

7. Rollback strategy​

8. Verification checklist (per release)​