ai-orchestrator-service — Migration Plan
Companion to:
DATA_MODEL.md·docs/standards/SERVICE_TEMPLATE.md
1. Migration tooling
- Flyway for SQL migrations against Cloud SQL Postgres 16 + pgvector.
- Migrations live in
services/ai-orchestrator-service/src/db/migrations/V<N>__<name>.sql. - Forward-only; no
downscripts. Rollback is via deploy-revert + restore-from-PITR for data accidents. - CI runs the full migration set against an empty DB and against a snapshot of staging on every PR; both must succeed.
- Production migrations are applied by a one-shot Cloud Run Job before the corresponding service revision rolls out.
2. Backwards-compatibility rules
Every migration must be forward + backward compatible with the previous service revision so a rolling deploy can pass through a brief mixed state:
| Change | Allowed in a single migration? |
|---|---|
| Add nullable column | yes |
| Add column with default | yes (ensure default is fast: literal, not function) |
Add index CONCURRENTLY | yes |
| Add table | yes |
| Drop column | only after a release that stopped reading + writing it |
| Rename column | no — use add+backfill+drop in 3 releases |
| Tighten NOT NULL | yes if backfilled in the same migration AND prior release tolerated null reads |
| Add CHECK constraint | yes if validated NOT VALID, then VALIDATED in a later migration |
| Drop table | only after a release that stopped reading + writing it |
3. Initial migration set (V1..V18)
| File | Purpose |
|---|---|
V1__extensions.sql | CREATE EXTENSION IF NOT EXISTS pgcrypto, vector; |
V2__capabilities_and_overrides.sql | capabilities, tenant_capability_overrides |
V3__prompts.sql | prompts, prompt_versions, ab_assignments |
V4__models_and_providers.sql | models, model_deployments, providers |
V5__inference_audit.sql | inference_requests, inference_results, provenances |
V6__budget.sql | budget_counters |
V7__hitl.sql | hitl_gates, hitl_decisions |
V8__eval.sql | eval_suites, eval_runs |
V9__rag_corpora_and_chunks.sql | rag_corpora, rag_chunks |
V10__embeddings_cloud.sql | embeddings_cloud (vector(768)) + HNSW |
V11__embeddings_edge.sql | embeddings_edge (vector(384)) + HNSW |
V12__edge_manifest.sql | edge_model_manifests |
V13__moderation_audit.sql | moderation_audit |
V14__prompt_cache_audit.sql | prompt_cache_audit |
V15__outbox_inbox.sql | outbox, inbox |
V16__rls_policies.sql | enable RLS + per-table tenant policies |
V17__app_role_grants.sql | app_user, app_admin roles + grants |
V18__seed_platform_capabilities.sql | seed the platform capabilities (idempotent, ON CONFLICT DO NOTHING) |
4. Release-train cadence
- A release train cuts every 2 weeks.
- Each train carries at most 5 schema migrations (smaller is better) to keep the deploy-job runtime under 60 s.
- Migrations heavier than 60 s (large backfills, index builds on huge tables) move to multi-release plans (see §5).
5. Multi-release plans for large changes
For changes that can't fit in a single migration safely, the plan is documented in docs/migrations/AI-<n>-<slug>.md and tracked in this file. Active plans:
MP-AI-001: Embeddings 768 → 1024 dim (planned)
We expect to upgrade the cloud embedder to text-embedding-005 (1024 dim) in Q3.
| Release | Action |
|---|---|
| R+0 | Add embeddings_cloud_1024 table with vector(1024); dual-write from the embedding pipeline; reads still use embeddings_cloud |
| R+1 | Reindex backfill job populates embeddings_cloud_1024 from raw chunks (1 corpus per night); RAG query begins reading from new table when corpus.embedding_dim=1024 |
| R+2 | Cut all corpora over (per-tenant feature flag); deprecate the 768 table for new corpora |
| R+3 | Drop embeddings_cloud after 30 days of zero reads (verified via metric) |
MP-AI-002: HITL gate policy_key becomes ENUM-backed (planned)
| Release | Action |
|---|---|
| R+0 | Add hitl_policies table (lookup) with seed |
| R+1 | Add nullable policy_id FK on hitl_gates; backfill from policy_key |
| R+2 | Make policy_id NOT NULL; stop writing policy_key |
| R+3 | Drop policy_key column |
6. Data backfills
Backfills run as Cloud Run Jobs with the following discipline:
- Always idempotent (
UPSERTor filterWHERE col IS NULL). - Chunked by
tenant_idto limit blast radius and to honour RLS context per chunk. - Throttled (≤ 200 rows/s by default; tunable per-job env var).
- Emit a progress metric
melmastoon.ai_orchestrator.backfill.processed_total{job_id, tenant_id}. - Resumable via
--since-idcursor.
7. Rollback strategy
| Class | Rollback |
|---|---|
| Code-only | Cloud Run revert to previous traffic split; no DB action |
| Additive schema (new col / table / index) | Code revert; new objects remain in DB harmlessly |
| Backfill | Stop the job; data correctness verified via spot queries |
| Destructive schema (drop / rename) | NOT performed live — multi-release plan ensures no destructive change happens during a single deploy |
| Catastrophic data corruption | Restore Cloud SQL PITR to a point before the bad migration; replay outbox from the corresponding cursor (the outbox table is in the restore so consumers see the same events again — consumers are idempotent) |
8. Verification checklist (per release)
-
pnpm migratesucceeds against an empty DB. -
pnpm migratesucceeds against a snapshot of staging. - No migration takes > 60 s on staging snapshot.
- All RLS policies still pass
IT-AI-007. - OpenAPI diff reviewed — no breaking changes inside a major version.
- Event registry diff reviewed — no breaking changes inside a major version.
- BigQuery
ai_calls_factschema unchanged or evolved with documented mapping.