Skip to main content

ai-orchestrator-service — Migration Plan

Companion to: DATA_MODEL.md · docs/standards/SERVICE_TEMPLATE.md

1. Migration tooling

  • Flyway for SQL migrations against Cloud SQL Postgres 16 + pgvector.
  • Migrations live in services/ai-orchestrator-service/src/db/migrations/V<N>__<name>.sql.
  • Forward-only; no down scripts. Rollback is via deploy-revert + restore-from-PITR for data accidents.
  • CI runs the full migration set against an empty DB and against a snapshot of staging on every PR; both must succeed.
  • Production migrations are applied by a one-shot Cloud Run Job before the corresponding service revision rolls out.

2. Backwards-compatibility rules

Every migration must be forward + backward compatible with the previous service revision so a rolling deploy can pass through a brief mixed state:

ChangeAllowed in a single migration?
Add nullable columnyes
Add column with defaultyes (ensure default is fast: literal, not function)
Add index CONCURRENTLYyes
Add tableyes
Drop columnonly after a release that stopped reading + writing it
Rename columnno — use add+backfill+drop in 3 releases
Tighten NOT NULLyes if backfilled in the same migration AND prior release tolerated null reads
Add CHECK constraintyes if validated NOT VALID, then VALIDATED in a later migration
Drop tableonly after a release that stopped reading + writing it

3. Initial migration set (V1..V18)

FilePurpose
V1__extensions.sqlCREATE EXTENSION IF NOT EXISTS pgcrypto, vector;
V2__capabilities_and_overrides.sqlcapabilities, tenant_capability_overrides
V3__prompts.sqlprompts, prompt_versions, ab_assignments
V4__models_and_providers.sqlmodels, model_deployments, providers
V5__inference_audit.sqlinference_requests, inference_results, provenances
V6__budget.sqlbudget_counters
V7__hitl.sqlhitl_gates, hitl_decisions
V8__eval.sqleval_suites, eval_runs
V9__rag_corpora_and_chunks.sqlrag_corpora, rag_chunks
V10__embeddings_cloud.sqlembeddings_cloud (vector(768)) + HNSW
V11__embeddings_edge.sqlembeddings_edge (vector(384)) + HNSW
V12__edge_manifest.sqledge_model_manifests
V13__moderation_audit.sqlmoderation_audit
V14__prompt_cache_audit.sqlprompt_cache_audit
V15__outbox_inbox.sqloutbox, inbox
V16__rls_policies.sqlenable RLS + per-table tenant policies
V17__app_role_grants.sqlapp_user, app_admin roles + grants
V18__seed_platform_capabilities.sqlseed the platform capabilities (idempotent, ON CONFLICT DO NOTHING)

4. Release-train cadence

  • A release train cuts every 2 weeks.
  • Each train carries at most 5 schema migrations (smaller is better) to keep the deploy-job runtime under 60 s.
  • Migrations heavier than 60 s (large backfills, index builds on huge tables) move to multi-release plans (see §5).

5. Multi-release plans for large changes

For changes that can't fit in a single migration safely, the plan is documented in docs/migrations/AI-<n>-<slug>.md and tracked in this file. Active plans:

MP-AI-001: Embeddings 768 → 1024 dim (planned)

We expect to upgrade the cloud embedder to text-embedding-005 (1024 dim) in Q3.

ReleaseAction
R+0Add embeddings_cloud_1024 table with vector(1024); dual-write from the embedding pipeline; reads still use embeddings_cloud
R+1Reindex backfill job populates embeddings_cloud_1024 from raw chunks (1 corpus per night); RAG query begins reading from new table when corpus.embedding_dim=1024
R+2Cut all corpora over (per-tenant feature flag); deprecate the 768 table for new corpora
R+3Drop embeddings_cloud after 30 days of zero reads (verified via metric)

MP-AI-002: HITL gate policy_key becomes ENUM-backed (planned)

ReleaseAction
R+0Add hitl_policies table (lookup) with seed
R+1Add nullable policy_id FK on hitl_gates; backfill from policy_key
R+2Make policy_id NOT NULL; stop writing policy_key
R+3Drop policy_key column

6. Data backfills

Backfills run as Cloud Run Jobs with the following discipline:

  • Always idempotent (UPSERT or filter WHERE col IS NULL).
  • Chunked by tenant_id to limit blast radius and to honour RLS context per chunk.
  • Throttled (≤ 200 rows/s by default; tunable per-job env var).
  • Emit a progress metric melmastoon.ai_orchestrator.backfill.processed_total{job_id, tenant_id}.
  • Resumable via --since-id cursor.

7. Rollback strategy

ClassRollback
Code-onlyCloud Run revert to previous traffic split; no DB action
Additive schema (new col / table / index)Code revert; new objects remain in DB harmlessly
BackfillStop the job; data correctness verified via spot queries
Destructive schema (drop / rename)NOT performed live — multi-release plan ensures no destructive change happens during a single deploy
Catastrophic data corruptionRestore Cloud SQL PITR to a point before the bad migration; replay outbox from the corresponding cursor (the outbox table is in the restore so consumers see the same events again — consumers are idempotent)

8. Verification checklist (per release)

  • pnpm migrate succeeds against an empty DB.
  • pnpm migrate succeeds against a snapshot of staging.
  • No migration takes > 60 s on staging snapshot.
  • All RLS policies still pass IT-AI-007.
  • OpenAPI diff reviewed — no breaking changes inside a major version.
  • Event registry diff reviewed — no breaking changes inside a major version.
  • BigQuery ai_calls_fact schema unchanged or evolved with documented mapping.