MIGRATION_PLAN — theme-config-service
Sibling: DATA_MODEL · DEPLOYMENT_TOPOLOGY · SERVICE_READINESS
Platform anchors:
docs/standards/SERVICE_TEMPLATE.md
This document covers two distinct concerns:
- Schema migration policy — how we evolve the
theme_configPostgres schema safely (zero-downtime, backwards-compatible). - Initial data migration — how the platform bootstraps from "no themes" to "every existing tenant has a published theme" at first launch and at chain-onboarding events.
1. Schema migration policy
1.1 Tooling
- Drizzle Kit for migration generation (
pnpm run db:migrate:generate). - Migrations stored in
services/theme-config-service/src/infrastructure/persistence/migrations/. - Tracked in
theme_config.__drizzle_migrations. - Execution role:
theme_config_migrator(DDL + minimal DML; no row-level read). - Applied via a dedicated Cloud Run Job triggered by the deployment pipeline before the new container revision rolls out.
1.2 Expand-then-contract policy (mandatory)
Any change that would otherwise be breaking is split into two deployments:
Deployment N:
1. Add new schema element (column / table / index / constraint variant) — backwards compatible.
2. Code reads from BOTH old and new; writes to BOTH (dual-write) when applicable.
Deployment N+1 (after data backfill complete):
3. Code reads/writes from new only.
4. Drop old schema element.
Backfills run as Cloud Run Jobs, batched with tenant_id partitioning, and tracked with progress checkpoints in a migration_checkpoints table.
1.3 Reviewed migration types
| Change type | Strategy | Lock impact |
|---|---|---|
| Add nullable column | one deployment, backfill optional | metadata only |
| Add NOT NULL column with default | add nullable → backfill → set NOT NULL across two deployments | metadata + table rewrite avoided via DEFAULT on PG 16 |
| Add index | CREATE INDEX CONCURRENTLY | none |
| Drop column | mark deprecated → stop writes → drop in next deployment | metadata only |
| Rename column | add new col → dual-write → backfill → cut reads → drop old | requires expand-then-contract |
| Add CHECK constraint | ALTER ... ADD CONSTRAINT ... NOT VALID then VALIDATE CONSTRAINT | brief table scan during validate |
| Change column type (compatible) | add new col → backfill → cut over | requires expand-then-contract |
| Add table | one deployment | none |
| Drop table | confirm zero readers/writers via observability for ≥ 7 d → drop | none |
| Add RLS policy | always before adding the table to the app's reader/writer set | none |
| Migrate JSONB shape | introduce shape v2; application reads both; migrate row-by-row in a job; cut reads; remove v1 reader | none |
1.4 CI gates on migrations
pnpm run migrate:lintenforces:- No
DROP COLUMNwithout an explicit-- @safe-dropannotation referencing the prior expand deployment. - No
ALTER TABLE ... ADD COLUMN ... NOT NULLwithout aDEFAULT. - No
CREATE INDEXwithoutCONCURRENTLY. - All new tables have RLS policies if they carry
tenant_id.
- No
pnpm run migrate:diffagainst the staging DB produces a human-readable plan attached to the PR.- Two-person approval required for any migration touching
themes,theme_versions,theme_publications.
1.5 Rollback
- Every migration MUST ship with a tested down migration generated alongside the up.
- Down migrations are exercised in CI on the local stack (
pnpm run db:migrate:down). - In production, rolling back a deployment does not automatically run the down migration; the on-call decides per the deployment runbook (often safer to roll the schema forward and patch in code).
1.6 Data backfill jobs
- Implemented as Cloud Run Jobs in
src/infrastructure/migrations/backfills/. - Idempotent and resumable (track last-processed
(tenant_id, primary_key)checkpoint). - Throttled to keep DB CPU below 60 %.
- Emit
theme.migration.backfill.progress.v1events for ops dashboards.
2. Initial migration plan (Phase 0 launch)
theme-config-service is a Phase 0 service: there is no incumbent system to migrate from. The "migration" is from nothing to every existing tenant has a published theme at the cutover moment.
2.1 Pre-launch state
- Tenant onboarding for existing pilot tenants happened via the
tenant-servicetenant.created.v1event before this service was live. - Booking flow served a hard-coded "platform default" brand for all tenants until launch.
2.2 Cutover plan
| Step | Owner | Detail |
|---|---|---|
| T-7d | Platform | Deploy theme-config-service in shadow mode: subscribes to events, but no consumer reads from it. |
| T-7d | Platform | Run a one-off bootstrap job that creates a default Theme + ThemeVersion (cloned from the MELMASTOON_DEFAULT_SCAFFOLD) for every existing tenant; publish each. |
| T-5d | Platform | Verify in staging with the BFFs reading from the new bundle URL; A/B compare to the hard-coded platform brand. |
| T-3d | FrontendPlatform | Customer-success outreach to pilot tenants: "your booking site brand will move to the new self-service editor on T-day; here's a 30 min walkthrough." |
| T-1d | Platform | Final dry-run of cutover script in staging. |
| T | Platform | Cutover: BFFs flip the read source from the hard-coded brand to GET /public/themes/<themeId>/published.json. Toggle is a feature flag (brand.source = 'theme-config-service'). |
| T+1d | Support | Monitor support tickets; on-call engineer dedicated; rollback by toggling the flag back. |
| T+7d | Platform | Remove the hard-coded brand fallback from BFFs; remove the feature flag. |
2.3 Bootstrap script
pnpm run migrate:bootstrap-existing-tenants — Cloud Run Job:
- List every active tenant via
tenant-serviceAPI. - For each tenant in batches of 50:
- Skip if a
Themealready exists for(tenantId, propertyId=null). - Otherwise invoke
ProvisionDefaultThemeUseCasedirectly through the application layer, withactor: { kind: 'system', id: 'svc:theme-config-service:bootstrap' }.
- Skip if a
- Emit
theme.migration.bootstrap.progress.v1per batch; finaltheme.migration.bootstrap.completed.v1with totals.
The script is idempotent — re-running it after partial failure resumes from the next un-provisioned tenant.
2.4 Validation
After bootstrap:
- Every tenant has exactly one active
ThemePublication. - Every published bundle is reachable at the CDN edge with HTTP 200.
- Every bundle's SHA matches
theme_publications.bundle_sha256. - The
published_theme_viewmaterialised view contains one row per tenant.
A validation job runs and fails the cutover if any of these checks fail.
3. Phase-2 chain-branding migration
When chain support ships, existing tenants that adopt chain mode will have their tenant-scoped theme promoted to chain-baseline + property override:
| Step | Detail |
|---|---|
| 1 | Author triggers "Convert to chain branding" in backoffice. |
| 2 | Migration use case creates per-property Theme(scope='property', cloneFromThemeId=<tenantTheme>) rows. |
| 3 | The tenant-scoped theme remains as the chain baseline; property overrides apply on top via the BFF resolution rule. |
| 4 | BFFs cut over per property as soon as the property theme publishes. |
| 5 | An ADR (ADR-0006-chain-branding) gates this work; until then, scope='property' requires a feature flag. |
4. Cross-service migration coordination
| Counterparty | Coordination |
|---|---|
bff-tenant-booking-service | Reads bundle URL; needs to flip from hard-coded brand to bundle source on cutover. |
notification-service | Reads email-theme via internal mTLS; needs the endpoint live before publishing email-theme-updated events. |
audit-service | Subscribes to all theme.* events; bootstrap events generate audit volume — pre-warned. |
analytics-service | Subscribes to publication events; bootstrap is a one-off spike — annotated in dashboards. |
desktop-sync-service | Per-property bundle sync ready before any property-scoped theme publishes. |
Each counterparty signs off via the cross-service contract acknowledgement in SERVICE_READINESS §6.
5. Rollback at the cutover
- Trigger: error budget burn on booking-flow rendering > 10 % for 15 min after cutover, OR support-ticket volume > 3× baseline.
- Action: Toggle feature flag
brand.source = 'hardcoded'on every BFF region; rollback within 60 s. - Side effect: any in-progress publishes are unaffected; theme data persists; we just stop reading it.
- Recovery: investigate, fix forward, retry cutover the next business day.
6. Long-running migrations
The service is intentionally schema-light at Phase 0. Anticipated future migrations:
| Migration | Why | Strategy |
|---|---|---|
Promote theme_publications.bundle_url to a separate theme_bundles table when storing variants per locale or per platform | Reduce row size; allow multi-bundle per publication | Expand-then-contract; backfill from existing rows |
Partition outbox by month | Performance once volume scales | PG declarative partitioning; cutover with dual-read window |
Drop archived versions older than 365 d | Storage hygiene | Background reaper writes a retention summary row before deleting |
Each will land its own dated migration plan referenced from this document.
7. References
- Schema:
DATA_MODEL - Deployment pipeline:
DEPLOYMENT_TOPOLOGY §6 - Readiness gates:
SERVICE_READINESS - Risks:
SERVICE_RISK_REGISTER— TCS-R-019