Skip to main content

MIGRATION_PLAN — theme-config-service

Sibling: DATA_MODEL · DEPLOYMENT_TOPOLOGY · SERVICE_READINESS

Platform anchors: docs/standards/SERVICE_TEMPLATE.md

This document covers two distinct concerns:

  1. Schema migration policy — how we evolve the theme_config Postgres schema safely (zero-downtime, backwards-compatible).
  2. Initial data migration — how the platform bootstraps from "no themes" to "every existing tenant has a published theme" at first launch and at chain-onboarding events.

1. Schema migration policy

1.1 Tooling

  • Drizzle Kit for migration generation (pnpm run db:migrate:generate).
  • Migrations stored in services/theme-config-service/src/infrastructure/persistence/migrations/.
  • Tracked in theme_config.__drizzle_migrations.
  • Execution role: theme_config_migrator (DDL + minimal DML; no row-level read).
  • Applied via a dedicated Cloud Run Job triggered by the deployment pipeline before the new container revision rolls out.

1.2 Expand-then-contract policy (mandatory)

Any change that would otherwise be breaking is split into two deployments:

Deployment N:
1. Add new schema element (column / table / index / constraint variant) — backwards compatible.
2. Code reads from BOTH old and new; writes to BOTH (dual-write) when applicable.

Deployment N+1 (after data backfill complete):
3. Code reads/writes from new only.
4. Drop old schema element.

Backfills run as Cloud Run Jobs, batched with tenant_id partitioning, and tracked with progress checkpoints in a migration_checkpoints table.

1.3 Reviewed migration types

Change typeStrategyLock impact
Add nullable columnone deployment, backfill optionalmetadata only
Add NOT NULL column with defaultadd nullable → backfill → set NOT NULL across two deploymentsmetadata + table rewrite avoided via DEFAULT on PG 16
Add indexCREATE INDEX CONCURRENTLYnone
Drop columnmark deprecated → stop writes → drop in next deploymentmetadata only
Rename columnadd new col → dual-write → backfill → cut reads → drop oldrequires expand-then-contract
Add CHECK constraintALTER ... ADD CONSTRAINT ... NOT VALID then VALIDATE CONSTRAINTbrief table scan during validate
Change column type (compatible)add new col → backfill → cut overrequires expand-then-contract
Add tableone deploymentnone
Drop tableconfirm zero readers/writers via observability for ≥ 7 d → dropnone
Add RLS policyalways before adding the table to the app's reader/writer setnone
Migrate JSONB shapeintroduce shape v2; application reads both; migrate row-by-row in a job; cut reads; remove v1 readernone

1.4 CI gates on migrations

  • pnpm run migrate:lint enforces:
    • No DROP COLUMN without an explicit -- @safe-drop annotation referencing the prior expand deployment.
    • No ALTER TABLE ... ADD COLUMN ... NOT NULL without a DEFAULT.
    • No CREATE INDEX without CONCURRENTLY.
    • All new tables have RLS policies if they carry tenant_id.
  • pnpm run migrate:diff against the staging DB produces a human-readable plan attached to the PR.
  • Two-person approval required for any migration touching themes, theme_versions, theme_publications.

1.5 Rollback

  • Every migration MUST ship with a tested down migration generated alongside the up.
  • Down migrations are exercised in CI on the local stack (pnpm run db:migrate:down).
  • In production, rolling back a deployment does not automatically run the down migration; the on-call decides per the deployment runbook (often safer to roll the schema forward and patch in code).

1.6 Data backfill jobs

  • Implemented as Cloud Run Jobs in src/infrastructure/migrations/backfills/.
  • Idempotent and resumable (track last-processed (tenant_id, primary_key) checkpoint).
  • Throttled to keep DB CPU below 60 %.
  • Emit theme.migration.backfill.progress.v1 events for ops dashboards.

2. Initial migration plan (Phase 0 launch)

theme-config-service is a Phase 0 service: there is no incumbent system to migrate from. The "migration" is from nothing to every existing tenant has a published theme at the cutover moment.

2.1 Pre-launch state

  • Tenant onboarding for existing pilot tenants happened via the tenant-service tenant.created.v1 event before this service was live.
  • Booking flow served a hard-coded "platform default" brand for all tenants until launch.

2.2 Cutover plan

StepOwnerDetail
T-7dPlatformDeploy theme-config-service in shadow mode: subscribes to events, but no consumer reads from it.
T-7dPlatformRun a one-off bootstrap job that creates a default Theme + ThemeVersion (cloned from the MELMASTOON_DEFAULT_SCAFFOLD) for every existing tenant; publish each.
T-5dPlatformVerify in staging with the BFFs reading from the new bundle URL; A/B compare to the hard-coded platform brand.
T-3dFrontendPlatformCustomer-success outreach to pilot tenants: "your booking site brand will move to the new self-service editor on T-day; here's a 30 min walkthrough."
T-1dPlatformFinal dry-run of cutover script in staging.
TPlatformCutover: BFFs flip the read source from the hard-coded brand to GET /public/themes/<themeId>/published.json. Toggle is a feature flag (brand.source = 'theme-config-service').
T+1dSupportMonitor support tickets; on-call engineer dedicated; rollback by toggling the flag back.
T+7dPlatformRemove the hard-coded brand fallback from BFFs; remove the feature flag.

2.3 Bootstrap script

pnpm run migrate:bootstrap-existing-tenants — Cloud Run Job:

  1. List every active tenant via tenant-service API.
  2. For each tenant in batches of 50:
    • Skip if a Theme already exists for (tenantId, propertyId=null).
    • Otherwise invoke ProvisionDefaultThemeUseCase directly through the application layer, with actor: { kind: 'system', id: 'svc:theme-config-service:bootstrap' }.
  3. Emit theme.migration.bootstrap.progress.v1 per batch; final theme.migration.bootstrap.completed.v1 with totals.

The script is idempotent — re-running it after partial failure resumes from the next un-provisioned tenant.

2.4 Validation

After bootstrap:

  • Every tenant has exactly one active ThemePublication.
  • Every published bundle is reachable at the CDN edge with HTTP 200.
  • Every bundle's SHA matches theme_publications.bundle_sha256.
  • The published_theme_view materialised view contains one row per tenant.

A validation job runs and fails the cutover if any of these checks fail.


3. Phase-2 chain-branding migration

When chain support ships, existing tenants that adopt chain mode will have their tenant-scoped theme promoted to chain-baseline + property override:

StepDetail
1Author triggers "Convert to chain branding" in backoffice.
2Migration use case creates per-property Theme(scope='property', cloneFromThemeId=<tenantTheme>) rows.
3The tenant-scoped theme remains as the chain baseline; property overrides apply on top via the BFF resolution rule.
4BFFs cut over per property as soon as the property theme publishes.
5An ADR (ADR-0006-chain-branding) gates this work; until then, scope='property' requires a feature flag.

4. Cross-service migration coordination

CounterpartyCoordination
bff-tenant-booking-serviceReads bundle URL; needs to flip from hard-coded brand to bundle source on cutover.
notification-serviceReads email-theme via internal mTLS; needs the endpoint live before publishing email-theme-updated events.
audit-serviceSubscribes to all theme.* events; bootstrap events generate audit volume — pre-warned.
analytics-serviceSubscribes to publication events; bootstrap is a one-off spike — annotated in dashboards.
desktop-sync-servicePer-property bundle sync ready before any property-scoped theme publishes.

Each counterparty signs off via the cross-service contract acknowledgement in SERVICE_READINESS §6.


5. Rollback at the cutover

  • Trigger: error budget burn on booking-flow rendering > 10 % for 15 min after cutover, OR support-ticket volume > 3× baseline.
  • Action: Toggle feature flag brand.source = 'hardcoded' on every BFF region; rollback within 60 s.
  • Side effect: any in-progress publishes are unaffected; theme data persists; we just stop reading it.
  • Recovery: investigate, fix forward, retry cutover the next business day.

6. Long-running migrations

The service is intentionally schema-light at Phase 0. Anticipated future migrations:

MigrationWhyStrategy
Promote theme_publications.bundle_url to a separate theme_bundles table when storing variants per locale or per platformReduce row size; allow multi-bundle per publicationExpand-then-contract; backfill from existing rows
Partition outbox by monthPerformance once volume scalesPG declarative partitioning; cutover with dual-read window
Drop archived versions older than 365 dStorage hygieneBackground reaper writes a retention summary row before deleting

Each will land its own dated migration plan referenced from this document.


7. References