MIGRATION_PLAN — notification-service
Sibling: DATA_MODEL · DEPLOYMENT_TOPOLOGY · EVENT_SCHEMAS · SYNC_CONTRACT · API_CONTRACTS
Strategic anchors: docs/06-data-models · docs/04-event-driven-architecture · docs/standards/SERVICE_TEMPLATE
This plan covers schema, event, API, and operational migrations that affect notification-service — both forward-only changes the team will execute and the playbooks for one-off cutovers (greenfield bootstrap, regional rollout, importing tenants from third-party providers).
Forward-only is the rule. Every change is shipped through the expand → migrate → contract pattern; no destructive change ships before the contract phase is verified across all consumers.
1. Migration philosophy
- Expand → Migrate → Contract: add new structure, dual-write/read while consumers move, then remove old structure.
- Backwards-compatible event evolution: only additive changes inside
vN; breaking changes ship asvN+1withvNretained for the deprecation window (see docs/04-event-driven-architecture §11). - API versioning:
/api/v1is stable. Breaking surface changes go to/api/v2; deprecation period ≥ 180 days;Sunset/Deprecationheaders per docs/05-api-design. - Idempotent migrations: every Drizzle migration tolerates partial application and re-execution.
- Online by default: no whole-table locks; use partition swaps,
CREATE INDEX CONCURRENTLY, lazy column backfill. - Tested in staging snapshot: every migration runs against a daily-refreshed staging snapshot before production.
- Tenant-aware: tenant scoping must be preserved at every step; backfills never bypass RLS unless explicitly justified by a runbook entry.
2. Greenfield bootstrap
Order of operations when standing up notification-service in a fresh region.
┌──────────────────────────────────────────────────────────────────────────────┐
│ 0. Provision GCP project, VPC, Cloud SQL, Memorystore, Pub/Sub, GCS, KMS │
│ 1. Run migrations 0001..N (Drizzle) │
│ 2. Seed channel registry (vendor placeholders, not real keys) │
│ 3. Seed platform-scoped templates (versioned, status=draft) │
│ 4. Wire Pub/Sub topics & subscriptions per EVENT_SCHEMAS §2 │
│ 5. Deploy Cloud Run services (canary 5 % → 50 % → 100 %) │
│ 6. Run smoke tests + synthetic monitor │
│ 7. Onboard first tenant (provisioning workflow, see §6) │
└──────────────────────────────────────────────────────────────────────────────┘
2.1 Initial migration set (0001_init … 00xx)
| Migration | Purpose |
|---|---|
0001_init_schema_namespaces | Create notification schema, set search_path, set default tablespace. |
0002_tenants_local_projection | Local projection of tenant.* events. |
0003_templates_and_versions | templates, template_versions tables; check constraints on status lifecycle. |
0004_recipients_and_preferences | recipients, notification_preferences; default-deny preference rows. |
0005_notifications_partitioned | notifications parent + first 6 monthly partitions; CDC publication. |
0006_delivery_attempts_partitioned | delivery_attempts parent + 12 monthly partitions. |
0007_suppressions | suppressions table; partial unique index per (tenant, channel, address_hash). |
0008_channels_and_credentials | channels, channel_credentials (KMS-wrapped DEK columns). |
0009_webhook_inbound | webhook_inbound, webhook_inbound_events partitioned. |
0010_dispatch_batches | dispatch_batches (batch metadata + counters). |
0011_scheduled_workqueue | notification_scheduled work projection + index on (due_at). |
0012_trigger_map | notification_trigger_map; bootstrap rows per consumed event subject. |
0013_outbox_inbox_idempotency | Standard outbox/inbox/idempotency tables. |
0014_opt_out_tokens | opt_out_tokens with TTL index. |
0015_rls_policies | Enable RLS + define tenant_isolation policy on every tenant-scoped table. |
0016_partition_cron_helpers | pg_partman registration; partman.maintenance schedule. |
0017_seed_platform_templates | INSERT canonical platform templates (status=draft). |
0018_indexes_and_stats | Optional indexes that would slow seeding if added earlier; analyse hot tables. |
Each migration is paired with a down_NNNN.sql that is never executed automatically — used only for emergency rollback under runbook approval (see §9).
3. Schema evolution playbook (Postgres)
For every change category, the canonical pattern.
3.1 Add a column
1. Migration: ADD COLUMN ... NULL (no DEFAULT to avoid table rewrite)
2. Application: write the new column on every INSERT/UPDATE.
3. Backfill job: chunked UPDATE in 5 000-row batches, throttled to <5 % CPU.
4. Migration: SET NOT NULL once backfill complete + verifier passes.
5. Migration: optional DEFAULT for new rows.
Use pg_partman.run_maintenance_proc() separately if backfilling partitioned tables.
3.2 Rename a column
1. ADD COLUMN <new>; copy old → new on writes.
2. Backfill from old.
3. Read from <new> with COALESCE(<new>, <old>) for one release.
4. Drop coalesce; read only <new>.
5. Drop <old> (contract).
3.3 Change a column type / constraint
1. ADD COLUMN new_typed.
2. Dual-write parsed value into new_typed.
3. Backfill in chunks.
4. Swap reads.
5. Drop old.
Never ALTER COLUMN ... USING on hot tables.
3.4 Add an index
CREATE INDEX CONCURRENTLY idx_xyz ON notification.notifications(...);
Followed by ANALYZE. Track creation in monitoring; skip during peak hours.
3.5 Add a partition
Automated by pg_partman.maintenance cron; manually invoked when adding a new yearly partition pre-emptively. Always create +3 months ahead of insert pressure.
3.6 Drop a partition
Detach first with ALTER TABLE ... DETACH PARTITION CONCURRENTLY, copy/archive to GCS as Parquet, then DROP TABLE. Retention windows per DATA_MODEL §10.
3.7 Modify RLS policy
1. Create new policy with new predicate.
2. Test with synthetic tenant in staging.
3. Drop old policy in same migration (RLS predicates are evaluated at query time; switching is atomic).
Never operate without an RLS policy on tenant tables — even briefly.
3.8 Move data between tables
Use logical replication (pg_logical) or batched COPY into the new table; cut over with a feature flag controlling which table the application reads from.
4. Event schema evolution
4.1 Additive change inside vN
- Add optional field; document as required-from-version-X in EVENT_SCHEMAS.
- Update producer; consumers may opt in lazily.
4.2 Breaking change → vN+1
T0 Publish v1 only.
T1 Producer publishes v1 AND v2 (dual-publish via outbox fan-out).
T2 Consumers migrate to v2 one at a time; emit `consumer.migration.completed` event.
T3 When all consumers reported migrated AND ≥ 30 d elapsed, producer stops publishing v1.
T4 Subscriptions on v1 deleted; topic archived after retention window.
4.3 Subject rename
Treat as a breaking change. The new subject coexists with the old subject through dual-publish. Document mapping in EVENT_SCHEMAS section "Renames".
4.4 Retention/partitioning change
Provision new topic with new partitioning, dual-publish, switch consumer subscriptions, decommission old topic. Never alter partitioning of a live high-volume topic.
5. API evolution
| Change | Pattern |
|---|---|
| Add a field to a request body | Optional, default = legacy behaviour. |
| Add a field to a response body | Always safe; clients ignore unknown. |
| Remove a field | Mark deprecated in OpenAPI; emit Deprecation + Sunset headers; remove in /api/v2. |
| Change an enum value | Add new value first; producers migrate; old value removed in next major. |
| Tighten validation | Roll out behind a feature flag; communicate to BFFs; log shadow rejections for ≥ 14 d. |
| Replace endpoint semantics | New path under /api/v2; redirect with 308 from /v1 only when semantically equivalent. |
OpenAPI spec is the contract; every change ships through PR review with the bff-backoffice-service, bff-tenant-booking-service, and bff-public-marketing-service owners as required reviewers.
6. Tenant onboarding & data import
When a new tenant joins the platform.
┌─────────────────────────────────────────────────────────────────────────────┐
│ Step 1. tenant-service emits tenant.created.v1 │
│ Step 2. notification-service projects tenant into tenants_local │
│ Step 3. Platform-scoped templates auto-cloned to tenant scope (draft) │
│ Step 4. Tenant configures channel credentials (Secret Manager API) │
│ Step 5. Tenant verifies DKIM/SPF/DMARC via onboarding wizard │
│ Step 6. Tenant approves transactional templates (HITL if AI-drafted) │
│ Step 7. notification-service marks tenant `provisioned` │
│ Step 8. Optional: import historical notification audit from prior provider │
└─────────────────────────────────────────────────────────────────────────────┘
6.1 Importing from a previous provider (e.g., SendGrid suppressions)
| Step | Tool | Notes |
|---|---|---|
| Export suppressions from previous provider | provider CSV export | Includes bounces + complaints + unsubscribes |
| Validate format & dedupe | tools/import/suppressions-csv-validate.ts | Hashes address per DATA_MODEL §3.4 |
| Import via internal admin API | POST /internal/v1/suppressions/import (rate-limited) | Idempotent on (tenant, channel, address_hash) |
| Verify count | GET /internal/v1/suppressions/stats | Compare with provider total ± 0.1 % |
Emit notification.suppressed.v1 per row | producer fans out via outbox | Optional — governed by import.emitEvents flag |
Recipient preference imports follow the same pattern: validate → import → emit preferences.updated.v1 (or suppress events to avoid noise on bulk import).
6.2 Template import
Templates may be imported from a tenant's previous CMS. Process:
1. Tenant uploads template package (.zip with handlebars + assets).
2. CI-grade validator runs in worker:
- schema check (variables declared, no unsafe helpers)
- rendering smoke against synthetic variables
- safety scan (links allowlist, no inline script, no remote-load assets)
3. Status set to `draft` per locale.
4. Tenant approves; status → `published`.
5. Optionally route through ai-orchestrator-service for tone/translation HITL.
7. Regional rollout (cross-region migration)
Follows the platform regional rollout pattern.
Phase A — Provision new region (e.g., me-central1)
Deploy infra (Cloud SQL, Memorystore, Pub/Sub, GCS, Cloud Run) via Terraform.
Apply migrations to fresh DB.
Verify smoke tests in isolation.
Phase B — Read-only shadow
Subscribe new region to global Pub/Sub feed (read-only ingestion).
Validate projections match origin region.
Phase C — Tenant pinning
For each ME tenant, set residency = me-central1 in tenant-service.
Tenant traffic begins routing to the new region via global LB.
Confirm enqueue/dispatch metrics and SLOs in new region.
Phase D — Decommission old replica (if any)
Crypto-shred ME-tenant data in origin region.
Audit-trail the deletion.
8. Tenant data residency move
Used when an existing tenant must be moved between regions (e.g., legal change, data-sovereignty enforcement).
1. Freeze tenant inflight: tenant-service emits tenant.frozen event; notification-service rejects new POSTs with 423 LOCKED for that tenant.
2. Drain inflight: workers continue dispatching queued items until the queue is empty (max 30 min); fail-fast unsent items.
3. Snapshot: pg_dump tenant rows (RLS-enforced via app role) → encrypted to GCS in source region.
4. Transfer: bucket-to-bucket copy with VPC-SC perimeter; CMEK re-wrapping in destination region.
5. Apply: COPY into destination tenant-local tables with idempotent UPSERT on PKs.
6. Verify: row-count + checksum reconciliation report; integration test suite re-runs against destination.
7. Re-route: tenant-service updates tenant.residency = <new region>.
8. Unfreeze: tenant-service emits tenant.unfrozen; notification-service resumes accepting POSTs in destination region.
9. Crypto-shred: source-region tenant rows have their KMS DEKs revoked; rows scheduled for hard-delete after 30-day rollback window.
Estimated downtime per tenant: < 30 min, executed in a maintenance window.
9. Rollback strategy
Default position: roll forward, not back. Migrations are forward-only and the down files are an emergency tool only.
| Failure window | Strategy |
|---|---|
| Migration fails mid-run | Drizzle's transactional DDL aborts; service keeps running on prior schema. Investigate, fix, re-deploy migration. |
| Migration succeeds but app is unhappy | Roll back the app deploy, not the schema. Schema must remain backwards-compatible per Expand-Migrate-Contract. |
| Schema drift after partial backfill | Pause backfill; consumers still operate; investigate. |
| Catastrophic data corruption (e.g., RLS bug) | PITR restore to a clone; reconcile via diff + delta-replay; cut over via feature flag. Communicate via incident channel. |
The runbook for each scenario lives in FAILURE_MODES (F-NTF-13 for render rollback, F-NTF-04 for DB issues, F-NTF-28 for regional outage).
10. Migration scheduling & change windows
| Change class | Allowed window | Approval |
|---|---|---|
| Additive index, additive column | Anytime | PR review |
| Backfill | Off-peak (00:00–05:00 region-local) | PR + on-call ack |
| Partition rotation | Automated daily | none (alert on failure) |
| RLS policy change | Off-peak | Security lead + service tech lead |
| Event schema breaking change | Quarterly window | Platform architect + all consumer leads |
| Tenant residency move | Tenant-coordinated maintenance | Tenant ops + service tech lead |
| Regional rollout | Phased over 2 wk | Platform architect + CTO |
All migrations require:
- A Migration Plan ticket with summary, risk class, plan, verification, rollback note.
- Successful execution in dev then staging (against live snapshot) before production.
- Post-migration verification report attached to the ticket.
11. Verification checklist (run after every migration)
-
pg_stat_user_tablesshows expected row counts on touched tables. - No replication lag spike on read replicas.
- CDC outbox publisher healthy (lag < 1 s p95).
- No new error rate on
notif.api.requests_totalor worker job series. - No new pattern in
pg_stat_statementsindicating unintended query plan change. - OpenTelemetry trace samples show expected span shape.
- Synthetic monitor green for 1 h post-migration.
- Audit-log entry recorded with migration ID and operator.
12. Inventory of pending forward-looking migrations
| Target release | Description | Status |
|---|---|---|
| 1.1 | Add delivery_attempts.network_latency_ms for vendor performance tracking | planned |
| 1.1 | Add WebSocket presence projection (recipient_presence) | planned |
| 1.2 | Voice/IVR aggregate fields (Phase 3) on notifications and delivery_attempts | planned |
| 1.2 | Tenant-self-serve template marketplace tables | planned |
| 1.3 | Promote sentiment-classifier metadata into a queryable inbound_replies table | planned |
| 1.3 | Add template_versions.aiQualityScore (eval pipeline) | planned |
| 2.0 | New event taxonomy version v2 (mostly additive consolidation; see EVENT_SCHEMAS) | planned (≥ 2026-Q4) |
Each item lands as its own PR with the full Expand-Migrate-Contract plan attached.
13. Cross-service migration coordination
Migrations that touch shared event schemas (especially with reservation-service, billing-service, lock-integration-service, ai-orchestrator-service) are coordinated via the platform Schema Council:
- Weekly review of proposed event changes.
- Pact contract validation gate per consumer.
- Shared deprecation calendar.
- Change-freeze windows around regulatory deadlines and tenant onboarding cohorts.