Skip to main content

MIGRATION_PLAN — notification-service

Sibling: DATA_MODEL · DEPLOYMENT_TOPOLOGY · EVENT_SCHEMAS · SYNC_CONTRACT · API_CONTRACTS

Strategic anchors: docs/06-data-models · docs/04-event-driven-architecture · docs/standards/SERVICE_TEMPLATE

This plan covers schema, event, API, and operational migrations that affect notification-service — both forward-only changes the team will execute and the playbooks for one-off cutovers (greenfield bootstrap, regional rollout, importing tenants from third-party providers).

Forward-only is the rule. Every change is shipped through the expand → migrate → contract pattern; no destructive change ships before the contract phase is verified across all consumers.


1. Migration philosophy

  1. Expand → Migrate → Contract: add new structure, dual-write/read while consumers move, then remove old structure.
  2. Backwards-compatible event evolution: only additive changes inside vN; breaking changes ship as vN+1 with vN retained for the deprecation window (see docs/04-event-driven-architecture §11).
  3. API versioning: /api/v1 is stable. Breaking surface changes go to /api/v2; deprecation period ≥ 180 days; Sunset/Deprecation headers per docs/05-api-design.
  4. Idempotent migrations: every Drizzle migration tolerates partial application and re-execution.
  5. Online by default: no whole-table locks; use partition swaps, CREATE INDEX CONCURRENTLY, lazy column backfill.
  6. Tested in staging snapshot: every migration runs against a daily-refreshed staging snapshot before production.
  7. Tenant-aware: tenant scoping must be preserved at every step; backfills never bypass RLS unless explicitly justified by a runbook entry.

2. Greenfield bootstrap

Order of operations when standing up notification-service in a fresh region.

┌──────────────────────────────────────────────────────────────────────────────┐
│ 0. Provision GCP project, VPC, Cloud SQL, Memorystore, Pub/Sub, GCS, KMS │
│ 1. Run migrations 0001..N (Drizzle) │
│ 2. Seed channel registry (vendor placeholders, not real keys) │
│ 3. Seed platform-scoped templates (versioned, status=draft) │
│ 4. Wire Pub/Sub topics & subscriptions per EVENT_SCHEMAS §2 │
│ 5. Deploy Cloud Run services (canary 5 % → 50 % → 100 %) │
│ 6. Run smoke tests + synthetic monitor │
│ 7. Onboard first tenant (provisioning workflow, see §6) │
└──────────────────────────────────────────────────────────────────────────────┘

2.1 Initial migration set (0001_init00xx)

MigrationPurpose
0001_init_schema_namespacesCreate notification schema, set search_path, set default tablespace.
0002_tenants_local_projectionLocal projection of tenant.* events.
0003_templates_and_versionstemplates, template_versions tables; check constraints on status lifecycle.
0004_recipients_and_preferencesrecipients, notification_preferences; default-deny preference rows.
0005_notifications_partitionednotifications parent + first 6 monthly partitions; CDC publication.
0006_delivery_attempts_partitioneddelivery_attempts parent + 12 monthly partitions.
0007_suppressionssuppressions table; partial unique index per (tenant, channel, address_hash).
0008_channels_and_credentialschannels, channel_credentials (KMS-wrapped DEK columns).
0009_webhook_inboundwebhook_inbound, webhook_inbound_events partitioned.
0010_dispatch_batchesdispatch_batches (batch metadata + counters).
0011_scheduled_workqueuenotification_scheduled work projection + index on (due_at).
0012_trigger_mapnotification_trigger_map; bootstrap rows per consumed event subject.
0013_outbox_inbox_idempotencyStandard outbox/inbox/idempotency tables.
0014_opt_out_tokensopt_out_tokens with TTL index.
0015_rls_policiesEnable RLS + define tenant_isolation policy on every tenant-scoped table.
0016_partition_cron_helperspg_partman registration; partman.maintenance schedule.
0017_seed_platform_templatesINSERT canonical platform templates (status=draft).
0018_indexes_and_statsOptional indexes that would slow seeding if added earlier; analyse hot tables.

Each migration is paired with a down_NNNN.sql that is never executed automatically — used only for emergency rollback under runbook approval (see §9).


3. Schema evolution playbook (Postgres)

For every change category, the canonical pattern.

3.1 Add a column

1. Migration: ADD COLUMN ... NULL (no DEFAULT to avoid table rewrite)
2. Application: write the new column on every INSERT/UPDATE.
3. Backfill job: chunked UPDATE in 5 000-row batches, throttled to <5 % CPU.
4. Migration: SET NOT NULL once backfill complete + verifier passes.
5. Migration: optional DEFAULT for new rows.

Use pg_partman.run_maintenance_proc() separately if backfilling partitioned tables.

3.2 Rename a column

1. ADD COLUMN <new>; copy old → new on writes.
2. Backfill from old.
3. Read from <new> with COALESCE(<new>, <old>) for one release.
4. Drop coalesce; read only <new>.
5. Drop <old> (contract).

3.3 Change a column type / constraint

1. ADD COLUMN new_typed.
2. Dual-write parsed value into new_typed.
3. Backfill in chunks.
4. Swap reads.
5. Drop old.

Never ALTER COLUMN ... USING on hot tables.

3.4 Add an index

CREATE INDEX CONCURRENTLY idx_xyz ON notification.notifications(...);

Followed by ANALYZE. Track creation in monitoring; skip during peak hours.

3.5 Add a partition

Automated by pg_partman.maintenance cron; manually invoked when adding a new yearly partition pre-emptively. Always create +3 months ahead of insert pressure.

3.6 Drop a partition

Detach first with ALTER TABLE ... DETACH PARTITION CONCURRENTLY, copy/archive to GCS as Parquet, then DROP TABLE. Retention windows per DATA_MODEL §10.

3.7 Modify RLS policy

1. Create new policy with new predicate.
2. Test with synthetic tenant in staging.
3. Drop old policy in same migration (RLS predicates are evaluated at query time; switching is atomic).

Never operate without an RLS policy on tenant tables — even briefly.

3.8 Move data between tables

Use logical replication (pg_logical) or batched COPY into the new table; cut over with a feature flag controlling which table the application reads from.


4. Event schema evolution

4.1 Additive change inside vN

  • Add optional field; document as required-from-version-X in EVENT_SCHEMAS.
  • Update producer; consumers may opt in lazily.

4.2 Breaking change → vN+1

T0 Publish v1 only.
T1 Producer publishes v1 AND v2 (dual-publish via outbox fan-out).
T2 Consumers migrate to v2 one at a time; emit `consumer.migration.completed` event.
T3 When all consumers reported migrated AND ≥ 30 d elapsed, producer stops publishing v1.
T4 Subscriptions on v1 deleted; topic archived after retention window.

4.3 Subject rename

Treat as a breaking change. The new subject coexists with the old subject through dual-publish. Document mapping in EVENT_SCHEMAS section "Renames".

4.4 Retention/partitioning change

Provision new topic with new partitioning, dual-publish, switch consumer subscriptions, decommission old topic. Never alter partitioning of a live high-volume topic.


5. API evolution

ChangePattern
Add a field to a request bodyOptional, default = legacy behaviour.
Add a field to a response bodyAlways safe; clients ignore unknown.
Remove a fieldMark deprecated in OpenAPI; emit Deprecation + Sunset headers; remove in /api/v2.
Change an enum valueAdd new value first; producers migrate; old value removed in next major.
Tighten validationRoll out behind a feature flag; communicate to BFFs; log shadow rejections for ≥ 14 d.
Replace endpoint semanticsNew path under /api/v2; redirect with 308 from /v1 only when semantically equivalent.

OpenAPI spec is the contract; every change ships through PR review with the bff-backoffice-service, bff-tenant-booking-service, and bff-public-marketing-service owners as required reviewers.


6. Tenant onboarding & data import

When a new tenant joins the platform.

┌─────────────────────────────────────────────────────────────────────────────┐
│ Step 1. tenant-service emits tenant.created.v1 │
│ Step 2. notification-service projects tenant into tenants_local │
│ Step 3. Platform-scoped templates auto-cloned to tenant scope (draft) │
│ Step 4. Tenant configures channel credentials (Secret Manager API) │
│ Step 5. Tenant verifies DKIM/SPF/DMARC via onboarding wizard │
│ Step 6. Tenant approves transactional templates (HITL if AI-drafted) │
│ Step 7. notification-service marks tenant `provisioned` │
│ Step 8. Optional: import historical notification audit from prior provider │
└─────────────────────────────────────────────────────────────────────────────┘

6.1 Importing from a previous provider (e.g., SendGrid suppressions)

StepToolNotes
Export suppressions from previous providerprovider CSV exportIncludes bounces + complaints + unsubscribes
Validate format & dedupetools/import/suppressions-csv-validate.tsHashes address per DATA_MODEL §3.4
Import via internal admin APIPOST /internal/v1/suppressions/import (rate-limited)Idempotent on (tenant, channel, address_hash)
Verify countGET /internal/v1/suppressions/statsCompare with provider total ± 0.1 %
Emit notification.suppressed.v1 per rowproducer fans out via outboxOptional — governed by import.emitEvents flag

Recipient preference imports follow the same pattern: validate → import → emit preferences.updated.v1 (or suppress events to avoid noise on bulk import).

6.2 Template import

Templates may be imported from a tenant's previous CMS. Process:

1. Tenant uploads template package (.zip with handlebars + assets).
2. CI-grade validator runs in worker:
- schema check (variables declared, no unsafe helpers)
- rendering smoke against synthetic variables
- safety scan (links allowlist, no inline script, no remote-load assets)
3. Status set to `draft` per locale.
4. Tenant approves; status → `published`.
5. Optionally route through ai-orchestrator-service for tone/translation HITL.

7. Regional rollout (cross-region migration)

Follows the platform regional rollout pattern.

Phase A — Provision new region (e.g., me-central1)
Deploy infra (Cloud SQL, Memorystore, Pub/Sub, GCS, Cloud Run) via Terraform.
Apply migrations to fresh DB.
Verify smoke tests in isolation.

Phase B — Read-only shadow
Subscribe new region to global Pub/Sub feed (read-only ingestion).
Validate projections match origin region.

Phase C — Tenant pinning
For each ME tenant, set residency = me-central1 in tenant-service.
Tenant traffic begins routing to the new region via global LB.
Confirm enqueue/dispatch metrics and SLOs in new region.

Phase D — Decommission old replica (if any)
Crypto-shred ME-tenant data in origin region.
Audit-trail the deletion.

8. Tenant data residency move

Used when an existing tenant must be moved between regions (e.g., legal change, data-sovereignty enforcement).

1. Freeze tenant inflight: tenant-service emits tenant.frozen event; notification-service rejects new POSTs with 423 LOCKED for that tenant.
2. Drain inflight: workers continue dispatching queued items until the queue is empty (max 30 min); fail-fast unsent items.
3. Snapshot: pg_dump tenant rows (RLS-enforced via app role) → encrypted to GCS in source region.
4. Transfer: bucket-to-bucket copy with VPC-SC perimeter; CMEK re-wrapping in destination region.
5. Apply: COPY into destination tenant-local tables with idempotent UPSERT on PKs.
6. Verify: row-count + checksum reconciliation report; integration test suite re-runs against destination.
7. Re-route: tenant-service updates tenant.residency = <new region>.
8. Unfreeze: tenant-service emits tenant.unfrozen; notification-service resumes accepting POSTs in destination region.
9. Crypto-shred: source-region tenant rows have their KMS DEKs revoked; rows scheduled for hard-delete after 30-day rollback window.

Estimated downtime per tenant: < 30 min, executed in a maintenance window.


9. Rollback strategy

Default position: roll forward, not back. Migrations are forward-only and the down files are an emergency tool only.

Failure windowStrategy
Migration fails mid-runDrizzle's transactional DDL aborts; service keeps running on prior schema. Investigate, fix, re-deploy migration.
Migration succeeds but app is unhappyRoll back the app deploy, not the schema. Schema must remain backwards-compatible per Expand-Migrate-Contract.
Schema drift after partial backfillPause backfill; consumers still operate; investigate.
Catastrophic data corruption (e.g., RLS bug)PITR restore to a clone; reconcile via diff + delta-replay; cut over via feature flag. Communicate via incident channel.

The runbook for each scenario lives in FAILURE_MODES (F-NTF-13 for render rollback, F-NTF-04 for DB issues, F-NTF-28 for regional outage).


10. Migration scheduling & change windows

Change classAllowed windowApproval
Additive index, additive columnAnytimePR review
BackfillOff-peak (00:00–05:00 region-local)PR + on-call ack
Partition rotationAutomated dailynone (alert on failure)
RLS policy changeOff-peakSecurity lead + service tech lead
Event schema breaking changeQuarterly windowPlatform architect + all consumer leads
Tenant residency moveTenant-coordinated maintenanceTenant ops + service tech lead
Regional rolloutPhased over 2 wkPlatform architect + CTO

All migrations require:

  • A Migration Plan ticket with summary, risk class, plan, verification, rollback note.
  • Successful execution in dev then staging (against live snapshot) before production.
  • Post-migration verification report attached to the ticket.

11. Verification checklist (run after every migration)

  • pg_stat_user_tables shows expected row counts on touched tables.
  • No replication lag spike on read replicas.
  • CDC outbox publisher healthy (lag < 1 s p95).
  • No new error rate on notif.api.requests_total or worker job series.
  • No new pattern in pg_stat_statements indicating unintended query plan change.
  • OpenTelemetry trace samples show expected span shape.
  • Synthetic monitor green for 1 h post-migration.
  • Audit-log entry recorded with migration ID and operator.

12. Inventory of pending forward-looking migrations

Target releaseDescriptionStatus
1.1Add delivery_attempts.network_latency_ms for vendor performance trackingplanned
1.1Add WebSocket presence projection (recipient_presence)planned
1.2Voice/IVR aggregate fields (Phase 3) on notifications and delivery_attemptsplanned
1.2Tenant-self-serve template marketplace tablesplanned
1.3Promote sentiment-classifier metadata into a queryable inbound_replies tableplanned
1.3Add template_versions.aiQualityScore (eval pipeline)planned
2.0New event taxonomy version v2 (mostly additive consolidation; see EVENT_SCHEMAS)planned (≥ 2026-Q4)

Each item lands as its own PR with the full Expand-Migrate-Contract plan attached.


13. Cross-service migration coordination

Migrations that touch shared event schemas (especially with reservation-service, billing-service, lock-integration-service, ai-orchestrator-service) are coordinated via the platform Schema Council:

  • Weekly review of proposed event changes.
  • Pact contract validation gate per consumer.
  • Shared deprecation calendar.
  • Change-freeze windows around regulatory deadlines and tenant onboarding cohorts.

14. References