MIGRATION_PLAN — staff-service
Sibling: SERVICE_OVERVIEW · DATA_MODEL · DEPLOYMENT_TOPOLOGY · SERVICE_READINESS
Strategic anchors: 02 §16 Roadmap
How staff-service evolves from MVP through M2 (multi-region active-active) and how migrations are executed safely. Schema changes, event-schema versioning, sync-protocol versioning, and tenant onboarding/offboarding flows are all covered.
1. Service Maturity Milestones
| Milestone | Scope | Target quarter |
|---|---|---|
| M0 | Single-region (me-central1); core staff CRUD; manual shift assignment; PIN clock-in (online + Electron offline); leave; basic reports | Q2-26 |
| M1 | AI shift suggestions (advisory); fairness reports; certification expiry alerts; second region read replica | Q4-26 |
| M2 | Multi-region active-active; tenant-driven data residency; edge anomaly model in Electron; broader BFF surfaces | Q2-27 |
| M3 | Light payroll-export adapter (read-only feed to external payroll); workforce planning beyond 30 d | TBD |
M3 is on the watch list; it requires an ADR before scope expansion (see SERVICE_RISK_REGISTER §R-12).
2. Schema Migrations (Postgres / Flyway)
2.1 Principles
- Forward-only. No rollback migrations; revert via a new forward fix.
- Online migrations. No exclusive locks during business hours; large data fixes done in chunks via
staff-cron. - Backfill jobs are idempotent and resumable on a
(tenant_id, last_id)cursor. - Migrations run in a Cloud Run Job before the API rollout (per DEPLOYMENT_TOPOLOGY §8).
- Three-step columns / table rename pattern for breaking changes:
- Add new column / table; dual-write from app
- Backfill historical data; switch reads
- Drop old column / table in a later release
2.2 Naming
V<NNN>__<snake_case_summary>.sql, e.g. V017__add_staff_certifications_email_alert.sql.
2.3 Examples
| Change | Pattern |
|---|---|
| Add nullable column | Single forward migration; defaults handled in app |
| Add NOT NULL column | (a) add nullable, (b) backfill in app or cron, (c) ALTER NOT NULL |
| Rename column | Add new, dual-write, backfill, switch reads, drop old (over 3 releases) |
| Drop column | Mark in DATA_MODEL.md as "deprecated since vN"; remove only after all clients off old version |
| Change index | CREATE INDEX CONCURRENTLY; drop old CONCURRENTLY once new is HOT |
| New table | Forward migration; RLS enabled in same migration |
2.4 RLS gating
Every new tenant-scoped table MUST enable RLS in the migration that creates it; CI test rls-on-every-table.spec.ts blocks the deploy if missing.
3. Event-Schema Versioning
Per 04 §13:
- The version suffix in the topic is permanent (
melmastoon.staff.shift.scheduled.v1). - Additive fields are emitted on the same version; new fields are nullable / defaulted.
- Breaking changes get a new topic suffix (
.v2); both run in parallel for at least one milestone. - Consumers MUST fall back to
.v1if they don't yet handle.v2.
3.1 Migration cookbook
| Change | Action |
|---|---|
| Add optional field | Bump JSON schema; emit on .v1; document in EVENT_SCHEMAS.md history table |
| Required field added or rename | Stand up .v2 topic + producer; dual-publish (write to both for ≥ 1 milestone); migrate consumers; deprecate .v1 |
| Field semantics change (same name) | Treat as breaking change (new .v2) |
| Topic split (e.g., shift.assigned → shift.assigned + shift.reassigned) | Stand up new topics; emit alongside; consumers migrate; old topic kept for 1 milestone |
A topic deprecation requires a deprecation notice on the producer + consumer DRIs at least 30 days before sunset.
4. Sync-Protocol Versioning
/sync/v1/pull and /sync/v1/push are versioned in the URL path. The Electron client sends Sync-Protocol-Version: 1; the server enforces compatibility and returns Sync-Protocol-Version-Supported: 1,2 once v2 is online.
4.1 Compatibility matrix (target)
| Server release | Client versions accepted |
|---|---|
| Q2-26 (M0) | v1 |
| Q4-26 (M1) | v1 |
| Q2-27 (M2) | v1, v2 (after Electron 2.0 ships) |
| Q4-27 (M2.1) | v2 only (sunset v1 after 6 months notice) |
The server retains the older protocol implementation behind a feature flag and an integration test until full sunset.
4.2 Conflict-policy changes
A change of conflict policy for any field-class is a breaking sync change and requires a new sync version. Documented per SYNC_CONTRACT §3.
5. Data Backfills
Run as Cloud Run Jobs invoked by staff-cron Pub/Sub messages or by manual operator command. Standard structure:
async function backfill<TRow>(opts: {
cursorKey: string;
fetch: (after: TRow | null, limit: number) => Promise<TRow[]>;
apply: (row: TRow) => Promise<void>;
batchSize: number;
}): Promise<void>
The cursor is persisted in staff.backfill_cursors(name TEXT PRIMARY KEY, last_value TEXT, updated_at TIMESTAMPTZ). Jobs are restartable, idempotent, and emit progress metrics.
6. Tenant Lifecycle Migrations
6.1 Onboarding
Triggered by melmastoon.tenant.activated.v1 (consumed):
- Provision per-tenant defaults (departments, position catalog) via the
tenant_seedjob - Set RLS GUC
app.tenant_idfor the seed run - Index hotspot warm-up (synthetic capacity reads)
- Emit
staff.tenant.bootstrapped.v1(internal — not public)
6.2 Region transfer (data residency change)
Rare, operator-driven:
- Quiesce writes for the tenant (read-only mode flag at BFF)
- Snapshot
staffschema viapg_dump --schema=staff --table-pattern='*' --where="tenant_id='<id>'" - Restore to target-region Cloud SQL
- Re-emit
tenant.region_transferred.v1;staff-serviceupdates KMS key reference - Lift the read-only flag in target region; archive the source-region rows after retention window
A formal runbook is at runbooks/staff/region-transfer.md and is reviewed annually.
6.3 Tenant offboarding
Triggered by tenant.deactivated.v1:
- Mark all staff
employmentStatus='terminated'(cascade-system source) - Cancel future shifts
- After 90 d grace window, delete personal data per DSAR rules; retain audit_events for 7 y in BigQuery
- Encrypt-then-delete KMS key after final retention; data becomes cryptographically inaccessible
7. Service Migration: M0 → M1
Key deltas:
- AI shift suggestions added as advisory surface (see AI_INTEGRATION §3.1)
- Read replica in
europe-west1(read-only); CDN-routed for tenants opting in to "fast read" - Schema additions:
staff.shift_suggestions,staff.staff_certifications.alert_sent_at - New events:
staff.certification.expired.v1(wasexpires_soon.v1only in M0)
Rollout:
- Ship schema migrations (M0+M1 combined) over 2 weeks
- Ship AI consumer behind feature flag
ff_staff_ai_suggestions; enable per-tenant gradually - Replica activated in
europe-west1; routing flag enabled per-tenant
8. Service Migration: M1 → M2
Key deltas:
- Multi-region active-active writes (per 02 §11)
- Edge anomaly model in Electron clients
- Sync
v2with field-level vector clocks (replacing per-aggregatev_localfor some aggregates) - Tenant-pinned KMS keyring per region
Rollout:
- Stand up secondary region writers; shadow-write for 2 weeks; reconcile divergence (target zero)
- Enable round-robin via DNS for opted-in tenants
- Ship Electron 2.0 with edge model and
v2sync; stagger rollout 10 % → 50 % → 100 % - Deprecate
v1sync 6 months after M2 GA
9. Migration Communication
For any tenant-visible migration (sync version, breaking event change, region transfer), the change owner publishes:
- A 30-day-ahead notice in the customer changelog
- An in-app banner in the BFF
- A direct email to tenant admins with PII handling impacts (if any)
- A migration FAQ in the help center
Post-migration, a 14-day warranty window: any data drift discovered is fixed by a forward backfill, never by a rollback.
10. Migration Verification Gates
Every schema or event migration must include:
- A migration test (
migrations/Vxxx.spec.ts) verifying schema state pre/post - An integration test reading & writing the new shape end-to-end
- A backwards-compat test (where applicable) that asserts the prior client/contract still works
- A canary plan in the deploy pipeline (10 % → 50 % → 100 %)
CI blocks the merge if any of the above are missing for a PR touching migrations.
11. Rollback Posture
- Schema: forward-only; if catastrophic, restore Cloud SQL from PITR (last 7 d) and replay outbox events from BigQuery audit cold copy
- Application: Cloud Deploy
rollbackto previous revision (revisions kept for 30 d) - Sync protocol: server keeps prior version until full sunset; clients can downgrade by reinstalling older Electron build (ADR
ADR-0003) - Events: older topic kept ≥ 1 milestone after
.v2introduction
12. Audit & Sign-off
Any migration touching:
staff.audit_eventsschema → privacy officer sign-offstaff.staffPII columns → security architect sign-off- Region topology → SRE lead sign-off
- Tenant lifecycle handlers → platform architect sign-off
- AI surfaces or model rotations → AI lead sign-off
Sign-offs are recorded in the migration PR description and re-walked at quarterly readiness review (per SERVICE_READINESS §11).