Skip to main content

MIGRATION_PLAN — staff-service

Sibling: SERVICE_OVERVIEW · DATA_MODEL · DEPLOYMENT_TOPOLOGY · SERVICE_READINESS

Strategic anchors: 02 §16 Roadmap

How staff-service evolves from MVP through M2 (multi-region active-active) and how migrations are executed safely. Schema changes, event-schema versioning, sync-protocol versioning, and tenant onboarding/offboarding flows are all covered.


1. Service Maturity Milestones

MilestoneScopeTarget quarter
M0Single-region (me-central1); core staff CRUD; manual shift assignment; PIN clock-in (online + Electron offline); leave; basic reportsQ2-26
M1AI shift suggestions (advisory); fairness reports; certification expiry alerts; second region read replicaQ4-26
M2Multi-region active-active; tenant-driven data residency; edge anomaly model in Electron; broader BFF surfacesQ2-27
M3Light payroll-export adapter (read-only feed to external payroll); workforce planning beyond 30 dTBD

M3 is on the watch list; it requires an ADR before scope expansion (see SERVICE_RISK_REGISTER §R-12).


2. Schema Migrations (Postgres / Flyway)

2.1 Principles

  • Forward-only. No rollback migrations; revert via a new forward fix.
  • Online migrations. No exclusive locks during business hours; large data fixes done in chunks via staff-cron.
  • Backfill jobs are idempotent and resumable on a (tenant_id, last_id) cursor.
  • Migrations run in a Cloud Run Job before the API rollout (per DEPLOYMENT_TOPOLOGY §8).
  • Three-step columns / table rename pattern for breaking changes:
    1. Add new column / table; dual-write from app
    2. Backfill historical data; switch reads
    3. Drop old column / table in a later release

2.2 Naming

V<NNN>__<snake_case_summary>.sql, e.g. V017__add_staff_certifications_email_alert.sql.

2.3 Examples

ChangePattern
Add nullable columnSingle forward migration; defaults handled in app
Add NOT NULL column(a) add nullable, (b) backfill in app or cron, (c) ALTER NOT NULL
Rename columnAdd new, dual-write, backfill, switch reads, drop old (over 3 releases)
Drop columnMark in DATA_MODEL.md as "deprecated since vN"; remove only after all clients off old version
Change indexCREATE INDEX CONCURRENTLY; drop old CONCURRENTLY once new is HOT
New tableForward migration; RLS enabled in same migration

2.4 RLS gating

Every new tenant-scoped table MUST enable RLS in the migration that creates it; CI test rls-on-every-table.spec.ts blocks the deploy if missing.


3. Event-Schema Versioning

Per 04 §13:

  • The version suffix in the topic is permanent (melmastoon.staff.shift.scheduled.v1).
  • Additive fields are emitted on the same version; new fields are nullable / defaulted.
  • Breaking changes get a new topic suffix (.v2); both run in parallel for at least one milestone.
  • Consumers MUST fall back to .v1 if they don't yet handle .v2.

3.1 Migration cookbook

ChangeAction
Add optional fieldBump JSON schema; emit on .v1; document in EVENT_SCHEMAS.md history table
Required field added or renameStand up .v2 topic + producer; dual-publish (write to both for ≥ 1 milestone); migrate consumers; deprecate .v1
Field semantics change (same name)Treat as breaking change (new .v2)
Topic split (e.g., shift.assigned → shift.assigned + shift.reassigned)Stand up new topics; emit alongside; consumers migrate; old topic kept for 1 milestone

A topic deprecation requires a deprecation notice on the producer + consumer DRIs at least 30 days before sunset.


4. Sync-Protocol Versioning

/sync/v1/pull and /sync/v1/push are versioned in the URL path. The Electron client sends Sync-Protocol-Version: 1; the server enforces compatibility and returns Sync-Protocol-Version-Supported: 1,2 once v2 is online.

4.1 Compatibility matrix (target)

Server releaseClient versions accepted
Q2-26 (M0)v1
Q4-26 (M1)v1
Q2-27 (M2)v1, v2 (after Electron 2.0 ships)
Q4-27 (M2.1)v2 only (sunset v1 after 6 months notice)

The server retains the older protocol implementation behind a feature flag and an integration test until full sunset.

4.2 Conflict-policy changes

A change of conflict policy for any field-class is a breaking sync change and requires a new sync version. Documented per SYNC_CONTRACT §3.


5. Data Backfills

Run as Cloud Run Jobs invoked by staff-cron Pub/Sub messages or by manual operator command. Standard structure:

async function backfill<TRow>(opts: {
cursorKey: string;
fetch: (after: TRow | null, limit: number) => Promise<TRow[]>;
apply: (row: TRow) => Promise<void>;
batchSize: number;
}): Promise<void>

The cursor is persisted in staff.backfill_cursors(name TEXT PRIMARY KEY, last_value TEXT, updated_at TIMESTAMPTZ). Jobs are restartable, idempotent, and emit progress metrics.


6. Tenant Lifecycle Migrations

6.1 Onboarding

Triggered by melmastoon.tenant.activated.v1 (consumed):

  1. Provision per-tenant defaults (departments, position catalog) via the tenant_seed job
  2. Set RLS GUC app.tenant_id for the seed run
  3. Index hotspot warm-up (synthetic capacity reads)
  4. Emit staff.tenant.bootstrapped.v1 (internal — not public)

6.2 Region transfer (data residency change)

Rare, operator-driven:

  1. Quiesce writes for the tenant (read-only mode flag at BFF)
  2. Snapshot staff schema via pg_dump --schema=staff --table-pattern='*' --where="tenant_id='<id>'"
  3. Restore to target-region Cloud SQL
  4. Re-emit tenant.region_transferred.v1; staff-service updates KMS key reference
  5. Lift the read-only flag in target region; archive the source-region rows after retention window

A formal runbook is at runbooks/staff/region-transfer.md and is reviewed annually.

6.3 Tenant offboarding

Triggered by tenant.deactivated.v1:

  1. Mark all staff employmentStatus='terminated' (cascade-system source)
  2. Cancel future shifts
  3. After 90 d grace window, delete personal data per DSAR rules; retain audit_events for 7 y in BigQuery
  4. Encrypt-then-delete KMS key after final retention; data becomes cryptographically inaccessible

7. Service Migration: M0 → M1

Key deltas:

  • AI shift suggestions added as advisory surface (see AI_INTEGRATION §3.1)
  • Read replica in europe-west1 (read-only); CDN-routed for tenants opting in to "fast read"
  • Schema additions: staff.shift_suggestions, staff.staff_certifications.alert_sent_at
  • New events: staff.certification.expired.v1 (was expires_soon.v1 only in M0)

Rollout:

  1. Ship schema migrations (M0+M1 combined) over 2 weeks
  2. Ship AI consumer behind feature flag ff_staff_ai_suggestions; enable per-tenant gradually
  3. Replica activated in europe-west1; routing flag enabled per-tenant

8. Service Migration: M1 → M2

Key deltas:

  • Multi-region active-active writes (per 02 §11)
  • Edge anomaly model in Electron clients
  • Sync v2 with field-level vector clocks (replacing per-aggregate v_local for some aggregates)
  • Tenant-pinned KMS keyring per region

Rollout:

  1. Stand up secondary region writers; shadow-write for 2 weeks; reconcile divergence (target zero)
  2. Enable round-robin via DNS for opted-in tenants
  3. Ship Electron 2.0 with edge model and v2 sync; stagger rollout 10 % → 50 % → 100 %
  4. Deprecate v1 sync 6 months after M2 GA

9. Migration Communication

For any tenant-visible migration (sync version, breaking event change, region transfer), the change owner publishes:

  1. A 30-day-ahead notice in the customer changelog
  2. An in-app banner in the BFF
  3. A direct email to tenant admins with PII handling impacts (if any)
  4. A migration FAQ in the help center

Post-migration, a 14-day warranty window: any data drift discovered is fixed by a forward backfill, never by a rollback.


10. Migration Verification Gates

Every schema or event migration must include:

  • A migration test (migrations/Vxxx.spec.ts) verifying schema state pre/post
  • An integration test reading & writing the new shape end-to-end
  • A backwards-compat test (where applicable) that asserts the prior client/contract still works
  • A canary plan in the deploy pipeline (10 % → 50 % → 100 %)

CI blocks the merge if any of the above are missing for a PR touching migrations.


11. Rollback Posture

  • Schema: forward-only; if catastrophic, restore Cloud SQL from PITR (last 7 d) and replay outbox events from BigQuery audit cold copy
  • Application: Cloud Deploy rollback to previous revision (revisions kept for 30 d)
  • Sync protocol: server keeps prior version until full sunset; clients can downgrade by reinstalling older Electron build (ADR ADR-0003)
  • Events: older topic kept ≥ 1 milestone after .v2 introduction

12. Audit & Sign-off

Any migration touching:

  • staff.audit_events schema → privacy officer sign-off
  • staff.staff PII columns → security architect sign-off
  • Region topology → SRE lead sign-off
  • Tenant lifecycle handlers → platform architect sign-off
  • AI surfaces or model rotations → AI lead sign-off

Sign-offs are recorded in the migration PR description and re-walked at quarterly readiness review (per SERVICE_READINESS §11).