Skip to main content

MIGRATION_PLAN — payment-gateway-service

Sibling: DATA_MODEL · DEPLOYMENT_TOPOLOGY · SERVICE_READINESS

This plan covers two scenarios: (a) the initial production cutover — moving the service from staging to prod with first tenants — and (b) per-tenant migrations that occur continuously as new tenants onboard, schemas evolve, and existing Stripe accounts attach via OAuth.

1. Initial production cutover (T-zero)

1.1 Pre-conditions

  • All checkboxes in SERVICE_READINESS complete and dated.
  • Vendor production accounts created and verified (Stripe Connect platform live, PayPal merchant approved, HesabPay live key issued).
  • Cloud SQL gm-payments-prod provisioned with HA, CMEK, regional backup.
  • Cloud Armor + Workload Identity policies in place.
  • DNS records (api.melmastoon.ghasi.io route, webhooks.payments.melmastoon.ghasi.io) ready but not yet pointed at the new service.

1.2 Cutover steps

  1. Deploy the production image to payment-api and payment-worker. No traffic yet (manifests at replicas: 0 for scale-out checks).
  2. Apply central migrations (migrations/central/*.sql).
  3. Smoke with internal synthetic tenant tnt_smoke_001.
  4. Scale up to baseline (payment-api 3 pods, payment-worker 2 pods).
  5. Cut DNS for the webhook subdomain first (vendors begin delivering events; nothing to apply yet — empty inbox is OK).
  6. Cut DNS for the API subdomain.
  7. Onboard pilot tenants (3 tenants in 1 week). For each, follow §2.
  8. Observe SLOs for 7 days; complete the post-launch readiness review.

1.3 Rollback

  • DNS swap back to legacy provider (if any) or to the staging deployment serving as a temporary stub.
  • Cloud SQL holds all transactions — no data loss.
  • Vendor webhook deliveries continue against the new endpoint (200 OK), but state changes can be replayed when the service is brought back.

2. Per-tenant onboarding

For each new tenant, tenant-service invokes the payments admin CLI and tracks completion in tenant_schema_registry.

2.1 Schema provisioning

payments-admin-cli provision-tenant \
--tenant-id tnt_01H… \
--pci-profile saq_a \
--cmek-key projects/gm-payments-prod/locations/us/keyRings/gm-payments/cryptoKeys/tnt_01H…

This:

  1. Creates schema tnt_<tenantId> in Cloud SQL.
  2. Creates per-tenant role with USAGE on the schema only.
  3. Applies all tenant migrations from migrations/tenant/ in order.
  4. Inserts a row in payments_central.tenant_schema_registry with the CMEK URI and applied migration versions.
  5. Verifies role separation by attempting a cross-tenant SELECT and asserting permission denied.

2.2 Vendor credentials

payments-admin-cli attach-vendor \
--tenant-id tnt_01H… \
--processor stripe \
--env production \
--account-ref acct_1NXz… \
--api-key-from-secret projects/gm-payments-prod/secrets/tnt_01H…/stripe/prod/api_key/versions/latest \
--webhook-secret-from-secret projects/gm-payments-prod/secrets/tnt_01H…/stripe/prod/webhook_secret/versions/latest \
--precedence 100

Repeat per processor. Existing Stripe accounts attach via OAuth (Stripe Connect Standard): the tenant admin clicks "Connect Stripe" in bff-backoffice-service, completes OAuth at Stripe, and the callback handler stores the acct_…, refresh token, and webhook secret in Secret Manager, then calls attach-vendor.

2.3 Verification

  • Run a £1.00 / $1.00 test authorize+void against each enabled processor.
  • Confirm webhook receipt and apply.
  • Mark tenant payments_ready=true in tenant-service.

3. Schema evolution (existing tenants)

3.1 Approach

  • Per-tenant migrations are versioned per-tenant in payments_central.tenant_migrations(tenant_id, version, applied_at).
  • Deploy applies the latest schema to all tenants in batches of 50 in parallel, with a 30-second timeout per tenant.
  • Failed applications are retried automatically up to 3 times; remaining failures are surfaced via P2 alert and resumed via payments-admin-cli migrate-tenant <id>.
  • Any migration introducing a column with a non-null default uses the expand-then-migrate-then-contract pattern over multiple deploys.

3.2 Backward compatibility rules

  • New columns must be nullable or have defaults; never NOT NULL without default.
  • Renames are forbidden in a single deploy; always add new column → backfill → swap reads → deprecate old.
  • Index creation uses CONCURRENTLY to avoid table locks.

3.3 Read traffic during migration

  • API stays online; ORM models are tolerant of unknown columns and missing-but-optional columns.
  • Drizzle migrations are dry-run in staging against a snapshot of production tenants.

4. Vendor adapter additions

When introducing a new processor (e.g., Adyen):

  1. Implement AdyenAdapter implements PaymentPort (full unit + contract tests in CI).
  2. Add adyen to the processor enum check on relevant tables (tnt_<id>.transactions.processor etc.).
  3. Provision sandbox vendor credentials per tenant who opts in (precedence default 200, off until tenant flips a flag).
  4. Roll out to 1 pilot tenant; canary for 7 days; promote to all opt-ins.
  5. Add canary cron and dashboard panels.

5. Tenant offboarding

  1. Tenant submits deletion request via tenant-service.
  2. tenant-service checks with payment-gateway-service for non-terminal transactions; if any, deletion is paused.
  3. Soft-delete window of 30 days begins. During this period:
    • No new authorize/capture allowed.
    • Refunds and webhook ingestion continue.
    • Operators can run final reconciliation.
  4. After 30 days:
    • payments-admin-cli offboard-tenant <id> runs.
    • Schema dropped after final snapshot to GCS Coldline (7-year retention per legal hold).
    • Secret Manager entries scheduled for secrets:destroy (24 h grace).
    • Row in tenant_schema_registry set to archived_at = now().
  5. Webhook-inbox rows for the tenant are anonymized (tenant_id = NULL) and retained 90 days for fraud forensics.

6. Disaster-driven re-platforming

Should a future ADR replace Cloud SQL with another database (e.g., AlloyDB), the migration approach is:

  1. Stand up the new platform alongside.
  2. Use logical replication for per-tenant schemas in batches.
  3. Cut writes per tenant during a defined low-traffic window.
  4. Validate row counts and outstanding transactions match.
  5. Swap connection strings in ConfigMap.
  6. Monitor; rollback by reverting ConfigMap (replication still running in reverse).

7. Documentation & sign-off

Every migration plan in production must be:

  • Linked to a Linear/Jira ticket with run sheet and rollback plan.
  • Reviewed by SecOps (PCI impact), SRE (downtime impact), and the service owner.
  • Recorded in the platform change log.

The first cutover and the first three tenant onboardings will be paired-operations (two engineers minimum, one as observer).