Skip to main content

api-gateway (Kong) — Migration Plan

Status: populated Owner: TBD (Platform / SRE) Last updated: 2026-04-17 Companion: ADR-0001 · SERVICE_OVERVIEW · Service Template

1. Purpose

Describe the cutover from the custom NestJS api-gateway to Kong Gateway as the platform edge. If no NestJS api-gateway is yet running in production (pre-GA), this plan reduces to a greenfield deploy. Where one already exists, the plan governs a safe dual-run and switchover.

2. Migration scope

In scopeOut of scope
Replace NestJS api-gateway edge with Kong for all north-south HTTP trafficSMPP ingress (stays at smpp-connector)
Migrate JWT + API-key authentication to Kong pluginsBusiness authorization — stays in services
Migrate rate limiting to Kong (Redis-backed)Idempotency / payload validation — move to sms-orchestrator (separate work)
Add correlation-id, OTel, http-log pluginsInternal east-west traffic
Decommission NestJS api-gatewayCloudflare configuration (unchanged other than origin target)

3. Preconditions

  • auth-service exposes JWKS at /.well-known/jwks.json and an internal API-key resolution endpoint.
  • sms-orchestrator has absorbed the idempotency + payload validation + NATS publish responsibilities previously held by NestJS api-gateway (tracked as a separate epic in that service).
  • decK YAML (ops/kong/<env>.kong.yaml) exists and passes CI lint.
  • Kong deployed to staging and passing SERVICE_READINESS checklist.

4. Phases

5. Dual-run strategy

Run Kong and the NestJS api-gateway concurrently behind Cloudflare for a bounded window.

  • Routing split: Cloudflare Worker (or Load Balancer origin rules) routes on either:
    • A request header (X-Edge-Route: kong) set by an internal proxy for pilot customers, or
    • A percentage weight (canary) — preferred once staging passes.
  • Both gateways forward to the same upstream services; idempotency key scoping (in sms-orchestrator) ensures no duplicate sends.
  • Observability: Grafana dashboard compares p95 latency, 5xx rate, auth failure rate, rate-limit rejection rate side-by-side.

6. Cutover steps (prod)

  1. T-14 d: Kong in staging for 1 week; passes smoke, load, and security tests.
  2. T-7 d: Kong deployed to prod behind Cloudflare with 0 % traffic weight. Synthetic probes only.
  3. T-3 d: Flip 5 % of traffic (Cloudflare weighted origin). Monitor for 24 h.
  4. T-2 d: Ramp to 25 %. Monitor 12 h.
  5. T-1 d: Ramp to 50 %. Monitor 12 h.
  6. T-0: Ramp to 100 %. NestJS api-gateway still warm.
  7. T+7 d: Scale NestJS api-gateway to zero replicas (still deployable via GitOps as rollback).
  8. T+14 d: Remove NestJS api-gateway deployment manifests. Archive the code folder in the application monorepo. Keep the documentation history under services/api-gateway/_sources/ for audit.

7. Client base URL / DNS

Option A (preferred): Keep the public hostname (api.ghasi.io) — Cloudflare routes transparently; clients do not change anything.

Option B: If migration demands a hostname change (e.g. regulatory audit trail), communicate a new base URL and maintain a 301 redirect from the old path for at least 90 days.

We choose Option A unless a specific regulatory reason mandates Option B.

8. Rollback

At any cutover step:

  • Cloudflare weighted routing: flip the Kong origin weight back to 0 %; NestJS api-gateway resumes 100 % traffic within one CF propagation cycle (< 30 s).
  • Configuration rollback: deck gateway sync against a previous tag.
  • Image rollback: blue/green redeploy of the previous Kong image.

No destructive rollback is required because the NestJS api-gateway remains warm through T+7.

9. Data considerations

  • Rate-limit counters: Kong builds its own counters in Redis under kong:rl:*. NestJS counters (if any) live in a different keyspace and are not migrated; during dual-run, per-customer limits may be softer (because traffic is split). This is acceptable in a 14-day window.
  • Consumer credentials: For pilot / static consumers, generate decK YAML from auth-service and include in the Kong config. For customer API keys, the custom ghasi-api-key-lookup plugin avoids migration entirely — keys resolve at request time.
  • JWTs: unchanged; both gateways validate against the same JWKS.

10. Validation (during dual-run)

  • Daily review of Grafana side-by-side dashboard.
  • Automated parity tests: same payload → both gateways → compare responses (minus latency headers).
  • Security scan of Kong edge; TLS/cipher parity with previous gateway.
  • Customer-side monitoring: if a customer is assigned to Kong and sees regression, page immediately and flip back.

11. Communication

  • Internal: engineering-wide announcement of the cutover plan 14 d ahead.
  • Customers: advance notice only if base URL changes (Option B). For Option A, a status-page note at cutover time is sufficient.
  • Partners on IP allow-lists: confirm partner source IPs match expectations before 100 % cutover.

12. Post-migration cleanup

  • Archive the NestJS api-gateway folder in the application monorepo (services/api-gateway/services/_retired/api-gateway-nestjs-<date>/) or delete per source-retention policy.
  • Keep services/api-gateway/_sources/ in the documentation repo for historical context.
  • Update 01 Enterprise Architecture change log marking the migration complete.
  • Retire legacy alerts / dashboards specific to the NestJS api-gateway.

13. Success criteria

  • 7 consecutive days at 100 % Kong with:
    • Edge availability ≥ 99.95 %.
    • p95 Kong latency ≤ 150 ms.
    • 5xx rate < 0.5 %.
    • No CRITICAL security findings.
  • NestJS api-gateway at zero replicas, no traffic, no pager events.

14. Open questions

  • Is there any in-production NestJS api-gateway to migrate from, or is this effectively a greenfield rollout? (Affects the length of the dual-run window.)
  • Exact dual-run window — 14 d baseline; shorter if metrics stabilise faster.
  • Do we need a regulatory sign-off before 100 % cutover (telecom licence conditions)?