api-gateway (Kong) — Migration Plan
Status: populated Owner: TBD (Platform / SRE) Last updated: 2026-04-17 Companion: ADR-0001 · SERVICE_OVERVIEW · Service Template
1. Purpose
Describe the cutover from the custom NestJS api-gateway to Kong Gateway as the platform edge. If no NestJS api-gateway is yet running in production (pre-GA), this plan reduces to a greenfield deploy. Where one already exists, the plan governs a safe dual-run and switchover.
2. Migration scope
| In scope | Out of scope |
|---|---|
| Replace NestJS api-gateway edge with Kong for all north-south HTTP traffic | SMPP ingress (stays at smpp-connector) |
| Migrate JWT + API-key authentication to Kong plugins | Business authorization — stays in services |
| Migrate rate limiting to Kong (Redis-backed) | Idempotency / payload validation — move to sms-orchestrator (separate work) |
| Add correlation-id, OTel, http-log plugins | Internal east-west traffic |
| Decommission NestJS api-gateway | Cloudflare configuration (unchanged other than origin target) |
3. Preconditions
auth-serviceexposes JWKS at/.well-known/jwks.jsonand an internal API-key resolution endpoint.sms-orchestratorhas absorbed the idempotency + payload validation + NATS publish responsibilities previously held by NestJS api-gateway (tracked as a separate epic in that service).- decK YAML (
ops/kong/<env>.kong.yaml) exists and passes CI lint. - Kong deployed to staging and passing
SERVICE_READINESSchecklist.
4. Phases
5. Dual-run strategy
Run Kong and the NestJS api-gateway concurrently behind Cloudflare for a bounded window.
- Routing split: Cloudflare Worker (or Load Balancer origin rules) routes on either:
- A request header (
X-Edge-Route: kong) set by an internal proxy for pilot customers, or - A percentage weight (canary) — preferred once staging passes.
- A request header (
- Both gateways forward to the same upstream services; idempotency key scoping (in
sms-orchestrator) ensures no duplicate sends. - Observability: Grafana dashboard compares p95 latency, 5xx rate, auth failure rate, rate-limit rejection rate side-by-side.
6. Cutover steps (prod)
- T-14 d: Kong in staging for 1 week; passes smoke, load, and security tests.
- T-7 d: Kong deployed to prod behind Cloudflare with 0 % traffic weight. Synthetic probes only.
- T-3 d: Flip 5 % of traffic (Cloudflare weighted origin). Monitor for 24 h.
- T-2 d: Ramp to 25 %. Monitor 12 h.
- T-1 d: Ramp to 50 %. Monitor 12 h.
- T-0: Ramp to 100 %. NestJS api-gateway still warm.
- T+7 d: Scale NestJS api-gateway to zero replicas (still deployable via GitOps as rollback).
- T+14 d: Remove NestJS api-gateway deployment manifests. Archive the code folder in the application monorepo. Keep the documentation history under
services/api-gateway/_sources/for audit.
7. Client base URL / DNS
Option A (preferred): Keep the public hostname (api.ghasi.io) — Cloudflare routes transparently; clients do not change anything.
Option B: If migration demands a hostname change (e.g. regulatory audit trail), communicate a new base URL and maintain a 301 redirect from the old path for at least 90 days.
We choose Option A unless a specific regulatory reason mandates Option B.
8. Rollback
At any cutover step:
- Cloudflare weighted routing: flip the Kong origin weight back to 0 %; NestJS api-gateway resumes 100 % traffic within one CF propagation cycle (< 30 s).
- Configuration rollback:
deck gateway syncagainst a previous tag. - Image rollback: blue/green redeploy of the previous Kong image.
No destructive rollback is required because the NestJS api-gateway remains warm through T+7.
9. Data considerations
- Rate-limit counters: Kong builds its own counters in Redis under
kong:rl:*. NestJS counters (if any) live in a different keyspace and are not migrated; during dual-run, per-customer limits may be softer (because traffic is split). This is acceptable in a 14-day window. - Consumer credentials: For pilot / static consumers, generate decK YAML from
auth-serviceand include in the Kong config. For customer API keys, the customghasi-api-key-lookupplugin avoids migration entirely — keys resolve at request time. - JWTs: unchanged; both gateways validate against the same JWKS.
10. Validation (during dual-run)
- Daily review of Grafana side-by-side dashboard.
- Automated parity tests: same payload → both gateways → compare responses (minus latency headers).
- Security scan of Kong edge; TLS/cipher parity with previous gateway.
- Customer-side monitoring: if a customer is assigned to Kong and sees regression, page immediately and flip back.
11. Communication
- Internal: engineering-wide announcement of the cutover plan 14 d ahead.
- Customers: advance notice only if base URL changes (Option B). For Option A, a status-page note at cutover time is sufficient.
- Partners on IP allow-lists: confirm partner source IPs match expectations before 100 % cutover.
12. Post-migration cleanup
- Archive the NestJS api-gateway folder in the application monorepo (
services/api-gateway/→services/_retired/api-gateway-nestjs-<date>/) or delete per source-retention policy. - Keep
services/api-gateway/_sources/in the documentation repo for historical context. - Update 01 Enterprise Architecture change log marking the migration complete.
- Retire legacy alerts / dashboards specific to the NestJS api-gateway.
13. Success criteria
- 7 consecutive days at 100 % Kong with:
- Edge availability ≥ 99.95 %.
- p95 Kong latency ≤ 150 ms.
- 5xx rate < 0.5 %.
- No CRITICAL security findings.
- NestJS api-gateway at zero replicas, no traffic, no pager events.
14. Open questions
- Is there any in-production NestJS api-gateway to migrate from, or is this effectively a greenfield rollout? (Affects the length of the dual-run window.)
- Exact dual-run window — 14 d baseline; shorter if metrics stabilise faster.
- Do we need a regulatory sign-off before 100 % cutover (telecom licence conditions)?