Skip to main content

api-gateway (Kong) — Service Readiness

Status: populated Owner: TBD (Platform / SRE) Last updated: 2026-04-17 Companion: SERVICE_OVERVIEW · DEPLOYMENT_TOPOLOGY · Service Template

1. Purpose

Kong must pass this checklist before it can be the sole production edge gateway (i.e. before the custom NestJS api-gateway is decommissioned per MIGRATION_PLAN).

2. Readiness gate checklist

2.1 Configuration

  • ops/kong/<env>.kong.yaml checked into the application monorepo, reviewed, and tagged.
  • All Routes declared in API_CONTRACTS §2.1 are present in prod config.
  • Every Route has an auth plugin or a public:true tag (CI lint green).
  • Every Route is tagged with env, owner.
  • No plaintext secrets in YAML; all secrets via vault references.
  • deck gateway sync has applied cleanly to staging, then to prod.
  • Nightly deck diff job scheduled and returning green.

2.2 Plugins

  • jwt plugin configured on every JWT route; JWKS URL resolves; cache TTL set.
  • key-auth plugin configured on every key-auth route.
  • rate-limiting-advanced enabled with Redis backend; per-route limits reviewed by Product + SRE.
  • request-size-limiting enabled on write routes.
  • ip-restriction applied to /admin/* and partner routes.
  • correlation-id enabled globally.
  • opentelemetry enabled globally; OTel collector endpoint reachable.
  • http-log enabled globally; Loki push endpoint reachable.
  • prometheus enabled globally; Prometheus scrape configured.
  • bot-detection enabled on public routes.
  • Custom plugin (ghasi-api-key-lookup) — if used — code-reviewed, unit tests ≥ 80 % coverage, image-pinned.

2.3 Security

  • TLS 1.2+ only; ciphers match Mozilla intermediate profile.
  • Origin cert pipeline green; alert set for < 14 d to expiry.
  • Admin API is network-isolated (NetworkPolicy verified).
  • Body logging disabled (verified via a test /v1/sms/send — body not present in Loki).
  • Header scrubbing verified (X-Api-Key not forwarded; Server/X-Powered-By not returned).
  • JWT algorithm restricted to RS256.

2.4 Observability

  • Prometheus scraping Kong /metrics.
  • Grafana dashboards live: kong-overview, kong-route-drilldown, kong-auth, kong-rate-limit, kong-plugin-latency, kong-resource.
  • Alerts wired: KongHighErrorRate, KongLatencyP95High, KongUpstreamUnhealthy, KongRateLimitStorm, KongJWKSRefreshFail, KongCertExpirySoon, KongPodRestartLoop, KongRedisUnavailable, KongAuthFailureSpike, KongConfigDrift.
  • Runbooks authored for each alert under docs/ops/runbooks/kong/.
  • SLO dashboard showing 99.95 % availability target.
  • Synthetic probes live (health + SMS-send canary every 5 min).

2.5 Resilience

  • Blue/green upgrade procedure rehearsed in staging.
  • Rate-limiter fail-mode matrix (fail-closed / fail-open) verified per route.
  • Redis failover tested; Kong continues per matrix.
  • JWKS cache behaviour tested with auth-service down.
  • PDB (minAvailable=2) enforced; rolling restart verified.
  • HPA tested (traffic ramp → scale up → scale down).
  • Load test passed at 2× expected peak for 30 min; no memory leaks, no rate-limit counter drift.

2.6 Operations

  • On-call rotation assigned; runbook ownership clear.
  • Deploy approval gate for prod on release tag.
  • DR region standby configured; failover procedure documented.
  • Backup strategy: Git history covers decK YAML; Vault DR per Vault runbook.
  • Access control: only CI/SRE can apply decK; admin API locked down.

2.7 Documentation

  • All 17 service docs populated (this checklist item passes with a full scan).
  • API_CONTRACTS accurately lists every prod route.
  • SECURITY_MODEL threat model reviewed by Security.
  • MIGRATION_PLAN executed (or signed off in-progress).

2.8 Sign-off

  • Tech lead (Platform) sign-off
  • SRE sign-off
  • Security sign-off
  • Product Ops sign-off

3. Post-readiness verification (prod canary)

  • 5 % traffic to Kong (behind Cloudflare header-split) for 30 min; error rate within baseline, p95 latency within SLO.
  • Rollback plan: Cloudflare route weight to 0 %; previous gateway remains warm during migration window per MIGRATION_PLAN.

4. Open questions

  • Which readiness checks are blocking vs advisory for initial staging rollout?