api-gateway (Kong) — Service Readiness
Status: populated Owner: TBD (Platform / SRE) Last updated: 2026-04-17 Companion: SERVICE_OVERVIEW · DEPLOYMENT_TOPOLOGY · Service Template
1. Purpose
Kong must pass this checklist before it can be the sole production edge gateway (i.e. before the custom NestJS api-gateway is decommissioned per MIGRATION_PLAN).
2. Readiness gate checklist
2.1 Configuration
-
ops/kong/<env>.kong.yamlchecked into the application monorepo, reviewed, and tagged. - All Routes declared in API_CONTRACTS §2.1 are present in prod config.
- Every Route has an auth plugin or a
public:truetag (CI lint green). - Every Route is tagged with
env,owner. - No plaintext secrets in YAML; all secrets via vault references.
-
deck gateway synchas applied cleanly to staging, then to prod. - Nightly
deck diffjob scheduled and returning green.
2.2 Plugins
-
jwtplugin configured on every JWT route; JWKS URL resolves; cache TTL set. -
key-authplugin configured on every key-auth route. -
rate-limiting-advancedenabled with Redis backend; per-route limits reviewed by Product + SRE. -
request-size-limitingenabled on write routes. -
ip-restrictionapplied to/admin/*and partner routes. -
correlation-idenabled globally. -
opentelemetryenabled globally; OTel collector endpoint reachable. -
http-logenabled globally; Loki push endpoint reachable. -
prometheusenabled globally; Prometheus scrape configured. -
bot-detectionenabled on public routes. - Custom plugin (
ghasi-api-key-lookup) — if used — code-reviewed, unit tests ≥ 80 % coverage, image-pinned.
2.3 Security
- TLS 1.2+ only; ciphers match Mozilla intermediate profile.
- Origin cert pipeline green; alert set for < 14 d to expiry.
- Admin API is network-isolated (NetworkPolicy verified).
- Body logging disabled (verified via a test
/v1/sms/send— body not present in Loki). - Header scrubbing verified (
X-Api-Keynot forwarded;Server/X-Powered-Bynot returned). - JWT algorithm restricted to
RS256.
2.4 Observability
- Prometheus scraping Kong
/metrics. - Grafana dashboards live:
kong-overview,kong-route-drilldown,kong-auth,kong-rate-limit,kong-plugin-latency,kong-resource. - Alerts wired:
KongHighErrorRate,KongLatencyP95High,KongUpstreamUnhealthy,KongRateLimitStorm,KongJWKSRefreshFail,KongCertExpirySoon,KongPodRestartLoop,KongRedisUnavailable,KongAuthFailureSpike,KongConfigDrift. - Runbooks authored for each alert under
docs/ops/runbooks/kong/. - SLO dashboard showing 99.95 % availability target.
- Synthetic probes live (health + SMS-send canary every 5 min).
2.5 Resilience
- Blue/green upgrade procedure rehearsed in staging.
- Rate-limiter fail-mode matrix (fail-closed / fail-open) verified per route.
- Redis failover tested; Kong continues per matrix.
- JWKS cache behaviour tested with
auth-servicedown. - PDB (
minAvailable=2) enforced; rolling restart verified. - HPA tested (traffic ramp → scale up → scale down).
- Load test passed at 2× expected peak for 30 min; no memory leaks, no rate-limit counter drift.
2.6 Operations
- On-call rotation assigned; runbook ownership clear.
- Deploy approval gate for prod on release tag.
- DR region standby configured; failover procedure documented.
- Backup strategy: Git history covers decK YAML; Vault DR per Vault runbook.
- Access control: only CI/SRE can apply decK; admin API locked down.
2.7 Documentation
- All 17 service docs populated (this checklist item passes with a full scan).
- API_CONTRACTS accurately lists every prod route.
- SECURITY_MODEL threat model reviewed by Security.
- MIGRATION_PLAN executed (or signed off in-progress).
2.8 Sign-off
- Tech lead (Platform) sign-off
- SRE sign-off
- Security sign-off
- Product Ops sign-off
3. Post-readiness verification (prod canary)
- 5 % traffic to Kong (behind Cloudflare header-split) for 30 min; error rate within baseline, p95 latency within SLO.
- Rollback plan: Cloudflare route weight to 0 %; previous gateway remains warm during migration window per MIGRATION_PLAN.
4. Open questions
- Which readiness checks are blocking vs advisory for initial staging rollout?