api-gateway (Kong) — Deployment Topology
Status: populated Owner: TBD (Platform / SRE) Last updated: 2026-04-17 Companion: SERVICE_OVERVIEW · DATA_MODEL · Service Template
1. Runtime
| Property | Value |
|---|---|
| Product | Kong Gateway (edition: OSS or Enterprise — see SERVICE_OVERVIEW open questions) |
| Version | Pinned; upgrades follow the Kong LTS track |
| Mode | DB-less (preferred); DB mode is a fallback |
| Container image | Official kong:<version>-alpine (SBOM scanned in CI) |
| Platform | Kubernetes 1.29+ |
| Workload kind | Deployment |
| Replicas | 2 (staging) / 2–6 (prod with HPA) |
| CPU / mem request | 500m / 1 GiB per pod |
| CPU / mem limit | 2000m / 2 GiB per pod |
| HPA | CPU 70 % target; scale 2–6 |
| Pod disruption budget | minAvailable = 2 |
DaemonSet layout is an alternative for dedicated ingress nodes; not chosen by default.
2. Topology diagram
3. Configuration delivery (GitOps)
- Engineer opens PR modifying
ops/kong/<env>.kong.yamlin the application monorepo. - CI runs the contract-test matrix (see TESTING_STRATEGY §2).
- On merge to
main, CI runsdeck gateway syncagainst staging. - On release tag, CI runs
deck gateway syncagainst production, with an approval gate. - Nightly
deck diffjob detects drift.
Kong pods in DB-less mode watch a ConfigMap (mounted as the declarative_config file). deck writes the ConfigMap; Kong reloads on SIGHUP triggered by the config watcher sidecar (or rolling restart if SIGHUP reload is not configured).
4. Cloudflare upstream
- Cloudflare sits in front as WAF + DDoS + CDN.
- Cloudflare → Kube LoadBalancer → Kong pods.
- TLS: Cloudflare holds the public cert; origin cert (Cloudflare Authenticated Origin Pulls) terminates at Kong.
- Cloudflare applies bot management, IP reputation, basic rate limiting (L7) on top of Kong's finer-grained limits.
5. Environments
| Env | Domain | Replicas | Redis | DB mode |
|---|---|---|---|---|
dev | n/a (docker compose) | 1 | local | DB-less |
staging | api.staging.ghasi.io | 2 | shared staging Redis | DB-less |
prod | api.ghasi.io | 2–6 (HPA) | dedicated cluster | DB-less (see open Qs) |
dr | api.dr.ghasi.io | 2 hot-standby | regional replica | DB-less |
6. Regions
- Primary: single region (e.g.
eu-west-1) — same region assms-orchestratorand Redis. - DR: warm-standby region, Kong config replicated via Git. RTO 4 h, RPO 1 h (matches platform baseline, see 01 Enterprise Architecture §10).
7. Networking
| Concern | Setting |
|---|---|
| Public ingress | Cloudflare → cloud LB (L4 or L7) → Kong Service |
| Kong Service type | LoadBalancer (public subnet) or NodePort behind an external LB |
| Admin API | Internal-only Service; NetworkPolicy restricts to SRE/CI pods |
| Upstream calls | Cluster DNS; mTLS preferred (via service mesh or direct) |
| Egress | Allow-list: auth-service, Redis, OTel collector, Loki |
NetworkPolicy excerpts:
- Deny all → Kong admin port except from
role=sre-toolingpods. - Kong pods may egress to cluster DNS, named services, and the OTel/Loki endpoints only.
8. Dependencies at runtime
| Dependency | Failure mode |
|---|---|
auth-service (JWKS + key resolution) | Degrades JWT validation after cache TTL; custom plugin rejects new API keys |
| Redis | Rate-limit counters fail per route policy (closed on writes, open on reads) |
| OTel collector | Traces dropped; alerts on sustained export failure |
| Loki | Log buffer fills then drops; not request-path critical |
| Prometheus | No impact on request path |
9. Upgrade strategy
- Config changes: rolling pod restart if SIGHUP reload not configured; else live reload.
- Kong version bumps: blue/green. Stand up a new ReplicaSet with the new image, flip Service selector after smoke test, keep old ReplicaSet for 1 h.
- Custom plugin releases: ship as part of the Kong image (build-time install). Roll with blue/green.
10. Resource sizing (reference)
For an SMS peak of ~5 000 req/s at the edge with p95 Kong latency < 150 ms:
- 4 Kong pods @ 1 vCPU each, ~60 % CPU headroom.
- Redis sustained throughput ~30 k ops/s (rate limit INCR + EXPIRE). 3-node cluster, 4 GiB each.
Final numbers ratified at load test (see TESTING_STRATEGY §7).
11. Open questions
- DaemonSet vs Deployment on dedicated ingress nodes — revisit if ingress traffic grows significantly.
- Service mesh (Linkerd / Istio) for Kong ↔ upstream mTLS vs direct TLS.
- HPA metric — CPU only or custom metric (e.g.
kong_http_requests_totalrate).