api-gateway (Kong) — Service Risk Register
Status: populated Owner: TBD (Platform / SRE) Last updated: 2026-04-17 Companion: SERVICE_OVERVIEW · FAILURE_MODES · Service Template
1. Purpose
Track known risks of standardising on Kong Gateway for the platform edge.
2. Risk register
| ID | Risk | Likelihood | Impact | Mitigation | Owner |
|---|---|---|---|---|---|
| GW-R-001 | Single point of failure on the north-south path — Kong outage = platform outage | Medium | Critical | HA (2–6 replicas) + PDB; multi-pod blue/green upgrades; DR region with warm standby; Cloudflare load balancing across origins | SRE |
| GW-R-002 | Config drift between live Kong and Git | Medium | High | Admin API write-locked to CI; nightly deck diff job; KongConfigDrift alert | SRE |
| GW-R-003 | Route drift vs upstream OpenAPI | Medium | Medium | CI lint: Routes must match upstream OpenAPI paths | Platform |
| GW-R-004 | Kong plugin version drift across environments | Medium | Medium | Pinned Kong image; single decK source of truth; CI uses the same image as prod | SRE |
| GW-R-005 | Kong major-version upgrade path (breaking config format changes) | Low | High | Follow LTS track; rehearse upgrade in staging; maintain blue/green rollback path | SRE |
| GW-R-006 | Vendor lock-in to Kong | Low | Low | Kong OSS is Apache-2.0 licensed; decK YAML is largely portable to other OSS gateways (Tyk, APISIX) with effort; custom plugin is isolated | Platform |
| GW-R-007 | Operational expertise — Kong Lua/Go plugin debugging requires specialised knowledge | Medium | Medium | Document plugin behaviour here; keep custom plugin count minimal (ideally one); train 2+ engineers | SRE |
| GW-R-008 | Rate-limit Redis dependency — correctness of SMS TPS gates | Medium | High | Redis cluster with redundancy; fail-mode matrix documented; synthetic rate-limit canary | SRE |
| GW-R-009 | JWT/JWKS tight coupling to auth-service — cascading failure | Low | Medium | JWKS cached 5 min; staged upgrade of auth-service; alert on refresh fail | Platform |
| GW-R-010 | Custom plugin (ghasi-api-key-lookup) performance regression | Medium | Medium | Unit + load tests; Prometheus metrics; canary rollout | Platform |
| GW-R-011 | Cloudflare-to-Kong misconfiguration during rollout | Medium | High | MIGRATION_PLAN dual-running window + canary + documented rollback | SRE |
| GW-R-012 | Kong edition choice (OSS vs Enterprise vs Konnect) not finalised | Medium | Medium | Decision before prod readiness gate; OSS viable for Slice 0 / 1 | Platform lead |
| GW-R-013 | Admin API exposed publicly by accident | Low | Critical | NetworkPolicy + separate listen port + mTLS; CI check | SRE |
| GW-R-014 | Body logging re-enabled by accident (PII leak) | Low | Critical | Audited in integration tests; http-log config linted in CI | Security |
| GW-R-015 | Over-reliance on Kong for authorization (creep) | Medium | High | Repeatedly assert: authoritative authz in upstream services; keep Kong plugins to coarse gating | Architecture |
| GW-R-016 | Kong licence cost escalation (if Enterprise chosen) | Medium | Medium | Track usage-based pricing; prefer OSS until clear Enterprise feature need | Finance |
| GW-R-017 | Plugin supply-chain compromise (malicious upstream plugin) | Low | Critical | Only bundled plugins; pinned versions; SBOM; code-review for custom plugin | Security |
| GW-R-018 | Cert rotation failure causing TLS outage | Low | Critical | cert-manager or Cloudflare auto-rotation; KongCertExpirySoon alert at T-14d | SRE |
| GW-R-019 | Misapplied NetworkPolicy blocking Admin API | Low | Medium | Staged NetworkPolicy change; test in staging first | SRE |
| GW-R-020 | Kong is incorrectly assumed to terminate SMPP | Low | Low | ADR-0001 §7 is explicit; SMPP terminates at smpp-connector | Architecture |
3. Residual risk statement
Adopting Kong concentrates edge risk into a single well-understood component rather than spreading it across a bespoke NestJS gateway. The net risk posture is lower than with a custom gateway, provided HA, observability, and config-as-code discipline are in place.
4. Open questions
- Should we adopt a second, diverse edge (e.g. NGINX ingress) as a hot-warm standby for Kong outages?
- Insurance / contractual obligations that demand a specific edge product or edition.