Skip to main content

api-gateway (Kong) — Service Risk Register

Status: populated Owner: TBD (Platform / SRE) Last updated: 2026-04-17 Companion: SERVICE_OVERVIEW · FAILURE_MODES · Service Template

1. Purpose

Track known risks of standardising on Kong Gateway for the platform edge.

2. Risk register

IDRiskLikelihoodImpactMitigationOwner
GW-R-001Single point of failure on the north-south path — Kong outage = platform outageMediumCriticalHA (2–6 replicas) + PDB; multi-pod blue/green upgrades; DR region with warm standby; Cloudflare load balancing across originsSRE
GW-R-002Config drift between live Kong and GitMediumHighAdmin API write-locked to CI; nightly deck diff job; KongConfigDrift alertSRE
GW-R-003Route drift vs upstream OpenAPIMediumMediumCI lint: Routes must match upstream OpenAPI pathsPlatform
GW-R-004Kong plugin version drift across environmentsMediumMediumPinned Kong image; single decK source of truth; CI uses the same image as prodSRE
GW-R-005Kong major-version upgrade path (breaking config format changes)LowHighFollow LTS track; rehearse upgrade in staging; maintain blue/green rollback pathSRE
GW-R-006Vendor lock-in to KongLowLowKong OSS is Apache-2.0 licensed; decK YAML is largely portable to other OSS gateways (Tyk, APISIX) with effort; custom plugin is isolatedPlatform
GW-R-007Operational expertise — Kong Lua/Go plugin debugging requires specialised knowledgeMediumMediumDocument plugin behaviour here; keep custom plugin count minimal (ideally one); train 2+ engineersSRE
GW-R-008Rate-limit Redis dependency — correctness of SMS TPS gatesMediumHighRedis cluster with redundancy; fail-mode matrix documented; synthetic rate-limit canarySRE
GW-R-009JWT/JWKS tight coupling to auth-service — cascading failureLowMediumJWKS cached 5 min; staged upgrade of auth-service; alert on refresh failPlatform
GW-R-010Custom plugin (ghasi-api-key-lookup) performance regressionMediumMediumUnit + load tests; Prometheus metrics; canary rolloutPlatform
GW-R-011Cloudflare-to-Kong misconfiguration during rolloutMediumHighMIGRATION_PLAN dual-running window + canary + documented rollbackSRE
GW-R-012Kong edition choice (OSS vs Enterprise vs Konnect) not finalisedMediumMediumDecision before prod readiness gate; OSS viable for Slice 0 / 1Platform lead
GW-R-013Admin API exposed publicly by accidentLowCriticalNetworkPolicy + separate listen port + mTLS; CI checkSRE
GW-R-014Body logging re-enabled by accident (PII leak)LowCriticalAudited in integration tests; http-log config linted in CISecurity
GW-R-015Over-reliance on Kong for authorization (creep)MediumHighRepeatedly assert: authoritative authz in upstream services; keep Kong plugins to coarse gatingArchitecture
GW-R-016Kong licence cost escalation (if Enterprise chosen)MediumMediumTrack usage-based pricing; prefer OSS until clear Enterprise feature needFinance
GW-R-017Plugin supply-chain compromise (malicious upstream plugin)LowCriticalOnly bundled plugins; pinned versions; SBOM; code-review for custom pluginSecurity
GW-R-018Cert rotation failure causing TLS outageLowCriticalcert-manager or Cloudflare auto-rotation; KongCertExpirySoon alert at T-14dSRE
GW-R-019Misapplied NetworkPolicy blocking Admin APILowMediumStaged NetworkPolicy change; test in staging firstSRE
GW-R-020Kong is incorrectly assumed to terminate SMPPLowLowADR-0001 §7 is explicit; SMPP terminates at smpp-connectorArchitecture

3. Residual risk statement

Adopting Kong concentrates edge risk into a single well-understood component rather than spreading it across a bespoke NestJS gateway. The net risk posture is lower than with a custom gateway, provided HA, observability, and config-as-code discipline are in place.

4. Open questions

  • Should we adopt a second, diverse edge (e.g. NGINX ingress) as a hot-warm standby for Kong outages?
  • Insurance / contractual obligations that demand a specific edge product or edition.