Skip to main content

api-gateway (Kong) — Observability

Status: populated Owner: TBD (Platform / SRE) Last updated: 2026-04-17 Companion: SERVICE_OVERVIEW · EVENT_SCHEMAS · Service Template

1. Purpose

Define the SLIs, SLOs, dashboards, alerts, and runbooks for Kong as the edge gateway. Kong is the first component in the critical path for every external request; its reliability directly gates platform SLAs.

2. Metrics (Prometheus)

Exposed by the prometheus plugin on an internal scrape port. See EVENT_SCHEMAS §4 for the full metric list.

2.1 Gold-signal SLIs

SLIDefinitionTarget (SLO)
Edge availabilitysum(rate(kong_http_requests_total{code!~"5.."}[5m])) / sum(rate(kong_http_requests_total[5m]))≥ 99.95 %
Edge latency (p95)histogram_quantile(0.95, rate(kong_http_latency_ms_bucket[5m]))≤ 150 ms (Kong-only)
Edge latency (p99)Same, 0.99≤ 500 ms
Upstream error raterate(kong_http_requests_total{code=~"5.."}[5m])< 0.5 % of traffic
Rate-limit rejection raterate(kong_http_requests_total{code="429"}[5m])< 0.1 % (healthy); spike → alert
JWKS refresh failuresincrease(kong_jwks_refresh_total{result="error"}[5m])0 sustained
Auth failure rate on /v1/auth/loginrate(kong_http_requests_total{route="rt-auth-login",code="401"}[5m])baseline + 3σ alert

2.2 Secondary metrics

  • kong_kong_latency_ms (plugin overhead)
  • kong_upstream_latency_ms (upstream health)
  • kong_nginx_http_current_connections
  • kong_memory_lua_shared_dict_bytes
  • ghasi_api_key_lookup_latency_seconds / _total

3. Logs (Loki)

  • Plugin: http-log → Loki push endpoint.
  • Format: JSON per-request, schema in EVENT_SCHEMAS §3.
  • Labels: service="kong", env, route, code_class (2xx, 4xx, 5xx).
  • Retention: 14 d hot / 90 d cold.
  • Body logging: disabled (PII).
  • LogQL examples:
    • {service="kong", code_class="5xx"} | json | latency_ms > 1000 — slow errors
    • {service="kong", route="rt-auth-login", status=401} — auth failure spikes

4. Traces (OpenTelemetry)

  • Plugin: opentelemetry → OTel collector (OTLP gRPC).
  • Span name: kong.request
  • Attributes: see EVENT_SCHEMAS §5.
  • Sampling: 10 % head-based default; 100 % for 5xx, 100 % for /v1/auth/login.
  • Upstream service spans chain as children via W3C traceparent.

5. Dashboards (Grafana)

Prebuilt dashboards under ops/grafana/dashboards/kong/:

  1. kong-overview — total RPS, error rate, p50/p95/p99 latency, 429 rate, top routes.
  2. kong-route-drilldown — same by Route; stacked by upstream status class.
  3. kong-auth — JWT success/failure, API-key lookup hit/miss, JWKS refresh state.
  4. kong-rate-limit — rejected requests by limit_by, Redis health, counter growth.
  5. kong-plugin-latency — Kong-internal latency by plugin phase (via OTel spans or synthetic).
  6. kong-resource — pod CPU/mem, connections, worker health.

6. Alerts

AlertConditionSeverityAction
KongHighErrorRate5xx rate > 1 % for 5 mincriticalPage on-call; check upstreams + runbook
KongLatencyP95HighKong p95 > 500 ms for 10 minhighInvestigate plugin cost / worker saturation
KongUpstreamUnhealthyUpstream health check failing for > 2 mincriticalFailover or scale upstream
KongRateLimitStorm429 rate > 5 % of traffic for 5 minhighInvestigate abuse; may be legitimate traffic surge
KongJWKSRefreshFail> 3 consecutive failureshighCheck auth-service health
KongCertExpirySoonTLS cert < 14 dmediumTrigger rotation
KongPodRestartLoopCrashLoopBackOffcriticalCheck config + resource limits
KongRedisUnavailableRate-limit plugin reports Redis errors > 10/minhighCheck Redis; review fail-open/closed behaviour
KongAuthFailureSpike401 rate on /v1/auth/login > baseline+3σhighPossible credential-stuffing; review IPs
KongConfigDriftdeck diff CI job detects drift between Git and livemediumInvestigate manual change; resync

All alerts route to the platform-edge Pager rotation.

7. Runbooks

Runbooks live under docs/ops/runbooks/kong/ (to be authored):

AlertRunbook
KongHighErrorRatekong-5xx-triage.md
KongUpstreamUnhealthykong-upstream-down.md
KongRateLimitStormkong-ratelimit-storm.md
KongJWKSRefreshFailkong-jwks-refresh.md
KongCertExpirySoonkong-cert-rotation.md
KongConfigDriftkong-config-drift.md

8. Health endpoints

PathPurposeExposure
/healthData-plane liveness (static 200)Public (non-sensitive)
/readyReadiness (JWKS loaded, Redis reachable)Public
/statusKong admin status (connections, workers)Internal only
/metricsPrometheus scrapeInternal only

9. Tracing baggage

Kong propagates and enriches OTel baggage (baggage: account.id=...,tier=...) so upstream services tag spans/logs consistently. Upstream services must not trust baggage for authorization decisions; it is for observability only.

10. Synthetic monitoring

  • A Blackbox exporter probes https://api.ghasi.io/health every 30 s from multiple regions.
  • A synthetic /v1/sms/send request with a dedicated internal API key runs every 5 min in staging and prod; failure pages on-call after 2 consecutive fails.

11. Open questions

  • Tail-based OTel sampling vs head-based (error-biased retention).
  • Separate Grafana tenant for SRE edge dashboards.