api-gateway (Kong) — Observability
Status: populated Owner: TBD (Platform / SRE) Last updated: 2026-04-17 Companion: SERVICE_OVERVIEW · EVENT_SCHEMAS · Service Template
1. Purpose
Define the SLIs, SLOs, dashboards, alerts, and runbooks for Kong as the edge gateway. Kong is the first component in the critical path for every external request; its reliability directly gates platform SLAs.
2. Metrics (Prometheus)
Exposed by the prometheus plugin on an internal scrape port. See EVENT_SCHEMAS §4 for the full metric list.
2.1 Gold-signal SLIs
| SLI | Definition | Target (SLO) |
|---|---|---|
| Edge availability | sum(rate(kong_http_requests_total{code!~"5.."}[5m])) / sum(rate(kong_http_requests_total[5m])) | ≥ 99.95 % |
| Edge latency (p95) | histogram_quantile(0.95, rate(kong_http_latency_ms_bucket[5m])) | ≤ 150 ms (Kong-only) |
| Edge latency (p99) | Same, 0.99 | ≤ 500 ms |
| Upstream error rate | rate(kong_http_requests_total{code=~"5.."}[5m]) | < 0.5 % of traffic |
| Rate-limit rejection rate | rate(kong_http_requests_total{code="429"}[5m]) | < 0.1 % (healthy); spike → alert |
| JWKS refresh failures | increase(kong_jwks_refresh_total{result="error"}[5m]) | 0 sustained |
Auth failure rate on /v1/auth/login | rate(kong_http_requests_total{route="rt-auth-login",code="401"}[5m]) | baseline + 3σ alert |
2.2 Secondary metrics
kong_kong_latency_ms(plugin overhead)kong_upstream_latency_ms(upstream health)kong_nginx_http_current_connectionskong_memory_lua_shared_dict_bytesghasi_api_key_lookup_latency_seconds/_total
3. Logs (Loki)
- Plugin:
http-log→ Loki push endpoint. - Format: JSON per-request, schema in EVENT_SCHEMAS §3.
- Labels:
service="kong",env,route,code_class(2xx,4xx,5xx). - Retention: 14 d hot / 90 d cold.
- Body logging: disabled (PII).
- LogQL examples:
{service="kong", code_class="5xx"} | json | latency_ms > 1000— slow errors{service="kong", route="rt-auth-login", status=401}— auth failure spikes
4. Traces (OpenTelemetry)
- Plugin:
opentelemetry→ OTel collector (OTLP gRPC). - Span name:
kong.request - Attributes: see EVENT_SCHEMAS §5.
- Sampling: 10 % head-based default; 100 % for 5xx, 100 % for
/v1/auth/login. - Upstream service spans chain as children via W3C
traceparent.
5. Dashboards (Grafana)
Prebuilt dashboards under ops/grafana/dashboards/kong/:
kong-overview— total RPS, error rate, p50/p95/p99 latency, 429 rate, top routes.kong-route-drilldown— same by Route; stacked by upstream status class.kong-auth— JWT success/failure, API-key lookup hit/miss, JWKS refresh state.kong-rate-limit— rejected requests bylimit_by, Redis health, counter growth.kong-plugin-latency— Kong-internal latency by plugin phase (via OTel spans or synthetic).kong-resource— pod CPU/mem, connections, worker health.
6. Alerts
| Alert | Condition | Severity | Action |
|---|---|---|---|
KongHighErrorRate | 5xx rate > 1 % for 5 min | critical | Page on-call; check upstreams + runbook |
KongLatencyP95High | Kong p95 > 500 ms for 10 min | high | Investigate plugin cost / worker saturation |
KongUpstreamUnhealthy | Upstream health check failing for > 2 min | critical | Failover or scale upstream |
KongRateLimitStorm | 429 rate > 5 % of traffic for 5 min | high | Investigate abuse; may be legitimate traffic surge |
KongJWKSRefreshFail | > 3 consecutive failures | high | Check auth-service health |
KongCertExpirySoon | TLS cert < 14 d | medium | Trigger rotation |
KongPodRestartLoop | CrashLoopBackOff | critical | Check config + resource limits |
KongRedisUnavailable | Rate-limit plugin reports Redis errors > 10/min | high | Check Redis; review fail-open/closed behaviour |
KongAuthFailureSpike | 401 rate on /v1/auth/login > baseline+3σ | high | Possible credential-stuffing; review IPs |
KongConfigDrift | deck diff CI job detects drift between Git and live | medium | Investigate manual change; resync |
All alerts route to the platform-edge Pager rotation.
7. Runbooks
Runbooks live under docs/ops/runbooks/kong/ (to be authored):
| Alert | Runbook |
|---|---|
KongHighErrorRate | kong-5xx-triage.md |
KongUpstreamUnhealthy | kong-upstream-down.md |
KongRateLimitStorm | kong-ratelimit-storm.md |
KongJWKSRefreshFail | kong-jwks-refresh.md |
KongCertExpirySoon | kong-cert-rotation.md |
KongConfigDrift | kong-config-drift.md |
8. Health endpoints
| Path | Purpose | Exposure |
|---|---|---|
/health | Data-plane liveness (static 200) | Public (non-sensitive) |
/ready | Readiness (JWKS loaded, Redis reachable) | Public |
/status | Kong admin status (connections, workers) | Internal only |
/metrics | Prometheus scrape | Internal only |
9. Tracing baggage
Kong propagates and enriches OTel baggage (baggage: account.id=...,tier=...) so upstream services tag spans/logs consistently. Upstream services must not trust baggage for authorization decisions; it is for observability only.
10. Synthetic monitoring
- A Blackbox exporter probes
https://api.ghasi.io/healthevery 30 s from multiple regions. - A synthetic
/v1/sms/sendrequest with a dedicated internal API key runs every 5 min in staging and prod; failure pages on-call after 2 consecutive fails.
11. Open questions
- Tail-based OTel sampling vs head-based (error-biased retention).
- Separate Grafana tenant for SRE edge dashboards.