api-gateway (Kong) — Observability

Status: populated Owner: TBD (Platform / SRE) Last updated: 2026-04-17 Companion: SERVICE_OVERVIEW · EVENT_SCHEMAS · Service Template

1. Purpose

Define the SLIs, SLOs, dashboards, alerts, and runbooks for Kong as the edge gateway. Kong is the first component in the critical path for every external request; its reliability directly gates platform SLAs.

2. Metrics (Prometheus)

Exposed by the prometheus plugin on an internal scrape port. See EVENT_SCHEMAS §4 for the full metric list.

2.1 Gold-signal SLIs

SLI	Definition	Target (SLO)
Edge availability	`sum(rate(kong_http_requests_total{code!~"5.."}[5m])) / sum(rate(kong_http_requests_total[5m]))`	≥ 99.95 %
Edge latency (p95)	`histogram_quantile(0.95, rate(kong_http_latency_ms_bucket[5m]))`	≤ 150 ms (Kong-only)
Edge latency (p99)	Same, 0.99	≤ 500 ms
Upstream error rate	`rate(kong_http_requests_total{code=~"5.."}[5m])`	< 0.5 % of traffic
Rate-limit rejection rate	`rate(kong_http_requests_total{code="429"}[5m])`	< 0.1 % (healthy); spike → alert
JWKS refresh failures	`increase(kong_jwks_refresh_total{result="error"}[5m])`	0 sustained
Auth failure rate on `/v1/auth/login`	`rate(kong_http_requests_total{route="rt-auth-login",code="401"}[5m])`	baseline + 3σ alert

2.2 Secondary metrics

kong_kong_latency_ms (plugin overhead)
kong_upstream_latency_ms (upstream health)
kong_nginx_http_current_connections
kong_memory_lua_shared_dict_bytes
ghasi_api_key_lookup_latency_seconds / _total

3. Logs (Loki)

Plugin: http-log → Loki push endpoint.
Format: JSON per-request, schema in EVENT_SCHEMAS §3.
Labels: service="kong", env, route, code_class (2xx, 4xx, 5xx).
Retention: 14 d hot / 90 d cold.
Body logging: disabled (PII).
LogQL examples:
- {service="kong", code_class="5xx"} | json | latency_ms > 1000 — slow errors
- {service="kong", route="rt-auth-login", status=401} — auth failure spikes

4. Traces (OpenTelemetry)

Plugin: opentelemetry → OTel collector (OTLP gRPC).
Span name: kong.request
Attributes: see EVENT_SCHEMAS §5.
Sampling: 10 % head-based default; 100 % for 5xx, 100 % for /v1/auth/login.
Upstream service spans chain as children via W3C traceparent.

5. Dashboards (Grafana)

Prebuilt dashboards under ops/grafana/dashboards/kong/:

kong-overview — total RPS, error rate, p50/p95/p99 latency, 429 rate, top routes.
kong-route-drilldown — same by Route; stacked by upstream status class.
kong-auth — JWT success/failure, API-key lookup hit/miss, JWKS refresh state.
kong-rate-limit — rejected requests by limit_by, Redis health, counter growth.
kong-plugin-latency — Kong-internal latency by plugin phase (via OTel spans or synthetic).
kong-resource — pod CPU/mem, connections, worker health.

6. Alerts

Alert	Condition	Severity	Action
`KongHighErrorRate`	5xx rate > 1 % for 5 min	critical	Page on-call; check upstreams + runbook
`KongLatencyP95High`	Kong p95 > 500 ms for 10 min	high	Investigate plugin cost / worker saturation
`KongUpstreamUnhealthy`	Upstream health check failing for > 2 min	critical	Failover or scale upstream
`KongRateLimitStorm`	429 rate > 5 % of traffic for 5 min	high	Investigate abuse; may be legitimate traffic surge
`KongJWKSRefreshFail`	> 3 consecutive failures	high	Check `auth-service` health
`KongCertExpirySoon`	TLS cert < 14 d	medium	Trigger rotation
`KongPodRestartLoop`	CrashLoopBackOff	critical	Check config + resource limits
`KongRedisUnavailable`	Rate-limit plugin reports Redis errors > 10/min	high	Check Redis; review fail-open/closed behaviour
`KongAuthFailureSpike`	401 rate on `/v1/auth/login` > baseline+3σ	high	Possible credential-stuffing; review IPs
`KongConfigDrift`	`deck diff` CI job detects drift between Git and live	medium	Investigate manual change; resync

All alerts route to the platform-edge Pager rotation.

7. Runbooks

Runbooks live under docs/ops/runbooks/kong/ (to be authored):

Alert	Runbook
`KongHighErrorRate`	`kong-5xx-triage.md`
`KongUpstreamUnhealthy`	`kong-upstream-down.md`
`KongRateLimitStorm`	`kong-ratelimit-storm.md`
`KongJWKSRefreshFail`	`kong-jwks-refresh.md`
`KongCertExpirySoon`	`kong-cert-rotation.md`
`KongConfigDrift`	`kong-config-drift.md`

8. Health endpoints

Path	Purpose	Exposure
`/health`	Data-plane liveness (static 200)	Public (non-sensitive)
`/ready`	Readiness (JWKS loaded, Redis reachable)	Public
`/status`	Kong admin status (connections, workers)	Internal only
`/metrics`	Prometheus scrape	Internal only

9. Tracing baggage

Kong propagates and enriches OTel baggage (baggage: account.id=...,tier=...) so upstream services tag spans/logs consistently. Upstream services must not trust baggage for authorization decisions; it is for observability only.

10. Synthetic monitoring

A Blackbox exporter probes https://api.ghasi.io/health every 30 s from multiple regions.
A synthetic /v1/sms/send request with a dedicated internal API key runs every 5 min in staging and prod; failure pages on-call after 2 consecutive fails.

11. Open questions

Tail-based OTel sampling vs head-based (error-biased retention).
Separate Grafana tenant for SRE edge dashboards.

1. Purpose​

2. Metrics (Prometheus)​

2.1 Gold-signal SLIs​

2.2 Secondary metrics​

3. Logs (Loki)​

4. Traces (OpenTelemetry)​

5. Dashboards (Grafana)​

6. Alerts​

7. Runbooks​

8. Health endpoints​

9. Tracing baggage​

10. Synthetic monitoring​

11. Open questions​