api-gateway (Kong) — Failure Modes
Status: populated Owner: TBD (Platform / SRE) Last updated: 2026-04-17 Companion: SERVICE_OVERVIEW · OBSERVABILITY · Service Template
1. Purpose
Catalogue what breaks, how it surfaces, and how we mitigate. Kong is the first critical-path component for every external request; failure modes here are often platform-wide.
2. Catalog
2.1 Kong pod crash / restart loop
- Symptom:
CrashLoopBackOff; 5xx spike from LB health check. - Detection: Kubernetes event +
KongPodRestartLoopalert. - User impact: Elevated 5xx if all pods are affected; none if a surviving replica absorbs traffic (HPA min 2).
- Mitigation: Roll back last config (
deck gateway reset+ resync from previous tag). Check resource limits. PDB enforces 2 minAvailable. - Runbook:
kong-pod-crash.md.
2.2 Upstream service down
- Symptom: Upstream returns 5xx or connection refused; Kong returns
502/503/504problem+json. - Detection:
KongUpstreamUnhealthyalert; trace waterfalls show upstream failures. - User impact: Feature-specific:
/v1/sms/sendrejects; other routes unaffected. - Mitigation: Kong passive health checks take the unhealthy target out of rotation. Page upstream service's on-call. Consider returning a maintenance response if prolonged.
2.3 Redis unavailable (rate-limit store)
- Symptom:
rate-limiting-advancedreports Redis errors. - Detection:
KongRedisUnavailablealert; error logs from plugin. - User impact: Per route policy:
/v1/sms/send,/v1/sms/bulk,/v1/auth/login— fail-closed (503 withRetry-After). Rationale: unbounded SMS send is a business-risk event.GET /v1/sms/{id},/v1/analytics/*— fail-open (serve without rate limit). Rationale: read-only endpoints are less abusable.
- Mitigation: Restore Redis (Redis cluster redundancy, failover); counters losing state is acceptable.
2.4 JWKS fetch failure (auth-service unreachable)
- Symptom: New JWTs with unknown
kidfail 401; existing cachedkids keep working until TTL. - Detection:
KongJWKSRefreshFailalert. - User impact: Progressive — existing sessions keep working, new/rotated keys fail.
- Mitigation: Restore
auth-service; Kong auto-refreshes JWKS on next cycle. ManualDELETE /cache/jwksif forced.
2.5 Custom plugin (ghasi-api-key-lookup) upstream failure
- Symptom: API-key routes return 503.
- Detection:
ghasi_api_key_lookup_total{result="error"}spikes; alert. - User impact: Customers using API keys cannot authenticate; JWT customers unaffected.
- Mitigation: Restore
auth-service; plugin continues serving from cache until entries expire. Never fail-open (security invariant).
2.6 TLS certificate expiry
- Symptom: Browsers/clients reject TLS handshake.
- Detection:
KongCertExpirySoonat T-14 d; synthetic probe fails at T-0. - User impact: Total outage if T-0 reached.
- Mitigation: cert-manager auto-renew or Cloudflare origin cert auto-rotate; manual rotation runbook if automation fails.
2.7 Config drift (live Kong diverges from Git)
- Symptom: Behaviour differs from expected;
deck diffshows changes. - Detection:
KongConfigDriftnightly alert. - User impact: Unpredictable; depends on drift.
- Mitigation: Investigate source (manual admin API change?); resync from Git. Lock down admin API writes to CI only (already enforced).
2.8 Route config drift vs upstream OpenAPI
- Symptom: Kong routes traffic to a path the upstream no longer serves (404) or rejects (400).
- Detection: CI lint fails on PR; post-deploy, upstream 404 rate rises.
- User impact: Specific endpoint returns 404/400.
- Mitigation: The CI lint is the primary prevention. If it lands anyway, hotfix the decK YAML or the upstream.
2.9 Plugin misconfiguration (e.g. JWT audience wrong)
- Symptom: Sudden 401 rate on a specific route.
- Detection:
KongAuthFailureSpikealert. - User impact: All traffic to that route fails auth.
- Mitigation: Revert to last known-good decK tag.
2.10 Plugin version conflict
- Symptom: Kong startup fails or plugins disabled after upgrade.
- Detection: Pod unready; readiness probe failing.
- User impact: Pod removed from pool.
- Mitigation: Blue/green upgrades (see DEPLOYMENT_TOPOLOGY §9) — rollback to previous image.
2.11 DoS / traffic surge beyond capacity
- Symptom: p95 latency balloons; 5xx from Kong due to worker saturation.
- Detection:
KongLatencyP95High+ HPA scaling events. - User impact: Slow responses; possible 503 overflow.
- Mitigation: Cloudflare challenges; HPA scale; enforce stricter global rate limit (5 000 req/s guardrail exists). If sustained, raise Kong replica ceiling.
2.12 Loki push backpressure (log pipeline)
- Symptom:
http-logplugin reports push errors. - Detection: Kong error log; Loki push metric.
- User impact: None request-path; logs dropped or buffered.
- Mitigation: Buffer size sane defaults; Loki scaled. This failure does not block requests.
2.13 OTel collector down
- Symptom: Traces absent; Kong otel export errors.
- Detection: Missing spans; OTel collector health.
- User impact: None.
- Mitigation: Scale/restart OTel collector. Configure a local buffer on Kong.
2.14 Misapplied NetworkPolicy blocking Admin API / plugin egress
- Symptom:
deck syncfails; custom plugin cannot reachauth-service. - Detection: CI failure; custom plugin 503.
- User impact: Deploy blocked; API-key customers fail.
- Mitigation: Revert NetworkPolicy change; SRE review.
2.15 Cloudflare misconfiguration (origin mismatch)
- Symptom: 525/526 TLS errors at CF; clients see gateway errors.
- Detection: Cloudflare dashboard; synthetic probes.
- User impact: Total or partial outage.
- Mitigation: Revert CF change; verify origin cert and SNI config.
3. Fail-mode matrix (quick reference)
| Route | Redis down | Auth-service down | Upstream down |
|---|---|---|---|
POST /v1/sms/send | 503 (closed) | 401/503 (JWT path), 503 (key-auth) | 503 |
GET /v1/sms/{id} | 200 (open) | 401 (new tokens) | 503 |
/v1/auth/login | 503 (closed) | n/a (target is auth-service) | 503 |
/v1/analytics/* | 200 (open) | 401 on new tokens | 503 |
/admin/* | 503 (closed) | 401 | 503 |
4. Blast-radius notes
- Kong is a single platform-wide component. Any total Kong outage is a platform-wide outage on the north-south path. HA replicas + blue/green upgrades are therefore load-bearing.
- East-west traffic (NATS, gRPC between services) bypasses Kong and is unaffected.
5. Open questions
- Add a standby Kong instance in DR region with DNS failover (Cloudflare Load Balancer)?
- Circuit breaker behaviour on upstream 503 — pass through as-is vs insert a retry-after header?