Skip to main content

api-gateway (Kong) — Failure Modes

Status: populated Owner: TBD (Platform / SRE) Last updated: 2026-04-17 Companion: SERVICE_OVERVIEW · OBSERVABILITY · Service Template

1. Purpose

Catalogue what breaks, how it surfaces, and how we mitigate. Kong is the first critical-path component for every external request; failure modes here are often platform-wide.

2. Catalog

2.1 Kong pod crash / restart loop

  • Symptom: CrashLoopBackOff; 5xx spike from LB health check.
  • Detection: Kubernetes event + KongPodRestartLoop alert.
  • User impact: Elevated 5xx if all pods are affected; none if a surviving replica absorbs traffic (HPA min 2).
  • Mitigation: Roll back last config (deck gateway reset + resync from previous tag). Check resource limits. PDB enforces 2 minAvailable.
  • Runbook: kong-pod-crash.md.

2.2 Upstream service down

  • Symptom: Upstream returns 5xx or connection refused; Kong returns 502/503/504 problem+json.
  • Detection: KongUpstreamUnhealthy alert; trace waterfalls show upstream failures.
  • User impact: Feature-specific: /v1/sms/send rejects; other routes unaffected.
  • Mitigation: Kong passive health checks take the unhealthy target out of rotation. Page upstream service's on-call. Consider returning a maintenance response if prolonged.

2.3 Redis unavailable (rate-limit store)

  • Symptom: rate-limiting-advanced reports Redis errors.
  • Detection: KongRedisUnavailable alert; error logs from plugin.
  • User impact: Per route policy:
    • /v1/sms/send, /v1/sms/bulk, /v1/auth/loginfail-closed (503 with Retry-After). Rationale: unbounded SMS send is a business-risk event.
    • GET /v1/sms/{id}, /v1/analytics/*fail-open (serve without rate limit). Rationale: read-only endpoints are less abusable.
  • Mitigation: Restore Redis (Redis cluster redundancy, failover); counters losing state is acceptable.

2.4 JWKS fetch failure (auth-service unreachable)

  • Symptom: New JWTs with unknown kid fail 401; existing cached kids keep working until TTL.
  • Detection: KongJWKSRefreshFail alert.
  • User impact: Progressive — existing sessions keep working, new/rotated keys fail.
  • Mitigation: Restore auth-service; Kong auto-refreshes JWKS on next cycle. Manual DELETE /cache/jwks if forced.

2.5 Custom plugin (ghasi-api-key-lookup) upstream failure

  • Symptom: API-key routes return 503.
  • Detection: ghasi_api_key_lookup_total{result="error"} spikes; alert.
  • User impact: Customers using API keys cannot authenticate; JWT customers unaffected.
  • Mitigation: Restore auth-service; plugin continues serving from cache until entries expire. Never fail-open (security invariant).

2.6 TLS certificate expiry

  • Symptom: Browsers/clients reject TLS handshake.
  • Detection: KongCertExpirySoon at T-14 d; synthetic probe fails at T-0.
  • User impact: Total outage if T-0 reached.
  • Mitigation: cert-manager auto-renew or Cloudflare origin cert auto-rotate; manual rotation runbook if automation fails.

2.7 Config drift (live Kong diverges from Git)

  • Symptom: Behaviour differs from expected; deck diff shows changes.
  • Detection: KongConfigDrift nightly alert.
  • User impact: Unpredictable; depends on drift.
  • Mitigation: Investigate source (manual admin API change?); resync from Git. Lock down admin API writes to CI only (already enforced).

2.8 Route config drift vs upstream OpenAPI

  • Symptom: Kong routes traffic to a path the upstream no longer serves (404) or rejects (400).
  • Detection: CI lint fails on PR; post-deploy, upstream 404 rate rises.
  • User impact: Specific endpoint returns 404/400.
  • Mitigation: The CI lint is the primary prevention. If it lands anyway, hotfix the decK YAML or the upstream.

2.9 Plugin misconfiguration (e.g. JWT audience wrong)

  • Symptom: Sudden 401 rate on a specific route.
  • Detection: KongAuthFailureSpike alert.
  • User impact: All traffic to that route fails auth.
  • Mitigation: Revert to last known-good decK tag.

2.10 Plugin version conflict

  • Symptom: Kong startup fails or plugins disabled after upgrade.
  • Detection: Pod unready; readiness probe failing.
  • User impact: Pod removed from pool.
  • Mitigation: Blue/green upgrades (see DEPLOYMENT_TOPOLOGY §9) — rollback to previous image.

2.11 DoS / traffic surge beyond capacity

  • Symptom: p95 latency balloons; 5xx from Kong due to worker saturation.
  • Detection: KongLatencyP95High + HPA scaling events.
  • User impact: Slow responses; possible 503 overflow.
  • Mitigation: Cloudflare challenges; HPA scale; enforce stricter global rate limit (5 000 req/s guardrail exists). If sustained, raise Kong replica ceiling.

2.12 Loki push backpressure (log pipeline)

  • Symptom: http-log plugin reports push errors.
  • Detection: Kong error log; Loki push metric.
  • User impact: None request-path; logs dropped or buffered.
  • Mitigation: Buffer size sane defaults; Loki scaled. This failure does not block requests.

2.13 OTel collector down

  • Symptom: Traces absent; Kong otel export errors.
  • Detection: Missing spans; OTel collector health.
  • User impact: None.
  • Mitigation: Scale/restart OTel collector. Configure a local buffer on Kong.

2.14 Misapplied NetworkPolicy blocking Admin API / plugin egress

  • Symptom: deck sync fails; custom plugin cannot reach auth-service.
  • Detection: CI failure; custom plugin 503.
  • User impact: Deploy blocked; API-key customers fail.
  • Mitigation: Revert NetworkPolicy change; SRE review.

2.15 Cloudflare misconfiguration (origin mismatch)

  • Symptom: 525/526 TLS errors at CF; clients see gateway errors.
  • Detection: Cloudflare dashboard; synthetic probes.
  • User impact: Total or partial outage.
  • Mitigation: Revert CF change; verify origin cert and SNI config.

3. Fail-mode matrix (quick reference)

RouteRedis downAuth-service downUpstream down
POST /v1/sms/send503 (closed)401/503 (JWT path), 503 (key-auth)503
GET /v1/sms/{id}200 (open)401 (new tokens)503
/v1/auth/login503 (closed)n/a (target is auth-service)503
/v1/analytics/*200 (open)401 on new tokens503
/admin/*503 (closed)401503

4. Blast-radius notes

  • Kong is a single platform-wide component. Any total Kong outage is a platform-wide outage on the north-south path. HA replicas + blue/green upgrades are therefore load-bearing.
  • East-west traffic (NATS, gRPC between services) bypasses Kong and is unaffected.

5. Open questions

  • Add a standby Kong instance in DR region with DNS failover (Cloudflare Load Balancer)?
  • Circuit breaker behaviour on upstream 503 — pass through as-is vs insert a retry-after header?