Skip to main content

AI Gateway Service — Failure Modes

Status: populated Owner: TBD Last updated: 2026-04-17 Companion: Service Template

1. Catalogue

IDFailureDetectionUser impactMitigationRunbook
F1Primary provider error spikeaigw_provider_error_totalSlower response; fallback activatedCircuit open → fallback provider; emit provider.degradedprovider-outage.md
F2All providers downBoth circuits open503 AI_PROVIDER_UNAVAILABLEPage P1; vendor status check; manual failover to on-premprovider-outage.md
F3Policy service timeoutaigw_policy_latency_ms p99 breach403 fail-closedCache short-lived allow decisions off by default; investigate policy servicepolicy-degraded.md
F4Moderation classifier offlineHealth check red422 on all assists or fail-open? Default: fail-closed; emit assist.failedRun local fallback classifier; escalatemoderation-degraded.md
F5Redis unavailablequota calls errorQuota enforced conservatively in memory; degradedSwitch to per-instance quota; Redis recovery; alertcache-outage.md
F6Postgres slow / downintegration test + p99 breachAssist 5xxHA failover; read-only admin queries
F7NATS partitionpublish errorsOutbox backs up; assist still succeedsOutbox relay resumes; DLQ monitorednats-partition.md
F8HITL queue backlogaigw_hitl_queue_depthDrafts wait; owning module cannot finaliseNotify lead reviewer; escalate to supervisor role; consider auto-reject over N dayshitl-backlog.md
F9Prompt injection detectedmoderation flag422 at call siteBlock, emit event, log template hash and feature; add corpus sample
F10Provider returns PHI leakpost-moderation blockOutput suppressed; 200 with null draftai.moderation.flagged.v1 stage=output; reviewer manual triage
F11Clock skewprovenance requestedAt > completedAtInvariant violationMonotonic clock in adapter; health check
F12Schema drift (event)contract test failConsumers failRoll back schema change; publish .v2 additive
F13Consent lookup failureABAC deny (consent missing)403 AI_CONSENT_REQUIREDReviewer checks consent module; restore consent DB
F14Quota misconfiguredSudden 429 spikeUsers blockedConfig rollback via config-service; quota override endpoint
F15Reviewer over-privilegedmanual auditAccidental acceptanceQuarterly role review; split reviewer/approver when scaled
F16Circular saga (assist → finalise → assist)trace loop detectionResource exhaustionAssist denies when X-AI-Originator header present
F17Audit publish lostaudit ingestion dedup gapCompliance riskDLQ replay; provenance copy on provenance table as source of truth
F18KMS outageencryption failuresAssist fails with null draft persistenceFail open for metadata only; draft text not stored; emit assist.failed reason KMS_UNAVAILABLE

2. Fail-closed vs fail-open matrix

DependencyAssist behaviour on failure
access-policyfail-closed (deny)
config-servicefail-closed if no cached routing rule; else last-known good
moderationfail-closed (block)
audit-servicefail-open (publish to outbox; never block assist)
providertry fallback; exhaust → fail-closed
Redis (quota)fall back to per-instance counter (best-effort)