Skip to main content

AI Gateway Service — Observability

Status: populated Owner: TBD Last updated: 2026-04-17 Companion: Service Template

1. SLIs / SLOs

SLITarget (SLO)
Assist success rate (policy allow → returned draft or accepted provider block)≥ 99.5% monthly
Assist p95 end-to-end latency≤ 4.0 s (excluding HITL time)
Assist p99 end-to-end latency≤ 8.0 s
Policy-client p99 latency≤ 200 ms
Moderation overhead p95≤ 300 ms
Provider error rate≤ 1% weekly per provider
Events delivered to audit-service≥ 99.9% within 5 s
HITL queue median wait≤ 4 h business hours

2. Metrics (OpenTelemetry)

MetricTypeLabels
aigw_assist_totalcountertenant, feature, outcome
aigw_assist_latency_mshistogramtenant, feature, provider
aigw_policy_latency_mshistogram
aigw_moderation_latency_mshistogramstage, classifier
aigw_provider_error_totalcounterprovider, error_code
aigw_quota_rejected_totalcountertenant, feature
aigw_hitl_queue_depthgaugefacility, feature
aigw_hitl_wait_mshistogramfacility, feature
aigw_provider_circuit_opengaugeprovider, feature
aigw_token_usagecounterprovider, model, direction (prompt|completion)

3. Tracing

Span hierarchy (per assist): ai.assistai.policy.evaluate, ai.quota.consume, ai.moderation.input, ai.provider.generate, ai.moderation.output, ai.persist, ai.event.publish. All spans carry tenant.id, feature.key, correlation.id, decision.id.

4. Logs

  • Structured JSON logs; no raw prompt or output text in default logs.
  • Fields: correlation_id, decision_id, tenant_id, actor_id, feature_key, provider, latency_ms, outcome, reason_code.
  • Error logs include stack (PHI redactor applied).

5. Dashboards

DashboardPanels
AIGW — Executiveassists/day per tenant, cost/tenant, HITL queue, SLO status
AIGW — Reliabilityassist success rate, p95/p99, provider error rate, circuit breaker state, policy/moderation latency
AIGW — Safetymoderation block rate, HITL rejection rate by feature, flagged categories distribution
AIGW — Capacitytokens/day per provider, quota consumption, cache hit rate

6. Alerts

AlertConditionSeverity
Assist success rate < 99% for 15 minpageP1
Any provider circuit open > 10 minpageP2
Policy client p99 > 1 s for 10 minpageP2
HITL queue depth > 200 for > 1 hticketP3
Moderation block rate > 5% sudden spiketicketP3
Tenant quota hit rate > 50% for 1 hticketP3

7. Runbooks

ScenarioRunbook
All assists failingrunbooks/aigw/provider-outage.md
Moderation classifier offlinerunbooks/aigw/moderation-degraded.md
HITL queue backlogrunbooks/aigw/hitl-backlog.md
Policy client timeout surgerunbooks/aigw/policy-degraded.md
Quota misconfigurationrunbooks/aigw/quota-incident.md