AI Gateway Service — Observability
Status: populated
Owner: TBD
Last updated: 2026-04-17
Companion: Service Template
1. SLIs / SLOs
| SLI | Target (SLO) |
|---|
| Assist success rate (policy allow → returned draft or accepted provider block) | ≥ 99.5% monthly |
| Assist p95 end-to-end latency | ≤ 4.0 s (excluding HITL time) |
| Assist p99 end-to-end latency | ≤ 8.0 s |
| Policy-client p99 latency | ≤ 200 ms |
| Moderation overhead p95 | ≤ 300 ms |
| Provider error rate | ≤ 1% weekly per provider |
Events delivered to audit-service | ≥ 99.9% within 5 s |
| HITL queue median wait | ≤ 4 h business hours |
2. Metrics (OpenTelemetry)
| Metric | Type | Labels |
|---|
aigw_assist_total | counter | tenant, feature, outcome |
aigw_assist_latency_ms | histogram | tenant, feature, provider |
aigw_policy_latency_ms | histogram | — |
aigw_moderation_latency_ms | histogram | stage, classifier |
aigw_provider_error_total | counter | provider, error_code |
aigw_quota_rejected_total | counter | tenant, feature |
aigw_hitl_queue_depth | gauge | facility, feature |
aigw_hitl_wait_ms | histogram | facility, feature |
aigw_provider_circuit_open | gauge | provider, feature |
aigw_token_usage | counter | provider, model, direction (prompt|completion) |
3. Tracing
Span hierarchy (per assist): ai.assist → ai.policy.evaluate, ai.quota.consume, ai.moderation.input, ai.provider.generate, ai.moderation.output, ai.persist, ai.event.publish. All spans carry tenant.id, feature.key, correlation.id, decision.id.
4. Logs
- Structured JSON logs; no raw prompt or output text in default logs.
- Fields:
correlation_id, decision_id, tenant_id, actor_id, feature_key, provider, latency_ms, outcome, reason_code.
- Error logs include stack (PHI redactor applied).
5. Dashboards
| Dashboard | Panels |
|---|
| AIGW — Executive | assists/day per tenant, cost/tenant, HITL queue, SLO status |
| AIGW — Reliability | assist success rate, p95/p99, provider error rate, circuit breaker state, policy/moderation latency |
| AIGW — Safety | moderation block rate, HITL rejection rate by feature, flagged categories distribution |
| AIGW — Capacity | tokens/day per provider, quota consumption, cache hit rate |
6. Alerts
| Alert | Condition | Severity |
|---|
| Assist success rate < 99% for 15 min | page | P1 |
| Any provider circuit open > 10 min | page | P2 |
| Policy client p99 > 1 s for 10 min | page | P2 |
| HITL queue depth > 200 for > 1 h | ticket | P3 |
| Moderation block rate > 5% sudden spike | ticket | P3 |
| Tenant quota hit rate > 50% for 1 h | ticket | P3 |
7. Runbooks
| Scenario | Runbook |
|---|
| All assists failing | runbooks/aigw/provider-outage.md |
| Moderation classifier offline | runbooks/aigw/moderation-degraded.md |
| HITL queue backlog | runbooks/aigw/hitl-backlog.md |
| Policy client timeout surge | runbooks/aigw/policy-degraded.md |
| Quota misconfiguration | runbooks/aigw/quota-incident.md |