ai-orchestrator-service — Observability
Companion to:
docs/02-enterprise-architecture.md·docs/08-ai-architecture.md §13
The AI service is the highest-cost service in the platform per request (token spend) and the most user-visible when it degrades. Observability therefore must answer four questions in real time:
- Is the service up? (golden signals)
- Is it producing safe, schema-valid outputs? (output-quality SLO)
- How much money are we spending, and by which tenant / capability / model? (token cost dashboard)
- When something failed, what was the exact provider, prompt version, redacted input, model output, and provenance? (forensics)
1. Golden signals
| Signal | Source | Target |
|---|---|---|
| Availability (HTTP 5xx ratio) | Cloud Run + LB | ≥ 99.9% over 28 d (excluding declared incident windows) |
Latency p95 — /ai/complete cloud | OTel histogram ai.inference.latency_ms filtered by provider!='onnx-edge' | ≤ 2 500 ms |
Latency p95 — /ai/complete edge replay | Same | ≤ 6 000 ms |
| Error rate — schema invalid | Counter ai.output.invalid_total / ai.inference.completed_total | ≤ 0.5% |
| Error rate — provider unavailable | Counter ai.provider.unavailable_total | ≤ 0.5% (after retry) |
| HITL queue depth | Gauge ai.hitl.gates.open per tenant | Alert if > 50 for > 30 min |
| Cache hit ratio | Counter ai.cache.hit_total / ai.cache.total | ≥ 35% (cost SLO) |
2. Metrics
All metrics use OpenTelemetry, exported to Cloud Monitoring (Managed Prometheus) and BigQuery (long-term cost analytics).
2.1 Inference metrics
| Metric | Type | Labels | Notes |
|---|---|---|---|
melmastoon.ai.inference.latency_ms | histogram | tenant_id, capability_key, provider, model, cache_hit, local | The single most-queried metric |
melmastoon.ai.inference.tokens.input | histogram | same | |
melmastoon.ai.inference.tokens.output | histogram | same | |
melmastoon.ai.inference.cost_micros | histogram | same | Provider-billed micros; 0 for cache & edge |
melmastoon.ai.inference.completed_total | counter | tenant_id, capability_key, provider | |
melmastoon.ai.inference.failed_total | counter | tenant_id, capability_key, error_code | |
melmastoon.ai.cache.hit_total | counter | tenant_id, capability_key | |
melmastoon.ai.cache.miss_total | counter | same |
2.2 HITL metrics
| Metric | Type | Labels |
|---|---|---|
melmastoon.ai.hitl.gates.open | gauge | tenant_id, policy_key |
melmastoon.ai.hitl.decision_latency_ms | histogram | tenant_id, policy_key, outcome |
melmastoon.ai.hitl.timeouts_total | counter | tenant_id, policy_key |
melmastoon.ai.hitl.outcome_total | counter | tenant_id, policy_key, outcome (accept/modify/reject) |
2.3 Provider/router metrics
| Metric | Type | Labels |
|---|---|---|
melmastoon.ai.provider.unavailable_total | counter | provider, model |
melmastoon.ai.provider.fallback_used_total | counter | from_provider, to_provider, capability_key |
melmastoon.ai.provider.rate_limited_total | counter | provider, model |
melmastoon.ai.provider.timeouts_total | counter | provider, model |
2.4 Eval & quality metrics
| Metric | Type | Labels |
|---|---|---|
melmastoon.ai.eval.score | gauge | eval_suite_id, prompt_version_id, metric (precision/recall/bleu/rougeL/score) |
melmastoon.ai.output.invalid_total | counter | capability_key, provider |
melmastoon.ai.output.repaired_total | counter | capability_key |
melmastoon.ai.moderation.flagged_total | counter | side (input/output), category |
2.5 Budget metrics
| Metric | Type | Labels |
|---|---|---|
melmastoon.ai.budget.consumed_micros_total | counter | tenant_id, purpose_id, period |
melmastoon.ai.budget.warnings_total | counter | tenant_id, purpose_id |
melmastoon.ai.budget.exceeded_total | counter | tenant_id, purpose_id |
melmastoon.ai.budget.utilization | gauge | tenant_id, purpose_id (0..1) |
3. Tracing
OpenTelemetry tracing is wired through the entire RunInferenceUseCase pipeline. The request span tree typically looks like:
ai.complete (root, http server span)
├─ ai.capability.resolve
├─ ai.prompt.resolve_active
├─ ai.budget.consume
├─ ai.moderation.input
├─ ai.redact.input
├─ ai.cache.lookup (sets attribute ai.cache.hit=true|false)
├─ ai.router.pick (sets ai.router.provider, ai.router.model)
├─ ai.provider.call (sets provider span attributes; SpanKind=CLIENT)
│ └─ provider.<vertex|anthropic|openai|onnx-edge>.invoke
├─ ai.moderation.output
├─ ai.output.validate
├─ ai.hitl.maybe_open
├─ ai.provenance.persist
├─ ai.outbox.append
└─ ai.bigquery.stream
Every span carries baggage tenant_id, capability_key, request_id, correlation_id, causation_id. The Trace-Parent from the inbound request propagates to provider HTTP calls so Vertex traces can be correlated.
Sampling: head-based 10% by default; tail-based 100% if any of error=true, ai.budget.exceeded=true, ai.hitl.opened=true, ai.output.invalid=true.
4. Logging
Structured JSON logs (Pino) with mandatory fields:
| Field | Source |
|---|---|
timestamp, level, message | logger |
service | ai-orchestrator-service |
version | git sha + semver |
tenant_id, request_id, correlation_id, causation_id | request context |
capability_key, prompt_version_id, provider, model | inference context |
tokens_in, tokens_out, cost_micros, latency_ms, cache_hit, local | result context |
error_code | on failure |
Logs never contain raw user inputs or model outputs. They contain input_hash, redacted_input_hash, and output_hash. The forensics flow (§7) retrieves the redacted text from the Postgres audit row, never from logs.
Log retention: 30 days hot in Cloud Logging; archived to GCS (CMEK) for 1 year.
5. BigQuery: token cost dashboard
A streaming insert pipeline writes one row per inference into ai_calls_fact (see DATA_MODEL.md). The dashboard queries:
| Tile | Query summary |
|---|---|
| Daily spend by tenant | SELECT tenant_id, SUM(cost_micros)/1e6 AS usd FROM ai_calls_fact WHERE event_date >= CURRENT_DATE() - 7 GROUP BY 1 ORDER BY usd DESC |
| Spend by capability | SELECT capability_key, SUM(cost_micros)/1e6 FROM ai_calls_fact WHERE event_date = CURRENT_DATE() GROUP BY 1 |
| Cache effectiveness | SELECT capability_key, COUNTIF(cache_hit)/COUNT(*) AS hit_ratio FROM ai_calls_fact WHERE event_date = CURRENT_DATE() GROUP BY 1 |
| Provider mix | SELECT provider, model, COUNT(*) FROM ai_calls_fact WHERE event_date = CURRENT_DATE() GROUP BY 1, 2 |
| Top expensive prompts | SELECT prompt_version_id, AVG(cost_micros) FROM ai_calls_fact WHERE event_date >= CURRENT_DATE() - 1 GROUP BY 1 ORDER BY 2 DESC LIMIT 20 |
| Edge ratio | SELECT capability_key, COUNTIF(local)/COUNT(*) FROM ai_calls_fact WHERE event_date = CURRENT_DATE() GROUP BY 1 |
The dashboard (Looker Studio + a Grafana mirror) is open to platform-engineering and ai-engineering; tenant slice is exposed read-only to tenant admins via the backoffice (the same query, gated by RLS via a dataset-level row policy).
6. SLOs and alerts
| SLO | Target | Burn-rate alert (Cloud Monitoring) |
|---|---|---|
/ai/complete availability | 99.9% rolling 28 d | 14× burn over 1 h or 6× over 6 h → page |
/ai/complete p95 cloud latency | ≤ 2 500 ms 28 d | 5× over 1 h → page |
| Output schema validity | ≥ 99.5% 7 d | < 99% over 1 h → page |
| Cache hit ratio | ≥ 35% 7 d | < 25% over 24 h → ticket |
| Token spend variance | ≤ 30% above 7-d trailing avg per tenant per day | exceed → page on-call |
| HITL p95 decision latency (auto-rejectable) | ≤ 2 h | exceed → ticket |
All alerts route to PagerDuty service ai-orchestrator; routing rules send budget alerts to the finance ops Slack as well.
7. Forensics flow ("what just happened?")
For any failed or suspicious inference:
- From the trace ID in the user's incident report, retrieve the root span in Cloud Trace.
- The root span's
request_idattribute is theinference_requests.id. - Fetch from Postgres: the
inference_requestsrow + theinference_resultsrow + theprovenancesrow + themoderation_auditrows. - Reconstruct the redacted input / output (NEVER the raw input — only redacted forms are persisted).
- Replay the call against the same prompt version + model deployment via
POST /api/v1/ai/replay/:requestId(admin-only, dry-run; emitsai.replay.executedevent for audit). - If the replay matches, the failure is reproducible; if not, root-cause is provider-side flakiness (and the corresponding
ai.provider.flaky_totalcounter increments).
8. Health probes
| Probe | What it checks |
|---|---|
GET /healthz (liveness) | Process up; event loop responsive within 200 ms |
GET /readyz (readiness) | Postgres reachable; Memorystore reachable; Vertex AI reachable (5 s budget); KMS reachable for manifest signer |
GET /degradedz | Returns the active degradation level (none / cache-only / deterministic-fallback / read-only) |
The degradedz endpoint feeds the global status banner shown in the backoffice.