Skip to main content

ai-orchestrator-service — Observability

Companion to: docs/02-enterprise-architecture.md · docs/08-ai-architecture.md §13

The AI service is the highest-cost service in the platform per request (token spend) and the most user-visible when it degrades. Observability therefore must answer four questions in real time:

  1. Is the service up? (golden signals)
  2. Is it producing safe, schema-valid outputs? (output-quality SLO)
  3. How much money are we spending, and by which tenant / capability / model? (token cost dashboard)
  4. When something failed, what was the exact provider, prompt version, redacted input, model output, and provenance? (forensics)

1. Golden signals

SignalSourceTarget
Availability (HTTP 5xx ratio)Cloud Run + LB≥ 99.9% over 28 d (excluding declared incident windows)
Latency p95 — /ai/complete cloudOTel histogram ai.inference.latency_ms filtered by provider!='onnx-edge'≤ 2 500 ms
Latency p95 — /ai/complete edge replaySame≤ 6 000 ms
Error rate — schema invalidCounter ai.output.invalid_total / ai.inference.completed_total≤ 0.5%
Error rate — provider unavailableCounter ai.provider.unavailable_total≤ 0.5% (after retry)
HITL queue depthGauge ai.hitl.gates.open per tenantAlert if > 50 for > 30 min
Cache hit ratioCounter ai.cache.hit_total / ai.cache.total≥ 35% (cost SLO)

2. Metrics

All metrics use OpenTelemetry, exported to Cloud Monitoring (Managed Prometheus) and BigQuery (long-term cost analytics).

2.1 Inference metrics

MetricTypeLabelsNotes
melmastoon.ai.inference.latency_mshistogramtenant_id, capability_key, provider, model, cache_hit, localThe single most-queried metric
melmastoon.ai.inference.tokens.inputhistogramsame
melmastoon.ai.inference.tokens.outputhistogramsame
melmastoon.ai.inference.cost_microshistogramsameProvider-billed micros; 0 for cache & edge
melmastoon.ai.inference.completed_totalcountertenant_id, capability_key, provider
melmastoon.ai.inference.failed_totalcountertenant_id, capability_key, error_code
melmastoon.ai.cache.hit_totalcountertenant_id, capability_key
melmastoon.ai.cache.miss_totalcountersame

2.2 HITL metrics

MetricTypeLabels
melmastoon.ai.hitl.gates.opengaugetenant_id, policy_key
melmastoon.ai.hitl.decision_latency_mshistogramtenant_id, policy_key, outcome
melmastoon.ai.hitl.timeouts_totalcountertenant_id, policy_key
melmastoon.ai.hitl.outcome_totalcountertenant_id, policy_key, outcome (accept/modify/reject)

2.3 Provider/router metrics

MetricTypeLabels
melmastoon.ai.provider.unavailable_totalcounterprovider, model
melmastoon.ai.provider.fallback_used_totalcounterfrom_provider, to_provider, capability_key
melmastoon.ai.provider.rate_limited_totalcounterprovider, model
melmastoon.ai.provider.timeouts_totalcounterprovider, model

2.4 Eval & quality metrics

MetricTypeLabels
melmastoon.ai.eval.scoregaugeeval_suite_id, prompt_version_id, metric (precision/recall/bleu/rougeL/score)
melmastoon.ai.output.invalid_totalcountercapability_key, provider
melmastoon.ai.output.repaired_totalcountercapability_key
melmastoon.ai.moderation.flagged_totalcounterside (input/output), category

2.5 Budget metrics

MetricTypeLabels
melmastoon.ai.budget.consumed_micros_totalcountertenant_id, purpose_id, period
melmastoon.ai.budget.warnings_totalcountertenant_id, purpose_id
melmastoon.ai.budget.exceeded_totalcountertenant_id, purpose_id
melmastoon.ai.budget.utilizationgaugetenant_id, purpose_id (0..1)

3. Tracing

OpenTelemetry tracing is wired through the entire RunInferenceUseCase pipeline. The request span tree typically looks like:

ai.complete (root, http server span)
├─ ai.capability.resolve
├─ ai.prompt.resolve_active
├─ ai.budget.consume
├─ ai.moderation.input
├─ ai.redact.input
├─ ai.cache.lookup (sets attribute ai.cache.hit=true|false)
├─ ai.router.pick (sets ai.router.provider, ai.router.model)
├─ ai.provider.call (sets provider span attributes; SpanKind=CLIENT)
│ └─ provider.<vertex|anthropic|openai|onnx-edge>.invoke
├─ ai.moderation.output
├─ ai.output.validate
├─ ai.hitl.maybe_open
├─ ai.provenance.persist
├─ ai.outbox.append
└─ ai.bigquery.stream

Every span carries baggage tenant_id, capability_key, request_id, correlation_id, causation_id. The Trace-Parent from the inbound request propagates to provider HTTP calls so Vertex traces can be correlated.

Sampling: head-based 10% by default; tail-based 100% if any of error=true, ai.budget.exceeded=true, ai.hitl.opened=true, ai.output.invalid=true.

4. Logging

Structured JSON logs (Pino) with mandatory fields:

FieldSource
timestamp, level, messagelogger
serviceai-orchestrator-service
versiongit sha + semver
tenant_id, request_id, correlation_id, causation_idrequest context
capability_key, prompt_version_id, provider, modelinference context
tokens_in, tokens_out, cost_micros, latency_ms, cache_hit, localresult context
error_codeon failure

Logs never contain raw user inputs or model outputs. They contain input_hash, redacted_input_hash, and output_hash. The forensics flow (§7) retrieves the redacted text from the Postgres audit row, never from logs.

Log retention: 30 days hot in Cloud Logging; archived to GCS (CMEK) for 1 year.

5. BigQuery: token cost dashboard

A streaming insert pipeline writes one row per inference into ai_calls_fact (see DATA_MODEL.md). The dashboard queries:

TileQuery summary
Daily spend by tenantSELECT tenant_id, SUM(cost_micros)/1e6 AS usd FROM ai_calls_fact WHERE event_date >= CURRENT_DATE() - 7 GROUP BY 1 ORDER BY usd DESC
Spend by capabilitySELECT capability_key, SUM(cost_micros)/1e6 FROM ai_calls_fact WHERE event_date = CURRENT_DATE() GROUP BY 1
Cache effectivenessSELECT capability_key, COUNTIF(cache_hit)/COUNT(*) AS hit_ratio FROM ai_calls_fact WHERE event_date = CURRENT_DATE() GROUP BY 1
Provider mixSELECT provider, model, COUNT(*) FROM ai_calls_fact WHERE event_date = CURRENT_DATE() GROUP BY 1, 2
Top expensive promptsSELECT prompt_version_id, AVG(cost_micros) FROM ai_calls_fact WHERE event_date >= CURRENT_DATE() - 1 GROUP BY 1 ORDER BY 2 DESC LIMIT 20
Edge ratioSELECT capability_key, COUNTIF(local)/COUNT(*) FROM ai_calls_fact WHERE event_date = CURRENT_DATE() GROUP BY 1

The dashboard (Looker Studio + a Grafana mirror) is open to platform-engineering and ai-engineering; tenant slice is exposed read-only to tenant admins via the backoffice (the same query, gated by RLS via a dataset-level row policy).

6. SLOs and alerts

SLOTargetBurn-rate alert (Cloud Monitoring)
/ai/complete availability99.9% rolling 28 d14× burn over 1 h or 6× over 6 h → page
/ai/complete p95 cloud latency≤ 2 500 ms 28 d5× over 1 h → page
Output schema validity≥ 99.5% 7 d< 99% over 1 h → page
Cache hit ratio≥ 35% 7 d< 25% over 24 h → ticket
Token spend variance≤ 30% above 7-d trailing avg per tenant per dayexceed → page on-call
HITL p95 decision latency (auto-rejectable)≤ 2 hexceed → ticket

All alerts route to PagerDuty service ai-orchestrator; routing rules send budget alerts to the finance ops Slack as well.

7. Forensics flow ("what just happened?")

For any failed or suspicious inference:

  1. From the trace ID in the user's incident report, retrieve the root span in Cloud Trace.
  2. The root span's request_id attribute is the inference_requests.id.
  3. Fetch from Postgres: the inference_requests row + the inference_results row + the provenances row + the moderation_audit rows.
  4. Reconstruct the redacted input / output (NEVER the raw input — only redacted forms are persisted).
  5. Replay the call against the same prompt version + model deployment via POST /api/v1/ai/replay/:requestId (admin-only, dry-run; emits ai.replay.executed event for audit).
  6. If the replay matches, the failure is reproducible; if not, root-cause is provider-side flakiness (and the corresponding ai.provider.flaky_total counter increments).

8. Health probes

ProbeWhat it checks
GET /healthz (liveness)Process up; event loop responsive within 200 ms
GET /readyz (readiness)Postgres reachable; Memorystore reachable; Vertex AI reachable (5 s budget); KMS reachable for manifest signer
GET /degradedzReturns the active degradation level (none / cache-only / deterministic-fallback / read-only)

The degradedz endpoint feeds the global status banner shown in the backoffice.