ai-orchestrator-service — Observability

Companion to: docs/02-enterprise-architecture.md · docs/08-ai-architecture.md §13

The AI service is the highest-cost service in the platform per request (token spend) and the most user-visible when it degrades. Observability therefore must answer four questions in real time:

Is the service up? (golden signals)
Is it producing safe, schema-valid outputs? (output-quality SLO)
How much money are we spending, and by which tenant / capability / model? (token cost dashboard)
When something failed, what was the exact provider, prompt version, redacted input, model output, and provenance? (forensics)

1. Golden signals

Signal	Source	Target
Availability (HTTP 5xx ratio)	Cloud Run + LB	≥ 99.9% over 28 d (excluding declared incident windows)
Latency p95 — `/ai/complete` cloud	OTel histogram `ai.inference.latency_ms` filtered by `provider!='onnx-edge'`	≤ 2 500 ms
Latency p95 — `/ai/complete` edge replay	Same	≤ 6 000 ms
Error rate — schema invalid	Counter `ai.output.invalid_total` / `ai.inference.completed_total`	≤ 0.5%
Error rate — provider unavailable	Counter `ai.provider.unavailable_total`	≤ 0.5% (after retry)
HITL queue depth	Gauge `ai.hitl.gates.open` per tenant	Alert if > 50 for > 30 min
Cache hit ratio	Counter `ai.cache.hit_total / ai.cache.total`	≥ 35% (cost SLO)

2. Metrics

All metrics use OpenTelemetry, exported to Cloud Monitoring (Managed Prometheus) and BigQuery (long-term cost analytics).

2.1 Inference metrics

Metric	Type	Labels	Notes
`melmastoon.ai.inference.latency_ms`	histogram	`tenant_id`, `capability_key`, `provider`, `model`, `cache_hit`, `local`	The single most-queried metric
`melmastoon.ai.inference.tokens.input`	histogram	same
`melmastoon.ai.inference.tokens.output`	histogram	same
`melmastoon.ai.inference.cost_micros`	histogram	same	Provider-billed micros; 0 for cache & edge
`melmastoon.ai.inference.completed_total`	counter	`tenant_id`, `capability_key`, `provider`
`melmastoon.ai.inference.failed_total`	counter	`tenant_id`, `capability_key`, `error_code`
`melmastoon.ai.cache.hit_total`	counter	`tenant_id`, `capability_key`
`melmastoon.ai.cache.miss_total`	counter	same

2.2 HITL metrics

Metric	Type	Labels
`melmastoon.ai.hitl.gates.open`	gauge	`tenant_id`, `policy_key`
`melmastoon.ai.hitl.decision_latency_ms`	histogram	`tenant_id`, `policy_key`, `outcome`
`melmastoon.ai.hitl.timeouts_total`	counter	`tenant_id`, `policy_key`
`melmastoon.ai.hitl.outcome_total`	counter	`tenant_id`, `policy_key`, `outcome` (accept/modify/reject)

2.3 Provider/router metrics

Metric	Type	Labels
`melmastoon.ai.provider.unavailable_total`	counter	`provider`, `model`
`melmastoon.ai.provider.fallback_used_total`	counter	`from_provider`, `to_provider`, `capability_key`
`melmastoon.ai.provider.rate_limited_total`	counter	`provider`, `model`
`melmastoon.ai.provider.timeouts_total`	counter	`provider`, `model`

2.4 Eval & quality metrics

Metric	Type	Labels
`melmastoon.ai.eval.score`	gauge	`eval_suite_id`, `prompt_version_id`, `metric` (precision/recall/bleu/rougeL/score)
`melmastoon.ai.output.invalid_total`	counter	`capability_key`, `provider`
`melmastoon.ai.output.repaired_total`	counter	`capability_key`
`melmastoon.ai.moderation.flagged_total`	counter	`side` (input/output), `category`

2.5 Budget metrics

Metric	Type	Labels
`melmastoon.ai.budget.consumed_micros_total`	counter	`tenant_id`, `purpose_id`, `period`
`melmastoon.ai.budget.warnings_total`	counter	`tenant_id`, `purpose_id`
`melmastoon.ai.budget.exceeded_total`	counter	`tenant_id`, `purpose_id`
`melmastoon.ai.budget.utilization`	gauge	`tenant_id`, `purpose_id` (0..1)

3. Tracing

OpenTelemetry tracing is wired through the entire RunInferenceUseCase pipeline. The request span tree typically looks like:

ai.complete                                                 (root, http server span)
├─ ai.capability.resolve
├─ ai.prompt.resolve_active
├─ ai.budget.consume
├─ ai.moderation.input
├─ ai.redact.input
├─ ai.cache.lookup        (sets attribute ai.cache.hit=true|false)
├─ ai.router.pick                                           (sets ai.router.provider, ai.router.model)
├─ ai.provider.call       (sets provider span attributes; SpanKind=CLIENT)
│  └─ provider.<vertex|anthropic|openai|onnx-edge>.invoke
├─ ai.moderation.output
├─ ai.output.validate
├─ ai.hitl.maybe_open
├─ ai.provenance.persist
├─ ai.outbox.append
└─ ai.bigquery.stream

Every span carries baggage tenant_id, capability_key, request_id, correlation_id, causation_id. The Trace-Parent from the inbound request propagates to provider HTTP calls so Vertex traces can be correlated.

Sampling: head-based 10% by default; tail-based 100% if any of error=true, ai.budget.exceeded=true, ai.hitl.opened=true, ai.output.invalid=true.

4. Logging

Structured JSON logs (Pino) with mandatory fields:

Field	Source
`timestamp`, `level`, `message`	logger
`service`	`ai-orchestrator-service`
`version`	git sha + semver
`tenant_id`, `request_id`, `correlation_id`, `causation_id`	request context
`capability_key`, `prompt_version_id`, `provider`, `model`	inference context
`tokens_in`, `tokens_out`, `cost_micros`, `latency_ms`, `cache_hit`, `local`	result context
`error_code`	on failure

Logs never contain raw user inputs or model outputs. They contain input_hash, redacted_input_hash, and output_hash. The forensics flow (§7) retrieves the redacted text from the Postgres audit row, never from logs.

Log retention: 30 days hot in Cloud Logging; archived to GCS (CMEK) for 1 year.

5. BigQuery: token cost dashboard

A streaming insert pipeline writes one row per inference into ai_calls_fact (see DATA_MODEL.md). The dashboard queries:

Tile	Query summary
Daily spend by tenant	`SELECT tenant_id, SUM(cost_micros)/1e6 AS usd FROM ai_calls_fact WHERE event_date >= CURRENT_DATE() - 7 GROUP BY 1 ORDER BY usd DESC`
Spend by capability	`SELECT capability_key, SUM(cost_micros)/1e6 FROM ai_calls_fact WHERE event_date = CURRENT_DATE() GROUP BY 1`
Cache effectiveness	`SELECT capability_key, COUNTIF(cache_hit)/COUNT(*) AS hit_ratio FROM ai_calls_fact WHERE event_date = CURRENT_DATE() GROUP BY 1`
Provider mix	`SELECT provider, model, COUNT(*) FROM ai_calls_fact WHERE event_date = CURRENT_DATE() GROUP BY 1, 2`
Top expensive prompts	`SELECT prompt_version_id, AVG(cost_micros) FROM ai_calls_fact WHERE event_date >= CURRENT_DATE() - 1 GROUP BY 1 ORDER BY 2 DESC LIMIT 20`
Edge ratio	`SELECT capability_key, COUNTIF(local)/COUNT(*) FROM ai_calls_fact WHERE event_date = CURRENT_DATE() GROUP BY 1`

The dashboard (Looker Studio + a Grafana mirror) is open to platform-engineering and ai-engineering; tenant slice is exposed read-only to tenant admins via the backoffice (the same query, gated by RLS via a dataset-level row policy).

6. SLOs and alerts

SLO	Target	Burn-rate alert (Cloud Monitoring)
`/ai/complete` availability	99.9% rolling 28 d	14× burn over 1 h or 6× over 6 h → page
`/ai/complete` p95 cloud latency	≤ 2 500 ms 28 d	5× over 1 h → page
Output schema validity	≥ 99.5% 7 d	< 99% over 1 h → page
Cache hit ratio	≥ 35% 7 d	< 25% over 24 h → ticket
Token spend variance	≤ 30% above 7-d trailing avg per tenant per day	exceed → page on-call
HITL p95 decision latency (auto-rejectable)	≤ 2 h	exceed → ticket

All alerts route to PagerDuty service ai-orchestrator; routing rules send budget alerts to the finance ops Slack as well.

7. Forensics flow ("what just happened?")

For any failed or suspicious inference:

From the trace ID in the user's incident report, retrieve the root span in Cloud Trace.
The root span's request_id attribute is the inference_requests.id.
Fetch from Postgres: the inference_requests row + the inference_results row + the provenances row + the moderation_audit rows.
Reconstruct the redacted input / output (NEVER the raw input — only redacted forms are persisted).
Replay the call against the same prompt version + model deployment via POST /api/v1/ai/replay/:requestId (admin-only, dry-run; emits ai.replay.executed event for audit).
If the replay matches, the failure is reproducible; if not, root-cause is provider-side flakiness (and the corresponding ai.provider.flaky_total counter increments).

8. Health probes

Probe	What it checks
`GET /healthz` (liveness)	Process up; event loop responsive within 200 ms
`GET /readyz` (readiness)	Postgres reachable; Memorystore reachable; Vertex AI reachable (5 s budget); KMS reachable for manifest signer
`GET /degradedz`	Returns the active degradation level (none / cache-only / deterministic-fallback / read-only)

The degradedz endpoint feeds the global status banner shown in the backoffice.

1. Golden signals​

2. Metrics​

2.1 Inference metrics​

2.2 HITL metrics​

2.3 Provider/router metrics​

2.4 Eval & quality metrics​

2.5 Budget metrics​

3. Tracing​

4. Logging​

5. BigQuery: token cost dashboard​

6. SLOs and alerts​

7. Forensics flow ("what just happened?")​

8. Health probes​