Observability

:::info Source Sourced from services/assignment-service/OBSERVABILITY.md in the documentation repo. :::

Companion: 15 Observability & Telemetry

1. Signals

The service emits three pillars of telemetry via OpenTelemetry SDK, exported through OTLP to SigNoz:

Signal	Transport	Retention
Traces	OTLP/HTTP	14 days hot, 90 days warm
Metrics	OTLP/HTTP	13 months
Logs	OTLP/HTTP (structured JSON)	30 days hot, 18 months archive (S3)

Service name: assignment-service. Always tagged with:

service.version, service.namespace="ghasi"
deployment.environment
tenant.id (where known — omitted for cross-tenant ops)
slice (S4|S5) for progressive-rollout dashboards

2. Key Spans

Span	Start	End	Attributes
`http.server`	request received	response sent	standard
`assignment.create`	handler start	commit	`assignment.id`, `tenant.id`, `created_by`
`assignment.activate`	handler start	commit	`assignment.id`, `estimated_window_count`
`materializer.run`	job start	commit	`assignment.id`, `windows.created`, `horizon.until`
`materializer.batch`	batch start	batch commit	`batch.size`
`window.transition`	before update	after update	`from_state`, `to_state`, `window.id`
`overdue.sweep`	sweeper tick	tick done	`windows.transitioned`
`closed_missed.sweep`	same	same	same
`escalation.fire`	action eval	event published	`window.id`, `level`, `action.kind`
`reminder.dispatch`	eval	publish	`window.id`, `trigger.hash`
`ai.gateway.call` (child)	Gateway client call	response	`prompt.id`, `prompt.version`, `cost.micro_usd`
`outbox.publish`	read batch	ack from NATS	`subject`, `batch.size`
`consumer.handle`	onMessage	ack	`subject`, `ce.id`, `attempt`

Every span carries tenant.id where possible and the inbound traceparent is continued (no new root).

3. Metrics (RED / USE)

Histograms use base-2 exponential buckets.

3.1 Request-level

assignment_http_requests_total{route, status, tenant_id} — counter
assignment_http_duration_seconds{route} — histogram
assignment_http_inflight{route} — gauge

3.2 Business

assignment_created_total{tenant_id, ai_suggested} — counter
assignment_activated_total{tenant_id} — counter
assignment_window_opened_total{tenant_id, assignment_id} — counter
assignment_window_state_transitions_total{from, to, tenant_id} — counter
assignment_window_open_count{tenant_id} — gauge (sampled every 60 s)
assignment_window_overdue_count{tenant_id} — gauge
assignment_compliance_rate{tenant_id, assignment_id} — gauge (percent)
assignment_materializer_duration_seconds — histogram
assignment_materializer_windows_created_total{tenant_id} — counter
assignment_escalation_fired_total{level, tenant_id} — counter
assignment_reminder_sent_total{tenant_id} — counter

3.3 Saga health

assignment_saga_lag_seconds{event} — histogram (wall-clock delta between publishing event and downstream observable effect)
assignment_saga_retries_total{event} — counter
assignment_saga_dlq_total{event} — counter (must be 0 in steady state)

3.4 AI

assignment_ai_suggest_total{tenant_id, outcome} (outcome=accepted|rejected|expired|invalid)
assignment_ai_cost_micro_usd_total{tenant_id} — counter
assignment_ai_latency_seconds — histogram

3.5 Infra

assignment_db_query_duration_seconds{op} — histogram
assignment_db_connections{state} — gauge
assignment_outbox_backlog{tenant_id} — gauge
assignment_nats_publish_total{subject} — counter
assignment_nats_ack_lag_seconds{consumer} — histogram

4. Logs

Structured JSON, one line per event, with:

{
  "ts": "2026-04-15T10:22:31.102Z",
  "level": "info",
  "msg": "assignment.activated",
  "service": "assignment-service",
  "version": "1.7.3",
  "env": "prod",
  "tenant_id": "tnt_…",
  "trace_id": "01HXYZ…",
  "span_id": "abc123…",
  "actor": "usr_…",
  "assignment_id": "asn_…"
}

Log levels:

error — handler error, unrecoverable
warn — retryable, degraded
info — state transitions, outbound events
debug — only when LOG_LEVEL=debug (disabled in prod)

PII redaction: no names/emails in logs. Tenant ids and user ids ok.

5. Dashboards

Pre-built dashboards in SigNoz under assignment-service/:

Overview — RED + saga health + DLQ count
Tenant Drill-down — picks tenant; shows compliance rate heatmap, window state distribution over time
Materializer — job runtime, batches/hour, windows/hour
Escalation & Reminders — fires/hour, top-10 targets
AI Suggest — latency p50/p95/p99, cost/day, acceptance rate, golden-eval pass rate
Saga Integrity — open → in_progress → completed conversion funnel, overdue→closed_missed rate

6. SLOs

SLI	Target	Alert at
HTTP success rate (non-5xx)	≥ 99.9%	≤ 99.5% (5m)
HTTP p95 (create)	≤ 250 ms	> 400 ms (10m)
Window opened → enrollment.created freshness p95	≤ 2 s	> 5 s (10m)
DLQ	0 msgs	> 0
Compliance report p95	≤ 1.5 s	> 3 s (10m)
AI suggest p95	≤ 8 s	> 15 s (10m)

Error budget: 0.1% / month for availability.

7. Alerts

Alert	Severity	Runbook
DLQ > 0	P1	`rb/assignment/dlq.md`
Outbox backlog > 10k for 10m	P1	`rb/assignment/outbox.md`
Materializer failed > 3x	P2	`rb/assignment/materializer.md`
Overdue sweep stalled (no transitions in 30m)	P2	`rb/assignment/sweeper.md`
AI suggest error rate > 5%	P3	`rb/assignment/ai.md`
RLS violation exception	P0	`rb/common/tenant-leak.md`

8. Correlation IDs

Every response carries: X-Trace-Id, X-Correlation-Id. Audit log, DB logs, outbox rows, and emitted events all carry the same traceId/correlationId for end-to-end stitching.

9. Health Endpoints

Endpoint	Purpose	Checks
`/api/v1/healthz`	K8s liveness	process up, event loop responsive
`/api/v1/readyz`	K8s readiness	DB reachable, NATS reachable, outbox publisher lag < threshold
`/metrics`	Prometheus scrape (optional; primary is OTLP push)	all metrics above

10. Trace Sampling

Head: 100% for errors, 20% uniform sampling otherwise (tail-based at OTel collector downsamples further).
Full capture always for: assignment.activate, assignment.suggest, GDPR handlers.

11. Synthetic Checks

5-min synthetic create/activate/list/archive in staging with a dedicated synthetic tenant tnt_synthetic_assignment.
Latency + correctness verdict per run.

1. Signals​

2. Key Spans​

3. Metrics (RED / USE)​

3.1 Request-level​

3.2 Business​

3.3 Saga health​

3.4 AI​

3.5 Infra​

4. Logs​

5. Dashboards​

6. SLOs​

7. Alerts​

8. Correlation IDs​

9. Health Endpoints​

10. Trace Sampling​

11. Synthetic Checks​