Observability
:::info Source
Sourced from services/assignment-service/OBSERVABILITY.md in the documentation repo.
:::
Companion: 15 Observability & Telemetry
1. Signals
The service emits three pillars of telemetry via OpenTelemetry SDK, exported through OTLP to SigNoz:
| Signal | Transport | Retention |
|---|---|---|
| Traces | OTLP/HTTP | 14 days hot, 90 days warm |
| Metrics | OTLP/HTTP | 13 months |
| Logs | OTLP/HTTP (structured JSON) | 30 days hot, 18 months archive (S3) |
Service name: assignment-service. Always tagged with:
service.version,service.namespace="ghasi"deployment.environmenttenant.id(where known — omitted for cross-tenant ops)slice(S4|S5) for progressive-rollout dashboards
2. Key Spans
| Span | Start | End | Attributes |
|---|---|---|---|
http.server | request received | response sent | standard |
assignment.create | handler start | commit | assignment.id, tenant.id, created_by |
assignment.activate | handler start | commit | assignment.id, estimated_window_count |
materializer.run | job start | commit | assignment.id, windows.created, horizon.until |
materializer.batch | batch start | batch commit | batch.size |
window.transition | before update | after update | from_state, to_state, window.id |
overdue.sweep | sweeper tick | tick done | windows.transitioned |
closed_missed.sweep | same | same | same |
escalation.fire | action eval | event published | window.id, level, action.kind |
reminder.dispatch | eval | publish | window.id, trigger.hash |
ai.gateway.call (child) | Gateway client call | response | prompt.id, prompt.version, cost.micro_usd |
outbox.publish | read batch | ack from NATS | subject, batch.size |
consumer.handle | onMessage | ack | subject, ce.id, attempt |
Every span carries tenant.id where possible and the inbound traceparent is continued (no new root).
3. Metrics (RED / USE)
Histograms use base-2 exponential buckets.
3.1 Request-level
assignment_http_requests_total{route, status, tenant_id}— counterassignment_http_duration_seconds{route}— histogramassignment_http_inflight{route}— gauge
3.2 Business
assignment_created_total{tenant_id, ai_suggested}— counterassignment_activated_total{tenant_id}— counterassignment_window_opened_total{tenant_id, assignment_id}— counterassignment_window_state_transitions_total{from, to, tenant_id}— counterassignment_window_open_count{tenant_id}— gauge (sampled every 60 s)assignment_window_overdue_count{tenant_id}— gaugeassignment_compliance_rate{tenant_id, assignment_id}— gauge (percent)assignment_materializer_duration_seconds— histogramassignment_materializer_windows_created_total{tenant_id}— counterassignment_escalation_fired_total{level, tenant_id}— counterassignment_reminder_sent_total{tenant_id}— counter
3.3 Saga health
assignment_saga_lag_seconds{event}— histogram (wall-clock delta between publishing event and downstream observable effect)assignment_saga_retries_total{event}— counterassignment_saga_dlq_total{event}— counter (must be 0 in steady state)
3.4 AI
assignment_ai_suggest_total{tenant_id, outcome}(outcome=accepted|rejected|expired|invalid)assignment_ai_cost_micro_usd_total{tenant_id}— counterassignment_ai_latency_seconds— histogram
3.5 Infra
assignment_db_query_duration_seconds{op}— histogramassignment_db_connections{state}— gaugeassignment_outbox_backlog{tenant_id}— gaugeassignment_nats_publish_total{subject}— counterassignment_nats_ack_lag_seconds{consumer}— histogram
4. Logs
Structured JSON, one line per event, with:
{
"ts": "2026-04-15T10:22:31.102Z",
"level": "info",
"msg": "assignment.activated",
"service": "assignment-service",
"version": "1.7.3",
"env": "prod",
"tenant_id": "tnt_…",
"trace_id": "01HXYZ…",
"span_id": "abc123…",
"actor": "usr_…",
"assignment_id": "asn_…"
}
Log levels:
error— handler error, unrecoverablewarn— retryable, degradedinfo— state transitions, outbound eventsdebug— only whenLOG_LEVEL=debug(disabled in prod)
PII redaction: no names/emails in logs. Tenant ids and user ids ok.
5. Dashboards
Pre-built dashboards in SigNoz under assignment-service/:
- Overview — RED + saga health + DLQ count
- Tenant Drill-down — picks tenant; shows compliance rate heatmap, window state distribution over time
- Materializer — job runtime, batches/hour, windows/hour
- Escalation & Reminders — fires/hour, top-10 targets
- AI Suggest — latency p50/p95/p99, cost/day, acceptance rate, golden-eval pass rate
- Saga Integrity — open → in_progress → completed conversion funnel, overdue→closed_missed rate
6. SLOs
| SLI | Target | Alert at |
|---|---|---|
| HTTP success rate (non-5xx) | ≥ 99.9% | ≤ 99.5% (5m) |
| HTTP p95 (create) | ≤ 250 ms | > 400 ms (10m) |
| Window opened → enrollment.created freshness p95 | ≤ 2 s | > 5 s (10m) |
| DLQ | 0 msgs | > 0 |
| Compliance report p95 | ≤ 1.5 s | > 3 s (10m) |
| AI suggest p95 | ≤ 8 s | > 15 s (10m) |
Error budget: 0.1% / month for availability.
7. Alerts
| Alert | Severity | Runbook |
|---|---|---|
| DLQ > 0 | P1 | rb/assignment/dlq.md |
| Outbox backlog > 10k for 10m | P1 | rb/assignment/outbox.md |
| Materializer failed > 3x | P2 | rb/assignment/materializer.md |
| Overdue sweep stalled (no transitions in 30m) | P2 | rb/assignment/sweeper.md |
| AI suggest error rate > 5% | P3 | rb/assignment/ai.md |
| RLS violation exception | P0 | rb/common/tenant-leak.md |
8. Correlation IDs
Every response carries: X-Trace-Id, X-Correlation-Id. Audit log, DB logs, outbox rows, and emitted events all carry the same traceId/correlationId for end-to-end stitching.
9. Health Endpoints
| Endpoint | Purpose | Checks |
|---|---|---|
/api/v1/healthz | K8s liveness | process up, event loop responsive |
/api/v1/readyz | K8s readiness | DB reachable, NATS reachable, outbox publisher lag < threshold |
/metrics | Prometheus scrape (optional; primary is OTLP push) | all metrics above |
10. Trace Sampling
- Head: 100% for errors, 20% uniform sampling otherwise (tail-based at OTel collector downsamples further).
- Full capture always for:
assignment.activate,assignment.suggest, GDPR handlers.
11. Synthetic Checks
- 5-min synthetic create/activate/list/archive in staging with a dedicated synthetic tenant
tnt_synthetic_assignment. - Latency + correctness verdict per run.