Skip to main content

Observability

:::info Source Sourced from services/assignment-service/OBSERVABILITY.md in the documentation repo. :::

Companion: 15 Observability & Telemetry


1. Signals

The service emits three pillars of telemetry via OpenTelemetry SDK, exported through OTLP to SigNoz:

SignalTransportRetention
TracesOTLP/HTTP14 days hot, 90 days warm
MetricsOTLP/HTTP13 months
LogsOTLP/HTTP (structured JSON)30 days hot, 18 months archive (S3)

Service name: assignment-service. Always tagged with:

  • service.version, service.namespace="ghasi"
  • deployment.environment
  • tenant.id (where known — omitted for cross-tenant ops)
  • slice (S4|S5) for progressive-rollout dashboards

2. Key Spans

SpanStartEndAttributes
http.serverrequest receivedresponse sentstandard
assignment.createhandler startcommitassignment.id, tenant.id, created_by
assignment.activatehandler startcommitassignment.id, estimated_window_count
materializer.runjob startcommitassignment.id, windows.created, horizon.until
materializer.batchbatch startbatch commitbatch.size
window.transitionbefore updateafter updatefrom_state, to_state, window.id
overdue.sweepsweeper ticktick donewindows.transitioned
closed_missed.sweepsamesamesame
escalation.fireaction evalevent publishedwindow.id, level, action.kind
reminder.dispatchevalpublishwindow.id, trigger.hash
ai.gateway.call (child)Gateway client callresponseprompt.id, prompt.version, cost.micro_usd
outbox.publishread batchack from NATSsubject, batch.size
consumer.handleonMessageacksubject, ce.id, attempt

Every span carries tenant.id where possible and the inbound traceparent is continued (no new root).

3. Metrics (RED / USE)

Histograms use base-2 exponential buckets.

3.1 Request-level

  • assignment_http_requests_total{route, status, tenant_id} — counter
  • assignment_http_duration_seconds{route} — histogram
  • assignment_http_inflight{route} — gauge

3.2 Business

  • assignment_created_total{tenant_id, ai_suggested} — counter
  • assignment_activated_total{tenant_id} — counter
  • assignment_window_opened_total{tenant_id, assignment_id} — counter
  • assignment_window_state_transitions_total{from, to, tenant_id} — counter
  • assignment_window_open_count{tenant_id} — gauge (sampled every 60 s)
  • assignment_window_overdue_count{tenant_id} — gauge
  • assignment_compliance_rate{tenant_id, assignment_id} — gauge (percent)
  • assignment_materializer_duration_seconds — histogram
  • assignment_materializer_windows_created_total{tenant_id} — counter
  • assignment_escalation_fired_total{level, tenant_id} — counter
  • assignment_reminder_sent_total{tenant_id} — counter

3.3 Saga health

  • assignment_saga_lag_seconds{event} — histogram (wall-clock delta between publishing event and downstream observable effect)
  • assignment_saga_retries_total{event} — counter
  • assignment_saga_dlq_total{event} — counter (must be 0 in steady state)

3.4 AI

  • assignment_ai_suggest_total{tenant_id, outcome} (outcome=accepted|rejected|expired|invalid)
  • assignment_ai_cost_micro_usd_total{tenant_id} — counter
  • assignment_ai_latency_seconds — histogram

3.5 Infra

  • assignment_db_query_duration_seconds{op} — histogram
  • assignment_db_connections{state} — gauge
  • assignment_outbox_backlog{tenant_id} — gauge
  • assignment_nats_publish_total{subject} — counter
  • assignment_nats_ack_lag_seconds{consumer} — histogram

4. Logs

Structured JSON, one line per event, with:

{
"ts": "2026-04-15T10:22:31.102Z",
"level": "info",
"msg": "assignment.activated",
"service": "assignment-service",
"version": "1.7.3",
"env": "prod",
"tenant_id": "tnt_…",
"trace_id": "01HXYZ…",
"span_id": "abc123…",
"actor": "usr_…",
"assignment_id": "asn_…"
}

Log levels:

  • error — handler error, unrecoverable
  • warn — retryable, degraded
  • info — state transitions, outbound events
  • debug — only when LOG_LEVEL=debug (disabled in prod)

PII redaction: no names/emails in logs. Tenant ids and user ids ok.

5. Dashboards

Pre-built dashboards in SigNoz under assignment-service/:

  1. Overview — RED + saga health + DLQ count
  2. Tenant Drill-down — picks tenant; shows compliance rate heatmap, window state distribution over time
  3. Materializer — job runtime, batches/hour, windows/hour
  4. Escalation & Reminders — fires/hour, top-10 targets
  5. AI Suggest — latency p50/p95/p99, cost/day, acceptance rate, golden-eval pass rate
  6. Saga Integrity — open → in_progress → completed conversion funnel, overdue→closed_missed rate

6. SLOs

SLITargetAlert at
HTTP success rate (non-5xx)≥ 99.9%≤ 99.5% (5m)
HTTP p95 (create)≤ 250 ms> 400 ms (10m)
Window opened → enrollment.created freshness p95≤ 2 s> 5 s (10m)
DLQ0 msgs> 0
Compliance report p95≤ 1.5 s> 3 s (10m)
AI suggest p95≤ 8 s> 15 s (10m)

Error budget: 0.1% / month for availability.

7. Alerts

AlertSeverityRunbook
DLQ > 0P1rb/assignment/dlq.md
Outbox backlog > 10k for 10mP1rb/assignment/outbox.md
Materializer failed > 3xP2rb/assignment/materializer.md
Overdue sweep stalled (no transitions in 30m)P2rb/assignment/sweeper.md
AI suggest error rate > 5%P3rb/assignment/ai.md
RLS violation exceptionP0rb/common/tenant-leak.md

8. Correlation IDs

Every response carries: X-Trace-Id, X-Correlation-Id. Audit log, DB logs, outbox rows, and emitted events all carry the same traceId/correlationId for end-to-end stitching.

9. Health Endpoints

EndpointPurposeChecks
/api/v1/healthzK8s livenessprocess up, event loop responsive
/api/v1/readyzK8s readinessDB reachable, NATS reachable, outbox publisher lag < threshold
/metricsPrometheus scrape (optional; primary is OTLP push)all metrics above

10. Trace Sampling

  • Head: 100% for errors, 20% uniform sampling otherwise (tail-based at OTel collector downsamples further).
  • Full capture always for: assignment.activate, assignment.suggest, GDPR handlers.

11. Synthetic Checks

  • 5-min synthetic create/activate/list/archive in staging with a dedicated synthetic tenant tnt_synthetic_assignment.
  • Latency + correctness verdict per run.