Claims Service — Observability
Status: populated Owner: TBD Last updated: 2026-04-18 Companion: SERVICE_OVERVIEW · Service Template · 02 DDD
SLIs and SLOs
| SLI | SLO | Measurement Window |
|---|---|---|
| Claim assembly API availability | ≥ 99.9% | 30-day rolling |
| Claim assembly p95 latency | < 1500ms | 1-hour |
| Claim submission success rate | ≥ 99% (excluding payer-side errors) | 7-day rolling |
| Eligibility check p95 latency | < 3000ms (includes payer round-trip) | 1-hour |
| ERA ingestion processing time | < 60s from receipt to allocations applied | per-file |
| Outbox relay lag | < 30s p99 | 1-hour |
| FHIR EOB read p95 latency | < 500ms | 1-hour |
OpenTelemetry Instrumentation
The claims-service is instrumented with OpenTelemetry SDK (Node.js) and emits:
- Traces: HTTP requests, DB queries, EDI adapter calls, payer API calls
- Metrics: Counter/histogram/gauge via OTEL SDK → Prometheus scrape
- Logs: Structured JSON logs → OpenTelemetry log bridge → Loki
All spans include tenant_id, claim_id (where applicable), and correlation_id attributes.
Key Metrics
| Metric | Type | Labels | Description |
|---|---|---|---|
claims_assembled_total | Counter | tenant_id, channel | Total claims assembled |
claims_submitted_total | Counter | tenant_id, channel, status | Claim submissions (success/failure) |
claims_denied_total | Counter | tenant_id, denial_code | Claims denied by payer |
claims_paid_total | Counter | tenant_id | Claims paid in full |
eligibility_check_duration_seconds | Histogram | tenant_id, channel | Eligibility inquiry latency |
era_processing_duration_seconds | Histogram | tenant_id | ERA ingestion to allocation applied |
outbox_lag_seconds | Gauge | tenant_id | Oldest unpublished outbox record age |
submission_adapter_errors_total | Counter | tenant_id, adapter, error_code | Adapter-level errors |
coverage_active_count | Gauge | tenant_id | Active coverage records per tenant |
Dashboards
| Dashboard | Purpose |
|---|---|
| Claims Pipeline | Assembly rate, submission rate, denial rate, paid rate by tenant |
| ERA Processing | ERA ingestion throughput, processing time, allocations per ERA |
| Eligibility | Check volume, latency by payer/channel, error rate |
| Outbox Health | Relay lag, unpublished event count, relay throughput |
| Payer Adapter | Per-adapter error rates, latency, circuit breaker status |
Alerts
| Alert | Condition | Severity | Runbook |
|---|---|---|---|
| High claim denial rate | denial_rate > 15% over 1 hour for any tenant | Warning | runbooks/claims-high-denial-rate.md |
| Submission adapter failures | adapter_errors > 10 in 5 minutes | Critical | runbooks/claims-adapter-failure.md |
| ERA processing timeout | ERA not processed within 120s of receipt | Warning | runbooks/claims-era-timeout.md |
| Outbox relay lag spike | outbox_lag > 120s | Warning | runbooks/claims-outbox-lag.md |
| Eligibility check SLO breach | p95 > 3000ms for 10 minutes | Warning | runbooks/claims-eligibility-slow.md |
| Payer circuit open | circuit breaker open for any payer adapter | Critical | runbooks/claims-payer-circuit-open.md |
Health Endpoints
| Endpoint | Purpose |
|---|---|
GET /health/live | Kubernetes liveness probe — returns 200 if process is alive |
GET /health/ready | Readiness probe — checks DB connection, NATS connection, adapter connectivity |
GET /health/startup | Startup probe — confirms migrations have run |
GET /metrics | Prometheus metrics scrape endpoint |