Document Service — Observability
Status: populated Owner: TBD Last updated: 2026-04-18 Companion: Service Template · 12 observability-telemetry
1. SLIs and SLOs
| SLI | SLO | Measurement window |
|---|---|---|
| Synchronous generation p95 latency | < 5 s | 5-minute rolling |
| Template list / metadata read p95 latency | < 200 ms | 5-minute rolling |
| Document download (presigned URL) p95 latency | < 300 ms | 5-minute rolling |
| Service availability (non-5xx rate) | ≥ 99.5 % | 30-day rolling |
| Virus scan success rate (clean or quarantined, no scan errors) | ≥ 99.9 % | 1-day rolling |
| Async render job completion rate (completed / total started) | ≥ 95 % | 1-hour rolling |
| Async render job p95 completion time | < 30 s | 1-hour rolling |
2. Key Metrics (OpenTelemetry)
| Metric name | Type | Labels |
|---|---|---|
document.generate.duration_ms | Histogram | tenantId, templateCategory, locale, success |
document.generate.total | Counter | tenantId, success, errorCode |
document.render_job.status | Gauge | tenantId, status |
document.render_job.duration_ms | Histogram | tenantId, templateCategory |
document.upload.total | Counter | tenantId, result (clean, quarantined, error) |
document.virus_scan.duration_ms | Histogram | tenantId |
document.download.total | Counter | tenantId |
document.event.published_total | Counter | event_type |
document.event.dlq_depth | Gauge | stream |
document.fhir.call_duration_ms | Histogram | operation, success |
document.storage.put_duration_ms | Histogram | tenantId, success |
3. Distributed Tracing
All inbound and outbound calls use W3C trace context. Key spans:
| Span | Attributes |
|---|---|
document.generate.sync | tenant_id, template_version_id, patient_id, locale, duration_ms |
document.render_job.execute | tenant_id, job_id, template_version_id |
document.fhir.resolve_bindings | tenant_id, resource_types[], duration_ms |
document.pdf.render | tenant_id, template_version_id, page_count |
document.storage.put | tenant_id, key, size_bytes |
document.virus_scan | tenant_id, result, duration_ms |
document.presigned_url.generate | tenant_id, ttl_seconds |
4. Dashboards
| Dashboard | Contents |
|---|---|
| Document Service Overview | Request rate, p95/p99 latency, error rate, availability |
| PDF Generation Performance | Sync vs async generation rate; p95 duration; timeout rate |
| Render Job Queue | Queued / running / completed / failed job counts; p95 completion time |
| Upload and Virus Scan | Upload rate; quarantine rate; scan duration; scan error rate |
| Object Storage | PUT/GET rate; presigned URL generation rate; storage errors |
| FHIR Dependency | Binding resolution rate; FHIR 4xx/5xx rate; p95 latency |
5. Alerts
| Alert | Threshold | Severity | Runbook |
|---|---|---|---|
| Sync generation p95 > 8 s | 5-minute window | Warning | runbooks/doc-slow-generation.md |
| Sync generation p95 > 15 s | 5-minute window | Critical | runbooks/doc-slow-generation.md |
| Render job queue depth > 500 | Any | Warning | runbooks/doc-render-queue.md |
| Render job failure rate > 5 % | 15-minute window | Critical | runbooks/doc-render-failures.md |
| Virus scan error rate > 0.1 % | 1-hour window | Critical | runbooks/doc-virus-scan.md |
| Object storage error rate > 1 % | 5-minute window | Critical | runbooks/doc-storage.md |
| NATS DLQ depth > 5 | Any | Warning | runbooks/doc-nats-dlq.md |
| FHIR binding failure rate > 2 % | 15-minute window | Warning | runbooks/doc-fhir-binding.md |
6. Structured Logging
All logs include: traceId, spanId, tenantId, service: "document-service", level, timestamp.
Generation log sample (INFO):
{
"level": "info",
"message": "document.generated",
"traceId": "...",
"tenantId": "ten_afg_moph_001",
"templateVersionId": "tv_01JRXXXX",
"patientId": "[REDACTED]",
"durationMs": 2340,
"inputSnapshotHash": "sha256:abcdef...",
"documentReferenceId": "DocRef/01JRXXXX"
}
Note: patientId logged as [REDACTED] in INFO level; only in DEBUG level trace for authorized debugging sessions.