Skip to main content

Document Service — Observability

Status: populated Owner: TBD Last updated: 2026-04-18 Companion: Service Template · 12 observability-telemetry


1. SLIs and SLOs

SLISLOMeasurement window
Synchronous generation p95 latency< 5 s5-minute rolling
Template list / metadata read p95 latency< 200 ms5-minute rolling
Document download (presigned URL) p95 latency< 300 ms5-minute rolling
Service availability (non-5xx rate)≥ 99.5 %30-day rolling
Virus scan success rate (clean or quarantined, no scan errors)≥ 99.9 %1-day rolling
Async render job completion rate (completed / total started)≥ 95 %1-hour rolling
Async render job p95 completion time< 30 s1-hour rolling

2. Key Metrics (OpenTelemetry)

Metric nameTypeLabels
document.generate.duration_msHistogramtenantId, templateCategory, locale, success
document.generate.totalCountertenantId, success, errorCode
document.render_job.statusGaugetenantId, status
document.render_job.duration_msHistogramtenantId, templateCategory
document.upload.totalCountertenantId, result (clean, quarantined, error)
document.virus_scan.duration_msHistogramtenantId
document.download.totalCountertenantId
document.event.published_totalCounterevent_type
document.event.dlq_depthGaugestream
document.fhir.call_duration_msHistogramoperation, success
document.storage.put_duration_msHistogramtenantId, success

3. Distributed Tracing

All inbound and outbound calls use W3C trace context. Key spans:

SpanAttributes
document.generate.synctenant_id, template_version_id, patient_id, locale, duration_ms
document.render_job.executetenant_id, job_id, template_version_id
document.fhir.resolve_bindingstenant_id, resource_types[], duration_ms
document.pdf.rendertenant_id, template_version_id, page_count
document.storage.puttenant_id, key, size_bytes
document.virus_scantenant_id, result, duration_ms
document.presigned_url.generatetenant_id, ttl_seconds

4. Dashboards

DashboardContents
Document Service OverviewRequest rate, p95/p99 latency, error rate, availability
PDF Generation PerformanceSync vs async generation rate; p95 duration; timeout rate
Render Job QueueQueued / running / completed / failed job counts; p95 completion time
Upload and Virus ScanUpload rate; quarantine rate; scan duration; scan error rate
Object StoragePUT/GET rate; presigned URL generation rate; storage errors
FHIR DependencyBinding resolution rate; FHIR 4xx/5xx rate; p95 latency

5. Alerts

AlertThresholdSeverityRunbook
Sync generation p95 > 8 s5-minute windowWarningrunbooks/doc-slow-generation.md
Sync generation p95 > 15 s5-minute windowCriticalrunbooks/doc-slow-generation.md
Render job queue depth > 500AnyWarningrunbooks/doc-render-queue.md
Render job failure rate > 5 %15-minute windowCriticalrunbooks/doc-render-failures.md
Virus scan error rate > 0.1 %1-hour windowCriticalrunbooks/doc-virus-scan.md
Object storage error rate > 1 %5-minute windowCriticalrunbooks/doc-storage.md
NATS DLQ depth > 5AnyWarningrunbooks/doc-nats-dlq.md
FHIR binding failure rate > 2 %15-minute windowWarningrunbooks/doc-fhir-binding.md

6. Structured Logging

All logs include: traceId, spanId, tenantId, service: "document-service", level, timestamp.

Generation log sample (INFO):

{
"level": "info",
"message": "document.generated",
"traceId": "...",
"tenantId": "ten_afg_moph_001",
"templateVersionId": "tv_01JRXXXX",
"patientId": "[REDACTED]",
"durationMs": 2340,
"inputSnapshotHash": "sha256:abcdef...",
"documentReferenceId": "DocRef/01JRXXXX"
}

Note: patientId logged as [REDACTED] in INFO level; only in DEBUG level trace for authorized debugging sessions.