Patient Portal Service — Observability

Status: populated Owner: TBD Last updated: 2026-04-18 Companion: Service Template · 12 observability-telemetry · 02 DDD

1. Service-Level Indicators (SLIs)

SLI	Description	Measurement
Availability	% of portal requests returning non-5xx	`1 - (5xx_count / total_count)` per 1-min window
API Latency p95	95th percentile response time for all `/v1/portal/*` endpoints	OTEL histogram `http.server.duration`
API Latency p99	99th percentile — captures tail latency	OTEL histogram
Upstream error rate	% of upstream service calls returning 5xx or timing out	Per upstream adapter metric
Login success rate	% of login attempts completing without error	`login_success / login_attempts`
Export job completion rate	% of export jobs reaching `complete` within 5 minutes	Job status metrics

2. Service-Level Objectives (SLOs)

SLO	Target	Measurement window
Availability	≥ 99.5%	30-day rolling
API Latency p95	≤ 800 ms	5-min rolling
API Latency p99	≤ 2000 ms	5-min rolling
Upstream error rate	≤ 2%	5-min rolling
Login success rate	≥ 99%	1-day rolling
Export job completion ≤ 5 min	≥ 95%	1-day rolling

3. OpenTelemetry Instrumentation

Traces:

Span	Attributes
`portal.api.request`	`http.method`, `http.route`, `tenant_id`, `account_id_hash`
`portal.upstream.call`	`upstream.service`, `upstream.endpoint`, `upstream.status`
`portal.cache.hit`	`cache.key_prefix`, `cache.hit` (bool)
`portal.access_event.write`	`event_type`, `acting_as_proxy`

Metrics:

Metric	Type	Labels
`portal.requests.total`	Counter	`method`, `route`, `status_code`, `tenant_id`
`portal.request.duration`	Histogram	`method`, `route`, `tenant_id`
`portal.upstream.calls.total`	Counter	`upstream`, `status_code`
`portal.cache.hits.total`	Counter	`key_prefix`
`portal.login.attempts.total`	Counter	`mfa_used`, `result` (success/fail)
`portal.export_jobs.total`	Counter	`status` (complete/failed)
`portal.proxy_session.total`	Counter	`tenant_id`

Logs: Structured JSON via @ghasi/logger. All logs include traceId, spanId, tenantId, service: patient-portal-service. PHI fields (patientId, resourceId) are hashed in log output.

4. Dashboards

Dashboard	Key panels
Portal Overview	Request rate, error rate, p95 latency, active accounts
Login & Auth	Login attempts, MFA usage, failed auth rate
Upstream Health	Per-upstream error rate and latency
Access Events	Volume by event type, proxy session rate
Export Jobs	Job success/fail rate, p95 completion time
Cache Performance	Hit rate by key prefix, Redis latency

5. Alerts

Alert	Condition	Severity	Action
High error rate	`5xx_rate > 5%` for 3 min	P1	Page on-call; check upstream dependencies
API p95 latency breach	`p95 > 2s` for 5 min	P2	Check upstream health; scale pods
Login failure spike	`login_fail_rate > 20%` for 2 min	P1	Investigate auth issues; check Keycloak
Upstream unavailable	Circuit breaker open for registration/lab/scheduling	P2	Alert on-call; check upstream service
Export job failures	`export_fail_rate > 10%` for 1 hour	P2	Check object storage and NATS
Pod count below 2	Kubernetes PDB violation	P1	Auto-scale review; check node health
Outbox relay lag	`outbox_unpublished_count > 100` for 10 min	P2	Check NATS connectivity; relay worker health

6. Runbook Links

Scenario	Runbook
Portal service down	`docs/runbooks/portal-service-outage.md`
Keycloak patient realm unreachable	`docs/runbooks/keycloak-patient-realm.md`
Upstream lab results unavailable	`docs/runbooks/lab-service-unavailable.md`
Export job stuck	`docs/runbooks/export-job-debug.md`
Redis cache eviction storm	`docs/runbooks/redis-cache-eviction.md`

1. Service-Level Indicators (SLIs)​

2. Service-Level Objectives (SLOs)​

3. OpenTelemetry Instrumentation​

4. Dashboards​

5. Alerts​

6. Runbook Links​