Skip to main content

Patient Portal Service — Observability

Status: populated Owner: TBD Last updated: 2026-04-18 Companion: Service Template · 12 observability-telemetry · 02 DDD

1. Service-Level Indicators (SLIs)

SLIDescriptionMeasurement
Availability% of portal requests returning non-5xx1 - (5xx_count / total_count) per 1-min window
API Latency p9595th percentile response time for all /v1/portal/* endpointsOTEL histogram http.server.duration
API Latency p9999th percentile — captures tail latencyOTEL histogram
Upstream error rate% of upstream service calls returning 5xx or timing outPer upstream adapter metric
Login success rate% of login attempts completing without errorlogin_success / login_attempts
Export job completion rate% of export jobs reaching complete within 5 minutesJob status metrics

2. Service-Level Objectives (SLOs)

SLOTargetMeasurement window
Availability≥ 99.5%30-day rolling
API Latency p95≤ 800 ms5-min rolling
API Latency p99≤ 2000 ms5-min rolling
Upstream error rate≤ 2%5-min rolling
Login success rate≥ 99%1-day rolling
Export job completion ≤ 5 min≥ 95%1-day rolling

3. OpenTelemetry Instrumentation

Traces:

SpanAttributes
portal.api.requesthttp.method, http.route, tenant_id, account_id_hash
portal.upstream.callupstream.service, upstream.endpoint, upstream.status
portal.cache.hitcache.key_prefix, cache.hit (bool)
portal.access_event.writeevent_type, acting_as_proxy

Metrics:

MetricTypeLabels
portal.requests.totalCountermethod, route, status_code, tenant_id
portal.request.durationHistogrammethod, route, tenant_id
portal.upstream.calls.totalCounterupstream, status_code
portal.cache.hits.totalCounterkey_prefix
portal.login.attempts.totalCountermfa_used, result (success/fail)
portal.export_jobs.totalCounterstatus (complete/failed)
portal.proxy_session.totalCountertenant_id

Logs: Structured JSON via @ghasi/logger. All logs include traceId, spanId, tenantId, service: patient-portal-service. PHI fields (patientId, resourceId) are hashed in log output.


4. Dashboards

DashboardKey panels
Portal OverviewRequest rate, error rate, p95 latency, active accounts
Login & AuthLogin attempts, MFA usage, failed auth rate
Upstream HealthPer-upstream error rate and latency
Access EventsVolume by event type, proxy session rate
Export JobsJob success/fail rate, p95 completion time
Cache PerformanceHit rate by key prefix, Redis latency

5. Alerts

AlertConditionSeverityAction
High error rate5xx_rate > 5% for 3 minP1Page on-call; check upstream dependencies
API p95 latency breachp95 > 2s for 5 minP2Check upstream health; scale pods
Login failure spikelogin_fail_rate > 20% for 2 minP1Investigate auth issues; check Keycloak
Upstream unavailableCircuit breaker open for registration/lab/schedulingP2Alert on-call; check upstream service
Export job failuresexport_fail_rate > 10% for 1 hourP2Check object storage and NATS
Pod count below 2Kubernetes PDB violationP1Auto-scale review; check node health
Outbox relay lagoutbox_unpublished_count > 100 for 10 minP2Check NATS connectivity; relay worker health

ScenarioRunbook
Portal service downdocs/runbooks/portal-service-outage.md
Keycloak patient realm unreachabledocs/runbooks/keycloak-patient-realm.md
Upstream lab results unavailabledocs/runbooks/lab-service-unavailable.md
Export job stuckdocs/runbooks/export-job-debug.md
Redis cache eviction stormdocs/runbooks/redis-cache-eviction.md