Patient Portal Service — Observability
Status: populated
Owner: TBD
Last updated: 2026-04-18
Companion: Service Template · 12 observability-telemetry · 02 DDD
1. Service-Level Indicators (SLIs)
| SLI | Description | Measurement |
|---|
| Availability | % of portal requests returning non-5xx | 1 - (5xx_count / total_count) per 1-min window |
| API Latency p95 | 95th percentile response time for all /v1/portal/* endpoints | OTEL histogram http.server.duration |
| API Latency p99 | 99th percentile — captures tail latency | OTEL histogram |
| Upstream error rate | % of upstream service calls returning 5xx or timing out | Per upstream adapter metric |
| Login success rate | % of login attempts completing without error | login_success / login_attempts |
| Export job completion rate | % of export jobs reaching complete within 5 minutes | Job status metrics |
2. Service-Level Objectives (SLOs)
| SLO | Target | Measurement window |
|---|
| Availability | ≥ 99.5% | 30-day rolling |
| API Latency p95 | ≤ 800 ms | 5-min rolling |
| API Latency p99 | ≤ 2000 ms | 5-min rolling |
| Upstream error rate | ≤ 2% | 5-min rolling |
| Login success rate | ≥ 99% | 1-day rolling |
| Export job completion ≤ 5 min | ≥ 95% | 1-day rolling |
3. OpenTelemetry Instrumentation
Traces:
| Span | Attributes |
|---|
portal.api.request | http.method, http.route, tenant_id, account_id_hash |
portal.upstream.call | upstream.service, upstream.endpoint, upstream.status |
portal.cache.hit | cache.key_prefix, cache.hit (bool) |
portal.access_event.write | event_type, acting_as_proxy |
Metrics:
| Metric | Type | Labels |
|---|
portal.requests.total | Counter | method, route, status_code, tenant_id |
portal.request.duration | Histogram | method, route, tenant_id |
portal.upstream.calls.total | Counter | upstream, status_code |
portal.cache.hits.total | Counter | key_prefix |
portal.login.attempts.total | Counter | mfa_used, result (success/fail) |
portal.export_jobs.total | Counter | status (complete/failed) |
portal.proxy_session.total | Counter | tenant_id |
Logs: Structured JSON via @ghasi/logger. All logs include traceId, spanId, tenantId, service: patient-portal-service. PHI fields (patientId, resourceId) are hashed in log output.
4. Dashboards
| Dashboard | Key panels |
|---|
| Portal Overview | Request rate, error rate, p95 latency, active accounts |
| Login & Auth | Login attempts, MFA usage, failed auth rate |
| Upstream Health | Per-upstream error rate and latency |
| Access Events | Volume by event type, proxy session rate |
| Export Jobs | Job success/fail rate, p95 completion time |
| Cache Performance | Hit rate by key prefix, Redis latency |
5. Alerts
| Alert | Condition | Severity | Action |
|---|
| High error rate | 5xx_rate > 5% for 3 min | P1 | Page on-call; check upstream dependencies |
| API p95 latency breach | p95 > 2s for 5 min | P2 | Check upstream health; scale pods |
| Login failure spike | login_fail_rate > 20% for 2 min | P1 | Investigate auth issues; check Keycloak |
| Upstream unavailable | Circuit breaker open for registration/lab/scheduling | P2 | Alert on-call; check upstream service |
| Export job failures | export_fail_rate > 10% for 1 hour | P2 | Check object storage and NATS |
| Pod count below 2 | Kubernetes PDB violation | P1 | Auto-scale review; check node health |
| Outbox relay lag | outbox_unpublished_count > 100 for 10 min | P2 | Check NATS connectivity; relay worker health |
6. Runbook Links
| Scenario | Runbook |
|---|
| Portal service down | docs/runbooks/portal-service-outage.md |
| Keycloak patient realm unreachable | docs/runbooks/keycloak-patient-realm.md |
| Upstream lab results unavailable | docs/runbooks/lab-service-unavailable.md |
| Export job stuck | docs/runbooks/export-job-debug.md |
| Redis cache eviction storm | docs/runbooks/redis-cache-eviction.md |