Skip to main content

12 — Observability and Telemetry

Status: populated Last updated: 2026-04-18 Companion: 01 enterprise-architecture · 04 event-driven-architecture · 11 risks-and-tradeoffs · 13 security-compliance-tenancy · 18 testing-strategy-qa

Observability is not optional instrumentation — it is the operating substrate that lets Ghasi-eHealth prove clinical safety, correctness, availability, cost discipline, and tamper-evidence across 27 services and multiple offline surfaces.

1. Principles

#PrincipleConsequence
P1Three pillars, one correlation IDEvery log line, metric exemplar, and span carries trace_id, tenant_id, request_id, actor_id_hash.
P2Structured by defaultNo free-form production log strings. JSON, schema-validated, versioned (log_schema_version).
P3PHI never leaves the boundary unredactedRedaction is a library, not a discipline. Applied at emitter, verified at collector, re-verified at sink.
P4Tenant isolation extends to telemetryTenant-scoped dashboards, alerts, retention, export.
P5Sampling is policy, not accidentHead-based for hot paths; tail-based for errors/slow paths; 100 % for safety-critical (AI, prescribing, audit).
P6Cost is a first-class signalAI spend, egress, storage have SLOs like latency.
P7Offline is observable when it reconnectsDevice-side telemetry buffer with tamper-evident framing; reconciled on sync.
P8Events are the ledgerDomain events (see 04) are replayable; telemetry augments, not replaces.
P9Every alert is actionableAlerts carry a runbook slug, owner, and auto-remediation hook where applicable.
P10Privacy > curiosityWhen in doubt, drop the field. Patient wellbeing outranks debuggability.

2. Reference stack

LayerTooling (normative)Notes
InstrumentationOpenTelemetry SDK (Node/TypeScript)Single vendor-neutral API.
CollectionOTel Collector (gateway + agent tiers)Redaction, tenant routing, sampling decisions.
LogsLoki (hot 14 d) → S3 + Parquet (cold 395 d)Labels: {service, env, region, severity, tenant_id}.
MetricsPrometheus (hot 30 d) → Mimir (13 mo)Remote-write from Collector.
TracesTempo (hot 7 d) → S3 (90 d, sampled)Exemplars link metrics → traces → logs.
DashboardsGrafana (per-tenant folders, RBAC)Stored as code in grafana/.
AlertsAlertmanager + PagerDuty + Slack #oncall-*Alerts declared in Git, PR-reviewed.
SLO engineSloth → Prometheus rulesGenerates burn-rate alerts.
Audit sinkaudit-service (append-only, WORM S3)Security/compliance only.
IncidentPagerDuty + Statuspage + incident-botAuto-declares, pulls runbook, opens bridge.

Services must not import vendor SDKs directly — they import @ghasi/telemetry which wraps OTel and enforces field contracts.

3. Correlation and context

3.1 Required context keys

Every telemetry signal includes the keys below where values exist. Absent values use null, never "".

KeyTypeSourceNotes
trace_idhex(32)W3C traceparentGenerated at edge if absent.
span_idhex(16)OTel
request_iduuidv7Kong edgeSurvives through NATS via baggage.
tenant_idUUIDJWT → baggageMandatory outside onboarding.
facility_idUUID?JWT / contextHierarchy scope.
actor_id_hashsha256(actor_id + tenant_salt)auth layerRaw actor_id never in telemetry.
actor_roleenumJWTpatient|clinician|nurse|admin|system|anonymous
encounter_id_hashsha256domain contextLinks clinical workflow spans.
session_idULIDUICross-request stitching.
device_id_hashsha256offline SDKDevice binding.
appstringbuildclinician-web, patient-portal, provider-mobile, etc.
app_versionsemverbuild
envenumruntimedev|staging|prod|sandbox
regionstringruntimeaf-kbl-1, af-mzs-1, …
log_schema_versionintlibraryCurrent: 1.

3.2 Baggage

OTel Baggage carries tenant_id, request_id, actor_role, facility_id, offline_origin, ai_budget_id across HTTP, NATS, and background workers. Baggage is stripped at egress to external APIs.

3.3 Cross-boundary propagation

  • HTTP: traceparent, tracestate, x-ghasi-tenant, x-ghasi-request-id.
  • NATS JetStream: CloudEvents extension attributes carry traceparent + tenantid.
  • WebRTC / virtual care: trace_id per session; each signalling message is a span.
  • Offline → online: Device emits sync_batch_id; server links replayed signals to the original offline trace_id preserved in the batch envelope.

4. Logging

4.1 Log schema (v1)

{
"ts": "2026-04-18T09:12:33.214Z",
"level": "INFO|DEBUG|WARN|ERROR|FATAL|AUDIT",
"msg": "short human summary, ≤120 chars, no PHI interpolation",
"event": "chart.encounter.opened",
"service": "patient-chart-service",
"component": "EncounterController",
"trace_id": "…", "span_id": "…", "request_id": "…",
"tenant_id": "…", "facility_id": "…",
"actor_id_hash": "…", "actor_role": "clinician",
"encounter_id_hash": "…",
"session_id": "…", "device_id_hash": "…",
"app": "clinician-web", "app_version": "2026.4.1",
"env": "prod", "region": "af-kbl-1",
"attrs": { "duration_ms": 212, "cache_hit": true },
"error": { "type": "…", "message": "…", "stack": "…" },
"log_schema_version": 1
}

Rules:

  • msg is static; variable data goes in attrs.
  • event uses domain.entity.action (see 04 catalog).
  • attrs keys are snake_case and namespaced.
  • error.stack only at ERROR/FATAL and only in non-prod, or after scrubbing in prod.

4.2 Levels

LevelUseSampling
FATALProcess-terminating100 %
ERRORContract violation, unhandled100 %
WARNRecoverable anomaly, degraded mode100 %
AUDITSecurity/compliance events100 %, also to audit-service (synchronous)
INFODomain event emission, lifecycle100 % prod (filtered by category)
DEBUGDeveloper detail0 % prod, 100 % sandbox

4.3 PHI / PII redaction

Redaction in @ghasi/telemetry at emit; Collector re-runs for defense in depth; nightly scanner sweeps Loki for leaks.

Deny-list (never logged, even hashed with platform key): password, password_hash, otp, totp_secret, access_token, refresh_token, id_token, private_key, webhook_secret, national_id, phone_e164, email, dob, home_address, insurance_card_number, credit_card, clinical_note_free_text, ai_prompt_raw, ai_response_raw, patient_name, provider_personal_phone.

Hashed (tenant-salt): actor_id, device_id, ip_address, patient_id, encounter_id, mrn, npi.

Truncated + categorised (never raw):

  • Clinical notes → note_length, note_lang, note_has_ai_draft.
  • AI prompts / responses → prompt_category, prompt_length, prompt_lang, safety_tags[].

4.4 Mandatory spans per service layer

Hexagonal architecture (02 DDD contexts) aligns with these span layers:

LayerSpan name patternRequired attributes
Presentation (controller){service}.controller.{handler}http.method, http.route, http.status_code
Application (use case){service}.usecase.{name}tenant_id, actor_role
Domain (aggregate){service}.domain.{aggregate}.{method}domain-specific attributes (no PHI)
Port (interface){service}.port.{name}
Adapter (infra){service}.adapter.{name} (e.g., ...adapter.postgres, ...adapter.nats)db.system, db.name, messaging.system

Span lifecycle is enforced by decorators in @ghasi/telemetry.

4.5 Multi-tenancy in logs

  • Loki label set: {service, env, region, severity, tenant_id}. Unlabeled logs dropped at Collector.
  • Per-tenant retention overrides supported via Collector routing.
  • Tenant admin can export only their tenant's logs (signed, time-boxed S3 URL, 24 h).

4.6 Audit logs (separate pipeline)

Audit events are never best-effort. They go through a synchronous, ack'd write to audit-service before the user-facing response completes. Failures return 503 — we do not transact without audit.

Audit categories:

  • Auth: login, MFA, session revoke, impersonation, break-glass.
  • Data: export (DSAR), delete, bulk access.
  • Clinical: order sign, note sign, result release, medication dispense.
  • Interop: HL7 ingest, FHIR export, e-prescribing route.
  • Moderation: AI block, human override, appeal.
  • Admin: tenant config, role change, policy change, licensing assignment.
  • Break-glass: invocation, reason, window.

Schema: canonical log schema + audit.action, audit.target, audit.before_hash, audit.after_hash, audit.signed_by, audit.chain_prev_hash (hash-chained for tamper evidence).

5. Metrics taxonomy

5.1 Families

Three families, strictly named:

  • USE per resource: process_cpu_seconds_total, db_pool_connections{state}, nats_consumer_pending_messages.
  • RED per request path: http_requests_total, http_request_duration_seconds, http_requests_errors_total.
  • Domain (DKPIs): <domain>_<entity>_<action>_total — see §5.3+.

Naming:

  • _total for counters; _seconds for latency; _ratio for 0–1; _bytes for sizes.
  • Labels bounded cardinality. tenant_id allowed; user_id never a label.
  • High-cardinality dimensions go to exemplars + analytics, not Prometheus labels.

5.2 Standard labels

Every metric: service, env, region, tenant_tier (public|district|regional|referral|national). Domain metrics may add tenant_id when cardinality permits; Collector enforces a per-series cap.

5.3 Per-service SLIs (default, all services)

SLIDefinitionTarget
Availability1 − errors/total over read paths (5xx excluding 499)99.9 %
Latency p95http_request_duration_seconds p95≤ 300 ms internal, ≤ 500 ms edge
Latency p99≤ 800 ms
DB saturationdb_pool_in_use / db_pool_max< 0.8 sustained
NATS consumer lagnats_consumer_pending_messages< 5 000 / partition
Audit write successaudit_write_success_ratio≥ 99.99 %

5.4 Domain metrics — highlights

  • identity: auth_login_total{result}, auth_mfa_challenge_total{method,result}, auth_session_revoked_total{reason}, auth_breakglass_invoked_total{reason_code}.
  • registration: reg_patient_registered_total{source}, reg_duplicate_detected_total, reg_merge_applied_total.
  • patient-chart: chart_opened_total, chart_read_latency_seconds, allergy_banner_rendered_ratio.
  • orders: order_placed_total{class}, order_signed_total, order_ddi_flag_total.
  • medication: rx_issued_total, rx_dispensed_total, rx_dispense_latency_seconds, rx_substituted_total{reason}.
  • eprescribing-gateway: eprx_route_total{corridor,result}, eprx_subscription_backlog.
  • laboratory: lab_accession_total, lab_specimen_rejected_total{reason}, lab_result_released_total{priority}.
  • radiology: rad_study_total, rad_report_finalised_total, rad_tat_seconds.
  • virtual-care: vcare_session_started_total, vcare_session_dropped_total{reason}, vcare_ttfi_seconds.
  • patient-portal: portal_login_total, portal_result_viewed_total, portal_appt_booked_total.
  • immunizations: imm_dose_total{product}, imm_field_offline_queue_size.
  • billing/claims: bill_charge_total, claim_submitted_total, claim_paid_total, claim_denial_total{reason}.
  • interop: hl7_msg_total{type,result}, fhir_subscription_delivered_total.
  • audit: audit_events_total{category}, audit_chain_integrity_ratio.
  • ai-gateway: see §7.
  • offline: see §8.

5.5 Exemplars

Every domain counter/histogram carries exemplars linking to trace IDs — 1 in N successful requests, 100 % of errored. Grafana panels enable one-click metric → trace → log drill-down.

6. Distributed tracing

6.1 Rules

  • Every inbound HTTP, WebSocket, and NATS consumer creates a root or child span.
  • Every outbound DB, cache, HTTP, NATS producer, object-store, AI call creates a child span.
  • Spans have otel.status_code, error.type, domain-specific attributes (no PHI).
  • Any span crossing a trust boundary (tenant → external, online → offline, sync-in, e-prescribing cross-corridor) carries trust.boundary=<name> and its own error budget.

6.2 Domain span attributes

  • ghasi.tenant_id, ghasi.facility_id, ghasi.actor_role.
  • ghasi.patient_id_hash, ghasi.encounter_id_hash.
  • ghasi.ai.model, ghasi.ai.provider, ghasi.ai.purpose, ghasi.ai.tokens_in, ghasi.ai.tokens_out, ghasi.ai.cost_micro_usd.
  • ghasi.offline.batch_id, ghasi.offline.source.
  • ghasi.eprx.corridor, ghasi.eprx.jurisdiction.

6.3 Sampling

PathStrategyRate
Health / livenessDrop0 %
Read APIs (cached)Head-based1 %
Write APIsHead-based10 %
Clinical safety-critical (order sign, dispense, result release, allergy write)Always-on100 %
AI inferenceAlways-on100 %
Errors (4xx ≥ 429, all 5xx)Tail-based override100 %
Slow requests (> SLO p99)Tail-based override100 %
Offline sync replayAlways-on100 %
Audit writesAlways-on100 %

6.4 Redaction in spans

db.statement captured parameterised only (placeholders, never values). URL paths templatised (/v1/patients/:id/chart). Request/response bodies never in span attributes. FHIR resource bodies never in spans.

7. AI telemetry

AI is the highest-risk, highest-cost surface in a clinical platform. It gets first-class observability.

7.1 Per-invocation dimensions

DimensionExamplePurpose
ai.purposechart.scribe, orders.suggest, rad.report_draft, moderation.clinicalSLOs, cost allocation
ai.model / ai.model_versionclaude-sonnet-4-6@20261001Drift, A/B
ai.provideranthropic, openai, local-llmFailover
ai.prompt_template_id + hashchart.scribe.v7Provenance
ai.tokens_in / ai.tokens_outCost
ai.cost_micro_usdCost
ai.latency_ms_ttfb / ttlbUX
ai.cache.hit / ai.cache.key_hashCost
ai.safety.pre / ai.safety.post{"unsafe_clinical":0.01,...}Safety
ai.safety.actionallow|redact|block|escalateSafety
ai.guardrail.violations[]phi_leak, dose_unsafe, contraindication_missedSafety
ai.output.citations[]FHIR resource IDs (hashed)Provenance
ai.output.grounding_score0–1Hallucination
ai.human_overrideboolHITL closure

Prompts and responses themselves go to a separate, encrypted, tenant-scoped store (ai-transcripts within ai-gateway-service) with tighter retention. Telemetry carries only hashes, categories, and safety signals.

7.2 AI SLIs and SLOs

SLITarget
Scribe TTFB p95 online≤ 1.2 s
Scribe TTFB p95 offline (local SLM)≤ 500 ms
Moderation decision latency p99≤ 400 ms
Safety false-negative rate (sampled audit)≤ 0.1 %
Safety false-positive rate≤ 2 %
AI cost per active clinician-month≤ budget per tenant tier
Cache hit rate (prompt + ctx)≥ 40 % scribe, ≥ 60 % moderation
Provider failover success≥ 99 %
Grounding score (RAG paths)p50 ≥ 0.8

7.3 AI cost observability

  • Budget IDs attach to every invocation; rollups by tenant_id × purpose × model.
  • Circuit breakers fire when tenant/purpose breaches 120 % hourly budget — model downgrade (Opus → Sonnet → Haiku), aggressive caching, then fail-closed to static / human-only.

7.4 Provenance and replay

Each AI output stores {prompt_template_id, prompt_hash, context_hash, model, model_version, params, safety_decisions, citations} — sufficient to replay for audit. Replay exposed via internal ai-audit admin tool, logged as AUDIT.

8. Offline telemetry

8.1 Device SDK

Offline runtimes (provider mobile, desktop registration, web fallback) run a local telemetry buffer:

  • SQLite-backed, append-only, size-capped (default 64 MB).
  • Each batch MAC-signed with the device binding key; server detects tampering on replay.
  • Buffer encrypts at rest using device-bound key.
  • Batches chunked by sync_batch_id, ordered by monotonic sequence number; gaps flagged.

8.2 Offline signals

SignalFields
offline.bundle.activatedbundle_id, size_bytes, integrity_ok
offline.bundle.integrity_failurebundle_id, expected_hash, actual_hash, reason
offline.sync.started/completed/failedbatch_id, items, bytes, duration_ms, conflicts
offline.conflict.detectedentity, strategy, winner, loser_preserved
offline.device.bind/unbind/rebind_denieddevice_id_hash, reason
offline.tamper.suspectedsignal, severity, evidence_hash
offline.clock.skewskew_seconds
offline.outbox.sizerolling

8.3 Offline SLIs

SLITarget
Sync success rate≥ 99 % per device/day
Conflict rate (of synced writes)≤ 1 %
Bundle tamper detection100 % of test cases caught
Device-binding mismatch blocked100 %
Reconnect → sync complete for ≤ 10 MB≤ 60 s

9. Dashboard catalogue

Dashboards live as JSON in grafana/, provisioned via CI. Each has an owner and SLO link.

9.1 Global

  1. Platform Overview — availability, latency, saturation across 27 services.
  2. Error Budget Burn — per service, per SLO, 1 h / 6 h / 24 h.
  3. Tenant Health — per-tenant error rate, latency, AI spend, offline sync.
  4. Release Radar — deploys correlated with error-rate / latency deltas.
  5. Cost Control — infra + AI + egress.
  6. Audit Integrity — hash-chain continuity, write-failure rate.

9.2 Per-capability

  • Identity & Access — login, MFA, session revocations, break-glass.
  • Clinical — chart open, order placement, result release, allergy writes, note sign.
  • Pharmacy — queue depth, verify/dispense latency, substitutions.
  • Lab — accession funnel, TAT, result release.
  • Radiology — study volume, TAT, AI-drafted report acceptance rate.
  • Virtual care — session lifecycle, drop rate, reconnect success.
  • Interop (FHIR + HL7 v2) — ingest volume, error rate, DLQ depth.
  • E-prescribing — route success, subscription backlog, corridor latency.
  • Immunizations / HMIS — dose volume, offline queue depth, HMIS export success.
  • Billing / claims — charge funnel, denial rate, remittance lag.
  • Patient portal — login, result view, appointment book success.
  • AI / Scribe — TTFB, cache hit, safety actions, grounding, cost.
  • Offline — sync success, conflict rate, tamper flags.
  • Safety & moderation — blocks, overrides, SLA to decision.
  • Data platform — NATS consumer lag, DLQ depth, outbox lag.

9.3 Per-service template

Each service auto-gets a dashboard with: RED panels, USE panels, top errors, top slow endpoints, DB pool, NATS lag, dependency map, audit integrity.

10. Alerts and SLOs

10.1 SLO framework

  • All SLOs defined in slo/*.yaml (Sloth format), reviewed via PR.
  • Multi-window, multi-burn-rate alerts (Google SRE): 1 h + 5 min (fast burn), 6 h + 30 min (slow burn).
  • 28-day rolling windows; error-budget policy enforced.

10.2 Alert contract

alert: OrderSignFailure
expr: rate(orders_sign_errors_total[5m]) > 0
for: 1m
severity: SEV-1
owner: clinical-orders
runbook: https://runbooks.ghasi/clinical/order-sign-failure
auto_remediation: none
dashboards: [clinical/orders, audit/integrity]
slos: [orders.sign.availability]

Alerts without runbook + owner are rejected in CI.

10.3 Severity ladder

SeverityDefinitionResponse
SEV-1Patient-safety, data-loss, audit breach, payment outagePage + bridge + Statuspage ≤ 5 min
SEV-2Capability degraded, SLO fast-burnPage primary on-call
SEV-3Slow-burn SLO, non-blockingTicket + Slack
SEV-4HousekeepingTicket

10.4 Example alerts (non-exhaustive)

  • IdentityLoginErrorRate (SEV-2): 5xx on /auth/* > 1 % for 5 m.
  • OrderSignFailure (SEV-1): any ERROR on order sign path.
  • AllergyWriteFailure (SEV-1): any ERROR on allergy write; safety-critical.
  • AuditWriteFailure (SEV-1): any failed audit write; blocks transactions.
  • BreakGlassSpike (SEV-2): break-glass invocations > 3× 7-day baseline in a facility.
  • ResultReleaseDelay (SEV-2): verified lab results unreleased > 30 min.
  • EprxCorridorLatency (SEV-2): p95 > 5 s on cross-corridor routing.
  • AIScribeTTFBSlow (SEV-3): 6 h burn > 1× budget on scribe TTFB.
  • OfflineBundleTamper (SEV-1): any offline.bundle.integrity_failure.
  • OfflineSyncConflictSpike (SEV-2): conflict rate > 2 % over 30 m per tenant.
  • AICostBudgetBreach (SEV-2): tenant hourly AI spend > 120 % budget.
  • NATSConsumerLagHigh (SEV-2): any consumer lag > 50 k for 10 m.
  • HL7IngestBacklog (SEV-2): HL7 queue depth > threshold 15 m.
  • DSARDeletionSLA (SEV-2): open DSAR > 27 days.

10.5 Error budget policy

  • 50 % burn → notify service owner; feature-freeze optional.
  • 75 % → feature-freeze mandatory; reliability PRs only.
  • 100 % → rollback recent risky changes; post-incident review required.

Policy enforcement is an automated GitHub check against the SLO service.

11. Runbook template

Every alert points to a runbook. Standard shape:

# Runbook: {Alert}

**Severity:** SEV-N
**Owner:** {team}
**Dashboards:** [{dashboard-links}]
**Related SLOs:** [{slo-ids}]

## Symptoms
- What users / clinicians see.

## Immediate triage (≤ 5 min)
1. Confirm alert (query: `{promql}`).
2. Check dependencies (DB, NATS, Keycloak, Kong).
3. Check recent deploys.

## Diagnosis paths
- Path A: {hypothesis} → {commands} → {indicator}.
- Path B: {hypothesis} → {commands} → {indicator}.

## Resolution steps
- {action} (auto-remediation `{hook}` if applicable).
- Rollback: `{command}`.

## Follow-up
- Postmortem required if SEV-1/2.
- Action items tracked against error-budget dashboard.

Runbooks live in runbooks/ repo; incident-bot surfaces them in the incident channel.

12. Incident response hooks

12.1 Auto-declare

SEV-1 or two concurrent SEV-2 auto-declare:

  1. incident-bot opens Slack #inc-YYYYMMDD-NN.
  2. Creates PagerDuty incident; pages on-call.
  3. Posts runbook, burn-rate, recent deploys, related alerts.
  4. Opens bridge (Zoom / Meet) with auto-invite.
  5. Updates Statuspage with tenant-scoped visibility.
  6. Timeline logger captures human comments + alert transitions.

12.2 Automated remediation

TriggerAction
AI provider error rate > 5 %Failover to secondary + model downgrade
Scribe TTFB p95 breachReduce streaming concurrency per tenant
NATS DLQ spikePause producer, alert, enable dead-letter drain
Offline tamper spike from tenantAuto-revoke affected device bindings; require re-enrol
Audit write failureTrip global "no-write" breaker on affected capability
Cost breachModel downgrade → cache-only → 503 with graceful copy

All auto-remediation logs as AUDIT and is single-command reversible.

12.3 Postmortems

  • Blameless template generated from incident timeline + telemetry (pm-bot).
  • Required within 5 business days for SEV-1/2.
  • Action items tracked with SLA; overdue AIs appear on error-budget dashboard.

13. Data retention and residency (telemetry)

SignalHotWarmColdMax
App logs (non-PHI)14 d Loki90 d S3395 d Glacier395 d
Audit logs30 d hot7 y WORM S37 y
Metrics30 d Prom13 mo Mimir13 mo
Traces (sampled)7 d Tempo90 d S390 d
Traces (errors, AI, safety-critical)30 d395 d395 d
AI transcripts30 d180 d tenant-config365 d
Safety-flagged AI180 d2 y2 y
Offline device telemetry14 d post-sync90 d90 d

Residency: tenant-pinned to home region (af-kbl-1, af-mzs-1, …). Cross-region replication off by default.

14. Per-service implementation checklist

A service is not production-ready until:

  • Uses @ghasi/telemetry wrapper (no raw OTel).
  • Emits §3 correlation context on every signal.
  • Passes PHI redaction CI test (fixtures of bad logs).
  • Declares RED SLIs + ≥ 1 domain SLI.
  • Has a per-service dashboard provisioned.
  • Has ≥ 1 SEV-2 alert with runbook.
  • Declares NATS consumer lag alert (if a consumer).
  • Uses synchronous audit write on PHI-touching writes.
  • Offline surfaces use buffered, MAC-signed telemetry.
  • AI surfaces emit §7.1 dimensions and pass safety-replay test.
  • Domain events registered with telemetry + analytics projections.
  • Retention / residency overrides documented if non-default.

15. PagerDuty integration

  • Services map to PagerDuty services 1:1.
  • Escalation: primary (5 min) → secondary (15 min) → manager (30 min).
  • Schedules are Git-managed via pagerduty-tf.
  • Alerts carry severity, service, runbook, dashboards as payload — PagerDuty auto-formats the incident.
  • Statuspage integration: SEV-1 auto-publishes a component-scoped incident; SEV-2 post-mortem-first with 30-min delay.

16. Governance

  • This document is versioned; material changes require an RFC under rfcs/observability/.
  • Log schema breaking changes bump log_schema_version and carry a 2-release deprecation window.
  • Alerts and SLO changes PR-reviewed by SRE + capability owner + (for safety) Clinical Informatics.

17. Open questions

  • Whether on-device anomaly detection for offline tamper should ship in provider-mobile v2 or remain server-side only.
  • Long-term strategy for per-facility (not just per-tenant) SLOs — currently tenant-tier granularity; some facilities will want dedicated views.