12 — Observability and Telemetry

Status: populated Last updated: 2026-04-18 Companion: 01 enterprise-architecture · 04 event-driven-architecture · 11 risks-and-tradeoffs · 13 security-compliance-tenancy · 18 testing-strategy-qa

Observability is not optional instrumentation — it is the operating substrate that lets Ghasi-eHealth prove clinical safety, correctness, availability, cost discipline, and tamper-evidence across 27 services and multiple offline surfaces.

1. Principles

#	Principle	Consequence
P1	Three pillars, one correlation ID	Every log line, metric exemplar, and span carries `trace_id`, `tenant_id`, `request_id`, `actor_id_hash`.
P2	Structured by default	No free-form production log strings. JSON, schema-validated, versioned (`log_schema_version`).
P3	PHI never leaves the boundary unredacted	Redaction is a library, not a discipline. Applied at emitter, verified at collector, re-verified at sink.
P4	Tenant isolation extends to telemetry	Tenant-scoped dashboards, alerts, retention, export.
P5	Sampling is policy, not accident	Head-based for hot paths; tail-based for errors/slow paths; 100 % for safety-critical (AI, prescribing, audit).
P6	Cost is a first-class signal	AI spend, egress, storage have SLOs like latency.
P7	Offline is observable when it reconnects	Device-side telemetry buffer with tamper-evident framing; reconciled on sync.
P8	Events are the ledger	Domain events (see 04) are replayable; telemetry augments, not replaces.
P9	Every alert is actionable	Alerts carry a runbook slug, owner, and auto-remediation hook where applicable.
P10	Privacy > curiosity	When in doubt, drop the field. Patient wellbeing outranks debuggability.

2. Reference stack

Layer	Tooling (normative)	Notes
Instrumentation	OpenTelemetry SDK (Node/TypeScript)	Single vendor-neutral API.
Collection	OTel Collector (gateway + agent tiers)	Redaction, tenant routing, sampling decisions.
Logs	Loki (hot 14 d) → S3 + Parquet (cold 395 d)	Labels: `{service, env, region, severity, tenant_id}`.
Metrics	Prometheus (hot 30 d) → Mimir (13 mo)	Remote-write from Collector.
Traces	Tempo (hot 7 d) → S3 (90 d, sampled)	Exemplars link metrics → traces → logs.
Dashboards	Grafana (per-tenant folders, RBAC)	Stored as code in `grafana/`.
Alerts	Alertmanager + PagerDuty + Slack `#oncall-*`	Alerts declared in Git, PR-reviewed.
SLO engine	Sloth → Prometheus rules	Generates burn-rate alerts.
Audit sink	audit-service (append-only, WORM S3)	Security/compliance only.
Incident	PagerDuty + Statuspage + `incident-bot`	Auto-declares, pulls runbook, opens bridge.

Services must not import vendor SDKs directly — they import @ghasi/telemetry which wraps OTel and enforces field contracts.

3. Correlation and context

3.1 Required context keys

Every telemetry signal includes the keys below where values exist. Absent values use null, never "".

Key	Type	Source	Notes
`trace_id`	hex(32)	W3C `traceparent`	Generated at edge if absent.
`span_id`	hex(16)	OTel
`request_id`	uuidv7	Kong edge	Survives through NATS via baggage.
`tenant_id`	UUID	JWT → baggage	Mandatory outside onboarding.
`facility_id`	UUID?	JWT / context	Hierarchy scope.
`actor_id_hash`	sha256(actor_id + tenant_salt)	auth layer	Raw `actor_id` never in telemetry.
`actor_role`	enum	JWT	`patient\|clinician\|nurse\|admin\|system\|anonymous`
`encounter_id_hash`	sha256	domain context	Links clinical workflow spans.
`session_id`	ULID	UI	Cross-request stitching.
`device_id_hash`	sha256	offline SDK	Device binding.
`app`	string	build	`clinician-web`, `patient-portal`, `provider-mobile`, etc.
`app_version`	semver	build
`env`	enum	runtime	`dev\|staging\|prod\|sandbox`
`region`	string	runtime	`af-kbl-1`, `af-mzs-1`, …
`log_schema_version`	int	library	Current: `1`.

3.2 Baggage

OTel Baggage carries tenant_id, request_id, actor_role, facility_id, offline_origin, ai_budget_id across HTTP, NATS, and background workers. Baggage is stripped at egress to external APIs.

3.3 Cross-boundary propagation

HTTP: traceparent, tracestate, x-ghasi-tenant, x-ghasi-request-id.
NATS JetStream: CloudEvents extension attributes carry traceparent + tenantid.
WebRTC / virtual care: trace_id per session; each signalling message is a span.
Offline → online: Device emits sync_batch_id; server links replayed signals to the original offline trace_id preserved in the batch envelope.

4. Logging

4.1 Log schema (v1)

{
  "ts": "2026-04-18T09:12:33.214Z",
  "level": "INFO|DEBUG|WARN|ERROR|FATAL|AUDIT",
  "msg": "short human summary, ≤120 chars, no PHI interpolation",
  "event": "chart.encounter.opened",
  "service": "patient-chart-service",
  "component": "EncounterController",
  "trace_id": "…", "span_id": "…", "request_id": "…",
  "tenant_id": "…", "facility_id": "…",
  "actor_id_hash": "…", "actor_role": "clinician",
  "encounter_id_hash": "…",
  "session_id": "…", "device_id_hash": "…",
  "app": "clinician-web", "app_version": "2026.4.1",
  "env": "prod", "region": "af-kbl-1",
  "attrs": { "duration_ms": 212, "cache_hit": true },
  "error": { "type": "…", "message": "…", "stack": "…" },
  "log_schema_version": 1
}

Rules:

msg is static; variable data goes in attrs.
event uses domain.entity.action (see 04 catalog).
attrs keys are snake_case and namespaced.
error.stack only at ERROR/FATAL and only in non-prod, or after scrubbing in prod.

4.2 Levels

Level	Use	Sampling
`FATAL`	Process-terminating	100 %
`ERROR`	Contract violation, unhandled	100 %
`WARN`	Recoverable anomaly, degraded mode	100 %
`AUDIT`	Security/compliance events	100 %, also to audit-service (synchronous)
`INFO`	Domain event emission, lifecycle	100 % prod (filtered by category)
`DEBUG`	Developer detail	0 % prod, 100 % sandbox

4.3 PHI / PII redaction

Redaction in @ghasi/telemetry at emit; Collector re-runs for defense in depth; nightly scanner sweeps Loki for leaks.

Deny-list (never logged, even hashed with platform key): password, password_hash, otp, totp_secret, access_token, refresh_token, id_token, private_key, webhook_secret, national_id, phone_e164, email, dob, home_address, insurance_card_number, credit_card, clinical_note_free_text, ai_prompt_raw, ai_response_raw, patient_name, provider_personal_phone.

Hashed (tenant-salt): actor_id, device_id, ip_address, patient_id, encounter_id, mrn, npi.

Truncated + categorised (never raw):

Clinical notes → note_length, note_lang, note_has_ai_draft.
AI prompts / responses → prompt_category, prompt_length, prompt_lang, safety_tags[].

4.4 Mandatory spans per service layer

Hexagonal architecture (02 DDD contexts) aligns with these span layers:

Layer	Span name pattern	Required attributes
Presentation (controller)	`{service}.controller.{handler}`	`http.method`, `http.route`, `http.status_code`
Application (use case)	`{service}.usecase.{name}`	`tenant_id`, `actor_role`
Domain (aggregate)	`{service}.domain.{aggregate}.{method}`	domain-specific attributes (no PHI)
Port (interface)	`{service}.port.{name}`
Adapter (infra)	`{service}.adapter.{name}` (e.g., `...adapter.postgres`, `...adapter.nats`)	`db.system`, `db.name`, `messaging.system`

Span lifecycle is enforced by decorators in @ghasi/telemetry.

4.5 Multi-tenancy in logs

Loki label set: {service, env, region, severity, tenant_id}. Unlabeled logs dropped at Collector.
Per-tenant retention overrides supported via Collector routing.
Tenant admin can export only their tenant's logs (signed, time-boxed S3 URL, 24 h).

4.6 Audit logs (separate pipeline)

Audit events are never best-effort. They go through a synchronous, ack'd write to audit-service before the user-facing response completes. Failures return 503 — we do not transact without audit.

Audit categories:

Auth: login, MFA, session revoke, impersonation, break-glass.
Data: export (DSAR), delete, bulk access.
Clinical: order sign, note sign, result release, medication dispense.
Interop: HL7 ingest, FHIR export, e-prescribing route.
Moderation: AI block, human override, appeal.
Admin: tenant config, role change, policy change, licensing assignment.
Break-glass: invocation, reason, window.

Schema: canonical log schema + audit.action, audit.target, audit.before_hash, audit.after_hash, audit.signed_by, audit.chain_prev_hash (hash-chained for tamper evidence).

5. Metrics taxonomy

5.1 Families

Three families, strictly named:

USE per resource: process_cpu_seconds_total, db_pool_connections{state}, nats_consumer_pending_messages.
RED per request path: http_requests_total, http_request_duration_seconds, http_requests_errors_total.
Domain (DKPIs): <domain>_<entity>_<action>_total — see §5.3+.

Naming:

_total for counters; _seconds for latency; _ratio for 0–1; _bytes for sizes.
Labels bounded cardinality. tenant_id allowed; user_id never a label.
High-cardinality dimensions go to exemplars + analytics, not Prometheus labels.

5.2 Standard labels

Every metric: service, env, region, tenant_tier (public|district|regional|referral|national). Domain metrics may add tenant_id when cardinality permits; Collector enforces a per-series cap.

5.3 Per-service SLIs (default, all services)

SLI	Definition	Target
Availability	`1 − errors/total` over read paths (5xx excluding 499)	99.9 %
Latency p95	`http_request_duration_seconds` p95	≤ 300 ms internal, ≤ 500 ms edge
Latency p99		≤ 800 ms
DB saturation	`db_pool_in_use / db_pool_max`	< 0.8 sustained
NATS consumer lag	`nats_consumer_pending_messages`	< 5 000 / partition
Audit write success	`audit_write_success_ratio`	≥ 99.99 %

5.4 Domain metrics — highlights

identity: auth_login_total{result}, auth_mfa_challenge_total{method,result}, auth_session_revoked_total{reason}, auth_breakglass_invoked_total{reason_code}.
registration: reg_patient_registered_total{source}, reg_duplicate_detected_total, reg_merge_applied_total.
patient-chart: chart_opened_total, chart_read_latency_seconds, allergy_banner_rendered_ratio.
orders: order_placed_total{class}, order_signed_total, order_ddi_flag_total.
medication: rx_issued_total, rx_dispensed_total, rx_dispense_latency_seconds, rx_substituted_total{reason}.
eprescribing-gateway: eprx_route_total{corridor,result}, eprx_subscription_backlog.
laboratory: lab_accession_total, lab_specimen_rejected_total{reason}, lab_result_released_total{priority}.
radiology: rad_study_total, rad_report_finalised_total, rad_tat_seconds.
virtual-care: vcare_session_started_total, vcare_session_dropped_total{reason}, vcare_ttfi_seconds.
patient-portal: portal_login_total, portal_result_viewed_total, portal_appt_booked_total.
immunizations: imm_dose_total{product}, imm_field_offline_queue_size.
billing/claims: bill_charge_total, claim_submitted_total, claim_paid_total, claim_denial_total{reason}.
interop: hl7_msg_total{type,result}, fhir_subscription_delivered_total.
audit: audit_events_total{category}, audit_chain_integrity_ratio.
ai-gateway: see §7.
offline: see §8.

5.5 Exemplars

Every domain counter/histogram carries exemplars linking to trace IDs — 1 in N successful requests, 100 % of errored. Grafana panels enable one-click metric → trace → log drill-down.

6. Distributed tracing

6.1 Rules

Every inbound HTTP, WebSocket, and NATS consumer creates a root or child span.
Every outbound DB, cache, HTTP, NATS producer, object-store, AI call creates a child span.
Spans have otel.status_code, error.type, domain-specific attributes (no PHI).
Any span crossing a trust boundary (tenant → external, online → offline, sync-in, e-prescribing cross-corridor) carries trust.boundary=<name> and its own error budget.

6.2 Domain span attributes

ghasi.tenant_id, ghasi.facility_id, ghasi.actor_role.
ghasi.patient_id_hash, ghasi.encounter_id_hash.
ghasi.ai.model, ghasi.ai.provider, ghasi.ai.purpose, ghasi.ai.tokens_in, ghasi.ai.tokens_out, ghasi.ai.cost_micro_usd.
ghasi.offline.batch_id, ghasi.offline.source.
ghasi.eprx.corridor, ghasi.eprx.jurisdiction.

6.3 Sampling

Path	Strategy	Rate
Health / liveness	Drop	0 %
Read APIs (cached)	Head-based	1 %
Write APIs	Head-based	10 %
Clinical safety-critical (order sign, dispense, result release, allergy write)	Always-on	100 %
AI inference	Always-on	100 %
Errors (4xx ≥ 429, all 5xx)	Tail-based override	100 %
Slow requests (> SLO p99)	Tail-based override	100 %
Offline sync replay	Always-on	100 %
Audit writes	Always-on	100 %

6.4 Redaction in spans

db.statement captured parameterised only (placeholders, never values). URL paths templatised (/v1/patients/:id/chart). Request/response bodies never in span attributes. FHIR resource bodies never in spans.

7. AI telemetry

AI is the highest-risk, highest-cost surface in a clinical platform. It gets first-class observability.

7.1 Per-invocation dimensions

Dimension	Example	Purpose
`ai.purpose`	`chart.scribe`, `orders.suggest`, `rad.report_draft`, `moderation.clinical`	SLOs, cost allocation
`ai.model` / `ai.model_version`	`claude-sonnet-4-6@20261001`	Drift, A/B
`ai.provider`	`anthropic`, `openai`, `local-llm`	Failover
`ai.prompt_template_id` + hash	`chart.scribe.v7`	Provenance
`ai.tokens_in` / `ai.tokens_out`		Cost
`ai.cost_micro_usd`		Cost
`ai.latency_ms_ttfb` / `ttlb`		UX
`ai.cache.hit` / `ai.cache.key_hash`		Cost
`ai.safety.pre` / `ai.safety.post`	`{"unsafe_clinical":0.01,...}`	Safety
`ai.safety.action`	`allow\|redact\|block\|escalate`	Safety
`ai.guardrail.violations[]`	`phi_leak`, `dose_unsafe`, `contraindication_missed`	Safety
`ai.output.citations[]`	FHIR resource IDs (hashed)	Provenance
`ai.output.grounding_score`	0–1	Hallucination
`ai.human_override`	bool	HITL closure

Prompts and responses themselves go to a separate, encrypted, tenant-scoped store (ai-transcripts within ai-gateway-service) with tighter retention. Telemetry carries only hashes, categories, and safety signals.

7.2 AI SLIs and SLOs

SLI	Target
Scribe TTFB p95 online	≤ 1.2 s
Scribe TTFB p95 offline (local SLM)	≤ 500 ms
Moderation decision latency p99	≤ 400 ms
Safety false-negative rate (sampled audit)	≤ 0.1 %
Safety false-positive rate	≤ 2 %
AI cost per active clinician-month	≤ budget per tenant tier
Cache hit rate (prompt + ctx)	≥ 40 % scribe, ≥ 60 % moderation
Provider failover success	≥ 99 %
Grounding score (RAG paths)	p50 ≥ 0.8

7.3 AI cost observability

Budget IDs attach to every invocation; rollups by tenant_id × purpose × model.
Circuit breakers fire when tenant/purpose breaches 120 % hourly budget — model downgrade (Opus → Sonnet → Haiku), aggressive caching, then fail-closed to static / human-only.

7.4 Provenance and replay

Each AI output stores {prompt_template_id, prompt_hash, context_hash, model, model_version, params, safety_decisions, citations} — sufficient to replay for audit. Replay exposed via internal ai-audit admin tool, logged as AUDIT.

8. Offline telemetry

8.1 Device SDK

Offline runtimes (provider mobile, desktop registration, web fallback) run a local telemetry buffer:

SQLite-backed, append-only, size-capped (default 64 MB).
Each batch MAC-signed with the device binding key; server detects tampering on replay.
Buffer encrypts at rest using device-bound key.
Batches chunked by sync_batch_id, ordered by monotonic sequence number; gaps flagged.

8.2 Offline signals

Signal	Fields
`offline.bundle.activated`	`bundle_id`, `size_bytes`, `integrity_ok`
`offline.bundle.integrity_failure`	`bundle_id`, `expected_hash`, `actual_hash`, `reason`
`offline.sync.started/completed/failed`	`batch_id`, `items`, `bytes`, `duration_ms`, `conflicts`
`offline.conflict.detected`	`entity`, `strategy`, `winner`, `loser_preserved`
`offline.device.bind/unbind/rebind_denied`	`device_id_hash`, `reason`
`offline.tamper.suspected`	`signal`, `severity`, `evidence_hash`
`offline.clock.skew`	`skew_seconds`
`offline.outbox.size`	rolling

8.3 Offline SLIs

SLI	Target
Sync success rate	≥ 99 % per device/day
Conflict rate (of synced writes)	≤ 1 %
Bundle tamper detection	100 % of test cases caught
Device-binding mismatch blocked	100 %
Reconnect → sync complete for ≤ 10 MB	≤ 60 s

9. Dashboard catalogue

Dashboards live as JSON in grafana/, provisioned via CI. Each has an owner and SLO link.

9.1 Global

Platform Overview — availability, latency, saturation across 27 services.
Error Budget Burn — per service, per SLO, 1 h / 6 h / 24 h.
Tenant Health — per-tenant error rate, latency, AI spend, offline sync.
Release Radar — deploys correlated with error-rate / latency deltas.
Cost Control — infra + AI + egress.
Audit Integrity — hash-chain continuity, write-failure rate.

9.2 Per-capability

Identity & Access — login, MFA, session revocations, break-glass.
Clinical — chart open, order placement, result release, allergy writes, note sign.
Pharmacy — queue depth, verify/dispense latency, substitutions.
Lab — accession funnel, TAT, result release.
Radiology — study volume, TAT, AI-drafted report acceptance rate.
Virtual care — session lifecycle, drop rate, reconnect success.
Interop (FHIR + HL7 v2) — ingest volume, error rate, DLQ depth.
E-prescribing — route success, subscription backlog, corridor latency.
Immunizations / HMIS — dose volume, offline queue depth, HMIS export success.
Billing / claims — charge funnel, denial rate, remittance lag.
Patient portal — login, result view, appointment book success.
AI / Scribe — TTFB, cache hit, safety actions, grounding, cost.
Offline — sync success, conflict rate, tamper flags.
Safety & moderation — blocks, overrides, SLA to decision.
Data platform — NATS consumer lag, DLQ depth, outbox lag.

9.3 Per-service template

Each service auto-gets a dashboard with: RED panels, USE panels, top errors, top slow endpoints, DB pool, NATS lag, dependency map, audit integrity.

10. Alerts and SLOs

10.1 SLO framework

All SLOs defined in slo/*.yaml (Sloth format), reviewed via PR.
Multi-window, multi-burn-rate alerts (Google SRE): 1 h + 5 min (fast burn), 6 h + 30 min (slow burn).
28-day rolling windows; error-budget policy enforced.

10.2 Alert contract

alert: OrderSignFailure
expr: rate(orders_sign_errors_total[5m]) > 0
for: 1m
severity: SEV-1
owner: clinical-orders
runbook: https://runbooks.ghasi/clinical/order-sign-failure
auto_remediation: none
dashboards: [clinical/orders, audit/integrity]
slos: [orders.sign.availability]

Alerts without runbook + owner are rejected in CI.

10.3 Severity ladder

Severity	Definition	Response
SEV-1	Patient-safety, data-loss, audit breach, payment outage	Page + bridge + Statuspage ≤ 5 min
SEV-2	Capability degraded, SLO fast-burn	Page primary on-call
SEV-3	Slow-burn SLO, non-blocking	Ticket + Slack
SEV-4	Housekeeping	Ticket

10.4 Example alerts (non-exhaustive)

IdentityLoginErrorRate (SEV-2): 5xx on /auth/* > 1 % for 5 m.
OrderSignFailure (SEV-1): any ERROR on order sign path.
AllergyWriteFailure (SEV-1): any ERROR on allergy write; safety-critical.
AuditWriteFailure (SEV-1): any failed audit write; blocks transactions.
BreakGlassSpike (SEV-2): break-glass invocations > 3× 7-day baseline in a facility.
ResultReleaseDelay (SEV-2): verified lab results unreleased > 30 min.
EprxCorridorLatency (SEV-2): p95 > 5 s on cross-corridor routing.
AIScribeTTFBSlow (SEV-3): 6 h burn > 1× budget on scribe TTFB.
OfflineBundleTamper (SEV-1): any offline.bundle.integrity_failure.
OfflineSyncConflictSpike (SEV-2): conflict rate > 2 % over 30 m per tenant.
AICostBudgetBreach (SEV-2): tenant hourly AI spend > 120 % budget.
NATSConsumerLagHigh (SEV-2): any consumer lag > 50 k for 10 m.
HL7IngestBacklog (SEV-2): HL7 queue depth > threshold 15 m.
DSARDeletionSLA (SEV-2): open DSAR > 27 days.

10.5 Error budget policy

50 % burn → notify service owner; feature-freeze optional.
75 % → feature-freeze mandatory; reliability PRs only.
100 % → rollback recent risky changes; post-incident review required.

Policy enforcement is an automated GitHub check against the SLO service.

11. Runbook template

Every alert points to a runbook. Standard shape:

# Runbook: {Alert}

**Severity:** SEV-N
**Owner:** {team}
**Dashboards:** [{dashboard-links}]
**Related SLOs:** [{slo-ids}]

## Symptoms
- What users / clinicians see.

## Immediate triage (≤ 5 min)
1. Confirm alert (query: `{promql}`).
2. Check dependencies (DB, NATS, Keycloak, Kong).
3. Check recent deploys.

## Diagnosis paths
- Path A: {hypothesis} → {commands} → {indicator}.
- Path B: {hypothesis} → {commands} → {indicator}.

## Resolution steps
- {action} (auto-remediation `{hook}` if applicable).
- Rollback: `{command}`.

## Follow-up
- Postmortem required if SEV-1/2.
- Action items tracked against error-budget dashboard.

Runbooks live in runbooks/ repo; incident-bot surfaces them in the incident channel.

12. Incident response hooks

12.1 Auto-declare

SEV-1 or two concurrent SEV-2 auto-declare:

incident-bot opens Slack #inc-YYYYMMDD-NN.
Creates PagerDuty incident; pages on-call.
Posts runbook, burn-rate, recent deploys, related alerts.
Opens bridge (Zoom / Meet) with auto-invite.
Updates Statuspage with tenant-scoped visibility.
Timeline logger captures human comments + alert transitions.

12.2 Automated remediation

Trigger	Action
AI provider error rate > 5 %	Failover to secondary + model downgrade
Scribe TTFB p95 breach	Reduce streaming concurrency per tenant
NATS DLQ spike	Pause producer, alert, enable dead-letter drain
Offline tamper spike from tenant	Auto-revoke affected device bindings; require re-enrol
Audit write failure	Trip global "no-write" breaker on affected capability
Cost breach	Model downgrade → cache-only → 503 with graceful copy

All auto-remediation logs as AUDIT and is single-command reversible.

12.3 Postmortems

Blameless template generated from incident timeline + telemetry (pm-bot).
Required within 5 business days for SEV-1/2.
Action items tracked with SLA; overdue AIs appear on error-budget dashboard.

13. Data retention and residency (telemetry)

Signal	Hot	Warm	Cold	Max
App logs (non-PHI)	14 d Loki	90 d S3	395 d Glacier	395 d
Audit logs	30 d hot	7 y WORM S3	—	7 y
Metrics	30 d Prom	13 mo Mimir	—	13 mo
Traces (sampled)	7 d Tempo	90 d S3	—	90 d
Traces (errors, AI, safety-critical)	30 d	395 d	—	395 d
AI transcripts	30 d	180 d tenant-config	—	365 d
Safety-flagged AI	180 d	2 y	—	2 y
Offline device telemetry	14 d post-sync	90 d	—	90 d

Residency: tenant-pinned to home region (af-kbl-1, af-mzs-1, …). Cross-region replication off by default.

14. Per-service implementation checklist

A service is not production-ready until:

15. PagerDuty integration

Services map to PagerDuty services 1:1.
Escalation: primary (5 min) → secondary (15 min) → manager (30 min).
Schedules are Git-managed via pagerduty-tf.
Alerts carry severity, service, runbook, dashboards as payload — PagerDuty auto-formats the incident.
Statuspage integration: SEV-1 auto-publishes a component-scoped incident; SEV-2 post-mortem-first with 30-min delay.

16. Governance

This document is versioned; material changes require an RFC under rfcs/observability/.
Log schema breaking changes bump log_schema_version and carry a 2-release deprecation window.
Alerts and SLO changes PR-reviewed by SRE + capability owner + (for safety) Clinical Informatics.

17. Open questions

Whether on-device anomaly detection for offline tamper should ship in provider-mobile v2 or remain server-side only.
Long-term strategy for per-facility (not just per-tenant) SLOs — currently tenant-tier granularity; some facilities will want dedicated views.

1. Principles​

2. Reference stack​

3. Correlation and context​

3.1 Required context keys​

3.2 Baggage​

3.3 Cross-boundary propagation​

4. Logging​

4.1 Log schema (v1)​

4.2 Levels​

4.3 PHI / PII redaction​

4.4 Mandatory spans per service layer​

4.5 Multi-tenancy in logs​

4.6 Audit logs (separate pipeline)​

5. Metrics taxonomy​

5.1 Families​

5.2 Standard labels​

5.3 Per-service SLIs (default, all services)​

5.4 Domain metrics — highlights​

5.5 Exemplars​

6. Distributed tracing​

6.1 Rules​

6.2 Domain span attributes​

6.3 Sampling​

6.4 Redaction in spans​

7. AI telemetry​

7.1 Per-invocation dimensions​

7.2 AI SLIs and SLOs​

7.3 AI cost observability​

7.4 Provenance and replay​

8. Offline telemetry​

8.1 Device SDK​

8.2 Offline signals​

8.3 Offline SLIs​

9. Dashboard catalogue​

9.1 Global​

9.2 Per-capability​

9.3 Per-service template​

10. Alerts and SLOs​

10.1 SLO framework​

10.2 Alert contract​

10.3 Severity ladder​

10.4 Example alerts (non-exhaustive)​

10.5 Error budget policy​

11. Runbook template​

12. Incident response hooks​

12.1 Auto-declare​

12.2 Automated remediation​

12.3 Postmortems​

13. Data retention and residency (telemetry)​

14. Per-service implementation checklist​

15. PagerDuty integration​

16. Governance​

17. Open questions​