12 — Observability and Telemetry
Status: populated Last updated: 2026-04-18 Companion: 01 enterprise-architecture · 04 event-driven-architecture · 11 risks-and-tradeoffs · 13 security-compliance-tenancy · 18 testing-strategy-qa
Observability is not optional instrumentation — it is the operating substrate that lets Ghasi-eHealth prove clinical safety, correctness, availability, cost discipline, and tamper-evidence across 27 services and multiple offline surfaces.
1. Principles
| # | Principle | Consequence |
|---|---|---|
| P1 | Three pillars, one correlation ID | Every log line, metric exemplar, and span carries trace_id, tenant_id, request_id, actor_id_hash. |
| P2 | Structured by default | No free-form production log strings. JSON, schema-validated, versioned (log_schema_version). |
| P3 | PHI never leaves the boundary unredacted | Redaction is a library, not a discipline. Applied at emitter, verified at collector, re-verified at sink. |
| P4 | Tenant isolation extends to telemetry | Tenant-scoped dashboards, alerts, retention, export. |
| P5 | Sampling is policy, not accident | Head-based for hot paths; tail-based for errors/slow paths; 100 % for safety-critical (AI, prescribing, audit). |
| P6 | Cost is a first-class signal | AI spend, egress, storage have SLOs like latency. |
| P7 | Offline is observable when it reconnects | Device-side telemetry buffer with tamper-evident framing; reconciled on sync. |
| P8 | Events are the ledger | Domain events (see 04) are replayable; telemetry augments, not replaces. |
| P9 | Every alert is actionable | Alerts carry a runbook slug, owner, and auto-remediation hook where applicable. |
| P10 | Privacy > curiosity | When in doubt, drop the field. Patient wellbeing outranks debuggability. |
2. Reference stack
| Layer | Tooling (normative) | Notes |
|---|---|---|
| Instrumentation | OpenTelemetry SDK (Node/TypeScript) | Single vendor-neutral API. |
| Collection | OTel Collector (gateway + agent tiers) | Redaction, tenant routing, sampling decisions. |
| Logs | Loki (hot 14 d) → S3 + Parquet (cold 395 d) | Labels: {service, env, region, severity, tenant_id}. |
| Metrics | Prometheus (hot 30 d) → Mimir (13 mo) | Remote-write from Collector. |
| Traces | Tempo (hot 7 d) → S3 (90 d, sampled) | Exemplars link metrics → traces → logs. |
| Dashboards | Grafana (per-tenant folders, RBAC) | Stored as code in grafana/. |
| Alerts | Alertmanager + PagerDuty + Slack #oncall-* | Alerts declared in Git, PR-reviewed. |
| SLO engine | Sloth → Prometheus rules | Generates burn-rate alerts. |
| Audit sink | audit-service (append-only, WORM S3) | Security/compliance only. |
| Incident | PagerDuty + Statuspage + incident-bot | Auto-declares, pulls runbook, opens bridge. |
Services must not import vendor SDKs directly — they import @ghasi/telemetry which wraps OTel and enforces field contracts.
3. Correlation and context
3.1 Required context keys
Every telemetry signal includes the keys below where values exist. Absent values use null, never "".
| Key | Type | Source | Notes |
|---|---|---|---|
trace_id | hex(32) | W3C traceparent | Generated at edge if absent. |
span_id | hex(16) | OTel | |
request_id | uuidv7 | Kong edge | Survives through NATS via baggage. |
tenant_id | UUID | JWT → baggage | Mandatory outside onboarding. |
facility_id | UUID? | JWT / context | Hierarchy scope. |
actor_id_hash | sha256(actor_id + tenant_salt) | auth layer | Raw actor_id never in telemetry. |
actor_role | enum | JWT | patient|clinician|nurse|admin|system|anonymous |
encounter_id_hash | sha256 | domain context | Links clinical workflow spans. |
session_id | ULID | UI | Cross-request stitching. |
device_id_hash | sha256 | offline SDK | Device binding. |
app | string | build | clinician-web, patient-portal, provider-mobile, etc. |
app_version | semver | build | |
env | enum | runtime | dev|staging|prod|sandbox |
region | string | runtime | af-kbl-1, af-mzs-1, … |
log_schema_version | int | library | Current: 1. |
3.2 Baggage
OTel Baggage carries tenant_id, request_id, actor_role, facility_id, offline_origin, ai_budget_id across HTTP, NATS, and background workers. Baggage is stripped at egress to external APIs.
3.3 Cross-boundary propagation
- HTTP:
traceparent,tracestate,x-ghasi-tenant,x-ghasi-request-id. - NATS JetStream: CloudEvents extension attributes carry
traceparent+tenantid. - WebRTC / virtual care:
trace_idper session; each signalling message is a span. - Offline → online: Device emits
sync_batch_id; server links replayed signals to the original offlinetrace_idpreserved in the batch envelope.
4. Logging
4.1 Log schema (v1)
{
"ts": "2026-04-18T09:12:33.214Z",
"level": "INFO|DEBUG|WARN|ERROR|FATAL|AUDIT",
"msg": "short human summary, ≤120 chars, no PHI interpolation",
"event": "chart.encounter.opened",
"service": "patient-chart-service",
"component": "EncounterController",
"trace_id": "…", "span_id": "…", "request_id": "…",
"tenant_id": "…", "facility_id": "…",
"actor_id_hash": "…", "actor_role": "clinician",
"encounter_id_hash": "…",
"session_id": "…", "device_id_hash": "…",
"app": "clinician-web", "app_version": "2026.4.1",
"env": "prod", "region": "af-kbl-1",
"attrs": { "duration_ms": 212, "cache_hit": true },
"error": { "type": "…", "message": "…", "stack": "…" },
"log_schema_version": 1
}
Rules:
msgis static; variable data goes inattrs.eventusesdomain.entity.action(see 04 catalog).attrskeys are snake_case and namespaced.error.stackonly atERROR/FATALand only in non-prod, or after scrubbing in prod.
4.2 Levels
| Level | Use | Sampling |
|---|---|---|
FATAL | Process-terminating | 100 % |
ERROR | Contract violation, unhandled | 100 % |
WARN | Recoverable anomaly, degraded mode | 100 % |
AUDIT | Security/compliance events | 100 %, also to audit-service (synchronous) |
INFO | Domain event emission, lifecycle | 100 % prod (filtered by category) |
DEBUG | Developer detail | 0 % prod, 100 % sandbox |
4.3 PHI / PII redaction
Redaction in @ghasi/telemetry at emit; Collector re-runs for defense in depth; nightly scanner sweeps Loki for leaks.
Deny-list (never logged, even hashed with platform key):
password, password_hash, otp, totp_secret, access_token, refresh_token, id_token, private_key, webhook_secret, national_id, phone_e164, email, dob, home_address, insurance_card_number, credit_card, clinical_note_free_text, ai_prompt_raw, ai_response_raw, patient_name, provider_personal_phone.
Hashed (tenant-salt):
actor_id, device_id, ip_address, patient_id, encounter_id, mrn, npi.
Truncated + categorised (never raw):
- Clinical notes →
note_length,note_lang,note_has_ai_draft. - AI prompts / responses →
prompt_category,prompt_length,prompt_lang,safety_tags[].
4.4 Mandatory spans per service layer
Hexagonal architecture (02 DDD contexts) aligns with these span layers:
| Layer | Span name pattern | Required attributes |
|---|---|---|
| Presentation (controller) | {service}.controller.{handler} | http.method, http.route, http.status_code |
| Application (use case) | {service}.usecase.{name} | tenant_id, actor_role |
| Domain (aggregate) | {service}.domain.{aggregate}.{method} | domain-specific attributes (no PHI) |
| Port (interface) | {service}.port.{name} | |
| Adapter (infra) | {service}.adapter.{name} (e.g., ...adapter.postgres, ...adapter.nats) | db.system, db.name, messaging.system |
Span lifecycle is enforced by decorators in @ghasi/telemetry.
4.5 Multi-tenancy in logs
- Loki label set:
{service, env, region, severity, tenant_id}. Unlabeled logs dropped at Collector. - Per-tenant retention overrides supported via Collector routing.
- Tenant admin can export only their tenant's logs (signed, time-boxed S3 URL, 24 h).
4.6 Audit logs (separate pipeline)
Audit events are never best-effort. They go through a synchronous, ack'd write to audit-service before the user-facing response completes. Failures return 503 — we do not transact without audit.
Audit categories:
- Auth: login, MFA, session revoke, impersonation, break-glass.
- Data: export (DSAR), delete, bulk access.
- Clinical: order sign, note sign, result release, medication dispense.
- Interop: HL7 ingest, FHIR export, e-prescribing route.
- Moderation: AI block, human override, appeal.
- Admin: tenant config, role change, policy change, licensing assignment.
- Break-glass: invocation, reason, window.
Schema: canonical log schema + audit.action, audit.target, audit.before_hash, audit.after_hash, audit.signed_by, audit.chain_prev_hash (hash-chained for tamper evidence).
5. Metrics taxonomy
5.1 Families
Three families, strictly named:
- USE per resource:
process_cpu_seconds_total,db_pool_connections{state},nats_consumer_pending_messages. - RED per request path:
http_requests_total,http_request_duration_seconds,http_requests_errors_total. - Domain (DKPIs):
<domain>_<entity>_<action>_total— see §5.3+.
Naming:
_totalfor counters;_secondsfor latency;_ratiofor 0–1;_bytesfor sizes.- Labels bounded cardinality.
tenant_idallowed;user_idnever a label. - High-cardinality dimensions go to exemplars + analytics, not Prometheus labels.
5.2 Standard labels
Every metric: service, env, region, tenant_tier (public|district|regional|referral|national). Domain metrics may add tenant_id when cardinality permits; Collector enforces a per-series cap.
5.3 Per-service SLIs (default, all services)
| SLI | Definition | Target |
|---|---|---|
| Availability | 1 − errors/total over read paths (5xx excluding 499) | 99.9 % |
| Latency p95 | http_request_duration_seconds p95 | ≤ 300 ms internal, ≤ 500 ms edge |
| Latency p99 | ≤ 800 ms | |
| DB saturation | db_pool_in_use / db_pool_max | < 0.8 sustained |
| NATS consumer lag | nats_consumer_pending_messages | < 5 000 / partition |
| Audit write success | audit_write_success_ratio | ≥ 99.99 % |
5.4 Domain metrics — highlights
- identity:
auth_login_total{result},auth_mfa_challenge_total{method,result},auth_session_revoked_total{reason},auth_breakglass_invoked_total{reason_code}. - registration:
reg_patient_registered_total{source},reg_duplicate_detected_total,reg_merge_applied_total. - patient-chart:
chart_opened_total,chart_read_latency_seconds,allergy_banner_rendered_ratio. - orders:
order_placed_total{class},order_signed_total,order_ddi_flag_total. - medication:
rx_issued_total,rx_dispensed_total,rx_dispense_latency_seconds,rx_substituted_total{reason}. - eprescribing-gateway:
eprx_route_total{corridor,result},eprx_subscription_backlog. - laboratory:
lab_accession_total,lab_specimen_rejected_total{reason},lab_result_released_total{priority}. - radiology:
rad_study_total,rad_report_finalised_total,rad_tat_seconds. - virtual-care:
vcare_session_started_total,vcare_session_dropped_total{reason},vcare_ttfi_seconds. - patient-portal:
portal_login_total,portal_result_viewed_total,portal_appt_booked_total. - immunizations:
imm_dose_total{product},imm_field_offline_queue_size. - billing/claims:
bill_charge_total,claim_submitted_total,claim_paid_total,claim_denial_total{reason}. - interop:
hl7_msg_total{type,result},fhir_subscription_delivered_total. - audit:
audit_events_total{category},audit_chain_integrity_ratio. - ai-gateway: see §7.
- offline: see §8.
5.5 Exemplars
Every domain counter/histogram carries exemplars linking to trace IDs — 1 in N successful requests, 100 % of errored. Grafana panels enable one-click metric → trace → log drill-down.
6. Distributed tracing
6.1 Rules
- Every inbound HTTP, WebSocket, and NATS consumer creates a root or child span.
- Every outbound DB, cache, HTTP, NATS producer, object-store, AI call creates a child span.
- Spans have
otel.status_code,error.type, domain-specific attributes (no PHI). - Any span crossing a trust boundary (tenant → external, online → offline, sync-in, e-prescribing cross-corridor) carries
trust.boundary=<name>and its own error budget.
6.2 Domain span attributes
ghasi.tenant_id,ghasi.facility_id,ghasi.actor_role.ghasi.patient_id_hash,ghasi.encounter_id_hash.ghasi.ai.model,ghasi.ai.provider,ghasi.ai.purpose,ghasi.ai.tokens_in,ghasi.ai.tokens_out,ghasi.ai.cost_micro_usd.ghasi.offline.batch_id,ghasi.offline.source.ghasi.eprx.corridor,ghasi.eprx.jurisdiction.
6.3 Sampling
| Path | Strategy | Rate |
|---|---|---|
| Health / liveness | Drop | 0 % |
| Read APIs (cached) | Head-based | 1 % |
| Write APIs | Head-based | 10 % |
| Clinical safety-critical (order sign, dispense, result release, allergy write) | Always-on | 100 % |
| AI inference | Always-on | 100 % |
| Errors (4xx ≥ 429, all 5xx) | Tail-based override | 100 % |
| Slow requests (> SLO p99) | Tail-based override | 100 % |
| Offline sync replay | Always-on | 100 % |
| Audit writes | Always-on | 100 % |
6.4 Redaction in spans
db.statement captured parameterised only (placeholders, never values). URL paths templatised (/v1/patients/:id/chart). Request/response bodies never in span attributes. FHIR resource bodies never in spans.
7. AI telemetry
AI is the highest-risk, highest-cost surface in a clinical platform. It gets first-class observability.
7.1 Per-invocation dimensions
| Dimension | Example | Purpose |
|---|---|---|
ai.purpose | chart.scribe, orders.suggest, rad.report_draft, moderation.clinical | SLOs, cost allocation |
ai.model / ai.model_version | claude-sonnet-4-6@20261001 | Drift, A/B |
ai.provider | anthropic, openai, local-llm | Failover |
ai.prompt_template_id + hash | chart.scribe.v7 | Provenance |
ai.tokens_in / ai.tokens_out | Cost | |
ai.cost_micro_usd | Cost | |
ai.latency_ms_ttfb / ttlb | UX | |
ai.cache.hit / ai.cache.key_hash | Cost | |
ai.safety.pre / ai.safety.post | {"unsafe_clinical":0.01,...} | Safety |
ai.safety.action | allow|redact|block|escalate | Safety |
ai.guardrail.violations[] | phi_leak, dose_unsafe, contraindication_missed | Safety |
ai.output.citations[] | FHIR resource IDs (hashed) | Provenance |
ai.output.grounding_score | 0–1 | Hallucination |
ai.human_override | bool | HITL closure |
Prompts and responses themselves go to a separate, encrypted, tenant-scoped store (ai-transcripts within ai-gateway-service) with tighter retention. Telemetry carries only hashes, categories, and safety signals.
7.2 AI SLIs and SLOs
| SLI | Target |
|---|---|
| Scribe TTFB p95 online | ≤ 1.2 s |
| Scribe TTFB p95 offline (local SLM) | ≤ 500 ms |
| Moderation decision latency p99 | ≤ 400 ms |
| Safety false-negative rate (sampled audit) | ≤ 0.1 % |
| Safety false-positive rate | ≤ 2 % |
| AI cost per active clinician-month | ≤ budget per tenant tier |
| Cache hit rate (prompt + ctx) | ≥ 40 % scribe, ≥ 60 % moderation |
| Provider failover success | ≥ 99 % |
| Grounding score (RAG paths) | p50 ≥ 0.8 |
7.3 AI cost observability
- Budget IDs attach to every invocation; rollups by
tenant_id × purpose × model. - Circuit breakers fire when tenant/purpose breaches 120 % hourly budget — model downgrade (Opus → Sonnet → Haiku), aggressive caching, then fail-closed to static / human-only.
7.4 Provenance and replay
Each AI output stores {prompt_template_id, prompt_hash, context_hash, model, model_version, params, safety_decisions, citations} — sufficient to replay for audit. Replay exposed via internal ai-audit admin tool, logged as AUDIT.
8. Offline telemetry
8.1 Device SDK
Offline runtimes (provider mobile, desktop registration, web fallback) run a local telemetry buffer:
- SQLite-backed, append-only, size-capped (default 64 MB).
- Each batch MAC-signed with the device binding key; server detects tampering on replay.
- Buffer encrypts at rest using device-bound key.
- Batches chunked by
sync_batch_id, ordered by monotonic sequence number; gaps flagged.
8.2 Offline signals
| Signal | Fields |
|---|---|
offline.bundle.activated | bundle_id, size_bytes, integrity_ok |
offline.bundle.integrity_failure | bundle_id, expected_hash, actual_hash, reason |
offline.sync.started/completed/failed | batch_id, items, bytes, duration_ms, conflicts |
offline.conflict.detected | entity, strategy, winner, loser_preserved |
offline.device.bind/unbind/rebind_denied | device_id_hash, reason |
offline.tamper.suspected | signal, severity, evidence_hash |
offline.clock.skew | skew_seconds |
offline.outbox.size | rolling |
8.3 Offline SLIs
| SLI | Target |
|---|---|
| Sync success rate | ≥ 99 % per device/day |
| Conflict rate (of synced writes) | ≤ 1 % |
| Bundle tamper detection | 100 % of test cases caught |
| Device-binding mismatch blocked | 100 % |
| Reconnect → sync complete for ≤ 10 MB | ≤ 60 s |
9. Dashboard catalogue
Dashboards live as JSON in grafana/, provisioned via CI. Each has an owner and SLO link.
9.1 Global
- Platform Overview — availability, latency, saturation across 27 services.
- Error Budget Burn — per service, per SLO, 1 h / 6 h / 24 h.
- Tenant Health — per-tenant error rate, latency, AI spend, offline sync.
- Release Radar — deploys correlated with error-rate / latency deltas.
- Cost Control — infra + AI + egress.
- Audit Integrity — hash-chain continuity, write-failure rate.
9.2 Per-capability
- Identity & Access — login, MFA, session revocations, break-glass.
- Clinical — chart open, order placement, result release, allergy writes, note sign.
- Pharmacy — queue depth, verify/dispense latency, substitutions.
- Lab — accession funnel, TAT, result release.
- Radiology — study volume, TAT, AI-drafted report acceptance rate.
- Virtual care — session lifecycle, drop rate, reconnect success.
- Interop (FHIR + HL7 v2) — ingest volume, error rate, DLQ depth.
- E-prescribing — route success, subscription backlog, corridor latency.
- Immunizations / HMIS — dose volume, offline queue depth, HMIS export success.
- Billing / claims — charge funnel, denial rate, remittance lag.
- Patient portal — login, result view, appointment book success.
- AI / Scribe — TTFB, cache hit, safety actions, grounding, cost.
- Offline — sync success, conflict rate, tamper flags.
- Safety & moderation — blocks, overrides, SLA to decision.
- Data platform — NATS consumer lag, DLQ depth, outbox lag.
9.3 Per-service template
Each service auto-gets a dashboard with: RED panels, USE panels, top errors, top slow endpoints, DB pool, NATS lag, dependency map, audit integrity.
10. Alerts and SLOs
10.1 SLO framework
- All SLOs defined in
slo/*.yaml(Sloth format), reviewed via PR. - Multi-window, multi-burn-rate alerts (Google SRE): 1 h + 5 min (fast burn), 6 h + 30 min (slow burn).
- 28-day rolling windows; error-budget policy enforced.
10.2 Alert contract
alert: OrderSignFailure
expr: rate(orders_sign_errors_total[5m]) > 0
for: 1m
severity: SEV-1
owner: clinical-orders
runbook: https://runbooks.ghasi/clinical/order-sign-failure
auto_remediation: none
dashboards: [clinical/orders, audit/integrity]
slos: [orders.sign.availability]
Alerts without runbook + owner are rejected in CI.
10.3 Severity ladder
| Severity | Definition | Response |
|---|---|---|
| SEV-1 | Patient-safety, data-loss, audit breach, payment outage | Page + bridge + Statuspage ≤ 5 min |
| SEV-2 | Capability degraded, SLO fast-burn | Page primary on-call |
| SEV-3 | Slow-burn SLO, non-blocking | Ticket + Slack |
| SEV-4 | Housekeeping | Ticket |
10.4 Example alerts (non-exhaustive)
IdentityLoginErrorRate(SEV-2): 5xx on/auth/*> 1 % for 5 m.OrderSignFailure(SEV-1): any ERROR on order sign path.AllergyWriteFailure(SEV-1): any ERROR on allergy write; safety-critical.AuditWriteFailure(SEV-1): any failed audit write; blocks transactions.BreakGlassSpike(SEV-2): break-glass invocations > 3× 7-day baseline in a facility.ResultReleaseDelay(SEV-2): verified lab results unreleased > 30 min.EprxCorridorLatency(SEV-2): p95 > 5 s on cross-corridor routing.AIScribeTTFBSlow(SEV-3): 6 h burn > 1× budget on scribe TTFB.OfflineBundleTamper(SEV-1): anyoffline.bundle.integrity_failure.OfflineSyncConflictSpike(SEV-2): conflict rate > 2 % over 30 m per tenant.AICostBudgetBreach(SEV-2): tenant hourly AI spend > 120 % budget.NATSConsumerLagHigh(SEV-2): any consumer lag > 50 k for 10 m.HL7IngestBacklog(SEV-2): HL7 queue depth > threshold 15 m.DSARDeletionSLA(SEV-2): open DSAR > 27 days.
10.5 Error budget policy
- 50 % burn → notify service owner; feature-freeze optional.
- 75 % → feature-freeze mandatory; reliability PRs only.
- 100 % → rollback recent risky changes; post-incident review required.
Policy enforcement is an automated GitHub check against the SLO service.
11. Runbook template
Every alert points to a runbook. Standard shape:
# Runbook: {Alert}
**Severity:** SEV-N
**Owner:** {team}
**Dashboards:** [{dashboard-links}]
**Related SLOs:** [{slo-ids}]
## Symptoms
- What users / clinicians see.
## Immediate triage (≤ 5 min)
1. Confirm alert (query: `{promql}`).
2. Check dependencies (DB, NATS, Keycloak, Kong).
3. Check recent deploys.
## Diagnosis paths
- Path A: {hypothesis} → {commands} → {indicator}.
- Path B: {hypothesis} → {commands} → {indicator}.
## Resolution steps
- {action} (auto-remediation `{hook}` if applicable).
- Rollback: `{command}`.
## Follow-up
- Postmortem required if SEV-1/2.
- Action items tracked against error-budget dashboard.
Runbooks live in runbooks/ repo; incident-bot surfaces them in the incident channel.
12. Incident response hooks
12.1 Auto-declare
SEV-1 or two concurrent SEV-2 auto-declare:
incident-botopens Slack#inc-YYYYMMDD-NN.- Creates PagerDuty incident; pages on-call.
- Posts runbook, burn-rate, recent deploys, related alerts.
- Opens bridge (Zoom / Meet) with auto-invite.
- Updates Statuspage with tenant-scoped visibility.
- Timeline logger captures human comments + alert transitions.
12.2 Automated remediation
| Trigger | Action |
|---|---|
| AI provider error rate > 5 % | Failover to secondary + model downgrade |
| Scribe TTFB p95 breach | Reduce streaming concurrency per tenant |
| NATS DLQ spike | Pause producer, alert, enable dead-letter drain |
| Offline tamper spike from tenant | Auto-revoke affected device bindings; require re-enrol |
| Audit write failure | Trip global "no-write" breaker on affected capability |
| Cost breach | Model downgrade → cache-only → 503 with graceful copy |
All auto-remediation logs as AUDIT and is single-command reversible.
12.3 Postmortems
- Blameless template generated from incident timeline + telemetry (
pm-bot). - Required within 5 business days for SEV-1/2.
- Action items tracked with SLA; overdue AIs appear on error-budget dashboard.
13. Data retention and residency (telemetry)
| Signal | Hot | Warm | Cold | Max |
|---|---|---|---|---|
| App logs (non-PHI) | 14 d Loki | 90 d S3 | 395 d Glacier | 395 d |
| Audit logs | 30 d hot | 7 y WORM S3 | — | 7 y |
| Metrics | 30 d Prom | 13 mo Mimir | — | 13 mo |
| Traces (sampled) | 7 d Tempo | 90 d S3 | — | 90 d |
| Traces (errors, AI, safety-critical) | 30 d | 395 d | — | 395 d |
| AI transcripts | 30 d | 180 d tenant-config | — | 365 d |
| Safety-flagged AI | 180 d | 2 y | — | 2 y |
| Offline device telemetry | 14 d post-sync | 90 d | — | 90 d |
Residency: tenant-pinned to home region (af-kbl-1, af-mzs-1, …). Cross-region replication off by default.
14. Per-service implementation checklist
A service is not production-ready until:
- Uses
@ghasi/telemetrywrapper (no raw OTel). - Emits §3 correlation context on every signal.
- Passes PHI redaction CI test (fixtures of bad logs).
- Declares RED SLIs + ≥ 1 domain SLI.
- Has a per-service dashboard provisioned.
- Has ≥ 1 SEV-2 alert with runbook.
- Declares NATS consumer lag alert (if a consumer).
- Uses synchronous audit write on PHI-touching writes.
- Offline surfaces use buffered, MAC-signed telemetry.
- AI surfaces emit §7.1 dimensions and pass safety-replay test.
- Domain events registered with telemetry + analytics projections.
- Retention / residency overrides documented if non-default.
15. PagerDuty integration
- Services map to PagerDuty services 1:1.
- Escalation: primary (5 min) → secondary (15 min) → manager (30 min).
- Schedules are Git-managed via
pagerduty-tf. - Alerts carry
severity,service,runbook,dashboardsas payload — PagerDuty auto-formats the incident. - Statuspage integration: SEV-1 auto-publishes a component-scoped incident; SEV-2 post-mortem-first with 30-min delay.
16. Governance
- This document is versioned; material changes require an RFC under
rfcs/observability/. - Log schema breaking changes bump
log_schema_versionand carry a 2-release deprecation window. - Alerts and SLO changes PR-reviewed by SRE + capability owner + (for safety) Clinical Informatics.
17. Open questions
- Whether on-device anomaly detection for offline tamper should ship in provider-mobile v2 or remain server-side only.
- Long-term strategy for per-facility (not just per-tenant) SLOs — currently tenant-tier granularity; some facilities will want dedicated views.