Skip to main content

Observability

:::info Source Sourced from services/tenant-service/OBSERVABILITY.md in the documentation repo. :::

Blueprint doc 10 of 17. Companion: 15 Observability | SECURITY_MODEL | FAILURE_MODES


1. Stack

Per platform normative stack (15 §2):

LayerToolPurpose
InstrumentationOpenTelemetry SDK via @ghasi/telemetry wrapperUnified emitter
CollectionOTel Collector (gateway + agent)Redaction, tenant routing, sampling
LogsLoki (hot 14d) → S3 Parquet (cold 395d)Indexed by tenant_id, service, severity
MetricsPrometheus (hot 30d) → Mimir (13mo)Remote-write from collector
TracesTempo (hot 7d) → S3 (90d, sampled)Exemplars link metrics→traces→logs
DashboardsGrafanaTenant-folder RBAC; stored as code
AlertsAlertmanager → PagerDuty + SlackDeclared in Git
SLO engineSloth → Prometheus rulesBurn-rate alerts

Services import only @ghasi/telemetry, never vendor SDKs directly.


2. Required Context Keys

Every log line, metric exemplar, and span carries (§3.1 of doc 15):

KeySource
trace_id, span_idOTel
request_idUUIDv7 (edge)
tenant_idJWT → baggage
actor_id_hashsha256(actor + tenant_salt)
actor_roleJWT
servicetenant-service
service_versionbuild
regionruntime
envdev/staging/prod
log_schema_version3

3. Metrics (RED + USE + Domain)

3.1 RED (Rate / Errors / Duration)

MetricTypeLabelsPurpose
tenant_http_requests_totalcounterroute, method, status, tenant_id_hashRequest rate
tenant_http_request_duration_secondshistogramroute, method, tenant_id_hashLatency p50/p95/p99
tenant_http_errors_totalcounterroute, error_code, tenant_id_hashError rate
tenant_nats_events_published_totalcountersubject, tenant_id_hashEvent publish rate
tenant_nats_events_consumed_totalcountersubject, result (ok/skip/err)Consumer throughput
tenant_nats_event_processing_duration_secondshistogramsubjectConsumer latency

3.2 USE (Utilization / Saturation / Errors)

MetricPurpose
tenant_db_connections_activePostgres pool usage
tenant_db_connections_waitingSaturation indicator
tenant_db_query_duration_seconds (histogram, by operation)Query latency
tenant_redis_connections_activeRedis pool usage
tenant_outbox_lag_secondsOutbox publish lag
tenant_outbox_size (gauge)Unpublished outbox rows
tenant_inbox_deduplication_totalcounter

3.3 Domain KPIs

MetricLabelsPurpose
tenant_provisioned_totaltype, regionBusiness KPI: new tenants
tenant_invite_sent_totaltenant_id_hashInvite volume
tenant_invite_accepted_totaltenant_id_hashInvite acceptance
tenant_invite_acceptance_ratederivedInvite acceptance / sent
tenant_invite_expired_totaltenant_id_hashExpiry volume (funnel loss)
tenant_active_memberships (gauge)tenant_id_hashPer-tenant active user count
tenant_suspended_totalreasonSuspension events
tenant_authz_checks_totalallowed, cached, tenant_id_hashAuthz PDP throughput
tenant_authz_check_duration_secondscached (bool)PDP latency
tenant_authz_cache_hit_ratioderivedCache effectiveness
tenant_dynamic_group_evaluations_totaltenant_id_hashDG eval rate
tenant_dynamic_group_evaluation_duration_secondstenant_id_hashDG eval latency
tenant_dynamic_group_member_count (histogram)tenant_id_hashGroup size distribution
tenant_role_churn_totalop (create/update/delete), tenant_id_hashPermission changes
tenant_feature_flag_overrides_active (gauge)tenant_id_hashFlag override count
tenant_sso_login_totaltenant_id_hash, protocol, statusSSO usage
tenant_residency_migrations_totalfrom, to, statusMigration tracking
tenant_ai_suggestions_totalcapability, accepted (bool)AI advisory usage

4. Service Level Objectives (SLOs)

SLITargetWindowError budgetAlert (burn rate)
Tenant resolution (NATS RR) availability99.99%30d0.01% (≈ 4m / month)2% in 1h → page; 5% in 6h → ticket
Tenant resolution latency p95 ≤ 5ms99.9%30d0.1%as above
Authz check availability99.95%30d0.05%2% in 1h → page
Authz check latency p95 ≤ 20ms uncached99%30d1%5% in 6h → ticket
REST API availability99.9%30d0.1%2% in 1h → page
REST API latency p95 ≤ 200ms99%30d1%5% in 6h
Event publish lag p95 ≤ 2s99.9%30d0.1%5% in 1h → page
Invite acceptance success rate ≥ 99%99%7d1%day-over-day drop > 10% → ticket
Dynamic group eval p95 ≤ 5s99%30d1%5% in 6h → ticket

5. Dashboards (Grafana)

All defined as code in grafana/ folder. Dashboards:

5.1 Service Overview

  • RED for all routes
  • Error code heatmap (problem+json codes)
  • Outbox lag & depth
  • NATS consumer lag per subject
  • DB pool saturation

5.2 Authorization PDP

  • Authz checks per second (allowed vs denied)
  • Cache hit rate
  • Latency p50/p95/p99 (cached vs uncached)
  • Top denial reasons
  • Per-tenant heavy hitters

5.3 Tenancy Health

  • Active tenants by region
  • New tenant provisioning funnel (signup → trial → active)
  • Membership invite funnel (sent → accepted → activated)
  • Role churn
  • Feature flag override count

5.4 Dynamic Groups

  • Evaluation rate, latency histogram
  • Top-N largest groups by tenant
  • Re-evaluation trigger reasons
  • Failure rate

5.5 Migration Saga

  • In-flight residency migrations
  • Step-level duration breakdown
  • Rollback rate

5.6 Security

  • Cross-tenant isolation test results (daily canary)
  • Authz denials spike
  • Invite abuse classifier alerts
  • SSO failures by tenant

Dashboards are tenant-folder scoped in Grafana; platform_admin sees all, tenant admins see their own folder.


6. Tracing

6.1 Instrumented Spans

Span nameAttributes
tenant.http.requestroute, method, status, tenant_id
tenant.use_case.{name}use_case, result
tenant.repo.{entity}.{op}entity, operation, rows_affected
tenant.nats.publishsubject, size_bytes
tenant.nats.consumesubject, result, retry_count
tenant.authz.evaluateresource, action, allowed, matched_permission_count, cache_hit
tenant.policy_engine.predicateoperator, depth
tenant.dynamic_group.evaluategroup_id, member_count
tenant.ai.callprompt_id, prompt_version, provider, cost_micro_usd

6.2 Sampling

PathRate
authz.check (allowed)1% head-based
authz.check (denied)100%
tenant.provision100%
dynamic_group.evaluate100%
Default10% head-based, tail-based for errors + p99 latency

6.3 Baggage Propagation

Outgoing requests to other services carry baggage: tenant_id, request_id, actor_role. Tenant-service itself receives baggage from API gateway.


7. Structured Logs

7.1 Log Schema v3

{
"timestamp": "2026-04-15T10:00:00.123Z",
"level": "info",
"message": "membership_activated",
"service": "tenant-service",
"service_version": "1.4.2",
"env": "prod",
"region": "eu-fra-1",
"trace_id": "00-...-01",
"span_id": "...",
"request_id": "018f...",
"tenant_id": "tnt_01HX...",
"actor_id_hash": "sha256:...",
"actor_role": "service",
"log_schema_version": 3,
"event": "membership_activated",
"entity_id": "mbr_01HX...",
"duration_ms": 12
}

7.2 Log Levels

LevelUsage
errorFailed use case, unhandled exception, event DLQ
warnDegraded behavior (cache miss storm, retried publish)
infoUse case success, state transitions, consumer processing
debugDev-only detailed flow (disabled in prod unless flag)

7.3 Redaction

Enforced by @ghasi/telemetry:

  • email → redacted to @domain.com
  • invite_token → never logged
  • sso_client_secret → never logged
  • permissions[].condition values → redacted if contain literals

8. Alerts

AlertTriggerSeverityRunbook
TenantResolveSLOBurn2% burn in 1h on availabilitypagerunbook://tenant/resolve-burn
AuthzCheckLatencyHighp95 > 50ms for 5 minpagerunbook://tenant/authz-latency
AuthzDenialSpikedenials > 10x baseline for 5 minpage (security)runbook://tenant/authz-spike
OutboxLagHighp95 publish lag > 30s for 10 minpagerunbook://tenant/outbox-lag
OutboxDepthGrowingunpublished > 10k rows for 5 minpagerunbook://tenant/outbox-depth
DLQNonEmptyany DLQ messageticket (or page if > 100)runbook://tenant/dlq
DynamicGroupEvalSlowp95 > 30sticketrunbook://tenant/dg-slow
InviteAbuseSpikeabuse classifier > 100/hour for one tenantpage (abuse)runbook://tenant/invite-abuse
DBPoolSaturationwaiting > 20 for 2 minpagerunbook://tenant/db-pool
ResidencyMigrationStalledsaga step > 2x expected durationticketrunbook://tenant/residency-stalled
CrossTenantCanaryFailuredaily two-tenant isolation test failspage (sev-1)runbook://tenant/xtenant-failure
LastOwnerRiskAlerttenant with only 1 org_owner for 24hticketrunbook://tenant/last-owner

Every alert references a runbook slug, owner, and auto-remediation hook where applicable.


9. Health Endpoints

EndpointPurpose
GET /health/liveLiveness (process up)
GET /health/readyReadiness (DB, Redis, NATS reachable; outbox relay running; JWKS loaded)
GET /health/startupStartup probe (migrations complete, system roles seeded)

Ready probe gates traffic at the load balancer.


10. Continuous Verification

CanaryScheduleAction
Two-tenant isolation testEvery 5 minProvision ephemeral tenants A & B; verify no cross-access; destroy
Authz latency canaryEvery minuteSimulated authz check; alert if p95 > 20ms
Event round-tripEvery 5 minPublish canary event; verify consumed and acked
Full saga dry-run (residency)Nightly on stagingEnd-to-end migration on synthetic tenant

11. Incident Response

Integration:

  • PagerDuty on alert → incident-bot auto-declares.
  • Bridge link auto-populated with: Grafana dashboards, Tempo traces for recent errors, Loki log slice (last 15 min, tenant-filtered).
  • Runbook URL injected into incident description.
  • Statuspage auto-update for availability alerts (5-min delay unless manually promoted).