Observability

:::info Source Sourced from services/tenant-service/OBSERVABILITY.md in the documentation repo. :::

Blueprint doc 10 of 17. Companion: 15 Observability | SECURITY_MODEL | FAILURE_MODES

1. Stack

Per platform normative stack (15 §2):

Layer	Tool	Purpose
Instrumentation	OpenTelemetry SDK via `@ghasi/telemetry` wrapper	Unified emitter
Collection	OTel Collector (gateway + agent)	Redaction, tenant routing, sampling
Logs	Loki (hot 14d) → S3 Parquet (cold 395d)	Indexed by tenant_id, service, severity
Metrics	Prometheus (hot 30d) → Mimir (13mo)	Remote-write from collector
Traces	Tempo (hot 7d) → S3 (90d, sampled)	Exemplars link metrics→traces→logs
Dashboards	Grafana	Tenant-folder RBAC; stored as code
Alerts	Alertmanager → PagerDuty + Slack	Declared in Git
SLO engine	Sloth → Prometheus rules	Burn-rate alerts

Services import only @ghasi/telemetry, never vendor SDKs directly.

2. Required Context Keys

Every log line, metric exemplar, and span carries (§3.1 of doc 15):

Key	Source
`trace_id`, `span_id`	OTel
`request_id`	UUIDv7 (edge)
`tenant_id`	JWT → baggage
`actor_id_hash`	sha256(actor + tenant_salt)
`actor_role`	JWT
`service`	`tenant-service`
`service_version`	build
`region`	runtime
`env`	dev/staging/prod
`log_schema_version`	3

3. Metrics (RED + USE + Domain)

3.1 RED (Rate / Errors / Duration)

Metric	Type	Labels	Purpose
`tenant_http_requests_total`	counter	route, method, status, tenant_id_hash	Request rate
`tenant_http_request_duration_seconds`	histogram	route, method, tenant_id_hash	Latency p50/p95/p99
`tenant_http_errors_total`	counter	route, error_code, tenant_id_hash	Error rate
`tenant_nats_events_published_total`	counter	subject, tenant_id_hash	Event publish rate
`tenant_nats_events_consumed_total`	counter	subject, result (ok/skip/err)	Consumer throughput
`tenant_nats_event_processing_duration_seconds`	histogram	subject	Consumer latency

3.2 USE (Utilization / Saturation / Errors)

Metric	Purpose
`tenant_db_connections_active`	Postgres pool usage
`tenant_db_connections_waiting`	Saturation indicator
`tenant_db_query_duration_seconds` (histogram, by operation)	Query latency
`tenant_redis_connections_active`	Redis pool usage
`tenant_outbox_lag_seconds`	Outbox publish lag
`tenant_outbox_size` (gauge)	Unpublished outbox rows
`tenant_inbox_deduplication_total`	counter

3.3 Domain KPIs

Metric	Labels	Purpose
`tenant_provisioned_total`	type, region	Business KPI: new tenants
`tenant_invite_sent_total`	tenant_id_hash	Invite volume
`tenant_invite_accepted_total`	tenant_id_hash	Invite acceptance
`tenant_invite_acceptance_rate`	derived	Invite acceptance / sent
`tenant_invite_expired_total`	tenant_id_hash	Expiry volume (funnel loss)
`tenant_active_memberships` (gauge)	tenant_id_hash	Per-tenant active user count
`tenant_suspended_total`	reason	Suspension events
`tenant_authz_checks_total`	allowed, cached, tenant_id_hash	Authz PDP throughput
`tenant_authz_check_duration_seconds`	cached (bool)	PDP latency
`tenant_authz_cache_hit_ratio`	derived	Cache effectiveness
`tenant_dynamic_group_evaluations_total`	tenant_id_hash	DG eval rate
`tenant_dynamic_group_evaluation_duration_seconds`	tenant_id_hash	DG eval latency
`tenant_dynamic_group_member_count` (histogram)	tenant_id_hash	Group size distribution
`tenant_role_churn_total`	op (create/update/delete), tenant_id_hash	Permission changes
`tenant_feature_flag_overrides_active` (gauge)	tenant_id_hash	Flag override count
`tenant_sso_login_total`	tenant_id_hash, protocol, status	SSO usage
`tenant_residency_migrations_total`	from, to, status	Migration tracking
`tenant_ai_suggestions_total`	capability, accepted (bool)	AI advisory usage

4. Service Level Objectives (SLOs)

SLI	Target	Window	Error budget	Alert (burn rate)
Tenant resolution (NATS RR) availability	99.99%	30d	0.01% (≈ 4m / month)	2% in 1h → page; 5% in 6h → ticket
Tenant resolution latency p95 ≤ 5ms	99.9%	30d	0.1%	as above
Authz check availability	99.95%	30d	0.05%	2% in 1h → page
Authz check latency p95 ≤ 20ms uncached	99%	30d	1%	5% in 6h → ticket
REST API availability	99.9%	30d	0.1%	2% in 1h → page
REST API latency p95 ≤ 200ms	99%	30d	1%	5% in 6h
Event publish lag p95 ≤ 2s	99.9%	30d	0.1%	5% in 1h → page
Invite acceptance success rate ≥ 99%	99%	7d	1%	day-over-day drop > 10% → ticket
Dynamic group eval p95 ≤ 5s	99%	30d	1%	5% in 6h → ticket

5. Dashboards (Grafana)

All defined as code in grafana/ folder. Dashboards:

5.1 Service Overview

RED for all routes
Error code heatmap (problem+json codes)
Outbox lag & depth
NATS consumer lag per subject
DB pool saturation

5.2 Authorization PDP

Authz checks per second (allowed vs denied)
Cache hit rate
Latency p50/p95/p99 (cached vs uncached)
Top denial reasons
Per-tenant heavy hitters

5.3 Tenancy Health

Active tenants by region
New tenant provisioning funnel (signup → trial → active)
Membership invite funnel (sent → accepted → activated)
Role churn
Feature flag override count

5.4 Dynamic Groups

Evaluation rate, latency histogram
Top-N largest groups by tenant
Re-evaluation trigger reasons
Failure rate

5.5 Migration Saga

In-flight residency migrations
Step-level duration breakdown
Rollback rate

5.6 Security

Cross-tenant isolation test results (daily canary)
Authz denials spike
Invite abuse classifier alerts
SSO failures by tenant

Dashboards are tenant-folder scoped in Grafana; platform_admin sees all, tenant admins see their own folder.

6. Tracing

6.1 Instrumented Spans

Span name	Attributes
`tenant.http.request`	route, method, status, tenant_id
`tenant.use_case.{name}`	use_case, result
`tenant.repo.{entity}.{op}`	entity, operation, rows_affected
`tenant.nats.publish`	subject, size_bytes
`tenant.nats.consume`	subject, result, retry_count
`tenant.authz.evaluate`	resource, action, allowed, matched_permission_count, cache_hit
`tenant.policy_engine.predicate`	operator, depth
`tenant.dynamic_group.evaluate`	group_id, member_count
`tenant.ai.call`	prompt_id, prompt_version, provider, cost_micro_usd

6.2 Sampling

Path	Rate
`authz.check` (allowed)	1% head-based
`authz.check` (denied)	100%
`tenant.provision`	100%
`dynamic_group.evaluate`	100%
Default	10% head-based, tail-based for errors + p99 latency

6.3 Baggage Propagation

Outgoing requests to other services carry baggage: tenant_id, request_id, actor_role. Tenant-service itself receives baggage from API gateway.

7. Structured Logs

7.1 Log Schema v3

{
  "timestamp": "2026-04-15T10:00:00.123Z",
  "level": "info",
  "message": "membership_activated",
  "service": "tenant-service",
  "service_version": "1.4.2",
  "env": "prod",
  "region": "eu-fra-1",
  "trace_id": "00-...-01",
  "span_id": "...",
  "request_id": "018f...",
  "tenant_id": "tnt_01HX...",
  "actor_id_hash": "sha256:...",
  "actor_role": "service",
  "log_schema_version": 3,
  "event": "membership_activated",
  "entity_id": "mbr_01HX...",
  "duration_ms": 12
}

7.2 Log Levels

Level	Usage
`error`	Failed use case, unhandled exception, event DLQ
`warn`	Degraded behavior (cache miss storm, retried publish)
`info`	Use case success, state transitions, consumer processing
`debug`	Dev-only detailed flow (disabled in prod unless flag)

7.3 Redaction

Enforced by @ghasi/telemetry:

email → redacted to @domain.com
invite_token → never logged
sso_client_secret → never logged
permissions[].condition values → redacted if contain literals

8. Alerts

Alert	Trigger	Severity	Runbook
`TenantResolveSLOBurn`	2% burn in 1h on availability	page	runbook://tenant/resolve-burn
`AuthzCheckLatencyHigh`	p95 > 50ms for 5 min	page	runbook://tenant/authz-latency
`AuthzDenialSpike`	denials > 10x baseline for 5 min	page (security)	runbook://tenant/authz-spike
`OutboxLagHigh`	p95 publish lag > 30s for 10 min	page	runbook://tenant/outbox-lag
`OutboxDepthGrowing`	unpublished > 10k rows for 5 min	page	runbook://tenant/outbox-depth
`DLQNonEmpty`	any DLQ message	ticket (or page if > 100)	runbook://tenant/dlq
`DynamicGroupEvalSlow`	p95 > 30s	ticket	runbook://tenant/dg-slow
`InviteAbuseSpike`	abuse classifier > 100/hour for one tenant	page (abuse)	runbook://tenant/invite-abuse
`DBPoolSaturation`	waiting > 20 for 2 min	page	runbook://tenant/db-pool
`ResidencyMigrationStalled`	saga step > 2x expected duration	ticket	runbook://tenant/residency-stalled
`CrossTenantCanaryFailure`	daily two-tenant isolation test fails	page (sev-1)	runbook://tenant/xtenant-failure
`LastOwnerRiskAlert`	tenant with only 1 org_owner for 24h	ticket	runbook://tenant/last-owner

Every alert references a runbook slug, owner, and auto-remediation hook where applicable.

9. Health Endpoints

Endpoint	Purpose
`GET /health/live`	Liveness (process up)
`GET /health/ready`	Readiness (DB, Redis, NATS reachable; outbox relay running; JWKS loaded)
`GET /health/startup`	Startup probe (migrations complete, system roles seeded)

Ready probe gates traffic at the load balancer.

10. Continuous Verification

Canary	Schedule	Action
Two-tenant isolation test	Every 5 min	Provision ephemeral tenants A & B; verify no cross-access; destroy
Authz latency canary	Every minute	Simulated authz check; alert if p95 > 20ms
Event round-trip	Every 5 min	Publish canary event; verify consumed and acked
Full saga dry-run (residency)	Nightly on staging	End-to-end migration on synthetic tenant

11. Incident Response

Integration:

PagerDuty on alert → incident-bot auto-declares.
Bridge link auto-populated with: Grafana dashboards, Tempo traces for recent errors, Loki log slice (last 15 min, tenant-filtered).
Runbook URL injected into incident description.
Statuspage auto-update for availability alerts (5-min delay unless manually promoted).

1. Stack​

2. Required Context Keys​

3. Metrics (RED + USE + Domain)​

3.1 RED (Rate / Errors / Duration)​

3.2 USE (Utilization / Saturation / Errors)​

3.3 Domain KPIs​

4. Service Level Objectives (SLOs)​

5. Dashboards (Grafana)​

5.1 Service Overview​

5.2 Authorization PDP​

5.3 Tenancy Health​

5.4 Dynamic Groups​

5.5 Migration Saga​

5.6 Security​

6. Tracing​

6.1 Instrumented Spans​

6.2 Sampling​

6.3 Baggage Propagation​

7. Structured Logs​

7.1 Log Schema v3​

7.2 Log Levels​

7.3 Redaction​

8. Alerts​

9. Health Endpoints​

10. Continuous Verification​

11. Incident Response​