Observability
:::info Source
Sourced from services/tenant-service/OBSERVABILITY.md in the documentation repo.
:::
Blueprint doc 10 of 17. Companion: 15 Observability | SECURITY_MODEL | FAILURE_MODES
1. Stack
Per platform normative stack (15 §2):
| Layer | Tool | Purpose |
|---|---|---|
| Instrumentation | OpenTelemetry SDK via @ghasi/telemetry wrapper | Unified emitter |
| Collection | OTel Collector (gateway + agent) | Redaction, tenant routing, sampling |
| Logs | Loki (hot 14d) → S3 Parquet (cold 395d) | Indexed by tenant_id, service, severity |
| Metrics | Prometheus (hot 30d) → Mimir (13mo) | Remote-write from collector |
| Traces | Tempo (hot 7d) → S3 (90d, sampled) | Exemplars link metrics→traces→logs |
| Dashboards | Grafana | Tenant-folder RBAC; stored as code |
| Alerts | Alertmanager → PagerDuty + Slack | Declared in Git |
| SLO engine | Sloth → Prometheus rules | Burn-rate alerts |
Services import only @ghasi/telemetry, never vendor SDKs directly.
2. Required Context Keys
Every log line, metric exemplar, and span carries (§3.1 of doc 15):
| Key | Source |
|---|---|
trace_id, span_id | OTel |
request_id | UUIDv7 (edge) |
tenant_id | JWT → baggage |
actor_id_hash | sha256(actor + tenant_salt) |
actor_role | JWT |
service | tenant-service |
service_version | build |
region | runtime |
env | dev/staging/prod |
log_schema_version | 3 |
3. Metrics (RED + USE + Domain)
3.1 RED (Rate / Errors / Duration)
| Metric | Type | Labels | Purpose |
|---|---|---|---|
tenant_http_requests_total | counter | route, method, status, tenant_id_hash | Request rate |
tenant_http_request_duration_seconds | histogram | route, method, tenant_id_hash | Latency p50/p95/p99 |
tenant_http_errors_total | counter | route, error_code, tenant_id_hash | Error rate |
tenant_nats_events_published_total | counter | subject, tenant_id_hash | Event publish rate |
tenant_nats_events_consumed_total | counter | subject, result (ok/skip/err) | Consumer throughput |
tenant_nats_event_processing_duration_seconds | histogram | subject | Consumer latency |
3.2 USE (Utilization / Saturation / Errors)
| Metric | Purpose |
|---|---|
tenant_db_connections_active | Postgres pool usage |
tenant_db_connections_waiting | Saturation indicator |
tenant_db_query_duration_seconds (histogram, by operation) | Query latency |
tenant_redis_connections_active | Redis pool usage |
tenant_outbox_lag_seconds | Outbox publish lag |
tenant_outbox_size (gauge) | Unpublished outbox rows |
tenant_inbox_deduplication_total | counter |
3.3 Domain KPIs
| Metric | Labels | Purpose |
|---|---|---|
tenant_provisioned_total | type, region | Business KPI: new tenants |
tenant_invite_sent_total | tenant_id_hash | Invite volume |
tenant_invite_accepted_total | tenant_id_hash | Invite acceptance |
tenant_invite_acceptance_rate | derived | Invite acceptance / sent |
tenant_invite_expired_total | tenant_id_hash | Expiry volume (funnel loss) |
tenant_active_memberships (gauge) | tenant_id_hash | Per-tenant active user count |
tenant_suspended_total | reason | Suspension events |
tenant_authz_checks_total | allowed, cached, tenant_id_hash | Authz PDP throughput |
tenant_authz_check_duration_seconds | cached (bool) | PDP latency |
tenant_authz_cache_hit_ratio | derived | Cache effectiveness |
tenant_dynamic_group_evaluations_total | tenant_id_hash | DG eval rate |
tenant_dynamic_group_evaluation_duration_seconds | tenant_id_hash | DG eval latency |
tenant_dynamic_group_member_count (histogram) | tenant_id_hash | Group size distribution |
tenant_role_churn_total | op (create/update/delete), tenant_id_hash | Permission changes |
tenant_feature_flag_overrides_active (gauge) | tenant_id_hash | Flag override count |
tenant_sso_login_total | tenant_id_hash, protocol, status | SSO usage |
tenant_residency_migrations_total | from, to, status | Migration tracking |
tenant_ai_suggestions_total | capability, accepted (bool) | AI advisory usage |
4. Service Level Objectives (SLOs)
| SLI | Target | Window | Error budget | Alert (burn rate) |
|---|---|---|---|---|
| Tenant resolution (NATS RR) availability | 99.99% | 30d | 0.01% (≈ 4m / month) | 2% in 1h → page; 5% in 6h → ticket |
| Tenant resolution latency p95 ≤ 5ms | 99.9% | 30d | 0.1% | as above |
| Authz check availability | 99.95% | 30d | 0.05% | 2% in 1h → page |
| Authz check latency p95 ≤ 20ms uncached | 99% | 30d | 1% | 5% in 6h → ticket |
| REST API availability | 99.9% | 30d | 0.1% | 2% in 1h → page |
| REST API latency p95 ≤ 200ms | 99% | 30d | 1% | 5% in 6h |
| Event publish lag p95 ≤ 2s | 99.9% | 30d | 0.1% | 5% in 1h → page |
| Invite acceptance success rate ≥ 99% | 99% | 7d | 1% | day-over-day drop > 10% → ticket |
| Dynamic group eval p95 ≤ 5s | 99% | 30d | 1% | 5% in 6h → ticket |
5. Dashboards (Grafana)
All defined as code in grafana/ folder. Dashboards:
5.1 Service Overview
- RED for all routes
- Error code heatmap (problem+json codes)
- Outbox lag & depth
- NATS consumer lag per subject
- DB pool saturation
5.2 Authorization PDP
- Authz checks per second (allowed vs denied)
- Cache hit rate
- Latency p50/p95/p99 (cached vs uncached)
- Top denial reasons
- Per-tenant heavy hitters
5.3 Tenancy Health
- Active tenants by region
- New tenant provisioning funnel (signup → trial → active)
- Membership invite funnel (sent → accepted → activated)
- Role churn
- Feature flag override count
5.4 Dynamic Groups
- Evaluation rate, latency histogram
- Top-N largest groups by tenant
- Re-evaluation trigger reasons
- Failure rate
5.5 Migration Saga
- In-flight residency migrations
- Step-level duration breakdown
- Rollback rate
5.6 Security
- Cross-tenant isolation test results (daily canary)
- Authz denials spike
- Invite abuse classifier alerts
- SSO failures by tenant
Dashboards are tenant-folder scoped in Grafana; platform_admin sees all, tenant admins see their own folder.
6. Tracing
6.1 Instrumented Spans
| Span name | Attributes |
|---|---|
tenant.http.request | route, method, status, tenant_id |
tenant.use_case.{name} | use_case, result |
tenant.repo.{entity}.{op} | entity, operation, rows_affected |
tenant.nats.publish | subject, size_bytes |
tenant.nats.consume | subject, result, retry_count |
tenant.authz.evaluate | resource, action, allowed, matched_permission_count, cache_hit |
tenant.policy_engine.predicate | operator, depth |
tenant.dynamic_group.evaluate | group_id, member_count |
tenant.ai.call | prompt_id, prompt_version, provider, cost_micro_usd |
6.2 Sampling
| Path | Rate |
|---|---|
authz.check (allowed) | 1% head-based |
authz.check (denied) | 100% |
tenant.provision | 100% |
dynamic_group.evaluate | 100% |
| Default | 10% head-based, tail-based for errors + p99 latency |
6.3 Baggage Propagation
Outgoing requests to other services carry baggage: tenant_id, request_id, actor_role. Tenant-service itself receives baggage from API gateway.
7. Structured Logs
7.1 Log Schema v3
{
"timestamp": "2026-04-15T10:00:00.123Z",
"level": "info",
"message": "membership_activated",
"service": "tenant-service",
"service_version": "1.4.2",
"env": "prod",
"region": "eu-fra-1",
"trace_id": "00-...-01",
"span_id": "...",
"request_id": "018f...",
"tenant_id": "tnt_01HX...",
"actor_id_hash": "sha256:...",
"actor_role": "service",
"log_schema_version": 3,
"event": "membership_activated",
"entity_id": "mbr_01HX...",
"duration_ms": 12
}
7.2 Log Levels
| Level | Usage |
|---|---|
error | Failed use case, unhandled exception, event DLQ |
warn | Degraded behavior (cache miss storm, retried publish) |
info | Use case success, state transitions, consumer processing |
debug | Dev-only detailed flow (disabled in prod unless flag) |
7.3 Redaction
Enforced by @ghasi/telemetry:
email→ redacted to@domain.cominvite_token→ never loggedsso_client_secret→ never loggedpermissions[].conditionvalues → redacted if contain literals
8. Alerts
| Alert | Trigger | Severity | Runbook |
|---|---|---|---|
TenantResolveSLOBurn | 2% burn in 1h on availability | page | runbook://tenant/resolve-burn |
AuthzCheckLatencyHigh | p95 > 50ms for 5 min | page | runbook://tenant/authz-latency |
AuthzDenialSpike | denials > 10x baseline for 5 min | page (security) | runbook://tenant/authz-spike |
OutboxLagHigh | p95 publish lag > 30s for 10 min | page | runbook://tenant/outbox-lag |
OutboxDepthGrowing | unpublished > 10k rows for 5 min | page | runbook://tenant/outbox-depth |
DLQNonEmpty | any DLQ message | ticket (or page if > 100) | runbook://tenant/dlq |
DynamicGroupEvalSlow | p95 > 30s | ticket | runbook://tenant/dg-slow |
InviteAbuseSpike | abuse classifier > 100/hour for one tenant | page (abuse) | runbook://tenant/invite-abuse |
DBPoolSaturation | waiting > 20 for 2 min | page | runbook://tenant/db-pool |
ResidencyMigrationStalled | saga step > 2x expected duration | ticket | runbook://tenant/residency-stalled |
CrossTenantCanaryFailure | daily two-tenant isolation test fails | page (sev-1) | runbook://tenant/xtenant-failure |
LastOwnerRiskAlert | tenant with only 1 org_owner for 24h | ticket | runbook://tenant/last-owner |
Every alert references a runbook slug, owner, and auto-remediation hook where applicable.
9. Health Endpoints
| Endpoint | Purpose |
|---|---|
GET /health/live | Liveness (process up) |
GET /health/ready | Readiness (DB, Redis, NATS reachable; outbox relay running; JWKS loaded) |
GET /health/startup | Startup probe (migrations complete, system roles seeded) |
Ready probe gates traffic at the load balancer.
10. Continuous Verification
| Canary | Schedule | Action |
|---|---|---|
| Two-tenant isolation test | Every 5 min | Provision ephemeral tenants A & B; verify no cross-access; destroy |
| Authz latency canary | Every minute | Simulated authz check; alert if p95 > 20ms |
| Event round-trip | Every 5 min | Publish canary event; verify consumed and acked |
| Full saga dry-run (residency) | Nightly on staging | End-to-end migration on synthetic tenant |
11. Incident Response
Integration:
- PagerDuty on alert →
incident-botauto-declares. - Bridge link auto-populated with: Grafana dashboards, Tempo traces for recent errors, Loki log slice (last 15 min, tenant-filtered).
- Runbook URL injected into incident description.
- Statuspage auto-update for availability alerts (5-min delay unless manually promoted).