Tenant Service — Observability
Status: populated Owner: TBD Last updated: 2026-04-18 Companion: Service Template · 12 Observability
1. SLIs and SLOs
| SLI | SLO | Measurement window |
|---|---|---|
| Availability (HTTP success rate) | ≥ 99.5% of requests return 2xx or 4xx (not 5xx) | 30-day rolling |
| Activation p95 latency | ≤ 500 ms (excluding downstream retry wait windows) | 24 h |
evaluate() p95 latency | ≤ 100 ms | 24 h |
| Event publish failure rate | < 1% over any 5-min window | 5-min sliding |
| Config write p99 latency | ≤ 200 ms | 24 h |
2. Key metrics
| Metric | Type | Labels | Description |
|---|---|---|---|
tenant_lifecycle_transitions_total | Counter | from_status, to_status | State machine transitions |
tenant_activation_duration_seconds | Histogram | outcome (success/failure) | Activation saga wall time |
tenant_activation_saga_retries_total | Counter | step (hierarchy/iam/licensing) | Retries per saga step |
tenant_evaluate_duration_seconds | Histogram | decision (allow/deny) | RBAC/ABAC evaluation latency |
tenant_config_writes_total | Counter | key | Config mutations per key |
tenant_event_publish_failures_total | Counter | subject | Outbox publish failures |
tenant_outbox_unpublished_age_seconds | Gauge | — | Age of oldest unpublished outbox row |
tenant_membership_changes_total | Counter | operation (assign/remove) | Membership mutations |
3. Dashboards
| Dashboard | Description |
|---|---|
| Tenant Lifecycle | State machine heatmap; activation success/failure rates; saga step latency |
| Authorization | evaluate() latency p50/p95/p99; allow/deny ratio; ABAC policy match rate |
| Event Health | Outbox unpublished age; publish failure rate; NATS consumer lag |
| Data Isolation | RLS policy bypass alerts; cross-tenant access attempts |
4. Alerts
| Alert | Condition | Severity | Action |
|---|---|---|---|
| TenantActivationFailed | tenant_activation_saga_retries_total exhausted (> 3) | Critical | Page on-call; check downstream service health |
| EventPublishFailureHigh | Publish failure rate > 1% over 5 min | High | Page SRE; check NATS health; manual outbox replay |
| OutboxUnpublishedOld | Oldest unpublished outbox row > 60 s | High | Check NATS relay; page SRE |
| EvaluateLatencyHigh | evaluate() p95 > 200 ms for 5 min | Medium | Check DB queries; check Redis cache |
| TenantSuspensionLag | tenant.tenant.suspended.v1 not consumed by identity-service within 30 s | High | Check identity-service consumer |
5. Traces
Every request carries OpenTelemetry Traceparent. Key spans:
| Span name | Description |
|---|---|
tenant.lifecycle.activate | Full activation saga |
tenant.lifecycle.activate.create_root_node | hierarchy-service call |
tenant.lifecycle.activate.seed_admin_user | identity-service call |
tenant.lifecycle.activate.seed_licenses | licensing call |
tenant.access.evaluate | RBAC/ABAC decision |
tenant.outbox.publish | Outbox relay publish |
tenant.cache.get / tenant.cache.set | Redis operations |
6. Log standards
- Structured JSON logs;
level,msg,tenantId,requestId,traceId. - No PII in log messages; use
userIdreference only. - Log all lifecycle state transitions at
infolevel. - Log all
evaluate()deny decisions atwarnlevel withreasons.