Skip to main content

Tenant Service — Observability

Status: populated Owner: TBD Last updated: 2026-04-18 Companion: Service Template · 12 Observability

1. SLIs and SLOs

SLISLOMeasurement window
Availability (HTTP success rate)≥ 99.5% of requests return 2xx or 4xx (not 5xx)30-day rolling
Activation p95 latency≤ 500 ms (excluding downstream retry wait windows)24 h
evaluate() p95 latency≤ 100 ms24 h
Event publish failure rate< 1% over any 5-min window5-min sliding
Config write p99 latency≤ 200 ms24 h

2. Key metrics

MetricTypeLabelsDescription
tenant_lifecycle_transitions_totalCounterfrom_status, to_statusState machine transitions
tenant_activation_duration_secondsHistogramoutcome (success/failure)Activation saga wall time
tenant_activation_saga_retries_totalCounterstep (hierarchy/iam/licensing)Retries per saga step
tenant_evaluate_duration_secondsHistogramdecision (allow/deny)RBAC/ABAC evaluation latency
tenant_config_writes_totalCounterkeyConfig mutations per key
tenant_event_publish_failures_totalCountersubjectOutbox publish failures
tenant_outbox_unpublished_age_secondsGaugeAge of oldest unpublished outbox row
tenant_membership_changes_totalCounteroperation (assign/remove)Membership mutations

3. Dashboards

DashboardDescription
Tenant LifecycleState machine heatmap; activation success/failure rates; saga step latency
Authorizationevaluate() latency p50/p95/p99; allow/deny ratio; ABAC policy match rate
Event HealthOutbox unpublished age; publish failure rate; NATS consumer lag
Data IsolationRLS policy bypass alerts; cross-tenant access attempts

4. Alerts

AlertConditionSeverityAction
TenantActivationFailedtenant_activation_saga_retries_total exhausted (> 3)CriticalPage on-call; check downstream service health
EventPublishFailureHighPublish failure rate > 1% over 5 minHighPage SRE; check NATS health; manual outbox replay
OutboxUnpublishedOldOldest unpublished outbox row > 60 sHighCheck NATS relay; page SRE
EvaluateLatencyHighevaluate() p95 > 200 ms for 5 minMediumCheck DB queries; check Redis cache
TenantSuspensionLagtenant.tenant.suspended.v1 not consumed by identity-service within 30 sHighCheck identity-service consumer

5. Traces

Every request carries OpenTelemetry Traceparent. Key spans:

Span nameDescription
tenant.lifecycle.activateFull activation saga
tenant.lifecycle.activate.create_root_nodehierarchy-service call
tenant.lifecycle.activate.seed_admin_useridentity-service call
tenant.lifecycle.activate.seed_licenseslicensing call
tenant.access.evaluateRBAC/ABAC decision
tenant.outbox.publishOutbox relay publish
tenant.cache.get / tenant.cache.setRedis operations

6. Log standards

  • Structured JSON logs; level, msg, tenantId, requestId, traceId.
  • No PII in log messages; use userId reference only.
  • Log all lifecycle state transitions at info level.
  • Log all evaluate() deny decisions at warn level with reasons.