tenant-service — OBSERVABILITY
Companion: APPLICATION_LOGIC · DEPLOYMENT_TOPOLOGY · FAILURE_MODES · SERVICE_READINESS
Tenant-service is a tier-1 dependency for every other service on the platform: if it degrades, every other service degrades. Observability standards therefore exceed the platform default.
1. Telemetry Stack
| Signal | Producer | Sink |
|---|---|---|
| Traces | OpenTelemetry SDK (@opentelemetry/sdk-node), W3C tracecontext | Cloud Trace (prod) + Tempo (dev) |
| Metrics | OpenTelemetry + Prometheus exporter | Cloud Monitoring (prod) + Prometheus (dev) |
| Logs | Pino → JSON stdout | Cloud Logging (prod) + Loki (dev) |
| Events / Audit | audit_events table → BigQuery via audit-service | BigQuery melmastoon_audit.tenant_* |
| Real-user (Backoffice) | OpenTelemetry browser SDK | Cloud Trace via gateway forwarder |
All telemetry carries the same context attributes via OTel baggage:
tenant.id, tenant.residency, user.id, role.codes, request.id, trace.id, device.id, route, http.status_code, app.version, deployment.region
2. Service Level Objectives
| SLO | Target | Window | Error budget |
|---|---|---|---|
GET /api/v1/tenants/{id}/config p95 | ≤ 25 ms | 30 d | 0.1 % |
GET /api/v1/tenants/{id}/config availability | ≥ 99.95 % | 30 d | 21.6 min/mo |
POST /api/v1/authz/check p95 | ≤ 20 ms | 30 d | 0.05 % |
POST /api/v1/authz/check availability | ≥ 99.99 % | 30 d | 4.3 min/mo |
PATCH /api/v1/tenants/{id}/config p95 | ≤ 200 ms | 30 d | 0.5 % |
| Outbox → Pub/Sub publish lag p95 | ≤ 2 s | 30 d | 1 % |
| Sync pull p95 | ≤ 200 ms | 30 d | 1 % |
| Two-tenant isolation incidents | 0 | always | 0 |
Error budget burn drives release freeze: > 50 % of monthly budget consumed in 24 h freezes prod deploys.
3. Metrics Catalog
3.1 RED (per route)
http_requests_total{route, method, status, tenant_id}
http_request_duration_seconds_bucket{route, method, status}
http_request_inflight{route}
http_request_errors_total{route, error_code}
3.2 USE (resources)
db_pool_connections_active / _idle / _waiting
db_query_duration_seconds_bucket{operation, table}
redis_command_duration_seconds_bucket{command}
process_cpu_seconds_total
process_resident_memory_bytes
nodejs_eventloop_lag_seconds
3.3 Domain KPIs
tenant_resolve_total{cache_hit}
tenant_resolve_duration_seconds_bucket
tenant_status_total{status} # gauge
membership_resolution_total{cache_hit}
membership_active_total # gauge per tenant
role_assignment_changes_total{by, role_code}
invitation_sent_total{tenant_id, locale}
invitation_accepted_total
invitation_expired_total
invitation_failed_total{reason} # rate_limited, escalation, ...
authz_check_total{decision} # allow|deny
authz_deny_total{reason_code}
last_owner_block_total # should be near zero
role_escalation_block_total # should be near zero
outbox_pending_total # gauge
outbox_publish_lag_seconds # histogram
saga_step_duration_seconds_bucket{saga, step}
saga_timeout_total{saga, step}
ai_call_total{prompt_id, decision}
ai_call_duration_seconds_bucket{prompt_id}
ai_budget_exhausted_total{tenant_id}
3.4 Sync (server side)
sync_pull_total{scope=tenant}
sync_pull_duration_seconds_bucket
sync_pull_items_total{scope, kind}
sync_pull_bytes_total{scope}
sync_push_rejected_total{reason=online_required}
4. Dashboards
Three Grafana / Cloud Monitoring dashboards (versioned in dashboards/ of this repo):
tenant-service / Overview— RED, error budget burn, top 10 tenants by request volume, top 10 by error count, Cloud Run instance count.tenant-service / Domain— Membership creation rate, invitation funnel, role assignment activity, last-owner blocks, escalation blocks, anomaly notifications.tenant-service / Eventing— Outbox depth, publish lag, DLQ counts per topic, consumer lag for inbox, saga in-flight + timeout.
Each dashboard pinned to the tenant-service runbook hub.
5. Tracing
5.1 Instrumented spans (selected)
| Span | Attributes |
|---|---|
tenant.provision | tenant.slug, actor.user_id, db.txn.duration |
tenant.config.update | tenant.id, expectVersion, changedFields[] |
tenant.suspend / reactivate | tenant.id, by, reason |
tenant.close.saga.start / …step / …ack / …complete | saga.id, service, outcome |
membership.invite | tenant.id, actor.user_id, roles_proposed, ai.decision |
membership.role_change | tenant.id, membership.id, added[], removed[] |
authz.check | principal.user_id, resource.type, resource.id, decision, cache.hit |
outbox.publish | topic, lagMs, payload.bytes |
sync.tenant.pull | device.id, cursor.kind, items.count |
Spans propagate to downstream service calls via W3C traceparent. The tenant id rides as OTel baggage so every downstream log entry carries it without further plumbing.
5.2 Sampling
- Tail-based sampling: 100 % of traces with errors, 100 % of traces with
authz.check.decision = 'deny'for sensitive resources, 1 % of all others. - Exemption:
GET /healthz,GET /readyz,GET /metricsexcluded from tracing entirely.
6. Logs
Structured JSON, one event per line. Schema:
{
"ts": "2026-04-22T08:00:00.123Z",
"level": "info",
"service": "tenant-service",
"version": "1.7.3",
"region": "asia-south1",
"tenant_id": "tnt_01H...",
"user_id": "usr_01H...",
"request_id": "req_01H...",
"trace_id": "...",
"span_id": "...",
"route": "POST /api/v1/invitations",
"msg": "invitation_sent",
"event": "invitation.sent",
"invitation_id": "inv_01H...",
"rolesProposed": ["rol_01H..."],
"duration_ms": 41
}
PII scrubber redacts email → em_<sha8>, address.* removed, phone → ph_<sha8>.
Log volume target: ≤ 1 KB / request average (excluding pretty traces in dev).
7. Alerts
| Alert | Condition | Severity | Action |
|---|---|---|---|
TenantConfigP95Breach | p95 > 50 ms for 5 min | warn | Investigate cache; check Cloud SQL CPU |
AuthzCheckErrorRate | error rate > 0.1 % for 5 min | page | PDP unavailable → fail-closed in callers; immediate triage |
OutboxBacklog | outbox_pending_total > 1000 for 10 min | page | Poller stuck or Pub/Sub down |
OutboxLagP95 | publish lag p95 > 10 s for 5 min | warn | Throughput insufficient; scale poller |
LastOwnerBlockSpike | > 5 / min for 5 min | warn | Possible fraud / mistake; alert security |
RoleEscalationBlockSpike | > 10 / min | warn | Possible compromised account |
TenantIsolationFailure | any non-zero on tenant_isolation_violation_total (CI canary metric) | page | Production incident; freeze deploy |
SagaTimeout | saga_timeout_total > 0 for any saga | page | Cascade saga stuck; runbook |
InvitationRateLimitSpike | invitation_failed_total{reason=rate_limited} > 50/min for 1 tenant | warn | Possible abuse; alert tenant.owner |
BillingSuspensionDelay | tenants in pending_suspend > 15 min past gracePeriodEndsAt | warn | Billing consumer stuck |
AIClientFailure | AI 5xx rate > 25 % for 10 min | warn | Circuit opens; degrade gracefully |
All alerts page through PagerDuty tenant-service-oncall with linked runbook URL.
8. Health Endpoints
| Endpoint | Checks | Used by |
|---|---|---|
GET /healthz | process up | Cloud Run liveness |
GET /readyz | Postgres SELECT 1, Memorystore PING, Pub/Sub publisher heartbeat, JWKS cache loaded | Cloud Run readiness |
GET /metrics | Prometheus scrape (auth-gated) | Monitoring agent |
GET /__/admin/outbox?status=pending | platform-only operator view | On-call runbook |
readyz returns 503 with the per-dependency status JSON so on-call can triage at a glance.
9. Continuous Verification
- Synthetic monitor in each region: every 60 s, calls
GET /api/v1/tenants/{seed-tenant}/configandPOST /authz/checkwith a known seed; latency + correctness recorded assynthetic_*metrics. - Canary tenants: two seed tenants per env; the two-tenant simulator runs against them in production hourly to verify isolation.
- Chaos drills quarterly: kill the Pub/Sub publisher; confirm outbox accumulates without data loss; restore.
10. Operator Runbooks
Each alert links to a runbook under runbooks/tenant-service/:
outbox-backlog.mdpdp-unavailable.mdtenant-isolation-violation.md(Sev-1; comms template)saga-timeout.mdlast-owner-block-spike.md
Runbooks are reviewed every release.