Skip to main content

tenant-service — OBSERVABILITY

Companion: APPLICATION_LOGIC · DEPLOYMENT_TOPOLOGY · FAILURE_MODES · SERVICE_READINESS

Tenant-service is a tier-1 dependency for every other service on the platform: if it degrades, every other service degrades. Observability standards therefore exceed the platform default.


1. Telemetry Stack

SignalProducerSink
TracesOpenTelemetry SDK (@opentelemetry/sdk-node), W3C tracecontextCloud Trace (prod) + Tempo (dev)
MetricsOpenTelemetry + Prometheus exporterCloud Monitoring (prod) + Prometheus (dev)
LogsPino → JSON stdoutCloud Logging (prod) + Loki (dev)
Events / Auditaudit_events table → BigQuery via audit-serviceBigQuery melmastoon_audit.tenant_*
Real-user (Backoffice)OpenTelemetry browser SDKCloud Trace via gateway forwarder

All telemetry carries the same context attributes via OTel baggage:

tenant.id, tenant.residency, user.id, role.codes, request.id, trace.id, device.id, route, http.status_code, app.version, deployment.region

2. Service Level Objectives

SLOTargetWindowError budget
GET /api/v1/tenants/{id}/config p95≤ 25 ms30 d0.1 %
GET /api/v1/tenants/{id}/config availability≥ 99.95 %30 d21.6 min/mo
POST /api/v1/authz/check p95≤ 20 ms30 d0.05 %
POST /api/v1/authz/check availability≥ 99.99 %30 d4.3 min/mo
PATCH /api/v1/tenants/{id}/config p95≤ 200 ms30 d0.5 %
Outbox → Pub/Sub publish lag p95≤ 2 s30 d1 %
Sync pull p95≤ 200 ms30 d1 %
Two-tenant isolation incidents0always0

Error budget burn drives release freeze: > 50 % of monthly budget consumed in 24 h freezes prod deploys.


3. Metrics Catalog

3.1 RED (per route)

http_requests_total{route, method, status, tenant_id}
http_request_duration_seconds_bucket{route, method, status}
http_request_inflight{route}
http_request_errors_total{route, error_code}

3.2 USE (resources)

db_pool_connections_active / _idle / _waiting
db_query_duration_seconds_bucket{operation, table}
redis_command_duration_seconds_bucket{command}
process_cpu_seconds_total
process_resident_memory_bytes
nodejs_eventloop_lag_seconds

3.3 Domain KPIs

tenant_resolve_total{cache_hit}
tenant_resolve_duration_seconds_bucket
tenant_status_total{status} # gauge
membership_resolution_total{cache_hit}
membership_active_total # gauge per tenant
role_assignment_changes_total{by, role_code}
invitation_sent_total{tenant_id, locale}
invitation_accepted_total
invitation_expired_total
invitation_failed_total{reason} # rate_limited, escalation, ...
authz_check_total{decision} # allow|deny
authz_deny_total{reason_code}
last_owner_block_total # should be near zero
role_escalation_block_total # should be near zero
outbox_pending_total # gauge
outbox_publish_lag_seconds # histogram
saga_step_duration_seconds_bucket{saga, step}
saga_timeout_total{saga, step}
ai_call_total{prompt_id, decision}
ai_call_duration_seconds_bucket{prompt_id}
ai_budget_exhausted_total{tenant_id}

3.4 Sync (server side)

sync_pull_total{scope=tenant}
sync_pull_duration_seconds_bucket
sync_pull_items_total{scope, kind}
sync_pull_bytes_total{scope}
sync_push_rejected_total{reason=online_required}

4. Dashboards

Three Grafana / Cloud Monitoring dashboards (versioned in dashboards/ of this repo):

  1. tenant-service / Overview — RED, error budget burn, top 10 tenants by request volume, top 10 by error count, Cloud Run instance count.
  2. tenant-service / Domain — Membership creation rate, invitation funnel, role assignment activity, last-owner blocks, escalation blocks, anomaly notifications.
  3. tenant-service / Eventing — Outbox depth, publish lag, DLQ counts per topic, consumer lag for inbox, saga in-flight + timeout.

Each dashboard pinned to the tenant-service runbook hub.


5. Tracing

5.1 Instrumented spans (selected)

SpanAttributes
tenant.provisiontenant.slug, actor.user_id, db.txn.duration
tenant.config.updatetenant.id, expectVersion, changedFields[]
tenant.suspend / reactivatetenant.id, by, reason
tenant.close.saga.start / …step / …ack / …completesaga.id, service, outcome
membership.invitetenant.id, actor.user_id, roles_proposed, ai.decision
membership.role_changetenant.id, membership.id, added[], removed[]
authz.checkprincipal.user_id, resource.type, resource.id, decision, cache.hit
outbox.publishtopic, lagMs, payload.bytes
sync.tenant.pulldevice.id, cursor.kind, items.count

Spans propagate to downstream service calls via W3C traceparent. The tenant id rides as OTel baggage so every downstream log entry carries it without further plumbing.

5.2 Sampling

  • Tail-based sampling: 100 % of traces with errors, 100 % of traces with authz.check.decision = 'deny' for sensitive resources, 1 % of all others.
  • Exemption: GET /healthz, GET /readyz, GET /metrics excluded from tracing entirely.

6. Logs

Structured JSON, one event per line. Schema:

{
"ts": "2026-04-22T08:00:00.123Z",
"level": "info",
"service": "tenant-service",
"version": "1.7.3",
"region": "asia-south1",
"tenant_id": "tnt_01H...",
"user_id": "usr_01H...",
"request_id": "req_01H...",
"trace_id": "...",
"span_id": "...",
"route": "POST /api/v1/invitations",
"msg": "invitation_sent",
"event": "invitation.sent",
"invitation_id": "inv_01H...",
"rolesProposed": ["rol_01H..."],
"duration_ms": 41
}

PII scrubber redacts emailem_<sha8>, address.* removed, phoneph_<sha8>.

Log volume target: ≤ 1 KB / request average (excluding pretty traces in dev).


7. Alerts

AlertConditionSeverityAction
TenantConfigP95Breachp95 > 50 ms for 5 minwarnInvestigate cache; check Cloud SQL CPU
AuthzCheckErrorRateerror rate > 0.1 % for 5 minpagePDP unavailable → fail-closed in callers; immediate triage
OutboxBacklogoutbox_pending_total > 1000 for 10 minpagePoller stuck or Pub/Sub down
OutboxLagP95publish lag p95 > 10 s for 5 minwarnThroughput insufficient; scale poller
LastOwnerBlockSpike> 5 / min for 5 minwarnPossible fraud / mistake; alert security
RoleEscalationBlockSpike> 10 / minwarnPossible compromised account
TenantIsolationFailureany non-zero on tenant_isolation_violation_total (CI canary metric)pageProduction incident; freeze deploy
SagaTimeoutsaga_timeout_total > 0 for any sagapageCascade saga stuck; runbook
InvitationRateLimitSpikeinvitation_failed_total{reason=rate_limited} > 50/min for 1 tenantwarnPossible abuse; alert tenant.owner
BillingSuspensionDelaytenants in pending_suspend > 15 min past gracePeriodEndsAtwarnBilling consumer stuck
AIClientFailureAI 5xx rate > 25 % for 10 minwarnCircuit opens; degrade gracefully

All alerts page through PagerDuty tenant-service-oncall with linked runbook URL.


8. Health Endpoints

EndpointChecksUsed by
GET /healthzprocess upCloud Run liveness
GET /readyzPostgres SELECT 1, Memorystore PING, Pub/Sub publisher heartbeat, JWKS cache loadedCloud Run readiness
GET /metricsPrometheus scrape (auth-gated)Monitoring agent
GET /__/admin/outbox?status=pendingplatform-only operator viewOn-call runbook

readyz returns 503 with the per-dependency status JSON so on-call can triage at a glance.


9. Continuous Verification

  • Synthetic monitor in each region: every 60 s, calls GET /api/v1/tenants/{seed-tenant}/config and POST /authz/check with a known seed; latency + correctness recorded as synthetic_* metrics.
  • Canary tenants: two seed tenants per env; the two-tenant simulator runs against them in production hourly to verify isolation.
  • Chaos drills quarterly: kill the Pub/Sub publisher; confirm outbox accumulates without data loss; restore.

10. Operator Runbooks

Each alert links to a runbook under runbooks/tenant-service/:

  • outbox-backlog.md
  • pdp-unavailable.md
  • tenant-isolation-violation.md (Sev-1; comms template)
  • saga-timeout.md
  • last-owner-block-spike.md

Runbooks are reviewed every release.