tenant-service — OBSERVABILITY

Companion: APPLICATION_LOGIC · DEPLOYMENT_TOPOLOGY · FAILURE_MODES · SERVICE_READINESS

Tenant-service is a tier-1 dependency for every other service on the platform: if it degrades, every other service degrades. Observability standards therefore exceed the platform default.

1. Telemetry Stack

Signal	Producer	Sink
Traces	OpenTelemetry SDK (`@opentelemetry/sdk-node`), W3C tracecontext	Cloud Trace (prod) + Tempo (dev)
Metrics	OpenTelemetry + Prometheus exporter	Cloud Monitoring (prod) + Prometheus (dev)
Logs	Pino → JSON stdout	Cloud Logging (prod) + Loki (dev)
Events / Audit	`audit_events` table → BigQuery via `audit-service`	BigQuery `melmastoon_audit.tenant_*`
Real-user (Backoffice)	OpenTelemetry browser SDK	Cloud Trace via gateway forwarder

All telemetry carries the same context attributes via OTel baggage:

tenant.id, tenant.residency, user.id, role.codes, request.id, trace.id, device.id, route, http.status_code, app.version, deployment.region

2. Service Level Objectives

SLO	Target	Window	Error budget
`GET /api/v1/tenants/{id}/config` p95	≤ 25 ms	30 d	0.1 %
`GET /api/v1/tenants/{id}/config` availability	≥ 99.95 %	30 d	21.6 min/mo
`POST /api/v1/authz/check` p95	≤ 20 ms	30 d	0.05 %
`POST /api/v1/authz/check` availability	≥ 99.99 %	30 d	4.3 min/mo
`PATCH /api/v1/tenants/{id}/config` p95	≤ 200 ms	30 d	0.5 %
Outbox → Pub/Sub publish lag p95	≤ 2 s	30 d	1 %
Sync pull p95	≤ 200 ms	30 d	1 %
Two-tenant isolation incidents	0	always	0

Error budget burn drives release freeze: > 50 % of monthly budget consumed in 24 h freezes prod deploys.

3. Metrics Catalog

3.1 RED (per route)

http_requests_total{route, method, status, tenant_id}
http_request_duration_seconds_bucket{route, method, status}
http_request_inflight{route}
http_request_errors_total{route, error_code}

3.2 USE (resources)

db_pool_connections_active / _idle / _waiting
db_query_duration_seconds_bucket{operation, table}
redis_command_duration_seconds_bucket{command}
process_cpu_seconds_total
process_resident_memory_bytes
nodejs_eventloop_lag_seconds

3.3 Domain KPIs

tenant_resolve_total{cache_hit}
tenant_resolve_duration_seconds_bucket
tenant_status_total{status}                                   # gauge
membership_resolution_total{cache_hit}
membership_active_total                                       # gauge per tenant
role_assignment_changes_total{by, role_code}
invitation_sent_total{tenant_id, locale}
invitation_accepted_total
invitation_expired_total
invitation_failed_total{reason}                               # rate_limited, escalation, ...
authz_check_total{decision}                                   # allow|deny
authz_deny_total{reason_code}
last_owner_block_total                                        # should be near zero
role_escalation_block_total                                   # should be near zero
outbox_pending_total                                          # gauge
outbox_publish_lag_seconds                                    # histogram
saga_step_duration_seconds_bucket{saga, step}
saga_timeout_total{saga, step}
ai_call_total{prompt_id, decision}
ai_call_duration_seconds_bucket{prompt_id}
ai_budget_exhausted_total{tenant_id}

3.4 Sync (server side)

sync_pull_total{scope=tenant}
sync_pull_duration_seconds_bucket
sync_pull_items_total{scope, kind}
sync_pull_bytes_total{scope}
sync_push_rejected_total{reason=online_required}

4. Dashboards

Three Grafana / Cloud Monitoring dashboards (versioned in dashboards/ of this repo):

tenant-service / Overview — RED, error budget burn, top 10 tenants by request volume, top 10 by error count, Cloud Run instance count.
tenant-service / Domain — Membership creation rate, invitation funnel, role assignment activity, last-owner blocks, escalation blocks, anomaly notifications.
tenant-service / Eventing — Outbox depth, publish lag, DLQ counts per topic, consumer lag for inbox, saga in-flight + timeout.

Each dashboard pinned to the tenant-service runbook hub.

5. Tracing

5.1 Instrumented spans (selected)

Span	Attributes
`tenant.provision`	`tenant.slug`, `actor.user_id`, `db.txn.duration`
`tenant.config.update`	`tenant.id`, `expectVersion`, `changedFields[]`
`tenant.suspend` / `reactivate`	`tenant.id`, `by`, `reason`
`tenant.close.saga.start` / `…step` / `…ack` / `…complete`	`saga.id`, `service`, `outcome`
`membership.invite`	`tenant.id`, `actor.user_id`, `roles_proposed`, `ai.decision`
`membership.role_change`	`tenant.id`, `membership.id`, `added[]`, `removed[]`
`authz.check`	`principal.user_id`, `resource.type`, `resource.id`, `decision`, `cache.hit`
`outbox.publish`	`topic`, `lagMs`, `payload.bytes`
`sync.tenant.pull`	`device.id`, `cursor.kind`, `items.count`

Spans propagate to downstream service calls via W3C traceparent. The tenant id rides as OTel baggage so every downstream log entry carries it without further plumbing.

5.2 Sampling

Tail-based sampling: 100 % of traces with errors, 100 % of traces with authz.check.decision = 'deny' for sensitive resources, 1 % of all others.
Exemption: GET /healthz, GET /readyz, GET /metrics excluded from tracing entirely.

6. Logs

Structured JSON, one event per line. Schema:

{
  "ts": "2026-04-22T08:00:00.123Z",
  "level": "info",
  "service": "tenant-service",
  "version": "1.7.3",
  "region": "asia-south1",
  "tenant_id": "tnt_01H...",
  "user_id": "usr_01H...",
  "request_id": "req_01H...",
  "trace_id": "...",
  "span_id": "...",
  "route": "POST /api/v1/invitations",
  "msg": "invitation_sent",
  "event": "invitation.sent",
  "invitation_id": "inv_01H...",
  "rolesProposed": ["rol_01H..."],
  "duration_ms": 41
}

PII scrubber redacts email → em_<sha8>, address.* removed, phone → ph_<sha8>.

Log volume target: ≤ 1 KB / request average (excluding pretty traces in dev).

7. Alerts

Alert	Condition	Severity	Action
`TenantConfigP95Breach`	p95 > 50 ms for 5 min	warn	Investigate cache; check Cloud SQL CPU
`AuthzCheckErrorRate`	error rate > 0.1 % for 5 min	page	PDP unavailable → fail-closed in callers; immediate triage
`OutboxBacklog`	`outbox_pending_total > 1000` for 10 min	page	Poller stuck or Pub/Sub down
`OutboxLagP95`	publish lag p95 > 10 s for 5 min	warn	Throughput insufficient; scale poller
`LastOwnerBlockSpike`	`> 5 / min` for 5 min	warn	Possible fraud / mistake; alert security
`RoleEscalationBlockSpike`	`> 10 / min`	warn	Possible compromised account
`TenantIsolationFailure`	any non-zero on `tenant_isolation_violation_total` (CI canary metric)	page	Production incident; freeze deploy
`SagaTimeout`	`saga_timeout_total > 0` for any saga	page	Cascade saga stuck; runbook
`InvitationRateLimitSpike`	`invitation_failed_total{reason=rate_limited} > 50/min for 1 tenant`	warn	Possible abuse; alert tenant.owner
`BillingSuspensionDelay`	tenants in `pending_suspend > 15 min past gracePeriodEndsAt`	warn	Billing consumer stuck
`AIClientFailure`	AI 5xx rate > 25 % for 10 min	warn	Circuit opens; degrade gracefully

All alerts page through PagerDuty tenant-service-oncall with linked runbook URL.

8. Health Endpoints

Endpoint	Checks	Used by
`GET /healthz`	process up	Cloud Run liveness
`GET /readyz`	Postgres SELECT 1, Memorystore PING, Pub/Sub publisher heartbeat, JWKS cache loaded	Cloud Run readiness
`GET /metrics`	Prometheus scrape (auth-gated)	Monitoring agent
`GET /__/admin/outbox?status=pending`	platform-only operator view	On-call runbook

readyz returns 503 with the per-dependency status JSON so on-call can triage at a glance.

9. Continuous Verification

Synthetic monitor in each region: every 60 s, calls GET /api/v1/tenants/{seed-tenant}/config and POST /authz/check with a known seed; latency + correctness recorded as synthetic_* metrics.
Canary tenants: two seed tenants per env; the two-tenant simulator runs against them in production hourly to verify isolation.
Chaos drills quarterly: kill the Pub/Sub publisher; confirm outbox accumulates without data loss; restore.

10. Operator Runbooks

Each alert links to a runbook under runbooks/tenant-service/:

outbox-backlog.md
pdp-unavailable.md
tenant-isolation-violation.md (Sev-1; comms template)
saga-timeout.md
last-owner-block-spike.md

Runbooks are reviewed every release.

1. Telemetry Stack​

2. Service Level Objectives​

3. Metrics Catalog​

3.1 RED (per route)​

3.2 USE (resources)​

3.3 Domain KPIs​

3.4 Sync (server side)​

4. Dashboards​

5. Tracing​

5.1 Instrumented spans (selected)​

5.2 Sampling​

6. Logs​

7. Alerts​

8. Health Endpoints​

9. Continuous Verification​

10. Operator Runbooks​