Skip to main content

Identity Service — Observability

Status: populated Owner: TBD Last updated: 2026-04-17 Companion: Service Template · 03 platform-services · 17 Tech Stack

1. Signal model

SignalToolEmission path
TracesOpenTelemetry → OTLP → SigNoz/Tempo@ghasi/telemetry auto-instruments NestJS, HTTP, pg, NATS
MetricsPrometheus-compatible scrape/metrics endpoint (Node) + OTEL metrics exporter
LogsStructured JSON → Loki@ghasi/telemetry logger with traceId, tenantId, userId fields
EventsDomain events → NATS → audit-service & BIBusiness-level observability

2. Service level indicators (SLIs)

SLIDefinitionSource
auth.login.success_ratecount(login.success) / count(login.attempt) over 5 minmetric
auth.login.latency_p95Histogram over 5 min, excluding MFA challenge waitmetric
auth.mfa.failure_ratiocount(mfa_failed) / count(mfa_attempt)metric
auth.refresh.rotation_latency_p95metric
license.effective.resolve_latency_p95End-to-end incl. tenant-service ancestor callmetric
license.cache.hit_ratiohits / (hits + misses)metric
license.assignment.event_to_cache_eviction_p95Time from publish to key deletemetric (synthetic probe)
federation.upstream.success_rateper provider.vendormetric
tenant_isolation.cross_tenant_attemptscountermetric
outbox.lag_seconds_p95Publish delay on unpublished rowsmetric

3. SLOs (initial; reviewed quarterly)

ObjectiveTargetWindow
Login success latencyp95 ≤ 400 ms (no MFA); p95 ≤ 800 ms (with MFA)28d rolling
Login availability≥ 99.9%28d
Refresh latencyp95 ≤ 150 ms28d
Effective-license resolvep95 ≤ 120 ms cached, ≤ 500 ms cold28d
License cache hit ratio≥ 90%28d
License event → cache evict≤ 30 s p9928d
JWKS endpoint availability≥ 99.99%28d
Outbox publish lagp95 ≤ 2 s28d

Error budget: 0.1% → 43 min / month for login availability.

4. Dashboards

DashboardPanels
Identity — AuthN healthLogin attempts + success by backend + vendor; MFA failure rate; latency histograms; circuit-breaker state per IdP
Identity — SessionsActive session count by tenant; refresh rotation rate; refresh-replay counter; logout-all events
Identity — LicensingCache hit ratio; effective-resolve latency; assignments by status; dependency violation counter
Identity — FederationUpstream IdP error rate; SAML assertion latency; OIDC callback latency
Identity — Securitycross_tenant_attempts, refresh_replay, mfa_challenge_failed, break_glass events
Identity — OutboxUnpublished row count + age; relay error rate

All dashboards tagged with service=identity-service, tenant_id (low-cardinality bucket: tenant_size=small|medium|large).

AlertTriggerSeverityRunbook
IdentityLoginAvailabilityBreachsuccess_rate < 99.5% for 5 minP1runbooks/identity/login-availability.md
IdentityMfaFailureSpikemfa.failure_ratio > 25% for 10 minP2runbooks/identity/mfa-spike.md
IdentityRefreshReplayrefresh_replay_count > 3 / 5minP2runbooks/identity/refresh-replay.md
IdentityLicenseCacheEvictionLagevent_to_cache_eviction_p95 > 45sP2runbooks/identity/license-eviction.md
IdentityOutboxLagHighoutbox.lag_seconds_p95 > 30sP2runbooks/identity/outbox.md
IdentityCrossTenantAttemptany nonzero in 1 minP3 (pager off-hours)runbooks/identity/cross-tenant.md
IdentityFederationCircuitOpenbreaker open > 5 minP2runbooks/identity/federation.md
IdentityJWKSUnavailablepublic probe 2 consecutive failuresP1runbooks/identity/jwks.md

6. Runbook template (excerpt)

Each runbook contains: symptoms, dashboards, quick checks, rollback steps, escalation. All runbooks in docs/runbooks/identity/*.md.

7. Logs

  • All logs JSON with required fields: service, env, version, traceId, spanId, tenantId (when set), userId (when set), requestId, severity, msg, err?.
  • PII never logged at DEBUG; email only logged at INFO for admin actions; password / TOTP / WebAuthn signatures never logged.
  • Log retention: 30 days hot in Loki, 1 year cold S3 glacier.

8. Synthetic probes

ProbeCadenceAlert on failure
POST /api/v1/auth/login with canary account1 min2 consecutive fails → P1
GET /.well-known/jwks.json30 s2 consecutive fails → P1
GET /internal/identity/licensing/effective?...canary...1 min3 fails → P2