Identity Service — Observability
Status: populated Owner: TBD Last updated: 2026-04-17 Companion: Service Template · 03 platform-services · 17 Tech Stack
1. Signal model
| Signal | Tool | Emission path |
|---|---|---|
| Traces | OpenTelemetry → OTLP → SigNoz/Tempo | @ghasi/telemetry auto-instruments NestJS, HTTP, pg, NATS |
| Metrics | Prometheus-compatible scrape | /metrics endpoint (Node) + OTEL metrics exporter |
| Logs | Structured JSON → Loki | @ghasi/telemetry logger with traceId, tenantId, userId fields |
| Events | Domain events → NATS → audit-service & BI | Business-level observability |
2. Service level indicators (SLIs)
| SLI | Definition | Source |
|---|---|---|
auth.login.success_rate | count(login.success) / count(login.attempt) over 5 min | metric |
auth.login.latency_p95 | Histogram over 5 min, excluding MFA challenge wait | metric |
auth.mfa.failure_ratio | count(mfa_failed) / count(mfa_attempt) | metric |
auth.refresh.rotation_latency_p95 | — | metric |
license.effective.resolve_latency_p95 | End-to-end incl. tenant-service ancestor call | metric |
license.cache.hit_ratio | hits / (hits + misses) | metric |
license.assignment.event_to_cache_eviction_p95 | Time from publish to key delete | metric (synthetic probe) |
federation.upstream.success_rate | per provider.vendor | metric |
tenant_isolation.cross_tenant_attempts | counter | metric |
outbox.lag_seconds_p95 | Publish delay on unpublished rows | metric |
3. SLOs (initial; reviewed quarterly)
| Objective | Target | Window |
|---|---|---|
| Login success latency | p95 ≤ 400 ms (no MFA); p95 ≤ 800 ms (with MFA) | 28d rolling |
| Login availability | ≥ 99.9% | 28d |
| Refresh latency | p95 ≤ 150 ms | 28d |
| Effective-license resolve | p95 ≤ 120 ms cached, ≤ 500 ms cold | 28d |
| License cache hit ratio | ≥ 90% | 28d |
| License event → cache evict | ≤ 30 s p99 | 28d |
| JWKS endpoint availability | ≥ 99.99% | 28d |
| Outbox publish lag | p95 ≤ 2 s | 28d |
Error budget: 0.1% → 43 min / month for login availability.
4. Dashboards
| Dashboard | Panels |
|---|---|
| Identity — AuthN health | Login attempts + success by backend + vendor; MFA failure rate; latency histograms; circuit-breaker state per IdP |
| Identity — Sessions | Active session count by tenant; refresh rotation rate; refresh-replay counter; logout-all events |
| Identity — Licensing | Cache hit ratio; effective-resolve latency; assignments by status; dependency violation counter |
| Identity — Federation | Upstream IdP error rate; SAML assertion latency; OIDC callback latency |
| Identity — Security | cross_tenant_attempts, refresh_replay, mfa_challenge_failed, break_glass events |
| Identity — Outbox | Unpublished row count + age; relay error rate |
All dashboards tagged with service=identity-service, tenant_id (low-cardinality bucket: tenant_size=small|medium|large).
5. Alerts (with runbook links)
| Alert | Trigger | Severity | Runbook |
|---|---|---|---|
IdentityLoginAvailabilityBreach | success_rate < 99.5% for 5 min | P1 | runbooks/identity/login-availability.md |
IdentityMfaFailureSpike | mfa.failure_ratio > 25% for 10 min | P2 | runbooks/identity/mfa-spike.md |
IdentityRefreshReplay | refresh_replay_count > 3 / 5min | P2 | runbooks/identity/refresh-replay.md |
IdentityLicenseCacheEvictionLag | event_to_cache_eviction_p95 > 45s | P2 | runbooks/identity/license-eviction.md |
IdentityOutboxLagHigh | outbox.lag_seconds_p95 > 30s | P2 | runbooks/identity/outbox.md |
IdentityCrossTenantAttempt | any nonzero in 1 min | P3 (pager off-hours) | runbooks/identity/cross-tenant.md |
IdentityFederationCircuitOpen | breaker open > 5 min | P2 | runbooks/identity/federation.md |
IdentityJWKSUnavailable | public probe 2 consecutive failures | P1 | runbooks/identity/jwks.md |
6. Runbook template (excerpt)
Each runbook contains: symptoms, dashboards, quick checks, rollback steps, escalation. All runbooks in docs/runbooks/identity/*.md.
7. Logs
- All logs JSON with required fields:
service, env, version, traceId, spanId, tenantId (when set), userId (when set), requestId, severity, msg, err?. - PII never logged at DEBUG;
emailonly logged at INFO for admin actions; password / TOTP / WebAuthn signatures never logged. - Log retention: 30 days hot in Loki, 1 year cold S3 glacier.
8. Synthetic probes
| Probe | Cadence | Alert on failure |
|---|---|---|
POST /api/v1/auth/login with canary account | 1 min | 2 consecutive fails → P1 |
GET /.well-known/jwks.json | 30 s | 2 consecutive fails → P1 |
GET /internal/identity/licensing/effective?...canary... | 1 min | 3 fails → P2 |