Skip to main content

Observability

:::info Source Sourced from services/identity-service/OBSERVABILITY.md in the documentation repo. :::

Companion: 15 Observability & Telemetry · SECURITY_MODEL

Identity-service observability is critical because authentication failures are both security incidents and availability incidents. This document specifies logs, metrics, traces, dashboards, alerts, and SLOs for identity-service.

1. Logging

1.1 Log Schema (conforms to platform log_schema_version: 3)

Every log line is structured JSON with mandatory fields:

{
"ts": "2026-04-15T10:00:00.123Z",
"level": "info",
"service": "identity-service",
"instance": "identity-7f8d-abc",
"version": "1.4.2",
"commit": "abc123",
"trace_id": "00-abc...-def...-01",
"span_id": "abc...",
"request_id": "req_01HN...",
"tenant_id": "ten_01HN...",
"actor_id_hash": "sha256:...",
"actor_role": "learner",
"route": "POST /api/v1/auth/login",
"message": "login.success",
"duration_ms": 87,
"attrs": {
"user_id_hash": "sha256:...",
"device_id": "dev_01HN...",
"mfa_used": true,
"amr": ["pwd", "totp"]
}
}

1.2 Redaction Rules

FieldRedaction
primary_emailSHA-256 hash with tenant salt → email_hash
ip/24 IPv4 or /48 IPv6 → ip_masked
user_idSHA-256 hash with tenant salt → user_id_hash
PasswordNever logged, ever. Redacted at framework level via log processor.
refresh_tokenRedacted. First 8 chars only if logged at all.
webauthn_credentialRedacted.
mfa_codeRedacted.
api_key raw valueRedacted. Prefix only.
User agentTruncated to 256 chars; retained in full in audit log only
Stack tracesScrubbed for potential secret strings (regex-based)

Redaction is enforced in the shared @ghasi/telemetry logger, not by discipline.

1.3 Log Levels

LevelUsage
fatalUnrecoverable — service shutting down
errorRequest failed due to server issue (5xx)
warnSuspicious activity, failed login, token reuse, etc.
infoSuccessful operations, session lifecycle events
debugDev only; disabled in prod
traceNever in prod

1.4 Key Log Events

EventLevelWhen
login.successinfoSuccessful auth
login.failure.invalid_credentialswarnWrong password
login.failure.account_lockedwarnLogin attempted on locked account
login.failure.mfa_requiredinfoMFA challenge issued
login.failure.mfa_invalidwarnWrong MFA code
session.revokedinfoSession ended
session.rotation_reuse_detectederrorSecurity: possible token theft
password.reset.requestedinfoReset initiated
password.reset.completedinfoReset finished
device.registeredinfoNew device
device.offline_cert.issuedinfoOffline binding
device.revokedwarnDevice revoked
api_key.issuedinfoAPI key created
api_key.revokedinfoAPI key revoked
api_key.unauthorized_useerrorUnknown or revoked key attempted
user.lockedwarnLockout triggered
jwks.rotationinfoSigning key rotated
sso.callback.failureerrorSSO validation failed
gdpr.erasure.completedinfoUser data anonymized

2. Metrics

Standard OTel metrics exposed via /metrics (Prometheus scrape):

2.1 RED Metrics (Rate / Errors / Duration)

MetricTypeLabelsPurpose
identity_http_requests_totalcounterroute, method, status_classRequest rate
identity_http_request_duration_secondshistogramroute, methodLatency distribution
identity_http_errors_totalcounterroute, method, codeError rate by error code

2.2 Authentication Metrics

MetricTypeLabelsPurpose
identity_login_attempts_totalcountermethod, result, tenant_idLogin volume
identity_login_failures_totalcountermethod, reasonBreakdown of failure reasons
identity_login_duration_secondshistogrammethodLogin latency
identity_mfa_challenges_totalcounterfactor, resultMFA usage
identity_mfa_adoption_ratiogaugetenant_idFraction of users with MFA enrolled
identity_adaptive_mfa_triggered_totalcounterreasonAdaptive MFA invocation breakdown
identity_sso_callbacks_totalcounterprovider, resultSSO usage
identity_session_count_activegaugetenant_idActive sessions per tenant
identity_session_refresh_totalcounterresultRefresh token rotations
identity_rotation_reuse_detected_totalcountertenant_idSecurity critical — token reuse
identity_account_locks_totalcounterreasonLockout events

2.3 Device & Offline Metrics

MetricTypeLabelsPurpose
identity_device_registrations_totalcountertenant_idDevice registration rate
identity_offline_certs_issued_totalcountertenant_idOffline binding rate
identity_offline_certs_expiring_soongaugeCerts expiring < 7 days
identity_device_revocations_totalcounterreasonDevice revocation rate

2.4 API Key Metrics

MetricTypeLabelsPurpose
identity_api_key_uses_totalcountertenant_id, key_id_hash, scopeAPI key usage
identity_api_key_validation_duration_secondshistogramValidation latency
identity_api_key_unauthorized_totalcounterreasonFailed key validations

2.5 Infrastructure Metrics

MetricTypeLabelsPurpose
identity_outbox_depthgaugeUnpublished events
identity_outbox_lag_secondsgaugeOldest unpublished event age
identity_outbox_publish_ratecountertopicEvents published per second
identity_outbox_publish_failures_totalcountertopic, reasonPublish failures
identity_inbox_consume_ratecountertopicEvents consumed per second
identity_dlq_depthgaugestreamDLQ backlog
identity_db_pool_activegaugeActive DB connections
identity_db_pool_idlegaugeIdle DB connections
identity_redis_operations_duration_secondshistogramoperationRedis latency
identity_kms_operations_duration_secondshistogramoperationKMS latency

2.6 AI Risk Classifier Metrics

MetricTypeLabelsPurpose
identity_ai_risk_calls_totalcounterresultAI gateway call volume
identity_ai_risk_fallback_totalcounterreasonFallback to rules-only
identity_risk_score_distributionhistogramRisk score distribution

3. Distributed Tracing

3.1 Span Naming

SpanParentAttributes
identity.http.loginAPI gatewayuser_id_hash, tenant_id, mfa_required, result
identity.domain.verify_passwordloginduration_argon2_ms
identity.domain.evaluate_mfaloginrisk_score, reasons[]
identity.ai.risk_classifyevaluate_mfaai_gateway.trace_id, cache_hit
identity.db.load_userlogindb.duration_ms, db.rows
identity.kms.sign_jwtloginkms.key_id, duration_ms
identity.outbox.publishany domain writetopic, event_id
identity.sso.callbackprovider, tenant_id, result

3.2 Context Propagation

  • W3C traceparent header on every HTTP hop.
  • NATS messages include traceparent in envelope header.
  • Baggage: tenant_id, request_id, actor_role across all spans.
  • At egress to external IdPs, sensitive baggage stripped.

3.3 Sampling

  • Head-based 10% baseline.
  • Tail-based 100% on error spans.
  • Tail-based 100% on identity.rotation_reuse_detected spans.
  • 100% for identity.ai.risk_classify (AI compliance requirement).

4. Dashboards

4.1 Identity SRE Dashboard (Grafana)

  • Panels:
    • Request rate (by route)
    • Error rate (by code) — stacked area
    • Latency p50/p95/p99 (by route)
    • Login success/fail ratio over time
    • MFA adoption ratio (line)
    • Active sessions (line)
    • Outbox depth + lag (dual-axis)
    • DLQ depth
    • DB pool utilization
    • KMS call latency

4.2 Identity Security Dashboard

  • Panels:
    • Login failures by reason (stacked)
    • Credential stuffing heatmap (IP → failed attempts)
    • Rotation reuse events (should be rare; any spike is incident)
    • MFA bypass attempts (count)
    • Account lockouts (rate + reason breakdown)
    • Adaptive MFA trigger reasons (pie)
    • Geo distribution of logins
    • Device registrations (rate)

4.3 Per-Tenant Dashboard

  • Shared template, tenant-id filter:
    • Login volume
    • MFA adoption
    • Active sessions
    • API key usage
    • Recent security events

5. SLOs

SLOTargetWindowAlert Threshold
Auth availability99.99%30d rollingburn rate 2x for 1h, 14x for 5min
Login latency p95< 100ms30d rollingp95 > 200ms for 10min
Refresh latency p95< 30ms30d rollingp95 > 100ms for 10min
JWKS endpoint availability99.999%30d rollingany 5min downtime
Outbox publish lag p95< 5s30d rollingp95 > 30s for 5min
Auth error rate< 0.1% (5xx)7d rolling> 1% for 5min
SSO callback success> 99.5%7d rolling< 99% for 10min

Error budget: 0.01% downtime over 30 days = ~4.3 minutes.

6. Alerts

6.1 Critical (PagerDuty immediate)

AlertConditionRunbook
IdentityDownHealth check failing for > 1 minrunbooks/identity/service-down.md
IdentityHighErrorRate5xx rate > 5% for 5 minrunbooks/identity/high-errors.md
IdentityKMSUnavailableKMS operation failures > 1%runbooks/identity/kms-outage.md
IdentityRotationReuseSpike> 5 rotation_reuse events / 5 minrunbooks/identity/token-theft.md
IdentityJWTKeyRotationFailedRotation job failurerunbooks/identity/jwt-rotation.md
IdentityOutboxStalledOutbox lag > 5 minrunbooks/identity/outbox-stalled.md
IdentityDLQNonEmptyDLQ depth > 0 for 5 minrunbooks/identity/dlq-triage.md

6.2 High (PagerDuty business hours)

AlertConditionRunbook
IdentityLoginLatencyHighp95 > 200ms for 10 minrunbooks/identity/latency.md
IdentityCredentialStuffingFailed login rate > 50/sec for 2 minrunbooks/identity/credential-stuffing.md
IdentityAccountLockoutSpikeLockout rate > 10x baselinerunbooks/identity/lockout-spike.md
IdentityMFAAdoptionDropMFA adoption ratio drops > 5% week/weekrunbooks/identity/mfa-adoption.md

6.3 Warning (Slack #oncall-identity)

AlertCondition
IdentityAIRiskFallback> 10% of risk classifications falling back to rules
IdentityOfflineCertsExpiring> 100 certs expiring in < 7 days
IdentityAPIKeyRotationOverdue> 50 API keys older than 180 days

7. Correlation

Every log, metric exemplar, and span carries:

  • trace_id (W3C)
  • tenant_id (when applicable)
  • request_id (gateway-issued)
  • actor_id_hash (SHA-256 with tenant salt)

Dashboards link: metric exemplar → trace → logs filtered by trace_id.

8. Runbook Index

ScenarioRunbook
Service downrunbooks/identity/service-down.md
High error raterunbooks/identity/high-errors.md
KMS outagerunbooks/identity/kms-outage.md
Token theft suspectedrunbooks/identity/token-theft.md
Credential stuffing attackrunbooks/identity/credential-stuffing.md
Outbox stalledrunbooks/identity/outbox-stalled.md
DLQ triagerunbooks/identity/dlq-triage.md
JWT key rotationrunbooks/identity/jwt-rotation.md
SSO provider outagerunbooks/identity/sso-outage.md
GDPR erasure stuckrunbooks/identity/gdpr-stuck.md

9. Synthetic Monitoring

Canary probes every 60 seconds from 3 regions:

ProbeEndpointExpected
HealthGET /health/live200
ReadinessGET /health/ready200
JWKSGET /.well-known/jwks.json200, valid JSON, ≥ 1 key
Login roundtripFull login with canary account200 with valid JWT
SSO discoveryOIDC well-known endpoint200

Canary failure = SLO impact; aggregates into availability measurement.

10. Cost Observability

  • identity_kms_cost_microusd_total counter — tracks KMS API calls converted to cost.
  • identity_ai_classifier_cost_microusd_total counter — AI gateway cost.
  • Monthly cost dashboard per tenant for chargeback.