Observability
:::info Source
Sourced from services/identity-service/OBSERVABILITY.md in the documentation repo.
:::
Companion: 15 Observability & Telemetry · SECURITY_MODEL
Identity-service observability is critical because authentication failures are both security incidents and availability incidents. This document specifies logs, metrics, traces, dashboards, alerts, and SLOs for identity-service.
1. Logging
1.1 Log Schema (conforms to platform log_schema_version: 3)
Every log line is structured JSON with mandatory fields:
{
"ts": "2026-04-15T10:00:00.123Z",
"level": "info",
"service": "identity-service",
"instance": "identity-7f8d-abc",
"version": "1.4.2",
"commit": "abc123",
"trace_id": "00-abc...-def...-01",
"span_id": "abc...",
"request_id": "req_01HN...",
"tenant_id": "ten_01HN...",
"actor_id_hash": "sha256:...",
"actor_role": "learner",
"route": "POST /api/v1/auth/login",
"message": "login.success",
"duration_ms": 87,
"attrs": {
"user_id_hash": "sha256:...",
"device_id": "dev_01HN...",
"mfa_used": true,
"amr": ["pwd", "totp"]
}
}
1.2 Redaction Rules
| Field | Redaction |
|---|---|
primary_email | SHA-256 hash with tenant salt → email_hash |
ip | /24 IPv4 or /48 IPv6 → ip_masked |
user_id | SHA-256 hash with tenant salt → user_id_hash |
| Password | Never logged, ever. Redacted at framework level via log processor. |
refresh_token | Redacted. First 8 chars only if logged at all. |
webauthn_credential | Redacted. |
mfa_code | Redacted. |
api_key raw value | Redacted. Prefix only. |
| User agent | Truncated to 256 chars; retained in full in audit log only |
| Stack traces | Scrubbed for potential secret strings (regex-based) |
Redaction is enforced in the shared @ghasi/telemetry logger, not by discipline.
1.3 Log Levels
| Level | Usage |
|---|---|
fatal | Unrecoverable — service shutting down |
error | Request failed due to server issue (5xx) |
warn | Suspicious activity, failed login, token reuse, etc. |
info | Successful operations, session lifecycle events |
debug | Dev only; disabled in prod |
trace | Never in prod |
1.4 Key Log Events
| Event | Level | When |
|---|---|---|
login.success | info | Successful auth |
login.failure.invalid_credentials | warn | Wrong password |
login.failure.account_locked | warn | Login attempted on locked account |
login.failure.mfa_required | info | MFA challenge issued |
login.failure.mfa_invalid | warn | Wrong MFA code |
session.revoked | info | Session ended |
session.rotation_reuse_detected | error | Security: possible token theft |
password.reset.requested | info | Reset initiated |
password.reset.completed | info | Reset finished |
device.registered | info | New device |
device.offline_cert.issued | info | Offline binding |
device.revoked | warn | Device revoked |
api_key.issued | info | API key created |
api_key.revoked | info | API key revoked |
api_key.unauthorized_use | error | Unknown or revoked key attempted |
user.locked | warn | Lockout triggered |
jwks.rotation | info | Signing key rotated |
sso.callback.failure | error | SSO validation failed |
gdpr.erasure.completed | info | User data anonymized |
2. Metrics
Standard OTel metrics exposed via /metrics (Prometheus scrape):
2.1 RED Metrics (Rate / Errors / Duration)
| Metric | Type | Labels | Purpose |
|---|---|---|---|
identity_http_requests_total | counter | route, method, status_class | Request rate |
identity_http_request_duration_seconds | histogram | route, method | Latency distribution |
identity_http_errors_total | counter | route, method, code | Error rate by error code |
2.2 Authentication Metrics
| Metric | Type | Labels | Purpose |
|---|---|---|---|
identity_login_attempts_total | counter | method, result, tenant_id | Login volume |
identity_login_failures_total | counter | method, reason | Breakdown of failure reasons |
identity_login_duration_seconds | histogram | method | Login latency |
identity_mfa_challenges_total | counter | factor, result | MFA usage |
identity_mfa_adoption_ratio | gauge | tenant_id | Fraction of users with MFA enrolled |
identity_adaptive_mfa_triggered_total | counter | reason | Adaptive MFA invocation breakdown |
identity_sso_callbacks_total | counter | provider, result | SSO usage |
identity_session_count_active | gauge | tenant_id | Active sessions per tenant |
identity_session_refresh_total | counter | result | Refresh token rotations |
identity_rotation_reuse_detected_total | counter | tenant_id | Security critical — token reuse |
identity_account_locks_total | counter | reason | Lockout events |
2.3 Device & Offline Metrics
| Metric | Type | Labels | Purpose |
|---|---|---|---|
identity_device_registrations_total | counter | tenant_id | Device registration rate |
identity_offline_certs_issued_total | counter | tenant_id | Offline binding rate |
identity_offline_certs_expiring_soon | gauge | — | Certs expiring < 7 days |
identity_device_revocations_total | counter | reason | Device revocation rate |
2.4 API Key Metrics
| Metric | Type | Labels | Purpose |
|---|---|---|---|
identity_api_key_uses_total | counter | tenant_id, key_id_hash, scope | API key usage |
identity_api_key_validation_duration_seconds | histogram | — | Validation latency |
identity_api_key_unauthorized_total | counter | reason | Failed key validations |
2.5 Infrastructure Metrics
| Metric | Type | Labels | Purpose |
|---|---|---|---|
identity_outbox_depth | gauge | — | Unpublished events |
identity_outbox_lag_seconds | gauge | — | Oldest unpublished event age |
identity_outbox_publish_rate | counter | topic | Events published per second |
identity_outbox_publish_failures_total | counter | topic, reason | Publish failures |
identity_inbox_consume_rate | counter | topic | Events consumed per second |
identity_dlq_depth | gauge | stream | DLQ backlog |
identity_db_pool_active | gauge | — | Active DB connections |
identity_db_pool_idle | gauge | — | Idle DB connections |
identity_redis_operations_duration_seconds | histogram | operation | Redis latency |
identity_kms_operations_duration_seconds | histogram | operation | KMS latency |
2.6 AI Risk Classifier Metrics
| Metric | Type | Labels | Purpose |
|---|---|---|---|
identity_ai_risk_calls_total | counter | result | AI gateway call volume |
identity_ai_risk_fallback_total | counter | reason | Fallback to rules-only |
identity_risk_score_distribution | histogram | — | Risk score distribution |
3. Distributed Tracing
3.1 Span Naming
| Span | Parent | Attributes |
|---|---|---|
identity.http.login | API gateway | user_id_hash, tenant_id, mfa_required, result |
identity.domain.verify_password | login | duration_argon2_ms |
identity.domain.evaluate_mfa | login | risk_score, reasons[] |
identity.ai.risk_classify | evaluate_mfa | ai_gateway.trace_id, cache_hit |
identity.db.load_user | login | db.duration_ms, db.rows |
identity.kms.sign_jwt | login | kms.key_id, duration_ms |
identity.outbox.publish | any domain write | topic, event_id |
identity.sso.callback | — | provider, tenant_id, result |
3.2 Context Propagation
- W3C
traceparentheader on every HTTP hop. - NATS messages include
traceparentin envelope header. - Baggage:
tenant_id,request_id,actor_roleacross all spans. - At egress to external IdPs, sensitive baggage stripped.
3.3 Sampling
- Head-based 10% baseline.
- Tail-based 100% on error spans.
- Tail-based 100% on
identity.rotation_reuse_detectedspans. - 100% for
identity.ai.risk_classify(AI compliance requirement).
4. Dashboards
4.1 Identity SRE Dashboard (Grafana)
- Panels:
- Request rate (by route)
- Error rate (by code) — stacked area
- Latency p50/p95/p99 (by route)
- Login success/fail ratio over time
- MFA adoption ratio (line)
- Active sessions (line)
- Outbox depth + lag (dual-axis)
- DLQ depth
- DB pool utilization
- KMS call latency
4.2 Identity Security Dashboard
- Panels:
- Login failures by reason (stacked)
- Credential stuffing heatmap (IP → failed attempts)
- Rotation reuse events (should be rare; any spike is incident)
- MFA bypass attempts (count)
- Account lockouts (rate + reason breakdown)
- Adaptive MFA trigger reasons (pie)
- Geo distribution of logins
- Device registrations (rate)
4.3 Per-Tenant Dashboard
- Shared template, tenant-id filter:
- Login volume
- MFA adoption
- Active sessions
- API key usage
- Recent security events
5. SLOs
| SLO | Target | Window | Alert Threshold |
|---|---|---|---|
| Auth availability | 99.99% | 30d rolling | burn rate 2x for 1h, 14x for 5min |
| Login latency p95 | < 100ms | 30d rolling | p95 > 200ms for 10min |
| Refresh latency p95 | < 30ms | 30d rolling | p95 > 100ms for 10min |
| JWKS endpoint availability | 99.999% | 30d rolling | any 5min downtime |
| Outbox publish lag p95 | < 5s | 30d rolling | p95 > 30s for 5min |
| Auth error rate | < 0.1% (5xx) | 7d rolling | > 1% for 5min |
| SSO callback success | > 99.5% | 7d rolling | < 99% for 10min |
Error budget: 0.01% downtime over 30 days = ~4.3 minutes.
6. Alerts
6.1 Critical (PagerDuty immediate)
| Alert | Condition | Runbook |
|---|---|---|
IdentityDown | Health check failing for > 1 min | runbooks/identity/service-down.md |
IdentityHighErrorRate | 5xx rate > 5% for 5 min | runbooks/identity/high-errors.md |
IdentityKMSUnavailable | KMS operation failures > 1% | runbooks/identity/kms-outage.md |
IdentityRotationReuseSpike | > 5 rotation_reuse events / 5 min | runbooks/identity/token-theft.md |
IdentityJWTKeyRotationFailed | Rotation job failure | runbooks/identity/jwt-rotation.md |
IdentityOutboxStalled | Outbox lag > 5 min | runbooks/identity/outbox-stalled.md |
IdentityDLQNonEmpty | DLQ depth > 0 for 5 min | runbooks/identity/dlq-triage.md |
6.2 High (PagerDuty business hours)
| Alert | Condition | Runbook |
|---|---|---|
IdentityLoginLatencyHigh | p95 > 200ms for 10 min | runbooks/identity/latency.md |
IdentityCredentialStuffing | Failed login rate > 50/sec for 2 min | runbooks/identity/credential-stuffing.md |
IdentityAccountLockoutSpike | Lockout rate > 10x baseline | runbooks/identity/lockout-spike.md |
IdentityMFAAdoptionDrop | MFA adoption ratio drops > 5% week/week | runbooks/identity/mfa-adoption.md |
6.3 Warning (Slack #oncall-identity)
| Alert | Condition |
|---|---|
IdentityAIRiskFallback | > 10% of risk classifications falling back to rules |
IdentityOfflineCertsExpiring | > 100 certs expiring in < 7 days |
IdentityAPIKeyRotationOverdue | > 50 API keys older than 180 days |
7. Correlation
Every log, metric exemplar, and span carries:
trace_id(W3C)tenant_id(when applicable)request_id(gateway-issued)actor_id_hash(SHA-256 with tenant salt)
Dashboards link: metric exemplar → trace → logs filtered by trace_id.
8. Runbook Index
| Scenario | Runbook |
|---|---|
| Service down | runbooks/identity/service-down.md |
| High error rate | runbooks/identity/high-errors.md |
| KMS outage | runbooks/identity/kms-outage.md |
| Token theft suspected | runbooks/identity/token-theft.md |
| Credential stuffing attack | runbooks/identity/credential-stuffing.md |
| Outbox stalled | runbooks/identity/outbox-stalled.md |
| DLQ triage | runbooks/identity/dlq-triage.md |
| JWT key rotation | runbooks/identity/jwt-rotation.md |
| SSO provider outage | runbooks/identity/sso-outage.md |
| GDPR erasure stuck | runbooks/identity/gdpr-stuck.md |
9. Synthetic Monitoring
Canary probes every 60 seconds from 3 regions:
| Probe | Endpoint | Expected |
|---|---|---|
| Health | GET /health/live | 200 |
| Readiness | GET /health/ready | 200 |
| JWKS | GET /.well-known/jwks.json | 200, valid JSON, ≥ 1 key |
| Login roundtrip | Full login with canary account | 200 with valid JWT |
| SSO discovery | OIDC well-known endpoint | 200 |
Canary failure = SLO impact; aggregates into availability measurement.
10. Cost Observability
identity_kms_cost_microusd_totalcounter — tracks KMS API calls converted to cost.identity_ai_classifier_cost_microusd_totalcounter — AI gateway cost.- Monthly cost dashboard per tenant for chargeback.