iam-service — Observability
Catalog summary: docs/03-microservices/iam-service.md · SECURITY_MODEL · FAILURE_MODES · DEPLOYMENT_TOPOLOGY
iam-service is a tier-0 service: any degradation is a platform incident. This document defines the SLIs, SLOs, dashboards, alerts, traces, and runbook entry points.
1. Telemetry Stack
| Concern | Tool |
|---|
| Metrics | Cloud Monitoring (OpenTelemetry → OTLP → Cloud Run sidecar → GMP) |
| Logs | Cloud Logging (structured JSON, severity, traceId) |
| Traces | Cloud Trace (W3C traceparent, sampling 10% normal / 100% error) |
| Profiling | Cloud Profiler (CPU + heap, 1% sample) |
| Errors | error-reporting-service aggregates exceptions |
| Synthetic checks | Cloud Monitoring uptime checks (/healthz, /.well-known/jwks.json) |
| Audit / SIEM | Pub/Sub → BigQuery → Chronicle |
2. Service-Level Indicators / Objectives
2.1 SLOs
| # | SLI | Window | Target | Error budget |
|---|
| S1 | Login latency p99 | 30 d rolling | < 800 ms | 1% |
| S2 | Refresh latency p99 | 30 d rolling | < 250 ms | 1% |
| S3 | JWKS availability (Cloud CDN edge) | 30 d rolling | 99.99% | 4 m / 30 d |
| S4 | JWT verify success (downstream sample) | 30 d | 99.99% | — |
| S5 | MFA challenge success (TOTP / WebAuthn) | 30 d | > 99.5% | 0.5% |
| S6 | SSO callback success (per provider) | 30 d | > 99.0% | 1% |
| S7 | Outbox publish lag p99 | 30 d | < 5 s | — |
| S8 | Service availability (HTTP 5xx ratio) | 30 d | 99.95% | 22 m / 30 d |
2.2 SLI Definitions
# S1 login latency p99
histogram_quantile(0.99,
sum by (le) (
rate(http_request_duration_seconds_bucket{
service="iam-service",
route="POST /api/v1/auth/login",
status!~"5.."
}[5m])
)
)
# S3 JWKS availability
1 - (
sum(rate(jwks_request_total{result="error"}[5m]))
/
sum(rate(jwks_request_total[5m]))
)
# S5 MFA challenge success
sum(rate(mfa_challenge_total{result="success"}[5m]))
/ sum(rate(mfa_challenge_total[5m]))
# S7 outbox lag
max(outbox_publish_lag_seconds{service="iam-service"})
3. Custom Metrics
| Metric | Type | Labels |
|---|
iam_login_attempts_total | counter | tenant_id, result, factor |
iam_login_duration_seconds | histogram | tenant_id, factor |
iam_refresh_total | counter | tenant_id, result (ok, reuse_detected, expired, revoked) |
iam_token_signing_duration_seconds | histogram | kid |
iam_mfa_challenge_total | counter | tenant_id, kind, result |
iam_sso_callback_total | counter | tenant_id, provider, result |
iam_magic_link_request_total | counter | tenant_id |
iam_device_registered_total | counter | tenant_id, os_family |
iam_offline_binding_issued_total | counter | tenant_id |
iam_offline_binding_active | gauge | tenant_id |
iam_lockout_total | counter | tenant_id, reason |
iam_breach_check_total | counter | result (hit, miss, error) |
iam_outbox_pending | gauge | — |
iam_outbox_publish_lag_seconds | gauge | — |
iam_kms_call_duration_seconds | histogram | op (sign, verify) |
iam_kms_errors_total | counter | op, code |
iam_jwks_cache_hits_total | counter | — |
iam_audit_chain_writes_total | counter | action |
4. Dashboards (Grafana / Cloud Monitoring)
- iam-service / Overview — RPS, p50/p95/p99 latency, 5xx rate, instance count, CPU/memory.
- iam-service / Auth Funnel — login_attempts → mfa_required → mfa_success → session_created (per tenant).
- iam-service / Sessions & Tokens — active sessions, refresh rotations, reuse-detect events, JWT signing latency, KMS error rate.
- iam-service / SSO — per-provider callback success/error, end-to-end latency.
- iam-service / Devices — registrations/day, offline bindings active, certs expiring next 24 h.
- iam-service / Security — failed login spikes (z-score), lockouts, breach-list hits, suspicious refresh patterns.
- iam-service / Outbox & Events — outbox lag, dead-letter count, per-subject publish rate.
- iam-service / SLO Burndown — error-budget burn for S1, S3, S5, S8.
5. Alerts
5.1 Critical (page on-call)
| ID | Trigger | Action |
|---|
| A-CRIT-1 | iam_login_duration_seconds:p99 > 1.5 s for 5 m | Page; check Postgres + KMS latency |
| A-CRIT-2 | JWKS endpoint 5xx > 1% for 5 m | Page; verify Cloud CDN + signing key |
| A-CRIT-3 | KMS sign error rate > 0.5% for 5 m | Page; engage KMS runbook RUN-IAM-002 |
| A-CRIT-4 | Refresh-reuse rate > 0.1% for 10 m (anomaly) | Page; potential token theft campaign |
| A-CRIT-5 | Outbox lag p99 > 60 s for 5 m | Page; events backlogged |
| A-CRIT-6 | HTTP 5xx ratio > 1% for 5 m | Page |
| A-CRIT-7 | Audit chain write failure (any) | Page; possible tampering |
5.2 Major (notify channel)
| ID | Trigger |
|---|
| A-MAJ-1 | Failed-login spike per tenant (z-score > 3) sustained 10 m → suggests credential stuffing |
| A-MAJ-2 | MFA challenge success < 99% for 30 m |
| A-MAJ-3 | SSO provider success < 95% for 15 m |
| A-MAJ-4 | Offline bindings expiring next 24 h spike (rotation gap) |
| A-MAJ-5 | Breach-list provider error rate > 10% for 30 m |
| A-MAJ-6 | TOTP drift complaints (mfa_failed{kind=totp} outliers) |
| ID | Trigger |
|---|
| A-INF-1 | New device registered in tenant (first 30 d only) |
| A-INF-2 | API key issued |
| A-INF-3 | Tenant CA usage spike |
Alert routing: critical → PagerDuty + on-call SMS + Slack #iam-prod. Major → Slack only. Info → audit log only.
6. Logging Conventions
Every log line is JSON, fields:
| Field | Always | Notes |
|---|
timestamp, severity, message | yes | |
traceId, spanId | yes | from W3C |
service = iam-service | yes | |
version | yes | semver + commit |
tenantId | when applicable | NEVER PII |
userId | when applicable | pseudonymous ID only |
route | yes (HTTP) | template, not interpolated |
httpStatus, durationMs | yes (HTTP) | |
errorCode | on error | MELMASTOON.IAM.* |
kid | on signing | |
factor, riskScore, aiModelVersion | on adaptive MFA | |
Forbidden in logs: password, password_hash, refresh_token, access_token, totp_secret, magic_link_token, email, webauthn.credential, device_private_key. Pre-commit hook blocks any literal string password|secret|token in log statements.
7. Tracing
Spans of interest:
auth.login
domain.user.load
crypto.argon2id.verify
risk.classify (calls ai-orchestrator-service)
mfa.challenge (if forced)
crypto.jwt.sign (KMS)
db.session.insert
outbox.append
auth.refresh
db.session.lookup
policy.reuse_detect
crypto.jwt.sign
db.session.update
auth.sso.callback
oidc.code_exchange (external)
oidc.id_token.verify
domain.user.upsert
Sampling: 10% baseline, 100% on 5xx, 100% on auth.refresh.reuse_detected.
8. Synthetic & Probes
| Probe | Frequency | Asserts |
|---|
GET /healthz | 30 s | 200, body {status:'ok'} |
GET /readyz | 30 s | 200; checks Postgres, KMS warmup |
GET /.well-known/jwks.json (4 regions) | 30 s | 200, ≥ 1 active key, kid overlap policy honored |
| Synthetic login (test tenant) | 5 m | < 1.5 s, returns access+refresh |
| Synthetic refresh | 5 m | < 500 ms |
| Synthetic SSO | 15 m (per provider) | success |
9. Runbooks
| ID | Title | Trigger |
|---|
RUN-IAM-001 | Emergency JWT signing key rotation | Suspected key compromise |
RUN-IAM-002 | KMS regional outage | A-CRIT-3 |
RUN-IAM-003 | Refresh-token reuse storm | A-CRIT-4 |
RUN-IAM-004 | Outbox backlog | A-CRIT-5 |
RUN-IAM-005 | Account lockout flood (legitimate) | A-MAJ-1 |
RUN-IAM-006 | SSO provider down | A-MAJ-3 |
RUN-IAM-007 | Tenant CA rotation | scheduled |
RUN-IAM-008 | Breach-list provider down | A-MAJ-5 |
Runbooks live in D:/GhasiTech/ghasi-e-documentation/ghasi-melmastoon/services/iam-service/runbooks/RUN-IAM-*.md (linked from each alert).
10. Cross-References