Skip to main content

iam-service — Observability

Catalog summary: docs/03-microservices/iam-service.md · SECURITY_MODEL · FAILURE_MODES · DEPLOYMENT_TOPOLOGY

iam-service is a tier-0 service: any degradation is a platform incident. This document defines the SLIs, SLOs, dashboards, alerts, traces, and runbook entry points.

1. Telemetry Stack

ConcernTool
MetricsCloud Monitoring (OpenTelemetry → OTLP → Cloud Run sidecar → GMP)
LogsCloud Logging (structured JSON, severity, traceId)
TracesCloud Trace (W3C traceparent, sampling 10% normal / 100% error)
ProfilingCloud Profiler (CPU + heap, 1% sample)
Errorserror-reporting-service aggregates exceptions
Synthetic checksCloud Monitoring uptime checks (/healthz, /.well-known/jwks.json)
Audit / SIEMPub/Sub → BigQuery → Chronicle

2. Service-Level Indicators / Objectives

2.1 SLOs

#SLIWindowTargetError budget
S1Login latency p9930 d rolling< 800 ms1%
S2Refresh latency p9930 d rolling< 250 ms1%
S3JWKS availability (Cloud CDN edge)30 d rolling99.99%4 m / 30 d
S4JWT verify success (downstream sample)30 d99.99%
S5MFA challenge success (TOTP / WebAuthn)30 d> 99.5%0.5%
S6SSO callback success (per provider)30 d> 99.0%1%
S7Outbox publish lag p9930 d< 5 s
S8Service availability (HTTP 5xx ratio)30 d99.95%22 m / 30 d

2.2 SLI Definitions

# S1 login latency p99
histogram_quantile(0.99,
sum by (le) (
rate(http_request_duration_seconds_bucket{
service="iam-service",
route="POST /api/v1/auth/login",
status!~"5.."
}[5m])
)
)

# S3 JWKS availability
1 - (
sum(rate(jwks_request_total{result="error"}[5m]))
/
sum(rate(jwks_request_total[5m]))
)

# S5 MFA challenge success
sum(rate(mfa_challenge_total{result="success"}[5m]))
/ sum(rate(mfa_challenge_total[5m]))

# S7 outbox lag
max(outbox_publish_lag_seconds{service="iam-service"})

3. Custom Metrics

MetricTypeLabels
iam_login_attempts_totalcountertenant_id, result, factor
iam_login_duration_secondshistogramtenant_id, factor
iam_refresh_totalcountertenant_id, result (ok, reuse_detected, expired, revoked)
iam_token_signing_duration_secondshistogramkid
iam_mfa_challenge_totalcountertenant_id, kind, result
iam_sso_callback_totalcountertenant_id, provider, result
iam_magic_link_request_totalcountertenant_id
iam_device_registered_totalcountertenant_id, os_family
iam_offline_binding_issued_totalcountertenant_id
iam_offline_binding_activegaugetenant_id
iam_lockout_totalcountertenant_id, reason
iam_breach_check_totalcounterresult (hit, miss, error)
iam_outbox_pendinggauge
iam_outbox_publish_lag_secondsgauge
iam_kms_call_duration_secondshistogramop (sign, verify)
iam_kms_errors_totalcounterop, code
iam_jwks_cache_hits_totalcounter
iam_audit_chain_writes_totalcounteraction

4. Dashboards (Grafana / Cloud Monitoring)

  1. iam-service / Overview — RPS, p50/p95/p99 latency, 5xx rate, instance count, CPU/memory.
  2. iam-service / Auth Funnel — login_attempts → mfa_required → mfa_success → session_created (per tenant).
  3. iam-service / Sessions & Tokens — active sessions, refresh rotations, reuse-detect events, JWT signing latency, KMS error rate.
  4. iam-service / SSO — per-provider callback success/error, end-to-end latency.
  5. iam-service / Devices — registrations/day, offline bindings active, certs expiring next 24 h.
  6. iam-service / Security — failed login spikes (z-score), lockouts, breach-list hits, suspicious refresh patterns.
  7. iam-service / Outbox & Events — outbox lag, dead-letter count, per-subject publish rate.
  8. iam-service / SLO Burndown — error-budget burn for S1, S3, S5, S8.

5. Alerts

5.1 Critical (page on-call)

IDTriggerAction
A-CRIT-1iam_login_duration_seconds:p99 > 1.5 s for 5 mPage; check Postgres + KMS latency
A-CRIT-2JWKS endpoint 5xx > 1% for 5 mPage; verify Cloud CDN + signing key
A-CRIT-3KMS sign error rate > 0.5% for 5 mPage; engage KMS runbook RUN-IAM-002
A-CRIT-4Refresh-reuse rate > 0.1% for 10 m (anomaly)Page; potential token theft campaign
A-CRIT-5Outbox lag p99 > 60 s for 5 mPage; events backlogged
A-CRIT-6HTTP 5xx ratio > 1% for 5 mPage
A-CRIT-7Audit chain write failure (any)Page; possible tampering

5.2 Major (notify channel)

IDTrigger
A-MAJ-1Failed-login spike per tenant (z-score > 3) sustained 10 m → suggests credential stuffing
A-MAJ-2MFA challenge success < 99% for 30 m
A-MAJ-3SSO provider success < 95% for 15 m
A-MAJ-4Offline bindings expiring next 24 h spike (rotation gap)
A-MAJ-5Breach-list provider error rate > 10% for 30 m
A-MAJ-6TOTP drift complaints (mfa_failed{kind=totp} outliers)

5.3 Informational

IDTrigger
A-INF-1New device registered in tenant (first 30 d only)
A-INF-2API key issued
A-INF-3Tenant CA usage spike

Alert routing: critical → PagerDuty + on-call SMS + Slack #iam-prod. Major → Slack only. Info → audit log only.

6. Logging Conventions

Every log line is JSON, fields:

FieldAlwaysNotes
timestamp, severity, messageyes
traceId, spanIdyesfrom W3C
service = iam-serviceyes
versionyessemver + commit
tenantIdwhen applicableNEVER PII
userIdwhen applicablepseudonymous ID only
routeyes (HTTP)template, not interpolated
httpStatus, durationMsyes (HTTP)
errorCodeon errorMELMASTOON.IAM.*
kidon signing
factor, riskScore, aiModelVersionon adaptive MFA

Forbidden in logs: password, password_hash, refresh_token, access_token, totp_secret, magic_link_token, email, webauthn.credential, device_private_key. Pre-commit hook blocks any literal string password|secret|token in log statements.

7. Tracing

Spans of interest:

  • auth.login
    • domain.user.load
    • crypto.argon2id.verify
    • risk.classify (calls ai-orchestrator-service)
    • mfa.challenge (if forced)
    • crypto.jwt.sign (KMS)
    • db.session.insert
    • outbox.append
  • auth.refresh
    • db.session.lookup
    • policy.reuse_detect
    • crypto.jwt.sign
    • db.session.update
  • auth.sso.callback
    • oidc.code_exchange (external)
    • oidc.id_token.verify
    • domain.user.upsert

Sampling: 10% baseline, 100% on 5xx, 100% on auth.refresh.reuse_detected.

8. Synthetic & Probes

ProbeFrequencyAsserts
GET /healthz30 s200, body {status:'ok'}
GET /readyz30 s200; checks Postgres, KMS warmup
GET /.well-known/jwks.json (4 regions)30 s200, ≥ 1 active key, kid overlap policy honored
Synthetic login (test tenant)5 m< 1.5 s, returns access+refresh
Synthetic refresh5 m< 500 ms
Synthetic SSO15 m (per provider)success

9. Runbooks

IDTitleTrigger
RUN-IAM-001Emergency JWT signing key rotationSuspected key compromise
RUN-IAM-002KMS regional outageA-CRIT-3
RUN-IAM-003Refresh-token reuse stormA-CRIT-4
RUN-IAM-004Outbox backlogA-CRIT-5
RUN-IAM-005Account lockout flood (legitimate)A-MAJ-1
RUN-IAM-006SSO provider downA-MAJ-3
RUN-IAM-007Tenant CA rotationscheduled
RUN-IAM-008Breach-list provider downA-MAJ-5

Runbooks live in D:/GhasiTech/ghasi-e-documentation/ghasi-melmastoon/services/iam-service/runbooks/RUN-IAM-*.md (linked from each alert).

10. Cross-References