Skip to main content

Auth Service — Jira-Ready Epics & User Stories

Status: populated Owner: Platform Engineering Last updated: 2026-04-18 Service prefix: AUTH Scope: New epics/stories covering Kong JWT/API-key integration (ADR-0001), JWKS publication, Firebase federation, session management, and account provisioning.


Epic Summary

Epic IDTitleStoriesPoints
EP-AUTH-01Kong JWT Integration (JWKS Publisher)US-AUTH-001 – US-AUTH-00526
EP-AUTH-02API Key Lifecycle & Kong Plugin IntegrationUS-AUTH-010 – US-AUTH-01534
EP-AUTH-03User & Account ManagementUS-AUTH-020 – US-AUTH-02528
EP-AUTH-04Session Management & Token RefreshUS-AUTH-030 – US-AUTH-03420
EP-AUTH-05Security Hardening & ObservabilityUS-AUTH-040 – US-AUTH-04418

EP-AUTH-01 · Kong JWT Integration (JWKS Publisher)

Context: Kong's jwt plugin validates Bearer tokens using the auth-service JWKS endpoint. Auth-service must publish RS256 public keys at /.well-known/jwks.json without a Kong route (Kong polls it directly via cluster-internal DNS).

US-AUTH-001 · JWKS endpoint publication

Type: Feature | Points: 5

Description:
As Kong's jwt plugin, I need to poll /.well-known/jwks.json from auth-service so that RS256 JWT access tokens issued by auth-service can be validated at the edge without forwarding requests to auth-service.

Acceptance Criteria:

  • GET /.well-known/jwks.json returns RFC 7517 JWK Set with all active RS256 public keys
  • Each JWK entry: { kty: "RSA", use: "sig", alg: "RS256", kid, n, e }
  • Endpoint reachable cluster-internally (no Kong route); NOT proxied through Kong
  • Response cached in memory with 5-minute TTL to avoid Vault round-trips on every request
  • Response includes Cache-Control: public, max-age=300 header
  • Unit test: JWKS response validates against RFC 7517 schema

US-AUTH-002 · RS256 key generation and Vault storage

Type: Feature | Points: 8

Description:
As the auth-service, I need RSA 2048-bit signing key pairs generated and stored in HashiCorp Vault so that private keys never touch application memory at rest.

Acceptance Criteria:

  • Key pairs generated using Vault PKI or transit engine; private key path: secret/auth/jwks/{kid}/private
  • Public key material stored in auth.jwk_keys table alongside kid, algorithm, status (ACTIVE/RETIRED), createdAt, retiredAt
  • Key kid = UUID v4 generated at creation
  • Minimum 1 active key at all times; max 2 active keys during rotation
  • Key material never logged or returned in API responses

US-AUTH-003 · JWT issuance with RS256

Type: Feature | Points: 5

Description:
As an authenticated user, I need JWT access tokens issued with RS256 so that Kong can validate them using the public JWKS without a secret shared between Kong and auth-service.

Acceptance Criteria:

  • Access token: alg: RS256, exp: now + 15min, iss: https://auth.ghasi.io, sub: userId, tenantId, roles: [], kid (matching active JWK)
  • Refresh token: opaque random string (32 bytes base62), stored hashed in auth.refresh_tokens
  • Token pair returned from POST /v1/auth/login and POST /v1/auth/refresh
  • kid header in JWT matches entry in JWKS response
  • Unit test: issued JWT verifiable using JWKS public key

US-AUTH-004 · JWKS key rotation

Type: Feature | Points: 5

Description:
As the platform, I need JWKS key rotation to work without downtime so that signing keys can be cycled periodically or on compromise without invalidating current sessions.

Acceptance Criteria:

  • Rotation: generate new key pair, add to JWKS (status=ACTIVE), set old key to RETIRING
  • RETIRING keys remain in JWKS response for 30 minutes (JWT TTL grace period)
  • After 30 minutes: old key removed from JWKS, status set to RETIRED
  • Rotation can be triggered via POST /v1/internal/auth/rotate-jwks (admin mTLS, not via Kong)
  • auth_key_rotation_total Prometheus counter incremented on each rotation
  • Alert: if active key count < 1 → PagerDuty critical

US-AUTH-005 · Kong jwt plugin configuration for auth-service routes

Type: Configuration | Points: 3

Description:
As a platform operator, I need Kong's jwt plugin configured to verify RS256 tokens using auth-service's JWKS so that all protected routes validate tokens at the edge.

Acceptance Criteria:

  • jwt plugin configured globally or per-route with secret_is_base64: false, algorithm: RS256
  • config.jwks_uri set to http://auth-service:3002/.well-known/jwks.json (cluster-internal)
  • config.key_claim_name: "kid" to select correct JWK
  • 401 returned by Kong if token expired, invalid signature, or missing
  • Declarative config in services/api-gateway/kong/plugins/jwt.yaml

EP-AUTH-02 · API Key Lifecycle & Kong Plugin Integration

Context: Customers use long-lived API keys (no expiry by default) for programmatic access. Kong's ghasi-api-key-lookup custom plugin validates keys by calling auth-service's internal lookup endpoint.

US-AUTH-010 · API key creation endpoint

Type: Feature | Points: 5

Description:
As a customer, I need to create API keys from the customer portal so that I can authenticate programmatic requests to the platform.

Acceptance Criteria:

  • POST /v1/api-keys accepts { label: string, expiresAt?: ISO8601 }
  • Generates raw key: ghasi_live_<24 chars base62> (format: ghasi_live_ prefix + random)
  • Stores sha256(rawKey) hash in auth.api_keys; raw key returned exactly once in response
  • Response: { id, label, key: "ghasi_live_...", createdAt, expiresAt } — key never returned again
  • Rate limit: max 10 active keys per account
  • Unit test: verify stored hash matches sha256(returnedKey)

US-AUTH-011 · API key listing and revocation

Type: Feature | Points: 3

Description:
As a customer, I need to list and revoke my API keys so that I can manage access credentials and respond to key compromise.

Acceptance Criteria:

  • GET /v1/api-keys returns [{ id, label, keyPrefix (first 8 chars), createdAt, expiresAt, status }]
  • Raw key NOT returned in listing; only prefix for identification
  • DELETE /v1/api-keys/:id sets status = REVOKED, revokedAt = now()
  • Revoked keys immediately fail validation (no TTL grace period)
  • auth_api_key_revoked_total counter incremented

US-AUTH-012 · Internal API key lookup endpoint (Kong plugin integration)

Type: Feature | Points: 8

Description:
As Kong's ghasi-api-key-lookup custom plugin, I need an internal endpoint that validates a hashed API key and returns the associated account so that Kong can authenticate API key requests at the edge.

Acceptance Criteria:

  • GET /v1/api-keys/lookup?hash=<sha256hex> returns { accountId, tenantId, roles, status } for valid key
  • Returns 404 for unknown hash
  • Returns 403 for REVOKED or EXPIRED key with { reason: "REVOKED" | "EXPIRED" }
  • Endpoint accessible cluster-internally ONLY (no Kong route; bind on internal interface)
  • Response cached in Redis auth:key:{hash} TTL 60s
  • mTLS client certificate required (Kong service account)
  • P95 response time ≤ 5 ms (Redis cache hit path)

US-AUTH-013 · API key lookup Redis caching

Type: Feature | Points: 5

Description:
As the lookup endpoint, I need Redis caching for key lookups so that Kong's high-frequency validation calls don't overwhelm the auth-service database.

Acceptance Criteria:

  • Cache key: auth:key:{sha256hash}, TTL 60s, value: { accountId, tenantId, roles, status }
  • Cache miss: query PG, populate cache
  • Cache invalidated immediately on key revocation: DEL auth:key:{hash}
  • auth_key_cache_hit_total and auth_key_cache_miss_total counters tracked
  • Cache hit path P95 ≤ 2 ms

US-AUTH-014 · API key expiry enforcement

Type: Feature | Points: 5

Description:
As the platform, I need API keys with expiresAt set to be rejected after their expiry time so that time-limited keys provide bounded access windows.

Acceptance Criteria:

  • Lookup endpoint checks expiresAt < now() → returns 403 { reason: "EXPIRED" }
  • Background job runs daily: updates status = EXPIRED for keys where expiresAt < now() AND status = ACTIVE
  • Expired keys not returned in active key listing
  • auth_api_key_expired_total counter incremented by daily job

US-AUTH-015 · Kong ghasi-api-key-lookup plugin

Type: Feature | Points: 8

Description:
As Kong, I need a custom Lua plugin that extracts the X-API-Key header, hashes it, calls auth-service lookup, and injects tenant context into upstream headers so that API key authentication works at the edge.

Acceptance Criteria:

  • Plugin reads X-API-Key request header; returns 401 if absent
  • Computes sha256(value) → calls http://auth-service:3002/v1/api-keys/lookup?hash={hash}
  • On 200: injects X-Tenant-Id, X-Account-Id, X-Roles headers into upstream request
  • On 404/403: returns 401 Unauthorized to client
  • Plugin timeout: 100ms; on timeout returns 503 Service Unavailable with Retry-After: 1
  • Plugin deployed to Kong as custom plugin; declarative config in services/api-gateway/kong/plugins/

EP-AUTH-03 · User & Account Management

US-AUTH-020 · User registration and Firebase federation

Type: Feature | Points: 8

Description:
As a new user, I need to register via email/password with Firebase Auth so that I get a Ghasi platform account linked to my Firebase identity.

Acceptance Criteria:

  • POST /v1/auth/register accepts { email, password, displayName, organizationName }
  • Creates Firebase user via Firebase Admin SDK; sets customer custom claim
  • Creates auth.accounts and auth.users rows in a PG transaction
  • Publishes auth.events: { type: "user.registered", userId, tenantId, email } to NATS
  • Password stored in Firebase only (never in auth-service DB)
  • Returns 201 Created { userId, accountId, email, displayName }
  • Duplicate email → 409 Conflict

US-AUTH-021 · Login and token issuance

Type: Feature | Points: 5

Description:
As a registered user, I need to login with email/password and receive JWT + refresh token so that I can authenticate subsequent requests.

Acceptance Criteria:

  • POST /v1/auth/login accepts { email, password }, validates via Firebase Admin SDK
  • Returns { accessToken (JWT RS256, 15m), refreshToken (opaque, 30d), expiresIn: 900 }
  • Refresh token stored as argon2id hash in auth.refresh_tokens
  • Invalid credentials → 401 { code: "INVALID_CREDENTIALS" }
  • Brute force protection: 5 failed attempts per IP in 15m → 429 Too Many Requests

US-AUTH-022 · Token refresh

Type: Feature | Points: 3

Description:
As an authenticated client, I need to refresh my access token using a refresh token so that sessions persist beyond the 15-minute access token TTL.

Acceptance Criteria:

  • POST /v1/auth/refresh accepts { refreshToken } in body
  • Validates refresh token hash in auth.refresh_tokens; checks expiresAt and revokedAt
  • Issues new access token (15m) and rotates refresh token (30d); old refresh token invalidated
  • Returns same shape as login response
  • Used/revoked refresh token → 401 { code: "INVALID_REFRESH_TOKEN" }

US-AUTH-023 · Account and user profile management

Type: Feature | Points: 5

Description:
As an authenticated user, I need to view and update my profile and account details so that I can manage my Ghasi account.

Acceptance Criteria:

  • GET /v1/users/me returns { userId, accountId, email, displayName, roles, createdAt }
  • PUT /v1/users/me accepts { displayName } (email change requires re-auth)
  • GET /v1/accounts/me returns { accountId, organizationName, tier, status }
  • All endpoints require valid JWT Bearer token (validated by Kong)
  • Tenant isolation: X-Tenant-Id header injected by Kong; service verifies userId belongs to tenantId

US-AUTH-024 · RBAC role assignment

Type: Feature | Points: 5

Description:
As a platform admin, I need to assign roles to users so that access control is enforced throughout the platform.

Acceptance Criteria:

  • POST /v1/admin/users/:id/roles accepts { roles: string[] } (admin-only route)
  • Valid roles: customer, admin, operator-admin, billing-admin
  • Role assignment updates Firebase custom claims via Admin SDK and auth.user_roles table atomically
  • Role changes take effect on next token refresh (access token carries stale claims for up to 15m)
  • Audit log entry created in auth.audit_log for every role change

US-AUTH-025 · Account status management (suspend/activate)

Type: Feature | Points: 2

Description:
As a platform admin, I need to suspend and reactivate accounts so that compromised or non-paying accounts can be blocked.

Acceptance Criteria:

  • PUT /v1/admin/accounts/:id/status accepts { status: "ACTIVE" | "SUSPENDED" }
  • Suspended accounts: API key lookup returns 403 { reason: "ACCOUNT_SUSPENDED" }; JWT validated but 403 returned by auth-service middleware
  • Reactivation restores normal access immediately
  • Status change event published to auth.events

EP-AUTH-04 · Session Management & Token Refresh

US-AUTH-030 · Logout and session revocation

Type: Feature | Points: 3

Description:
As an authenticated user, I need a logout endpoint that revokes my current session so that access cannot be continued after sign-out.

Acceptance Criteria:

  • POST /v1/auth/logout requires valid JWT; revokes the associated refresh token
  • Refresh token revokedAt set to now()
  • Firebase revokeRefreshTokens(userId) called to invalidate all Firebase sessions
  • Returns 204 No Content
  • Subsequent refresh attempts with revoked token → 401

US-AUTH-031 · Refresh token rotation

Type: Feature | Points: 3

Description:
As the security model, I need refresh token rotation so that a stolen refresh token has a bounded exploit window.

Acceptance Criteria:

  • Each /v1/auth/refresh call issues a new refresh token and invalidates the previous one
  • Old refresh token revokedAt set; new token linked to same sessionId
  • Refresh token reuse detection: using an already-rotated token invalidates the entire session (auth.refresh_tokens for sessionId all revoked)
  • auth_refresh_token_reuse_total counter incremented on reuse detection

US-AUTH-032 · Refresh token storage with argon2id

Type: Feature | Points: 3

Description:
As the security model, I need refresh tokens stored as argon2id hashes so that a database compromise does not expose usable tokens.

Acceptance Criteria:

  • Refresh token stored as argon2id hash: m=65536 (64MB), t=3, p=4
  • Raw token (32 bytes base62) returned to client exactly once; never stored or logged
  • Hash verification on each refresh: argon2.verify(storedHash, incomingToken)
  • Unit test: hash → verify round trip passes; tampered token fails

US-AUTH-033 · Session listing and revocation (all devices)

Type: Feature | Points: 5

Description:
As a user, I need to list all active sessions and revoke all sessions so that I can respond to a suspected account compromise.

Acceptance Criteria:

  • GET /v1/auth/sessions returns [{ sessionId, createdAt, lastUsedAt, userAgent, ipAddress }]
  • DELETE /v1/auth/sessions revokes all refresh tokens for the user
  • DELETE /v1/auth/sessions/:sessionId revokes a single session
  • Firebase revokeRefreshTokens called when all sessions revoked

US-AUTH-034 · Brute force protection

Type: Feature | Points: 6

Description:
As the security model, I need brute force rate limiting on login and token refresh endpoints so that credential stuffing attacks are mitigated.

Acceptance Criteria:

  • Login: 5 failed attempts per IP per 15 minutes → 429 with Retry-After header
  • Login: 10 failed attempts per email per 15 minutes → 429
  • Counters stored in Redis: auth:ratelimit:login:ip:{ip} and auth:ratelimit:login:email:{email}
  • TTL = 900s on both counters
  • Successful login resets counters
  • auth_login_rate_limited_total counter incremented on block

EP-AUTH-05 · Security Hardening & Observability

US-AUTH-040 · Password hashing with argon2id

Type: Feature | Points: 3

Description:
As the security model, I need passwords hashed with argon2id before sending to Firebase (or storing locally if Firebase offline) so that password exposure risk is minimised.

Acceptance Criteria:

  • argon2id parameters: m=65536, t=3, p=4 (OWASP recommended minimum)
  • Password never stored in auth-service DB (Firebase is authoritative)
  • Hash parameters stored as part of Firebase password hash configuration
  • Unit test verifies argon2id hash/verify round-trip

US-AUTH-041 · Audit logging for security events

Type: Feature | Points: 5

Description:
As the compliance team, I need all security-relevant events recorded in an immutable audit log so that breach investigations have a reliable event trail.

Acceptance Criteria:

  • Events logged: login success/failure, logout, token refresh, API key create/revoke, role change, account suspend/activate, JWKS rotation
  • Each entry: { id, userId, accountId, action, outcome, ipAddress, userAgent, timestamp, metadata }
  • Stored in auth.audit_log; no UPDATE or DELETE on audit rows (append-only enforced via PG trigger)
  • Audit log queryable by admin via GET /v1/admin/audit-log with pagination

US-AUTH-042 · Prometheus metrics for auth-service

Type: Feature | Points: 3

Description:
As Prometheus, I need /metrics from auth-service so that authentication health is monitored.

Acceptance Criteria:

  • Metrics: auth_login_total{outcome}, auth_token_refresh_total, auth_api_key_lookup_total{result}, auth_key_cache_hit_total, auth_key_cache_miss_total, auth_login_rate_limited_total
  • Histogram: auth_api_key_lookup_duration_seconds
  • /metrics endpoint not behind Kong (cluster-internal only)

US-AUTH-043 · Health and readiness endpoints

Type: Feature | Points: 2

Description:
As Kubernetes, I need health and readiness endpoints for auth-service so that pod lifecycle is managed correctly.

Acceptance Criteria:

  • GET /health/live → 200 always if process running
  • GET /health/ready → 200 only if PG, Redis, Vault, Firebase Admin SDK all reachable
  • GET /health/ready → 503 with dependency map if any are down

US-AUTH-044 · Kubernetes deployment manifest with secret management

Type: DevOps | Points: 5

Description:
As the platform, I need auth-service deployed with secrets injected via Vault Agent sidecar so that no credentials are stored in Kubernetes ConfigMaps or env files.

Acceptance Criteria:

  • Vault Agent sidecar annotation on Deployment pod template
  • Firebase Admin SDK service account JSON injected at /vault/secrets/firebase-admin.json
  • Vault token for Vault transit/PKI injected via Kubernetes ServiceAccount
  • PostgreSQL credentials injected from secret/auth/db
  • Redis password injected from secret/auth/redis
  • HPA: min 2 replicas, scale on CPU > 70% or auth_api_key_lookup_total rate

EP-AUTH-06 · Tenant Sub-Org / Reseller Hierarchy + Cross-Tenant Token Revocation Propagation

Context: Enterprise tenants (telcos, agencies, banks) need to model sub-organisations under a single legal entity. The platform must enforce hierarchy in RBAC, scope token issuance to the right org, and propagate revocation across all sub-orgs of a parent.

US-AUTH-050 · Sub-org data model and parent-child constraint

Type: Feature | Points: 5

Description: As the platform, I need a auth.tenant_orgs table representing the legal-entity → sub-org tree so that one customer can model multiple business units without operating multiple Ghasi accounts.

Acceptance Criteria:

  • Table auth.tenant_orgs (id, tenantId, parentId NULLABLE, name, kind ENUM(LEGAL_ENTITY,BUSINESS_UNIT,AGENCY_CLIENT), createdAt).
  • Constraint: max depth 3 (legal entity → business unit → agency client).
  • FK on auth.users.orgId; backfill defaults to parent = tenant.rootOrg.
  • Cycle prevention: trigger rejects parent set to a descendant.

US-AUTH-051 · Org-scoped JWT claim

Type: Feature | Points: 3

Description: The platform JWT must carry orgId so downstream services can scope queries by org without an extra DB lookup.

Acceptance Criteria:

  • JWT payload includes tenantId, orgId, orgPath: ["legalId","buId","clientId"].
  • Downstream services use orgPath for ancestor-based authorisation.
  • OpenAPI spec updated; contract test passes.

US-AUTH-052 · Cross-org admin role with explicit scope

Type: Feature | Points: 5

Description: A legal-entity admin must be able to administer descendant orgs but never sibling orgs.

Acceptance Criteria:

  • Role org.admin is scoped to a specific orgId and applies to that org plus all descendants.
  • Role assignment endpoint validates the assigner is admin of an ancestor org.
  • Negative tests: sibling-org admin cannot read another sibling's data (RLS verified).

US-AUTH-053 · Token revocation propagation across descendants

Type: Feature | Points: 5

Description: When an org is suspended, every active session and API key for that org and its descendants must be revoked within 60 s.

Acceptance Criteria:

  • POST /v1/admin/orgs/:orgId/suspend triggers cascade revocation.
  • All matching auth.refresh_tokens flagged revoked_at = now().
  • All matching auth.api_keys.status = REVOKED.
  • Redis cache auth:rbac:{userId} invalidated.
  • auth.org.suspended.v1 NATS event published with descendant org IDs.
  • Integration test: suspend root org → verify child-org user gets 401 within 60 s on next call.

US-AUTH-054 · Per-org API key quota and rate limits

Type: Feature | Points: 3

Description: Quotas (max API keys, max sender IDs, max RPS) must be inherited from parent org but overridable downward (parent ≥ child).

Acceptance Criteria:

  • Quota table auth.org_quotas with hierarchical resolution (child quota ≤ parent).
  • Quota check on API-key creation; 422 with code: "QUOTA_EXCEEDED" on breach.
  • Telemetry: auth_quota_breach_total{resource,orgId} counter.

US-AUTH-055 · Org-scoped audit log queries

Type: Feature | Points: 3

Description: Auditors of an ancestor org must be able to query audit events of all descendants but never of siblings.

Acceptance Criteria:

  • GET /v1/audit?orgId=:id returns rows for :id and descendants only.
  • RLS policy enforces ancestor-or-self.
  • Negative test: non-ancestor cannot read.

US-AUTH-056 · Org transfer and re-parenting workflow

Type: Feature | Points: 5

Description: Platform admins must be able to move an agency-client org from one business-unit parent to another (e.g., when a customer restructures).

Acceptance Criteria:

  • POST /v1/admin/orgs/:id/reparent accepts { newParentId }; validates depth and cycle constraints.
  • Existing tokens for the moved org are re-issued with the new orgPath on next refresh.
  • auth.org.reparented.v1 event includes old and new path.
  • Audit entry mandatory; reason field required.

EP-AUTH-07 · HSM-Backed JWT Signing (replaces Vault-only key handling)

Context: Per EP-PLAT-NB-04 (HSM-backed key custody), the platform JWT signing root must move from Vault-managed RSA keys to a PKCS#11 HSM (FIPS 140-2 L3). This epic replaces the key generation/storage stories in EP-AUTH-01 (US-AUTH-002) for production environments while keeping local-dev fallback.

US-AUTH-060 · PKCS#11 client and HSM connection pool

Type: Feature | Points: 5

Description: As auth-service, I need a PKCS#11 client integrated so that JWT signing operations call into the HSM rather than holding private key material in process memory.

Acceptance Criteria:

  • PKCS#11 library wrapper (@ghasi/hsm-client) loaded; supports Thales nShield, Entrust, and SoftHSM2 (for local-dev).
  • Connection pool of 4 sessions per pod; health-checked every 10 s.
  • HSM_PROVIDER env var selects backend; default softhsm2 for local-dev, nshield for prod.
  • Sign latency P99 ≤ 5 ms when HSM is colocated; P99 ≤ 25 ms across-region.

US-AUTH-061 · Migrate existing JWKS keys into HSM

Type: Migration | Points: 5

Description: Existing Vault-stored RSA keys must be migrated into the HSM partition without invalidating active sessions.

Acceptance Criteria:

  • Migration script generates new HSM-resident key pair with new kid; adds to JWKS as ACTIVE alongside the existing Vault-stored key (now RETIRING).
  • Both keys served via JWKS during the rotation grace window (30 min default).
  • After grace period, Vault key set to RETIRED and removed from JWKS.
  • Vault secret/auth/jwks/{kid}/private keys are wiped and audit-logged.
  • Rollback plan: keep Vault key ACTIVE until HSM key has produced ≥ 10 000 successful verifications.

US-AUTH-062 · HSM-aware key rotation cron

Type: Feature | Points: 5

Description: Key rotation must continue to work end-to-end with HSM as the issuer.

Acceptance Criteria:

  • POST /v1/internal/auth/rotate-jwks generates a new HSM-resident key pair (PKCS#11 C_GenerateKeyPair).
  • New key added as ACTIVE; previous key marked RETIRING for 30 min.
  • Cron 0 0 1 * * (monthly) auto-rotates with PagerDuty notification.
  • Metric auth_hsm_rotation_total and auth_hsm_rotation_duration_seconds.
  • Alert: rotation failure → AuthHsmRotationFailed (Critical).

US-AUTH-063 · HSM unavailability fail-fast and circuit breaker

Type: Feature | Points: 3

Description: HSM unavailability must fail fast (no silent fallback to in-process keys) so that a hardware fault is loud and visible.

Acceptance Criteria:

  • HSM error during sign → 503 with code: "HSM_UNAVAILABLE"; no in-memory key fallback ever.
  • Circuit breaker opens after 3 consecutive HSM errors; half-open after 30 s.
  • When circuit open, /health/ready returns 503 (pod removed from Service endpoints).
  • Alert AuthHsmUnavailable (Critical) with runbook link.

EP-AUTH-08 · Break-Glass Admin Access + WebAuthn for Platform Staff

Context: Per [13-security-compliance-tenancy.md §2.2], break-glass platform-admin accounts bypass tenant IdPs and must use hardware WebAuthn. This epic implements that capability end-to-end with audit and dual-control.

US-AUTH-070 · Break-glass account model and WebAuthn enrolment

Type: Feature | Points: 5

Description: Platform staff with break-glass authority must enrol at least one FIDO2 hardware authenticator (YubiKey or equivalent).

Acceptance Criteria:

  • auth.break_glass_users table (userId, role, status ENUM(ACTIVE,SUSPENDED,RETIRED)).
  • WebAuthn enrolment ceremony at /v1/admin/break-glass/enroll; minimum 2 authenticators per user.
  • Authenticators stored as auth.webauthn_credentials (credentialId, publicKey, signCount, createdAt).
  • Enrolment requires existing platform admin approval (dual-control).

US-AUTH-071 · Break-glass login flow

Type: Feature | Points: 5

Description: Break-glass login must bypass all tenant IdPs and require WebAuthn assertion + a dual-approver acknowledgement.

Acceptance Criteria:

  • POST /v1/admin/break-glass/login/init → returns WebAuthn challenge.
  • POST /v1/admin/break-glass/login/finish → verifies assertion, then notifies a dual-approver (PagerDuty incident + Slack).
  • Dual-approver clicks /v1/admin/break-glass/approve/:requestId; access is granted only after approval (< 5 min window) AND a justification is supplied.
  • Issued JWT TTL is 15 min, non-refreshable, scope platform.break_glass.
  • Audit event auth.break_glass.granted.v1 includes initiator, approver, justification, source IP.

US-AUTH-072 · Break-glass session monitoring

Type: Feature | Points: 3

Description: Every API call made under a break-glass token must be logged in real time to a SIEM-forwarded stream.

Acceptance Criteria:

  • All break-glass requests carry X-BreakGlass: true and are mirrored to NATS subject auth.break_glass.activity.v1.
  • regulator-portal-service SIEM forwarder (per EP-REG-02) ingests this stream.
  • Real-time NOC banner shown when any break-glass session is active.

US-AUTH-073 · Break-glass quarterly access review

Type: Feature | Points: 3

Description: Break-glass authority must be reviewed quarterly; users not reviewed are auto-suspended.

Acceptance Criteria:

  • Cron 0 0 1 */3 * lists auth.break_glass_users and emits review tickets.
  • Users not re-attested within 30 d → status SUSPENDED.
  • Re-attestation requires CISO + CTO sign-off, recorded in audit.