Platform Admin Service — User Stories
Status: populated Owner: TBD Last updated: 2026-04-18
Story index
| ID | Epic | Summary | Priority | Milestone |
|---|---|---|---|---|
| PLTADM-US-001 | PLTADM-EPIC-01 | Create and update platform config entries | Must | M0 |
| PLTADM-US-002 | PLTADM-EPIC-01 | Read and delete platform config entries | Must | M0 |
| PLTADM-US-003 | PLTADM-EPIC-02 | Create and update feature flags | Must | M0 |
| PLTADM-US-004 | PLTADM-EPIC-02 | Manage tenant overrides on feature flags | Must | M0 |
| PLTADM-US-005 | PLTADM-EPIC-02 | Evaluate feature flag for a tenant | Must | M0 |
| PLTADM-US-006 | PLTADM-EPIC-03 | View aggregate platform health | Must | M0 |
| PLTADM-US-007 | PLTADM-EPIC-03 | Register and update health sources dynamically | Must | M1 |
| PLTADM-US-008 | PLTADM-EPIC-04 | Verify service meets coverage and latency targets | Must | M1 |
| PLTADM-US-009 | PLTADM-EPIC-04 | Validate observability and audit trail completeness | Must | M1 |
| PLTADM-US-010 | PLTADM-EPIC-05 | Retrieve paginated config history | Should | M1 |
| PLTADM-US-011 | PLTADM-EPIC-05 | List feature flags visible to a tenant admin | Should | M1 |
PLTADM-US-001 — Create and update platform config entries
| Field | Value |
|---|---|
| Issue type | Story |
| Summary | As a platform operator, I can create and update config entries so that I can govern platform-wide and tenant-scoped settings |
| Status | To Do |
| Priority | Must |
| Labels | service:platform-admin, domain:platform_admin, slice:S0 |
| Components | config-module |
| Fix version | M0 |
| Epic link | PLTADM-EPIC-01 |
| FR references | FR-PLTADM-CFG-001, FR-PLTADM-CFG-002 |
| Legacy FR refs | FR-ADM-CFG-001, FR-ADM-CFG-002 |
User story:
As a platform operator, I want to create and update platform config entries via the admin API so that I can govern platform-wide settings and per-tenant overrides from a single governed store.
Acceptance criteria:
Scenario: Create a new PLATFORM-scoped config entry
Given I am authenticated as SUPER_ADMIN
And the config key "session.timeout_minutes" exists in the allow-list
When I POST /api/v1/admin/platform-config
with { "key": "session.timeout_minutes", "value": "60", "scope": "PLATFORM" }
Then the response is 201 Created
And the config entry is persisted with scope=PLATFORM
And event "platform_admin.config.updated.v1" is published to NATS
Scenario: Reject an unknown config key
Given I am authenticated as SUPER_ADMIN
When I POST /api/v1/admin/platform-config
with { "key": "unknown.key", "value": "x", "scope": "PLATFORM" }
Then the response is 400 Bad Request
And the error code is "ADM_CONFIG_KEY_UNKNOWN"
And no event is published
Scenario: Update an existing config entry
Given config entry "mfa.required" exists with value "false"
When I PATCH /api/v1/admin/platform-config/mfa.required
with { "value": "true" }
Then the response is 200 OK
And a history record is appended to config_history with previous_value="false" and new_value="true"
Scenario: Create a TENANT-scoped config entry
Given tenant "ten_DEMO001" exists and is ACTIVE
When I POST /api/v1/admin/platform-config
with { "key": "session.timeout_minutes", "value": "30", "scope": "TENANT", "tenantId": "ten_DEMO001" }
Then the response is 201 Created
And the config entry is stored with tenantId="ten_DEMO001"
Scenario: Reject config mutation from non-SUPER_ADMIN
Given I am authenticated as TENANT_ADMIN
When I POST /api/v1/admin/platform-config
Then the response is 403 Forbidden
Technical notes:
- Allow-list enforced in
ConfigAllowListPort; unknown keys returnADM_CONFIG_KEY_UNKNOWN(400). scopeenum:PLATFORM | TENANT. TENANT scope requirestenantId.- Type validation: each allow-listed key has a defined type (
string | number | boolean | json); value is validated against the type before persist. secrettype keys: value stored via Secrets Manager reference;GETreturns***REDACTED***.- Mutation emits
platform_admin.config.updated.v1; downstream consumers (e.g., identity-service forsession.timeout_minutes) subscribe. - History record written in same DB transaction as config update (transactional outbox).
Definition of done:
-
POST /api/v1/admin/platform-configandPATCH /api/v1/admin/platform-config/:keyimplemented and tested - Allow-list validation enforced; 23 seeded keys present in test fixtures
-
config_historyrow appended on every mutation -
platform_admin.config.updated.v1published via outbox - SUPER_ADMIN scope enforced; 403 returned for non-SUPER_ADMIN callers
- Unit coverage ≥ 80% for config-module
PLTADM-US-002 — Read and delete platform config entries
| Field | Value |
|---|---|
| Issue type | Story |
| Summary | As a platform operator, I can list and delete config entries so that I have full read/delete lifecycle control over governed settings |
| Status | To Do |
| Priority | Must |
| Labels | service:platform-admin, domain:platform_admin, slice:S0 |
| Components | config-module |
| Fix version | M0 |
| Epic link | PLTADM-EPIC-01 |
| FR references | FR-PLTADM-CFG-003, FR-PLTADM-CFG-004 |
| Legacy FR refs | FR-ADM-CFG-003, FR-ADM-CFG-004 |
User story:
As a platform operator, I want to list all config entries and delete individual ones so that I can inspect the current configuration state and remove obsolete tenant overrides.
Acceptance criteria:
Scenario: List all config entries
Given 10 config entries exist (3 PLATFORM-scoped, 7 TENANT-scoped across 2 tenants)
When I GET /api/v1/admin/platform-config
Then the response is 200 OK
And the response contains all 10 entries
And secret-type entries show value "***REDACTED***"
Scenario: Get a single config entry by key and scope
Given config entry "mfa.required" exists with scope=PLATFORM
When I GET /api/v1/admin/platform-config/mfa.required?scope=PLATFORM
Then the response is 200 OK
And the response includes { "key": "mfa.required", "scope": "PLATFORM", "value": "true" }
Scenario: Delete a TENANT-scoped config entry
Given a TENANT-scoped config for key "session.timeout_minutes" exists for tenant "ten_DEMO001"
When I DELETE /api/v1/admin/platform-config/session.timeout_minutes?scope=TENANT&tenantId=ten_DEMO001
Then the response is 204 No Content
And the entry is removed from platform_configs
Scenario: Secret key value is redacted in list response
Given config entry "smtp.password" with type=secret exists
When I GET /api/v1/admin/platform-config
Then the entry for "smtp.password" shows value "***REDACTED***"
Technical notes:
GET /api/v1/admin/platform-configreturns array; no pagination required at M0 (max ~200 entries at launch).- Delete does not write a history record (deletion is audited via the outbox event only).
- Redis cache for config reads: TTL 5 min; cache invalidated on mutation event.
Definition of done:
-
GET /api/v1/admin/platform-configandGET /api/v1/admin/platform-config/:keyimplemented -
DELETE /api/v1/admin/platform-config/:keyimplemented with scope/tenantId query params - Secret-type redaction verified in list and get responses
- SUPER_ADMIN scope enforced on all write/delete endpoints
PLTADM-US-003 — Create and update feature flags
| Field | Value |
|---|---|
| Issue type | Story |
| Summary | As a platform engineer, I can create and update feature flags so that I can control feature rollout without code deployment |
| Status | To Do |
| Priority | Must |
| Labels | service:platform-admin, domain:platform_admin, slice:S0 |
| Components | feature-flag-module |
| Fix version | M0 |
| Epic link | PLTADM-EPIC-02 |
| FR references | FR-PLTADM-FF-001, FR-PLTADM-FF-002, FR-PLTADM-FF-003 |
| Legacy FR refs | FR-ADM-FF-001, FR-ADM-FF-002, FR-ADM-FF-003 |
User story:
As a platform engineer, I want to create, update, and archive feature flags via the admin API so that I can safely control feature rollout across all tenants or specific tenants without a code deployment.
Acceptance criteria:
Scenario: Create a new feature flag
Given I am authenticated as SUPER_ADMIN
When I POST /api/v1/admin/flags
with { "key": "OFFLINE_SYNC_V2", "defaultEnabled": false, "description": "Enable offline sync v2" }
Then the response is 201 Created
And the flag is persisted with status=ACTIVE and defaultEnabled=false
And event "platform_admin.flag.created.v1" is published
Scenario: Reject duplicate flag key
Given flag "OFFLINE_SYNC_V2" already exists
When I POST /api/v1/admin/flags with the same key
Then the response is 409 Conflict
And the error code is "ADM_FLAG_KEY_DUPLICATE"
Scenario: Update flag defaultEnabled and description
Given flag "OFFLINE_SYNC_V2" exists with defaultEnabled=false
When I PATCH /api/v1/admin/flags/OFFLINE_SYNC_V2
with { "defaultEnabled": true, "description": "Enable offline sync v2 - GA" }
Then the response is 200 OK
And the flag defaultEnabled is updated to true
And event "platform_admin.flag.updated.v1" is published
Scenario: Archive a flag
Given flag "OFFLINE_SYNC_V1" exists with status=ACTIVE
When I DELETE /api/v1/admin/flags/OFFLINE_SYNC_V1
Then the response is 200 OK
And the flag status is set to ARCHIVED
And event "platform_admin.flag.archived.v1" is published
And subsequent evaluate calls for this flag return false
Scenario: Cannot update an archived flag
Given flag "OFFLINE_SYNC_V1" has status=ARCHIVED
When I PATCH /api/v1/admin/flags/OFFLINE_SYNC_V1
Then the response is 422 Unprocessable Entity
And the error code is "ADM_FLAG_ARCHIVED"
Technical notes:
keyis globally unique; validated as UPPER_SNAKE_CASE.- Archive is terminal — no reactivation path.
- Redis cache invalidated on create/update/archive via
platform_admin.flag.*.v1events. - Flag
statusenum:ACTIVE | ARCHIVED.
Definition of done:
-
POST /api/v1/admin/flags,PATCH /api/v1/admin/flags/:key,DELETE /api/v1/admin/flags/:keyimplemented - Duplicate key rejection with
ADM_FLAG_KEY_DUPLICATE(409) - Archive terminal state enforced; archived flag updates return
ADM_FLAG_ARCHIVED(422) - Events published via outbox for create/update/archive
- Redis cache invalidated on mutation
PLTADM-US-004 — Manage tenant overrides on feature flags
| Field | Value |
|---|---|
| Issue type | Story |
| Summary | As a platform engineer, I can set per-tenant overrides on feature flags so that I can enable or disable features for specific tenants independently |
| Status | To Do |
| Priority | Must |
| Labels | service:platform-admin, domain:platform_admin, slice:S0 |
| Components | feature-flag-module |
| Fix version | M0 |
| Epic link | PLTADM-EPIC-02 |
| FR references | FR-PLTADM-FF-004, FR-PLTADM-FF-005 |
| Legacy FR refs | FR-ADM-FF-004, FR-ADM-FF-005 |
User story:
As a platform engineer, I want to add or remove per-tenant overrides on a feature flag so that I can enable a beta feature for specific tenants or block a feature for a problem tenant without affecting the global default.
Acceptance criteria:
Scenario: Enable a flag for a specific tenant (override)
Given flag "NEW_DASHBOARD" has defaultEnabled=false
When I POST /api/v1/admin/flags/NEW_DASHBOARD/overrides
with { "tenantId": "ten_DEMO001", "override": "ENABLED" }
Then the response is 200 OK
And ten_DEMO001 is added to enabledTenantIds
And evaluate(NEW_DASHBOARD, ten_DEMO001) returns true
And event "platform_admin.flag.updated.v1" is published
Scenario: Disable a flag for a specific tenant
Given flag "NEW_DASHBOARD" has defaultEnabled=true
When I POST /api/v1/admin/flags/NEW_DASHBOARD/overrides
with { "tenantId": "ten_DEMO001", "override": "DISABLED" }
Then ten_DEMO001 is added to disabledTenantIds
And evaluate(NEW_DASHBOARD, ten_DEMO001) returns false
Scenario: Remove a tenant override
Given ten_DEMO001 is in enabledTenantIds for "NEW_DASHBOARD"
When I DELETE /api/v1/admin/flags/NEW_DASHBOARD/overrides/ten_DEMO001
Then the tenant is removed from enabledTenantIds
And evaluate falls back to defaultEnabled
Scenario: Cannot set override on archived flag
Given flag "OLD_FEATURE" has status=ARCHIVED
When I POST /api/v1/admin/flags/OLD_FEATURE/overrides
Then the response is 422 Unprocessable Entity
And the error code is "ADM_FLAG_ARCHIVED"
Technical notes:
enabledTenantIdsanddisabledTenantIdsare JSONB arrays on thefeature_flagsrow.- A tenant can appear in at most one of the two arrays; adding to one removes from the other.
- Redis cache key:
flag:{key}:{tenantId}andflag:{key}:*; invalidate both on override change.
Definition of done:
-
POST /api/v1/admin/flags/:key/overridesandDELETE /api/v1/admin/flags/:key/overrides/:tenantIdimplemented - Mutual exclusion of enabled/disabled arrays enforced at domain level
- Archived flag override rejection (
ADM_FLAG_ARCHIVED) - Cache invalidation on override change (scoped and global flag cache keys)
PLTADM-US-005 — Evaluate feature flag for a tenant
| Field | Value |
|---|---|
| Issue type | Story |
| Summary | As a downstream service, I can call the internal evaluate endpoint so that I can determine feature availability for a tenant with sub-120ms latency |
| Status | To Do |
| Priority | Must |
| Labels | service:platform-admin, domain:platform_admin, slice:S0 |
| Components | feature-flag-module |
| Fix version | M0 |
| Epic link | PLTADM-EPIC-02 |
| FR references | FR-PLTADM-FF-006, FR-PLTADM-ENH-003, FR-PLTADM-ENH-004 |
| Legacy FR refs | FR-ADM-FF-006, FR-ADM-ENH-003, FR-ADM-ENH-004 |
User story:
As a downstream service, I want to call
GET /internal/admin/flags/:key/evaluate?tenantId=...so that I can determine whether a feature is enabled for a tenant with deterministic logic and p95 latency ≤ 120 ms.
Acceptance criteria:
Scenario: Evaluate archived flag returns false
Given flag "OLD_FEATURE" has status=ARCHIVED
When GET /internal/admin/flags/OLD_FEATURE/evaluate?tenantId=ten_DEMO001
Then the response is 200 OK
And { "enabled": false, "reason": "ARCHIVED" }
Scenario: Evaluate flag with disabled tenant override
Given flag "NEW_DASHBOARD" has defaultEnabled=true
And ten_DEMO001 is in disabledTenantIds
When GET /internal/admin/flags/NEW_DASHBOARD/evaluate?tenantId=ten_DEMO001
Then { "enabled": false, "reason": "TENANT_DISABLED" }
Scenario: Evaluate flag with enabled tenant override
Given flag "NEW_DASHBOARD" has defaultEnabled=false
And ten_DEMO001 is in enabledTenantIds
When GET /internal/admin/flags/NEW_DASHBOARD/evaluate?tenantId=ten_DEMO001
Then { "enabled": true, "reason": "TENANT_ENABLED" }
Scenario: Evaluate flag falls back to defaultEnabled
Given flag "NEW_DASHBOARD" has defaultEnabled=true
And ten_DEMO001 has no override
When GET /internal/admin/flags/NEW_DASHBOARD/evaluate?tenantId=ten_DEMO001
Then { "enabled": true, "reason": "DEFAULT" }
Scenario: Evaluate returns p95 <= 120ms under load
Given 1000 concurrent evaluate calls with Redis cache warm
Then p95 response time is <= 120ms
Scenario: Bootstrap endpoint returns all flag evaluations for a tenant
Given 15 active flags, 2 with overrides for ten_DEMO001
When GET /internal/admin/flags/bootstrap?tenantId=ten_DEMO001
Then the response contains all 15 flags with their evaluated enabled values
Technical notes:
- Evaluation logic order:
ARCHIVED → false>disabledTenantIds → false>enabledTenantIds → true>defaultEnabled. - Redis cache:
flag:{key}:{tenantId}TTL 60 s; warm on first call; invalidated via NATS event listener. - Bootstrap endpoint is used by service startup and client SDK hydration.
- Internal endpoint restricted to cluster-internal IPs (no JWT required at evaluate path; network policy enforces).
- Compatibility routes:
GET /api/platform/flags/:key/evaluateredirects to internal path with deprecation header.
Definition of done:
-
GET /internal/admin/flags/:key/evaluateimplements 4-step logic deterministically -
GET /internal/admin/flags/bootstrapreturns all active flags for tenant - Redis cache warm + TTL 60 s verified
- Event-driven cache invalidation wired (flag.updated + flag.archived events)
- p95 ≤ 120 ms verified under 1000 RPS load test
- Compatibility route present with
DeprecationandSunsetresponse headers
PLTADM-US-006 — View aggregate platform health
| Field | Value |
|---|---|
| Issue type | Story |
| Summary | As a platform operator, I can call the health aggregate endpoint so that I can triage incidents with a single view of all service statuses |
| Status | To Do |
| Priority | Must |
| Labels | service:platform-admin, domain:platform_admin, slice:S0 |
| Components | health-module |
| Fix version | M0 |
| Epic link | PLTADM-EPIC-03 |
| FR references | FR-PLTADM-HLT-001, FR-PLTADM-HLT-002 |
| Legacy FR refs | FR-ADM-HLT-001, FR-ADM-HLT-002 |
User story:
As a platform operator, I want to call
GET /api/v1/admin/health/aggregateso that I can see the overall platform health status and per-service breakdown to quickly triage incidents.
Acceptance criteria:
Scenario: Aggregate health returns overall UP when all services are healthy
Given 5 registered health sources, all returning healthy in the last poll
When GET /api/v1/admin/health/aggregate
Then the response is 200 OK
And { "overall": "UP", "services": [ { "name": "...", "status": "UP", "latencyMs": ... }, ... ] }
Scenario: Aggregate health returns DEGRADED when one service is unhealthy
Given service "notification-service" returned UNHEALTHY in last poll
When GET /api/v1/admin/health/aggregate
Then { "overall": "DEGRADED", "services": [ ..., { "name": "notification-service", "status": "DOWN" } ] }
Scenario: Response is served from 10s cache
Given the cache was populated 5 seconds ago
When two rapid successive GET /api/v1/admin/health/aggregate calls are made
Then both return 200 within 50ms (cache hit)
And no upstream health probes are triggered
Scenario: Response returns within 2 seconds
Given 27 registered health sources with staggered probe results
When GET /api/v1/admin/health/aggregate
Then the response time is <= 2000ms
Scenario: Non-authenticated request returns 401
Given no Authorization header
When GET /api/v1/admin/health/aggregate
Then 401 Unauthorized
Technical notes:
- Response cached at 10 s TTL in Redis;
HealthPollerJobprobes each source every 15 s in background. overalllogic:UPif all sources healthy;DEGRADEDif ≥1 down but <50%;DOWNif ≥50% down.- At M0 health sources are seeded statically; dynamic registration added in PLTADM-US-007 (M1).
- Response size: 27 services × ~100 bytes = ~2.7 KB; no pagination needed.
Definition of done:
-
GET /api/v1/admin/health/aggregateimplemented with correct overall logic - 10 s Redis cache wired to HealthPollerJob
- p99 response time ≤ 2 s verified under load
- SUPER_ADMIN authentication enforced
- Static seed list of health sources populates on service start
PLTADM-US-007 — Register and update health sources dynamically
| Field | Value |
|---|---|
| Issue type | Story |
| Summary | As a service instance, I can register itself as a health source so that dynamic deployments are reflected in the aggregate health without hardcoded lists |
| Status | To Do |
| Priority | Must |
| Labels | service:platform-admin, domain:platform_admin, slice:S1 |
| Components | health-module |
| Fix version | M1 |
| Epic link | PLTADM-EPIC-03 |
| FR references | FR-PLTADM-HLT-003, FR-PLTADM-HLT-004, FR-PLTADM-ENH-002 |
| Legacy FR refs | FR-ADM-HLT-003, FR-ADM-HLT-004, FR-ADM-ENH-002 |
User story:
As a Kubernetes service instance, I want to POST my health endpoint to
/internal/admin/health/sourceson startup so that I am automatically included in the aggregate health view without requiring a hardcoded list update.
Acceptance criteria:
Scenario: Register a new health source
Given service "new-service" has not previously registered
When POST /internal/admin/health/sources
with { "name": "new-service", "healthUrl": "http://new-service:3020/health" }
Then the response is 201 Created
And the source is stored in health_sources
And event "platform_admin.health_source.registered.v1" is published
Scenario: Re-registration updates heartbeat timestamp
Given source "identity-service" was last registered 30 seconds ago
When POST /internal/admin/health/sources again with same payload
Then the response is 200 OK
And lastRegisteredAt is updated to now
Scenario: Stale source is marked unhealthy
Given source "old-service" last registered 90 seconds ago (staleness threshold=60s)
When HealthPollerJob runs
Then "old-service" status is set to UNHEALTHY in health_sources
And aggregate health reflects the degraded status
Scenario: Stale source re-registers and recovers
Given "old-service" is marked UNHEALTHY due to staleness
When "old-service" POSTs to /internal/admin/health/sources
Then lastRegisteredAt is updated
And on next poll the source returns to HEALTHY if its health endpoint responds 200
Technical notes:
POST /internal/admin/health/sourcesis idempotent onname; upsert by name.- Staleness threshold configurable via
PLTADM_HEALTH_STALENESS_S(default 60 s). HealthPollerJobCronJob runs every 15 s; probeshealthUrl; updateshealth_check_results.- Dynamic registration replaces static seed list at M1; static seed remains as fallback behind feature flag
DYNAMIC_HEALTH_REGISTRATION.
Definition of done:
-
POST /internal/admin/health/sourcesupserts by name with heartbeat update - Staleness check in HealthPollerJob marks stale sources UNHEALTHY
-
platform_admin.health_source.registered.v1published on new registration - Static seed list kept behind
DYNAMIC_HEALTH_REGISTRATIONflag for rollback - Integration test: register → probe → aggregate reflects new source
PLTADM-US-008 — Verify service meets coverage and latency targets
| Field | Value |
|---|---|
| Issue type | Story |
| Summary | As a platform team lead, I can confirm test coverage ≥ 80% and flag evaluate p95 ≤ 120ms so that platform-admin-service meets quality gates before M1 sign-off |
| Status | To Do |
| Priority | Must |
| Labels | service:platform-admin, domain:platform_admin, slice:S0 |
| Components | cross-cutting |
| Fix version | M1 |
| Epic link | PLTADM-EPIC-04 |
| FR references | FR-PLTADM-NFR-001, FR-PLTADM-NFR-002 |
| Legacy FR refs | NFR-ADM-001, NFR-ADM-002 |
User story:
As a platform team lead, I want to run the test suite and see coverage ≥ 80% with zero lint/typecheck errors, and confirm flag evaluate p95 ≤ 120 ms so that platform-admin-service can pass the quality gate and proceed to production.
Acceptance criteria:
Scenario: Unit and integration coverage threshold met
When pnpm test:cov is executed
Then overall statement coverage is >= 80%
And branch coverage is >= 80%
And the following test files exist:
- config-module.spec.ts (allow-list, type validation, history)
- feature-flag.spec.ts (CRUD, evaluation logic, cache)
- health-aggregate.spec.ts (aggregation, staleness)
- tenant-isolation.spec.ts
- outbox.spec.ts
- inbox.spec.ts
Scenario: ESLint and TypeScript type checks pass
When pnpm lint && pnpm typecheck is executed
Then exit code is 0 with zero errors
Scenario: Flag evaluate p95 latency target met
Given Redis cache warm
When 1000 concurrent evaluate requests are sent
Then p95 response time is <= 120ms
And p99 response time is <= 200ms
Scenario: Aggregate health p99 latency target met
Given 27 registered health sources with cached results
When 100 concurrent aggregate health requests are sent
Then p99 response time is <= 2000ms
Technical notes:
- Load test script:
k6 run tests/load/flag-evaluate.k6.jstargeting 1000 RPS for 60 s. - Coverage report generated by Vitest with
--coverageflag; threshold enforced invitest.config.ts. - CI gate: coverage check runs in the
testjob; load test runs in a separateperf-testjob on pre-prod.
Definition of done:
-
vitest.config.tscoverage thresholds set to 80% (statements, branches, functions, lines) - All 6 mandatory test files present and passing
-
pnpm lint && pnpm typecheckreturns exit 0 - k6 load test confirms p95 ≤ 120 ms at 1000 RPS (recorded in CI artefact)
- Health aggregate p99 ≤ 2 s verified
PLTADM-US-009 — Validate observability and audit trail completeness
| Field | Value |
|---|---|
| Issue type | Story |
| Summary | As a platform SRE, I can confirm OTel traces are visible, SLO burn alerts are configured, and config audit history is preserved for 7 years |
| Status | To Do |
| Priority | Must |
| Labels | service:platform-admin, domain:platform_admin, slice:S0 |
| Components | cross-cutting |
| Fix version | M1 |
| Epic link | PLTADM-EPIC-04 |
| FR references | FR-PLTADM-NFR-003 |
| Legacy FR refs | NFR-ADM-003 |
User story:
As a platform SRE, I want to confirm that OpenTelemetry traces flow through to the tracing backend, SLO burn-rate alerts are active, and config_history rows are retained for 7 years so that the service meets platform-wide observability and compliance requirements.
Acceptance criteria:
Scenario: OTel trace visible for flag evaluate
Given OTel exporter is configured and staging is running
When GET /internal/admin/flags/OFFLINE_SYNC_V2/evaluate?tenantId=ten_DEMO001
Then a trace is visible in the tracing backend
And span "flag.evaluate" includes attributes: flag_key, tenant_id, evaluation_result, cache_hit
Scenario: OTel trace visible for config mutation
When PATCH /api/v1/admin/platform-config/:key is called
Then a trace with span "config.update" includes: config_key, config_scope, actor_sub
Scenario: SLO burn-rate alert fires on latency regression
Given the SLO for flag evaluate p95 is 120ms
When flag evaluate p95 exceeds 120ms for 5 consecutive minutes
Then an alert fires to the SRE on-call channel
Scenario: Config audit history retained for 7 years
Given a config mutation happened 2 years ago
When querying config_history for that key
Then the record is still present
And the retention policy annotation confirms 7-year retention with S3 archive after 2 years
Technical notes:
- OTel SDK:
@opentelemetry/sdk-node; exporter: OTLP HTTP toOTEL_EXPORTER_OTLP_ENDPOINT. - Key span names:
flag.evaluate,flag.cache.hit,flag.cache.miss,config.update,health.poll. - SLO burn-rate alert configured in Prometheus/Alertmanager:
platform_admin_flag_evaluate_p95_ms > 120for 5 min → page SRE. config_historyretention: 7-year PostgreSQL retention policy; rows older than 2 years archived to S3 via nightly job.
Definition of done:
- OTel instrumentation in config-module, feature-flag-module, health-module (all key spans instrumented)
- Traces visible in staging tracing backend
- SLO burn-rate Prometheus alert rule deployed and tested (fire/resolve cycle)
-
config_historyretention policy set viapg_partmanor equivalent; S3 archive job configured - Compliance team sign-off on audit trail documented in SERVICE_READINESS.md
PLTADM-US-010 — Retrieve paginated config history
| Field | Value |
|---|---|
| Issue type | Story |
| Summary | As a platform operator, I can retrieve paginated config change history so that I can audit who changed what and when |
| Status | To Do |
| Priority | Should |
| Labels | service:platform-admin, domain:platform_admin, slice:S1 |
| Components | config-module |
| Fix version | M1 |
| Epic link | PLTADM-EPIC-05 |
| FR references | FR-PLTADM-ENH-001 |
| Legacy FR refs | FR-ADM-ENH-001 |
User story:
As a platform operator, I want to call
GET /api/v1/admin/platform-config/:key/historyso that I can audit every change to a config entry with who made it and what values changed, with cursor-based pagination.
Acceptance criteria:
Scenario: Retrieve history for a config key
Given 50 change records exist for key "session.timeout_minutes"
When GET /api/v1/admin/platform-config/session.timeout_minutes/history?limit=20
Then the response is 200 OK
And the response contains 20 records sorted by changed_at DESC
And each record includes: id, key, previous_value, new_value, changed_by, changed_at
And a nextCursor is included in the response
Scenario: Cursor-based pagination yields consistent results
Given the first page returned nextCursor "cursor_abc"
When GET /api/v1/admin/platform-config/session.timeout_minutes/history?limit=20&cursor=cursor_abc
Then the next 20 records are returned without duplicates
Scenario: History for unknown key returns 404
When GET /api/v1/admin/platform-config/unknown.key/history
Then 404 Not Found with error code "ADM_CONFIG_KEY_UNKNOWN"
Scenario: Secret-type key history redacts values
Given key "smtp.password" has type=secret
When GET /api/v1/admin/platform-config/smtp.password/history
Then all previous_value and new_value fields show "***REDACTED***"
Technical notes:
- Cursor: opaque base64-encoded
{ id, changed_at }for keyset pagination. - Default sort:
changed_at DESC. - History endpoint is read-only; no write operations.
changed_byis thesubclaim from the SUPER_ADMIN JWT that performed the mutation.
Definition of done:
-
GET /api/v1/admin/platform-config/:key/historyimplemented with cursor pagination - Sort order
changed_at DESCenforced - Secret-type value redaction in history responses
-
changed_bypopulated from JWTsubon every mutation - Integration test: 50-record seed → paginate through all records in 3 pages
PLTADM-US-011 — List feature flags visible to a tenant admin
| Field | Value |
|---|---|
| Issue type | Story |
| Summary | As a tenant admin, I can list feature flags applicable to my tenant so that I can understand which features are available to my organization |
| Status | To Do |
| Priority | Should |
| Labels | service:platform-admin, domain:platform_admin, slice:S1 |
| Components | feature-flag-module |
| Fix version | M1 |
| Epic link | PLTADM-EPIC-05 |
| FR references | FR-PLTADM-ENH-003 |
| Legacy FR refs | FR-ADM-ENH-003 |
User story:
As a tenant admin, I want to call
GET /api/v1/tenant/flagsso that I can see all active feature flags and their evaluated status for my tenant, enabling me to understand which features my organization can use.
Acceptance criteria:
Scenario: Tenant admin lists flags for their tenant
Given I am authenticated as TENANT_ADMIN for tenant "ten_DEMO001"
And 12 active flags exist; 3 have specific overrides for ten_DEMO001
When GET /api/v1/tenant/flags
Then the response is 200 OK
And the response contains 12 flags (archived flags excluded)
And each flag shows: key, description, enabled (evaluated for ten_DEMO001), hasOverride
Scenario: Archived flags are excluded from tenant listing
Given 2 flags have status=ARCHIVED
When GET /api/v1/tenant/flags
Then the response does not include the 2 archived flags
Scenario: Tenant admin can only see their own tenant's flag evaluation
Given I am TENANT_ADMIN for "ten_DEMO001"
When GET /api/v1/tenant/flags
Then all enabled values are evaluated for tenantId="ten_DEMO001"
And I cannot see flags evaluated for any other tenant
Scenario: SUPER_ADMIN can query flags for any tenant
Given I am authenticated as SUPER_ADMIN
When GET /api/v1/tenant/flags?tenantId=ten_DEMO001
Then flags evaluated for ten_DEMO001 are returned
Technical notes:
tenantIdresolved from JWTtenant_idclaim for TENANT_ADMIN; can be overridden with query param for SUPER_ADMIN only.- Response uses cached evaluate results where available (60 s TTL); falls back to DB.
hasOverride: trueif tenant appears in eitherenabledTenantIdsordisabledTenantIds.- Archived flags filtered at query level (
WHERE status = 'ACTIVE'). - Compatibility route:
GET /api/platform/flagswith deprecation header redirects to this endpoint.
Definition of done:
-
GET /api/v1/tenant/flagsreturns active flags with per-tenant evaluation - Archived flags excluded from response
- TENANT_ADMIN scoped to their own
tenantId; SUPER_ADMIN can override with query param -
hasOverridefield correctly populated - Compatibility route
GET /api/platform/flagspresent withDeprecationresponse header - Unit test: 15 flags (3 archived, 3 with overrides) → correct subset returned with correct enabled values