Platform Admin Service — Observability
Status: populated Owner: TBD Last updated: 2026-04-18 Companion: 12 Observability
1. SLIs and SLOs
| SLI | SLO | Window |
|---|---|---|
| Internal flag evaluate availability | ≥ 99.9% | 30-day rolling |
| Internal flag evaluate p95 latency | ≤ 120 ms | 24 h |
| Aggregate health endpoint availability | ≥ 99.5% | 30-day |
| Aggregate health p95 response time | ≤ 2 s | 24 h |
| Config write p99 latency | ≤ 300 ms | 24 h |
2. Key metrics
| Metric | Type | Labels | Description |
|---|---|---|---|
pltadm_flag_evaluate_duration_seconds | Histogram | key, decision, reason | Flag evaluation latency |
pltadm_flag_cache_hits_total | Counter | key | Redis cache hits for flag evaluation |
pltadm_config_mutations_total | Counter | key, scope | Config upsert/archive events |
pltadm_health_aggregate_status | Gauge | overall | 0=healthy, 1=degraded, 2=unhealthy |
pltadm_health_source_status | Gauge | service_id | Per-service health status |
pltadm_health_poll_duration_seconds | Histogram | service_id | Health probe duration |
pltadm_outbox_unpublished_age_seconds | Gauge | — | Age of oldest unpublished event |
3. Alerts
| Alert | Condition | Severity |
|---|---|---|
| FlagEvaluateLatencyHigh | p95 > 150 ms for 5 min | High |
| FlagCacheHitRateLow | Cache hit rate < 50% for 5 min | Medium |
| PlatformHealthUnhealthy | pltadm_health_aggregate_status = 2 for > 2 min | Critical |
| HealthSourceStale | Any source not polled within 2× staleness threshold | High |
| OutboxUnpublishedOld | Age > 60 s | High |
4. Dashboards
| Dashboard | Description |
|---|---|
| Feature Flag Operations | Evaluate latency; cache hit ratio; flag mutation frequency |
| Platform Health | Aggregate status heatmap; per-service status over time |
| Config Governance | Config mutation frequency; history growth |
| Outbox Health | Unpublished event age; publish failure rate |
5. Traces
| Span | Description |
|---|---|
pltadm.flag.evaluate | Full flag evaluation including cache check |
pltadm.health.aggregate | Health aggregation query |
pltadm.health.poll | Per-service health probe |
pltadm.config.upsert | Config write + history row |
pltadm.outbox.publish | Event publish via outbox relay |