Skip to main content

Platform Admin Service — Observability

Status: populated Owner: TBD Last updated: 2026-04-18 Companion: 12 Observability

1. SLIs and SLOs

SLISLOWindow
Internal flag evaluate availability≥ 99.9%30-day rolling
Internal flag evaluate p95 latency≤ 120 ms24 h
Aggregate health endpoint availability≥ 99.5%30-day
Aggregate health p95 response time≤ 2 s24 h
Config write p99 latency≤ 300 ms24 h

2. Key metrics

MetricTypeLabelsDescription
pltadm_flag_evaluate_duration_secondsHistogramkey, decision, reasonFlag evaluation latency
pltadm_flag_cache_hits_totalCounterkeyRedis cache hits for flag evaluation
pltadm_config_mutations_totalCounterkey, scopeConfig upsert/archive events
pltadm_health_aggregate_statusGaugeoverall0=healthy, 1=degraded, 2=unhealthy
pltadm_health_source_statusGaugeservice_idPer-service health status
pltadm_health_poll_duration_secondsHistogramservice_idHealth probe duration
pltadm_outbox_unpublished_age_secondsGaugeAge of oldest unpublished event

3. Alerts

AlertConditionSeverity
FlagEvaluateLatencyHighp95 > 150 ms for 5 minHigh
FlagCacheHitRateLowCache hit rate < 50% for 5 minMedium
PlatformHealthUnhealthypltadm_health_aggregate_status = 2 for > 2 minCritical
HealthSourceStaleAny source not polled within 2× staleness thresholdHigh
OutboxUnpublishedOldAge > 60 sHigh

4. Dashboards

DashboardDescription
Feature Flag OperationsEvaluate latency; cache hit ratio; flag mutation frequency
Platform HealthAggregate status heatmap; per-service status over time
Config GovernanceConfig mutation frequency; history growth
Outbox HealthUnpublished event age; publish failure rate

5. Traces

SpanDescription
pltadm.flag.evaluateFull flag evaluation including cache check
pltadm.health.aggregateHealth aggregation query
pltadm.health.pollPer-service health probe
pltadm.config.upsertConfig write + history row
pltadm.outbox.publishEvent publish via outbox relay