Operator Management Service — Observability
Status: populated Owner: Platform Engineering + SRE Last updated: 2026-04-18 Companion: 12 Observability
1. SLIs / SLOs
| SLI | SLO | Window | Measurement |
|---|---|---|---|
| Admin API P95 latency | ≤ 300 ms | 30 d | All admin endpoints |
| Admin API availability | ≥ 99.5% | 30 d | Non-5xx ratio at Kong |
| Internal API P95 latency | ≤ 50 ms | 30 d | /v1/internal/operators/:id/credentials |
| Vault read P95 | ≤ 200 ms | 7 d | ops_vault_read_duration_seconds |
| Config event publish success rate | ≥ 99.9% | 7 d | ops_nats_publish_total{result='ok'} / total |
| Health state propagation latency | ≤ 5 s | 7 d | smpp-connector event → Redis SET |
2. Metrics
Exposed at /metrics (Prometheus):
ops_operator_create_total{result="ok|duplicate|vault_error"}
ops_operator_update_total{result="ok|not_found|vault_error"}
ops_operator_delete_total{result="ok|not_found"}
ops_credentials_read_total{result="ok|not_found|vault_error"}
ops_vault_read_duration_seconds_bucket
ops_vault_write_duration_seconds_bucket
ops_nats_publish_total{event="created|updated|deleted|health", result="ok|error"}
ops_health_transition_total{from, to}
ops_pg_errors_total{op="insert|update|select"}
ops_redis_set_total{result="ok|error"}
ops_routing_rule_create_total{result="ok|conflict|error"}
3. Traces
OpenTelemetry spans (parent from Kong or internal caller):
ops.admin.createOperatorops.vault.writeCredentialsops.pg.insertOperatorops.nats.publish{event=operator.config.created}
ops.internal.getCredentialsops.vault.readCredentials
ops.health.ingestops.redis.setHealthCacheops.nats.publish{event=operator.health}
Attributes: ops.operator_id, ops.action, vault.path.
4. Logs (Pino → Loki)
Fields: level, ts, service=operator-management-service, operatorId, action, adminId, durationMs, traceId, spanId.
Sensitive fields never logged: password, full Vault secret payload.
5. Dashboards (Grafana)
- Operator Config Overview — create/update/delete rates, Vault latency, NATS publish success rate
- Health State Map — real-time health state per operator (colored table)
- Vault Dependency — read/write latency, error rate, token renewal status
- Internal API — credential read rate, latency, error ratio
6. Alerts
| Alert | Condition | Runbook |
|---|---|---|
OpsVaultErrors | Vault error rate > 5% for 2 m | runbooks/ops/vault-errors.md |
OpsNatsPublishErrors | Publish error > 3/min for 5 m | runbooks/ops/nats-degraded.md |
OpsPgErrors | PG errors > 5/min | runbooks/ops/pg-down.md |
OpsRedisErrors | Redis errors > 5/min | runbooks/ops/redis-down.md |
OpsHealthCacheStaleness | Any operator health key missing for > 90 s | runbooks/ops/health-cache-stale.md |
OpsAdminHigh5xx | 5xx ratio > 2% for 5 m | runbooks/ops/admin-5xx.md |
7. Readiness Probe
/health/ready returns 200 only when: PG reachable, Redis reachable, NATS connected, Vault reachable (token valid).