Operator Management Service — Observability

Status: populated Owner: Platform Engineering + SRE Last updated: 2026-04-18 Companion: 12 Observability

1. SLIs / SLOs

SLI	SLO	Window	Measurement
Admin API P95 latency	≤ 300 ms	30 d	All admin endpoints
Admin API availability	≥ 99.5%	30 d	Non-5xx ratio at Kong
Internal API P95 latency	≤ 50 ms	30 d	`/v1/internal/operators/:id/credentials`
Vault read P95	≤ 200 ms	7 d	`ops_vault_read_duration_seconds`
Config event publish success rate	≥ 99.9%	7 d	`ops_nats_publish_total{result='ok'} / total`
Health state propagation latency	≤ 5 s	7 d	smpp-connector event → Redis SET

2. Metrics

Exposed at /metrics (Prometheus):

ops_operator_create_total{result="ok|duplicate|vault_error"}
ops_operator_update_total{result="ok|not_found|vault_error"}
ops_operator_delete_total{result="ok|not_found"}
ops_credentials_read_total{result="ok|not_found|vault_error"}
ops_vault_read_duration_seconds_bucket
ops_vault_write_duration_seconds_bucket
ops_nats_publish_total{event="created|updated|deleted|health", result="ok|error"}
ops_health_transition_total{from, to}
ops_pg_errors_total{op="insert|update|select"}
ops_redis_set_total{result="ok|error"}
ops_routing_rule_create_total{result="ok|conflict|error"}

3. Traces

OpenTelemetry spans (parent from Kong or internal caller):

ops.admin.createOperator
- ops.vault.writeCredentials
- ops.pg.insertOperator
- ops.nats.publish{event=operator.config.created}
ops.internal.getCredentials
- ops.vault.readCredentials
ops.health.ingest
- ops.redis.setHealthCache
- ops.nats.publish{event=operator.health}

Attributes: ops.operator_id, ops.action, vault.path.

4. Logs (Pino → Loki)

Fields: level, ts, service=operator-management-service, operatorId, action, adminId, durationMs, traceId, spanId.

Sensitive fields never logged: password, full Vault secret payload.

5. Dashboards (Grafana)

Operator Config Overview — create/update/delete rates, Vault latency, NATS publish success rate
Health State Map — real-time health state per operator (colored table)
Vault Dependency — read/write latency, error rate, token renewal status
Internal API — credential read rate, latency, error ratio

6. Alerts

Alert	Condition	Runbook
`OpsVaultErrors`	Vault error rate > 5% for 2 m	`runbooks/ops/vault-errors.md`
`OpsNatsPublishErrors`	Publish error > 3/min for 5 m	`runbooks/ops/nats-degraded.md`
`OpsPgErrors`	PG errors > 5/min	`runbooks/ops/pg-down.md`
`OpsRedisErrors`	Redis errors > 5/min	`runbooks/ops/redis-down.md`
`OpsHealthCacheStaleness`	Any operator health key missing for > 90 s	`runbooks/ops/health-cache-stale.md`
`OpsAdminHigh5xx`	5xx ratio > 2% for 5 m	`runbooks/ops/admin-5xx.md`

7. Readiness Probe

/health/ready returns 200 only when: PG reachable, Redis reachable, NATS connected, Vault reachable (token valid).

1. SLIs / SLOs​

2. Metrics​

3. Traces​

4. Logs (Pino → Loki)​

5. Dashboards (Grafana)​

6. Alerts​

7. Readiness Probe​