Skip to main content

Operator Management Service — Observability

Status: populated Owner: Platform Engineering + SRE Last updated: 2026-04-18 Companion: 12 Observability

1. SLIs / SLOs

SLISLOWindowMeasurement
Admin API P95 latency≤ 300 ms30 dAll admin endpoints
Admin API availability≥ 99.5%30 dNon-5xx ratio at Kong
Internal API P95 latency≤ 50 ms30 d/v1/internal/operators/:id/credentials
Vault read P95≤ 200 ms7 dops_vault_read_duration_seconds
Config event publish success rate≥ 99.9%7 dops_nats_publish_total{result='ok'} / total
Health state propagation latency≤ 5 s7 dsmpp-connector event → Redis SET

2. Metrics

Exposed at /metrics (Prometheus):

ops_operator_create_total{result="ok|duplicate|vault_error"}
ops_operator_update_total{result="ok|not_found|vault_error"}
ops_operator_delete_total{result="ok|not_found"}
ops_credentials_read_total{result="ok|not_found|vault_error"}
ops_vault_read_duration_seconds_bucket
ops_vault_write_duration_seconds_bucket
ops_nats_publish_total{event="created|updated|deleted|health", result="ok|error"}
ops_health_transition_total{from, to}
ops_pg_errors_total{op="insert|update|select"}
ops_redis_set_total{result="ok|error"}
ops_routing_rule_create_total{result="ok|conflict|error"}

3. Traces

OpenTelemetry spans (parent from Kong or internal caller):

  • ops.admin.createOperator
    • ops.vault.writeCredentials
    • ops.pg.insertOperator
    • ops.nats.publish{event=operator.config.created}
  • ops.internal.getCredentials
    • ops.vault.readCredentials
  • ops.health.ingest
    • ops.redis.setHealthCache
    • ops.nats.publish{event=operator.health}

Attributes: ops.operator_id, ops.action, vault.path.

4. Logs (Pino → Loki)

Fields: level, ts, service=operator-management-service, operatorId, action, adminId, durationMs, traceId, spanId.

Sensitive fields never logged: password, full Vault secret payload.

5. Dashboards (Grafana)

  • Operator Config Overview — create/update/delete rates, Vault latency, NATS publish success rate
  • Health State Map — real-time health state per operator (colored table)
  • Vault Dependency — read/write latency, error rate, token renewal status
  • Internal API — credential read rate, latency, error ratio

6. Alerts

AlertConditionRunbook
OpsVaultErrorsVault error rate > 5% for 2 mrunbooks/ops/vault-errors.md
OpsNatsPublishErrorsPublish error > 3/min for 5 mrunbooks/ops/nats-degraded.md
OpsPgErrorsPG errors > 5/minrunbooks/ops/pg-down.md
OpsRedisErrorsRedis errors > 5/minrunbooks/ops/redis-down.md
OpsHealthCacheStalenessAny operator health key missing for > 90 srunbooks/ops/health-cache-stale.md
OpsAdminHigh5xx5xx ratio > 2% for 5 mrunbooks/ops/admin-5xx.md

7. Readiness Probe

/health/ready returns 200 only when: PG reachable, Redis reachable, NATS connected, Vault reachable (token valid).