Operator Management Service — Service Readiness
Status: populated Owner: Platform Engineering + SRE Last updated: 2026-04-18
Service is production-ready only when EVERY box below is checked.
Docs
- All 17 service docs complete (no stubs remain).
- Vault bootstrap runbook linked from LOCAL_DEV_SETUP.
Code + Tests
- TypeScript strict, zero errors.
- ESLint: no banned patterns (no PG password columns, no direct Vault SDK from controllers).
- Unit coverage: aggregates/VOs ≥ 95%, domain services ≥ 90%, use cases ≥ 90%.
- Mutation score on changed files ≥ 75%.
- Integration tests pass: create-operator, duplicate-prevention, soft-delete, credentials-endpoint, vault-failure, health-ingest, routing-rules.
- Contract tests green: schema registry for all 4 produced events.
- OpenAPI diff gate: no breaking change without major bump.
Security
-
security-revieweragent run, zero critical/high. - Password never appears in API responses (automated test asserts absence of
passwordkey in GET responses). - Password never appears in NATS event payloads (schema registry enforces).
- Vault policy scoped correctly (no access outside
secret/ops/operators/*). - mTLS enforced on internal API (NetworkPolicy + PeerAuthentication in staging).
- OWASP ZAP baseline passed against staging admin API.
Observability
-
/metrics,/health/live,/health/readyup. - All metric families visible in Grafana dashboard.
- OTel spans in trace backend.
- All 6 alerts have runbooks linked.
Infra / Rollout
- Helm chart + Terraform module committed.
- Vault Kubernetes auth role configured in Terraform.
- K8s PDB, HPA, resource limits per DEPLOYMENT_TOPOLOGY.
- NetworkPolicy applied in staging and validated.
- mTLS cert rotation tested (cert-manager).
- On-call rotation assigned.
Data
- PG migrations applied in staging.
-
pg_partmanconfigured forops.operator_health_logpartitions. - Legacy operator migration script executed in staging; ops team validated config.
Sign-off
- Tech lead ✅
- SRE ✅
- Security ✅
- Carrier Relations (ops team) ✅