Skip to main content

SMS Orchestrator — Service Readiness

Status: populated Owner: Platform Engineering + SRE Last updated: 2026-04-18

Service is production-ready only when EVERY box below is checked.

Docs

  • All 17 service docs complete (no stubs remain).
  • ADR-0001 migration consequences reflected in MIGRATION_PLAN.

Code + tests

  • ESLint: domain import-restriction rule passes.
  • TypeScript strict, zero errors.
  • Unit coverage: aggregates ≥ 95%, VOs 100%, domain services ≥ 90%.
  • Mutation testing on changed files ≥ 75% (aggregates), ≥ 85% (VOs).
  • Integration tests pass: tenant-isolation, submit-happy-path, submit-idempotency, pipeline-retry, validation-failure, redis-outage.
  • Contract tests green: Pact with routing-engine, schema registry for all 4 produced events.
  • OpenAPI diff gate: no breaking change without major bump.
  • Load test: 2500 TPS sustained, P95 submit ≤ 200 ms, P95 pipeline ≤ 500 ms.

Security

  • security-reviewer agent run, zero critical/high.
  • Tenant isolation integration test passes.
  • PII mask in Pino transport verified (no body, no raw to in Loki).
  • Kong JWT + key-auth integration tested end-to-end in staging.
  • RLS policies active on orch.sms_messages + orch.idempotency_keys.

Observability

  • /metrics, /health/live, /health/ready endpoints up.
  • All six metric families (submit, pipeline stage, retry, DLQ, deps) visible in Grafana.
  • OTel spans visible in trace backend with parent Kong span chaining.
  • All six alerts (OrchHigh5xx, OrchDlqBurst, OrchRoutingEngineDown, OrchRedisErrors, OrchPgErrors, OrchNatsPublishErrors) have runbooks linked.

Infra / rollout

  • Helm chart + Terraform module committed.
  • K8s HPA, PDB, resource requests/limits configured per DEPLOYMENT_TOPOLOGY.
  • Canary deploy completed in staging (5% / 30m) and rolled forward.
  • Rollback verified in staging (image revert).
  • On-call rotation assigned.

Data

  • Migrations applied in staging.
  • Partition maintenance cron scheduled.
  • Idempotency key purge cron scheduled.
  • Archival target (S3 parquet) configured for 90-day-old partitions.

Sign-off

  • Tech lead ✅
  • SRE ✅
  • Security ✅
  • Product ✅