SMS Orchestrator — Observability
Status: populated Owner: Platform Engineering + SRE Last updated: 2026-04-18 Companion: 12 Observability
1. SLIs / SLOs
| SLI | SLO | Window | Measurement |
|---|---|---|---|
| Submit latency P95 | ≤ 200 ms | 30 d | POST /v1/sms/send through 202 |
| Submit availability | ≥ 99.9% | 30 d | Non-5xx ratio at Kong |
| Pipeline end-to-end P95 (accept → SMPP publish) | ≤ 500 ms | 30 d | span orch.pipeline.total |
| DLQ rate | ≤ 0.05% | 7 d | orch_dlq_total / orch_submit_total |
| Retry rate | ≤ 2% | 7 d | orch_retry_total / orch_pipeline_stage_total{stage='route'} |
| Idempotency replay rate | observational | — | orch_submit_total{result='replay'} / orch_submit_total |
2. Metrics
Exposed at /metrics (Prometheus):
orch_submit_total{result="accepted|replay|rejected"}
orch_submit_latency_seconds_bucket{...}
orch_pipeline_stage_total{stage="idempotency|validate|route|publish|persist", result="ok|fail|transient"}
orch_pipeline_stage_duration_seconds_bucket{stage=...}
orch_retry_total{attempt="1|2|3"}
orch_dlq_total{reason="validation|no_route|publish_fail|max_attempts"}
orch_idempotency_cache_hit_total
orch_routing_engine_errors_total{grpc_code=...}
orch_nats_publish_errors_total{stream=...}
orch_pg_errors_total{op=...}
orch_redis_errors_total{op=...}
3. Traces
OpenTelemetry spans (parent kong.proxy span):
orch.http.submit— HTTP accept pathorch.validateorch.idempotency.checkorch.pg.insertorch.nats.publish{subject=sms.outbound.request}
orch.pipeline.total(NATS consumer root)orch.pipeline.idempotencyorch.pipeline.validateorch.pipeline.route→routing-engine.select_routeorch.pipeline.publish{operator=...}orch.pipeline.persist
Attributes: sms.message_id, sms.tenant_id, sms.account_id, sms.operator_id, sms.attempt, error.type.
4. Logs (Pino → Loki)
Fields: level, ts, service=sms-orchestrator, messageId, tenantId, accountId, stage, durationMs, traceId, spanId.
Body and to MSISDN masked: body_len, to_hash=sha256(to).
5. Dashboards (Grafana)
- Orchestrator Overview — submit rate, latency, 2xx/4xx/5xx split, active in-flight, DLQ rate
- Pipeline Stages — per-stage latency histogram, error ratio, attempt distribution
- Dependencies — routing-engine gRPC latency/error, Redis / PG op latency, NATS publish latency
- Idempotency — replay rate, Redis hit ratio
6. Alerts
| Alert | Condition | Runbook |
|---|---|---|
OrchHigh5xx | 5xx ratio > 1% for 5m | runbooks/orch/high-5xx.md |
OrchDlqBurst | DLQ events > 10/min for 5m | runbooks/orch/dlq-burst.md |
OrchRoutingEngineDown | gRPC error ratio > 20% for 2m | runbooks/orch/routing-engine-down.md |
OrchRedisErrors | Redis errors > 5/min | runbooks/orch/redis-down.md |
OrchPgErrors | PG errors > 5/min | runbooks/orch/pg-down.md |
OrchNatsPublishErrors | publish error > 5/min | runbooks/orch/nats-degraded.md |
7. Readiness probe
/health/ready returns 200 only when: PG reachable, Redis reachable, NATS connected, routing-engine reachable (shallow ping).