Skip to main content

SMS Orchestrator — Observability

Status: populated Owner: Platform Engineering + SRE Last updated: 2026-04-18 Companion: 12 Observability

1. SLIs / SLOs

SLISLOWindowMeasurement
Submit latency P95≤ 200 ms30 dPOST /v1/sms/send through 202
Submit availability≥ 99.9%30 dNon-5xx ratio at Kong
Pipeline end-to-end P95 (accept → SMPP publish)≤ 500 ms30 dspan orch.pipeline.total
DLQ rate≤ 0.05%7 dorch_dlq_total / orch_submit_total
Retry rate≤ 2%7 dorch_retry_total / orch_pipeline_stage_total{stage='route'}
Idempotency replay rateobservationalorch_submit_total{result='replay'} / orch_submit_total

2. Metrics

Exposed at /metrics (Prometheus):

orch_submit_total{result="accepted|replay|rejected"}
orch_submit_latency_seconds_bucket{...}
orch_pipeline_stage_total{stage="idempotency|validate|route|publish|persist", result="ok|fail|transient"}
orch_pipeline_stage_duration_seconds_bucket{stage=...}
orch_retry_total{attempt="1|2|3"}
orch_dlq_total{reason="validation|no_route|publish_fail|max_attempts"}
orch_idempotency_cache_hit_total
orch_routing_engine_errors_total{grpc_code=...}
orch_nats_publish_errors_total{stream=...}
orch_pg_errors_total{op=...}
orch_redis_errors_total{op=...}

3. Traces

OpenTelemetry spans (parent kong.proxy span):

  • orch.http.submit — HTTP accept path
    • orch.validate
    • orch.idempotency.check
    • orch.pg.insert
    • orch.nats.publish{subject=sms.outbound.request}
  • orch.pipeline.total (NATS consumer root)
    • orch.pipeline.idempotency
    • orch.pipeline.validate
    • orch.pipeline.routerouting-engine.select_route
    • orch.pipeline.publish{operator=...}
    • orch.pipeline.persist

Attributes: sms.message_id, sms.tenant_id, sms.account_id, sms.operator_id, sms.attempt, error.type.

4. Logs (Pino → Loki)

Fields: level, ts, service=sms-orchestrator, messageId, tenantId, accountId, stage, durationMs, traceId, spanId. Body and to MSISDN masked: body_len, to_hash=sha256(to).

5. Dashboards (Grafana)

  • Orchestrator Overview — submit rate, latency, 2xx/4xx/5xx split, active in-flight, DLQ rate
  • Pipeline Stages — per-stage latency histogram, error ratio, attempt distribution
  • Dependencies — routing-engine gRPC latency/error, Redis / PG op latency, NATS publish latency
  • Idempotency — replay rate, Redis hit ratio

6. Alerts

AlertConditionRunbook
OrchHigh5xx5xx ratio > 1% for 5mrunbooks/orch/high-5xx.md
OrchDlqBurstDLQ events > 10/min for 5mrunbooks/orch/dlq-burst.md
OrchRoutingEngineDowngRPC error ratio > 20% for 2mrunbooks/orch/routing-engine-down.md
OrchRedisErrorsRedis errors > 5/minrunbooks/orch/redis-down.md
OrchPgErrorsPG errors > 5/minrunbooks/orch/pg-down.md
OrchNatsPublishErrorspublish error > 5/minrunbooks/orch/nats-degraded.md

7. Readiness probe

/health/ready returns 200 only when: PG reachable, Redis reachable, NATS connected, routing-engine reachable (shallow ping).