routing-engine — Observability
Status: populated | Last updated: 2026-04-18
1. Metrics (Prometheus)
All metrics are exposed at GET /metrics on port 3001.
| Metric name | Type | Labels | Description |
|---|---|---|---|
routing_engine_grpc_requests_total | Counter | method, status_code | Total gRPC requests by method and outcome |
routing_engine_grpc_request_duration_seconds | Histogram | method | gRPC request latency; buckets: 5ms, 10ms, 25ms, 50ms, 100ms, 250ms |
routing_engine_cache_hits_total | Counter | cache_type (decision, health) | Redis cache hits |
routing_engine_cache_misses_total | Counter | cache_type | Redis cache misses |
routing_engine_cache_invalidations_total | Counter | reason (health_unbound, ttl_expiry) | Cache invalidation events |
routing_engine_operators_healthy_total | Gauge | — | Current count of operators with BOUND or FAILBACK status |
routing_engine_health_events_consumed_total | Counter | status (BOUND, UNBOUND, FAILBACK) | NATS operator.health events processed |
routing_engine_db_query_duration_seconds | Histogram | query | PostgreSQL query latency |
routing_engine_prefix_cache_size | Gauge | — | Number of entries in the in-process prefix cache |
Key Alerts
| Alert | Condition | Severity |
|---|---|---|
RoutingEngineHighLatency | p95(grpc_request_duration_seconds{method="SelectOperator"}) > 0.05 for 2 min | critical |
RoutingEngineHighErrorRate | rate(grpc_requests_total{status_code!="OK"}[5m]) > 0.01 | warning |
RoutingEngineNoHealthyOperators | operators_healthy_total == 0 for 30 s | critical |
RoutingEngineCacheMissRatioHigh | cache_misses / (cache_hits + cache_misses) > 0.5 for 5 min | warning |
2. Structured Logging
All logs are emitted as JSON to stdout. Log level is controlled by the LOG_LEVEL environment variable (default: info).
Standard log fields
| Field | Type | Description |
|---|---|---|
timestamp | ISO 8601 | Event time |
level | string | debug / info / warn / error |
service | string | Always routing-engine |
traceId | string | OpenTelemetry trace ID (if available) |
spanId | string | OpenTelemetry span ID |
message | string | Human-readable summary |
Key log events
Event (message) | Level | Extra fields |
|---|---|---|
grpc.request.received | debug | method, to (masked to prefix), accountId |
cache.hit | debug | cacheKey, strategy |
cache.miss | debug | cacheKey |
operator.selected | info | operatorId, strategy, candidateCount |
operator.unavailable | warn | prefix, accountId, messageType, candidateCount |
health.event.received | info | operatorId, status, previousStatus |
cache.invalidated | info | operatorId, keysInvalidated |
prefix.cache.refreshed | debug | entryCount |
startup.ready | info | port, dbConnected, redisConnected |
PII masking: the to field is masked to the matched prefix (e.g. +447***) in all logs.
3. Distributed Tracing (OpenTelemetry)
- Instrumented with
@opentelemetry/sdk-nodeand gRPC auto-instrumentation. - Trace context is propagated via gRPC metadata (
grpc-trace-binheader) fromsms-orchestrator. - Spans created:
routing-engine.SelectOperator(root span for each gRPC call)routing-engine.redis.get(cache lookup)routing-engine.postgres.query(on cache miss)routing-engine.redis.set(cache write)
- Traces exported to the cluster's OpenTelemetry Collector (OTLP gRPC endpoint).