Skip to main content

routing-engine — Observability

Status: populated | Last updated: 2026-04-18

1. Metrics (Prometheus)

All metrics are exposed at GET /metrics on port 3001.

Metric nameTypeLabelsDescription
routing_engine_grpc_requests_totalCountermethod, status_codeTotal gRPC requests by method and outcome
routing_engine_grpc_request_duration_secondsHistogrammethodgRPC request latency; buckets: 5ms, 10ms, 25ms, 50ms, 100ms, 250ms
routing_engine_cache_hits_totalCountercache_type (decision, health)Redis cache hits
routing_engine_cache_misses_totalCountercache_typeRedis cache misses
routing_engine_cache_invalidations_totalCounterreason (health_unbound, ttl_expiry)Cache invalidation events
routing_engine_operators_healthy_totalGaugeCurrent count of operators with BOUND or FAILBACK status
routing_engine_health_events_consumed_totalCounterstatus (BOUND, UNBOUND, FAILBACK)NATS operator.health events processed
routing_engine_db_query_duration_secondsHistogramqueryPostgreSQL query latency
routing_engine_prefix_cache_sizeGaugeNumber of entries in the in-process prefix cache

Key Alerts

AlertConditionSeverity
RoutingEngineHighLatencyp95(grpc_request_duration_seconds{method="SelectOperator"}) > 0.05 for 2 mincritical
RoutingEngineHighErrorRaterate(grpc_requests_total{status_code!="OK"}[5m]) > 0.01warning
RoutingEngineNoHealthyOperatorsoperators_healthy_total == 0 for 30 scritical
RoutingEngineCacheMissRatioHighcache_misses / (cache_hits + cache_misses) > 0.5 for 5 minwarning

2. Structured Logging

All logs are emitted as JSON to stdout. Log level is controlled by the LOG_LEVEL environment variable (default: info).

Standard log fields

FieldTypeDescription
timestampISO 8601Event time
levelstringdebug / info / warn / error
servicestringAlways routing-engine
traceIdstringOpenTelemetry trace ID (if available)
spanIdstringOpenTelemetry span ID
messagestringHuman-readable summary

Key log events

Event (message)LevelExtra fields
grpc.request.receiveddebugmethod, to (masked to prefix), accountId
cache.hitdebugcacheKey, strategy
cache.missdebugcacheKey
operator.selectedinfooperatorId, strategy, candidateCount
operator.unavailablewarnprefix, accountId, messageType, candidateCount
health.event.receivedinfooperatorId, status, previousStatus
cache.invalidatedinfooperatorId, keysInvalidated
prefix.cache.refresheddebugentryCount
startup.readyinfoport, dbConnected, redisConnected

PII masking: the to field is masked to the matched prefix (e.g. +447***) in all logs.


3. Distributed Tracing (OpenTelemetry)

  • Instrumented with @opentelemetry/sdk-node and gRPC auto-instrumentation.
  • Trace context is propagated via gRPC metadata (grpc-trace-bin header) from sms-orchestrator.
  • Spans created:
    • routing-engine.SelectOperator (root span for each gRPC call)
    • routing-engine.redis.get (cache lookup)
    • routing-engine.postgres.query (on cache miss)
    • routing-engine.redis.set (cache write)
  • Traces exported to the cluster's OpenTelemetry Collector (OTLP gRPC endpoint).