Skip to main content

DLR Processor — Observability

Status: populated Owner: Platform Engineering Last updated: 2026-04-18 Companion: FAILURE_MODES · DEPLOYMENT_TOPOLOGY

1. Metrics (Prometheus)

MetricTypeLabelsDescription
dlr_messages_received_totalCounteroperator_idTotal DLR events received from NATS
dlr_messages_processed_totalCounterdlr_status, operator_idSuccessfully processed DLRs
dlr_duplicates_totalCounteroperator_idDLRs deduplicated (already exists)
dlr_orphans_totalCounteroperator_idDLRs with unresolvable operatorMessageId
dlr_validation_errors_totalCounterMalformed inbound events discarded
dlr_processing_duration_secondsHistogramdlr_statusEnd-to-end processing latency
dlr_db_errors_totalCounteroperationPostgreSQL errors by operation type
dlr_outbox_pending_countGaugeUnpublished events in outbox table
dlr_orphan_rateGaugeRolling 5-min orphan percentage
dlr_nats_consumer_statusGauge1 = active, 0 = disconnected
dlr_nats_ack_pendingGaugeNATS consumer unacked message count

2. Structured Logs (Pino / JSON)

All logs emitted as JSON to stdout. Collected by Fluent Bit → Loki.

Key Log Events

Levelevent fieldWhen
INFOdlr.receivedEach inbound DLR (includes operatorMessageId, operatorId)
INFOdlr.correlatedSuccessful correlation (includes messageId, accountId, dlrStatus)
INFOdlr.duplicateDuplicate detected (includes operatorMessageId, existingReceiptId)
WARNdlr.orphanedNo correlation found (includes full context)
WARNdlr.validation_failedSchema validation error (sanitised payload)
ERRORdlr.db_errorDatabase error (includes operation, redacted query)
ERRORdlr.nats_publish_errorOutbox publish failure

Log Field Standards

{
"level": "info",
"time": "2026-04-18T10:23:45.123Z",
"service": "dlr-processor",
"traceId": "abc123",
"spanId": "def456",
"event": "dlr.correlated",
"operatorMessageId": "OP-MSG-20240418-00123",
"messageId": "c3d4e5f6-...",
"accountId": "d4e5f6a7-...",
"dlrStatus": "DELIVERED",
"durationMs": 47
}

No phone numbers or message body content in logs.

3. Distributed Tracing (OpenTelemetry)

Trace spans created for:

  • dlr.process — root span per NATS message
  • dlr.validate — schema validation
  • dlr.correlate — PG correlation query
  • dlr.persist — DB transaction
  • dlr.publish_billing — outbox insert
  • dlr.publish_webhook — outbox insert

W3C traceparent propagated from NATS message headers when present (set by smpp-connector).

4. Alerting Rules

# Orphan rate alert
- alert: DlrHighOrphanRate
expr: dlr_orphan_rate > 0.005
for: 5m
labels: { severity: warning }
annotations:
summary: "DLR orphan rate > 0.5%"

# Outbox lag alert
- alert: DlrOutboxLag
expr: dlr_outbox_pending_count > 1000
for: 2m
labels: { severity: warning }

# Consumer disconnected
- alert: DlrConsumerDisconnected
expr: dlr_nats_consumer_status == 0
for: 1m
labels: { severity: critical }

# Processing latency p99
- alert: DlrHighLatency
expr: histogram_quantile(0.99, rate(dlr_processing_duration_seconds_bucket[5m])) > 0.5
for: 5m
labels: { severity: warning }

5. Dashboards

Grafana dashboard dlr-processor-overview includes:

  • DLR throughput (received/processed/orphaned) — time series
  • Processing latency p50/p95/p99 — time series
  • Outbox pending count — gauge + time series
  • DLR status breakdown — pie chart
  • Orphan rate — gauge with threshold colouring
  • DB error rate — time series