DLR Processor — Observability
Status: populated Owner: Platform Engineering Last updated: 2026-04-18 Companion: FAILURE_MODES · DEPLOYMENT_TOPOLOGY
1. Metrics (Prometheus)
| Metric | Type | Labels | Description |
|---|---|---|---|
dlr_messages_received_total | Counter | operator_id | Total DLR events received from NATS |
dlr_messages_processed_total | Counter | dlr_status, operator_id | Successfully processed DLRs |
dlr_duplicates_total | Counter | operator_id | DLRs deduplicated (already exists) |
dlr_orphans_total | Counter | operator_id | DLRs with unresolvable operatorMessageId |
dlr_validation_errors_total | Counter | — | Malformed inbound events discarded |
dlr_processing_duration_seconds | Histogram | dlr_status | End-to-end processing latency |
dlr_db_errors_total | Counter | operation | PostgreSQL errors by operation type |
dlr_outbox_pending_count | Gauge | — | Unpublished events in outbox table |
dlr_orphan_rate | Gauge | — | Rolling 5-min orphan percentage |
dlr_nats_consumer_status | Gauge | — | 1 = active, 0 = disconnected |
dlr_nats_ack_pending | Gauge | — | NATS consumer unacked message count |
2. Structured Logs (Pino / JSON)
All logs emitted as JSON to stdout. Collected by Fluent Bit → Loki.
Key Log Events
| Level | event field | When |
|---|---|---|
INFO | dlr.received | Each inbound DLR (includes operatorMessageId, operatorId) |
INFO | dlr.correlated | Successful correlation (includes messageId, accountId, dlrStatus) |
INFO | dlr.duplicate | Duplicate detected (includes operatorMessageId, existingReceiptId) |
WARN | dlr.orphaned | No correlation found (includes full context) |
WARN | dlr.validation_failed | Schema validation error (sanitised payload) |
ERROR | dlr.db_error | Database error (includes operation, redacted query) |
ERROR | dlr.nats_publish_error | Outbox publish failure |
Log Field Standards
{
"level": "info",
"time": "2026-04-18T10:23:45.123Z",
"service": "dlr-processor",
"traceId": "abc123",
"spanId": "def456",
"event": "dlr.correlated",
"operatorMessageId": "OP-MSG-20240418-00123",
"messageId": "c3d4e5f6-...",
"accountId": "d4e5f6a7-...",
"dlrStatus": "DELIVERED",
"durationMs": 47
}
No phone numbers or message body content in logs.
3. Distributed Tracing (OpenTelemetry)
Trace spans created for:
dlr.process— root span per NATS messagedlr.validate— schema validationdlr.correlate— PG correlation querydlr.persist— DB transactiondlr.publish_billing— outbox insertdlr.publish_webhook— outbox insert
W3C traceparent propagated from NATS message headers when present (set by smpp-connector).
4. Alerting Rules
# Orphan rate alert
- alert: DlrHighOrphanRate
expr: dlr_orphan_rate > 0.005
for: 5m
labels: { severity: warning }
annotations:
summary: "DLR orphan rate > 0.5%"
# Outbox lag alert
- alert: DlrOutboxLag
expr: dlr_outbox_pending_count > 1000
for: 2m
labels: { severity: warning }
# Consumer disconnected
- alert: DlrConsumerDisconnected
expr: dlr_nats_consumer_status == 0
for: 1m
labels: { severity: critical }
# Processing latency p99
- alert: DlrHighLatency
expr: histogram_quantile(0.99, rate(dlr_processing_duration_seconds_bucket[5m])) > 0.5
for: 5m
labels: { severity: warning }
5. Dashboards
Grafana dashboard dlr-processor-overview includes:
- DLR throughput (received/processed/orphaned) — time series
- Processing latency p50/p95/p99 — time series
- Outbox pending count — gauge + time series
- DLR status breakdown — pie chart
- Orphan rate — gauge with threshold colouring
- DB error rate — time series