Skip to main content

DLR Processor — Epics & User Stories

Status: populated Owner: Platform Engineering / Product Last updated: 2026-04-18


EP-DLR-01: Core DLR Processing Pipeline

Description: Build the foundational event-driven pipeline that consumes delivery receipts from NATS, normalises status codes, and persists results to PostgreSQL.

Acceptance Criteria:

  • durable NATS consumer dlr-processor active on sms.dlr.inbound
  • DLR status normalisation covers all SMPP stat values
  • dlr.delivery_receipts written with idempotency on operatorMessageId
  • orch.sms_messages updated for terminal statuses
  • Processing latency p99 < 500 ms at 500 msg/s sustained load

US-DLR-001: Consume DLR Events from NATS JetStream

Title: As the platform, I want the DLR Processor to consume sms.dlr.inbound messages durably so that no delivery receipt is lost.

Description: Implement a durable NATS JetStream consumer (dlr-processor) with AckExplicit policy and MaxConcurrency: 10. Consumer must survive pod restarts and replay from last acked position.

Acceptance Criteria:

  • Consumer name dlr-processor appears in NATS consumer list
  • On pod restart, processing resumes from last acked offset
  • MaxConcurrency: 10 enforced; concurrent processing verified under load
  • /ready returns 200 only when consumer is active

Story Points: 3


US-DLR-002: Normalise SMPP Status Codes to Canonical DlrStatus

Title: As the platform, I want all carrier status strings mapped to a canonical DlrStatus enum so that downstream services work with consistent values.

Description: Implement the stateless DlrStatusNormaliser module mapping all 8 known SMPP stat values plus fallback to UNKNOWN. Must be case-insensitive.

Acceptance Criteria:

  • DELIVRDDELIVERED, UNDELIVUNDELIVERED, EXPIREDEXPIRED, DELETEDFAILED, ACCEPTDUNKNOWN, REJECTDREJECTED, UNKNOWNUNKNOWN, FAILEDFAILED
  • Any unrecognised value → UNKNOWN
  • Case-insensitive input handled
  • Unit tests cover all 9 cases

Story Points: 2


US-DLR-003: Persist Delivery Receipts with Idempotency

Title: As the platform, I want each DLR written to dlr.delivery_receipts exactly once so that duplicate events from operators cause no side effects.

Description: Use INSERT ... ON CONFLICT (operator_message_id) DO NOTHING pattern. If conflict detected, Ack the NATS message and exit without further processing.

Acceptance Criteria:

  • First occurrence of operatorMessageId creates row
  • Second occurrence of same operatorMessageId results in dlr_duplicates_total counter increment and immediate Ack
  • No billing or webhook events emitted for duplicates
  • Integration test confirms single row after 5 rapid duplicates

Story Points: 3


US-DLR-004: Update sms_messages Status on Terminal DLR

Title: As the platform, I want orch.sms_messages updated with DLR status so that message tracking reflects final delivery outcome.

Description: Execute targeted UPDATE setting status, dlr_status, dlr_received_at, processed_at within the same DB transaction as the delivery_receipts insert. Guard clause prevents overwriting already-terminal rows.

Acceptance Criteria:

  • Terminal DLR statuses (DELIVERED, FAILED, UNDELIVERED, EXPIRED, REJECTED) update orch.sms_messages
  • UNKNOWN status does NOT update orch.sms_messages.status
  • Guard clause verified: second DLR for same messageId does not overwrite terminal state
  • Verified with integration test using Testcontainers

Story Points: 3


EP-DLR-02: Orphan Handling & Reconciliation

Description: Handle DLRs that arrive with an operatorMessageId that cannot be correlated to a known outbound message.


US-DLR-005: Write Orphaned Receipts on Correlation Failure

Title: As the platform, I want unresolvable DLRs stored in dlr.orphaned_receipts so that they can be reconciled later.

Description: When correlation query returns no rows for operatorMessageId, insert full event payload into dlr.orphaned_receipts and Ack the NATS message (no retry since retrying will not help without a matching row).

Acceptance Criteria:

  • dlr.orphaned_receipts row created with operatorMessageId, rawPayload, receivedAt
  • dlr_orphans_total Prometheus counter incremented
  • dlr.orphan_rate gauge updated
  • NATS message Acked (not Nak'd)
  • Log entry dlr.orphaned at WARN level with operatorMessageId and operatorId

Story Points: 2


US-DLR-006: Publish sms.dlr.unmatched for Orphaned Receipts

Title: As the reconciliation subsystem, I want to receive a sms.dlr.unmatched NATS event for each orphan so that automated reconciliation can be triggered.

Description: After writing to dlr.orphaned_receipts, publish sms.dlr.unmatched including orphanId for cross-reference.

Acceptance Criteria:

  • sms.dlr.unmatched published within the same DB transaction scope (outbox pattern)
  • Event includes orphanId, operatorMessageId, operatorId, rawStat, receivedAt
  • Integration test: publish DLR with unknown ID → verify sms.dlr.unmatched received by test consumer

Story Points: 2


US-DLR-007: Correlation Retry Before Orphaning (Race Condition Mitigation)

Title: As the platform, I want a brief retry window before orphaning a DLR so that DLRs arriving slightly before the SENT status update are not incorrectly orphaned.

Description: When correlation fails, retry the lookup after a 3-second delay up to 2 times before writing to orphaned_receipts. This mitigates the race condition where the smpp-connector DLR arrives before the orchestrator has written SENT status.

Acceptance Criteria:

  • Correlation retried up to 2 times with 3s delay between attempts
  • If found on retry: normal processing path (not orphaned)
  • If not found after retries: orphan path
  • dlr_correlation_retry_total counter tracks retry attempts
  • Processing latency SLO accounts for up to 6s retry window

Story Points: 3


EP-DLR-03: Downstream Event Publishing

Description: Fan processed DLR outcomes to billing and webhook downstream consumers via the transactional outbox pattern.


US-DLR-008: Publish billing.events via Transactional Outbox

Title: As the billing service, I want a billing.events event for each terminal DLR so that charges can be calculated.

Description: Insert billing.events payload into dlr.outbox within the same DB transaction as the delivery_receipts insert. Outbox relay publishes to NATS asynchronously.

Acceptance Criteria:

  • billing.events published for DELIVERED, FAILED, UNDELIVERED, EXPIRED, REJECTED statuses
  • NOT published for UNKNOWN status
  • Event includes messageId, accountId, segmentCount, dlrStatus, operatorId
  • Outbox pattern: no billing event lost even if NATS is temporarily unavailable
  • Pact contract test verified with billing-service team

Story Points: 3


US-DLR-009: Publish webhook.dispatch via Transactional Outbox

Title: As the webhook dispatcher, I want a webhook.dispatch event for every DLR correlation so that customer webhooks can be notified.

Description: Insert webhook.dispatch payload into dlr.outbox within the same DB transaction. Published for all DlrStatus values including UNKNOWN.

Acceptance Criteria:

  • webhook.dispatch published for ALL DlrStatus values
  • Event includes accountId, messageId, dlrStatus, to, occurredAt
  • Outbox pattern ensures at-least-once delivery
  • Pact contract test verified with webhook-dispatcher team

Story Points: 2


EP-DLR-04: Observability & Operations


US-DLR-010: Instrument Prometheus Metrics

Title: As an SRE, I want comprehensive Prometheus metrics so that I can monitor DLR processing health in production.

Description: Implement all metrics listed in OBSERVABILITY.md including throughput counters, latency histogram, orphan gauge, outbox pending gauge, and NATS consumer status gauge.

Acceptance Criteria:

  • All 11 metrics from OBSERVABILITY.md §1 implemented
  • /metrics endpoint returns Prometheus text format
  • dlr_processing_duration_seconds histogram has buckets: 5ms, 10ms, 25ms, 50ms, 100ms, 250ms, 500ms, 1s
  • Grafana dashboard dlr-processor-overview loads without errors

Story Points: 3


US-DLR-011: Structured JSON Logging

Title: As an SRE, I want structured JSON logs for all key processing events so that I can query and correlate incidents in Loki.

Description: Implement Pino structured logging for all processing paths with event field, traceId, spanId, and domain context. No PII in log lines.

Acceptance Criteria:

  • All 8 log events from OBSERVABILITY.md §2 implemented
  • Phone numbers (destAddr, sourceAddr) never appear in log output
  • traceId and spanId present when OTEL trace is active
  • Log level configurable via LOG_LEVEL env var

Story Points: 2


US-DLR-012: Alerting Rules for Critical Failure Modes

Title: As an SRE, I want Prometheus alerting rules for orphan rate, outbox lag, and consumer disconnection so that on-call engineers are paged promptly.

Description: Implement the 4 alerting rules from OBSERVABILITY.md §4.

Acceptance Criteria:

  • DlrHighOrphanRate fires when orphan rate > 0.5% for 5 min
  • DlrOutboxLag fires when outbox pending > 1000 for 2 min
  • DlrConsumerDisconnected fires when consumer status = 0 for 1 min
  • DlrHighLatency fires when p99 > 500 ms for 5 min
  • All rules tested in staging with simulated conditions

Story Points: 2


EP-DLR-05: Security & Compliance


US-DLR-013: Least-Privilege NATS Permissions

Title: As a security engineer, I want the DLR Processor NATS account scoped to minimum required subjects so that a compromise cannot publish arbitrary events.

Acceptance Criteria:

  • NATS user dlr-processor can only: subscribe sms.dlr.inbound, publish billing.events, publish webhook.dispatch, publish sms.dlr.unmatched
  • No wildcard subjects granted
  • Verified in NATS security review

Story Points: 1


US-DLR-014: Least-Privilege PostgreSQL Grants

Title: As a security engineer, I want the dlr-processor DB user restricted to minimum required operations so that a compromise cannot affect other service schemas.

Acceptance Criteria:

  • dlr_svc has SELECT, INSERT on dlr.* tables only
  • dlr_svc has UPDATE(status, dlr_status, dlr_received_at, processed_at) on orch.sms_messages only
  • No DROP, TRUNCATE, or schema-level privileges
  • Verified by DBA review

Story Points: 1


EP-DLR-06: Inbound DLR Idempotency & Replay Quarantine

Description: MNOs occasionally re-deliver deliver_sm DLRs from stale buffers (especially after MNO-side restarts). The processor must dedupe by (operatorId, operatorMessageId, status, timestampBucket) and quarantine suspicious replays for forensic review.


US-DLR-015 — DLR fingerprint dedup

Title: As the dlr-processor, I want every inbound DLR fingerprinted and deduped so that duplicate delivery from MNO buffers does not double-bill or double-trigger webhooks.

Acceptance Criteria:

  • Fingerprint = sha256(operatorId || operatorMessageId || status || floor(receivedAt / 60s))
  • Redis key dlr:fp:{fingerprint} set with TTL 24h on first seen
  • Duplicate fingerprint → ACK NATS, increment dlr_dedup_total{operatorId}, do NOT process downstream
  • Integration test: same DLR replayed within 60 s → only one billing event, one webhook

Story Points: 3


US-DLR-016 — Replay quarantine (suspicious patterns)

Title: As an SRE, I want DLRs replayed from > 5 min after the original DLR window quarantined so forensic review can determine MNO mis-behaviour.

Acceptance Criteria:

  • If now - dlrTimestamp > 5 min AND original message reached terminal status > 24h ago → quarantine
  • Quarantine table dlr.quarantined_dlrs (operatorId, payload JSONB, receivedAt, fingerprint)
  • Daily report aggregates quarantine count per operator; flag any operator > 1% replay rate
  • Manual release endpoint POST /v1/admin/dlr/quarantine/:id/release

Story Points: 3


US-DLR-017 — Late DLR handling (after message terminal status)

Title: As the dlr-processor, I want DLRs received after a message reaches terminal status logged but not propagated downstream so billing/webhooks are not reopened.

Acceptance Criteria:

  • If sms_messages.status is terminal (DELIVERED/FAILED/EXPIRED) and DLR arrives → log to dlr.late_dlrs, do not republish
  • Metric dlr_late_total{operatorId} counter
  • Late-DLR rate dashboard panel; alert if > 0.5% of total DLRs sustained for 1 h

Story Points: 2


EP-DLR-07: Segment-Aware DLR Aggregation for Concatenated SMS

Description: A 3-segment concatenated SMS yields up to 3 separate DLRs. Per-segment status must be aggregated into a single message-level status using documented rules so downstream consumers see one event per outbound message.


US-DLR-018 — Segment correlation table and aggregation rules

Title: As the dlr-processor, I want segment-level DLRs correlated to the parent message and aggregated per documented rules so downstream consumers see a single per-message status.

Acceptance Criteria:

  • Table dlr.segment_dlrs (messageId, segmentIndex, totalSegments, status, receivedAt)
  • Aggregation: DELIVERED only when all segments DELIVERED; FAILED if any segment FAILED; PARTIAL on mixed; PENDING until all received or 24h grace expires
  • Message-level status published to sms.dlr.aggregated.v1 once all segments received OR 24h grace expires
  • Downstream consumers (billing, webhook) consume aggregated subject only

Story Points: 5


US-DLR-019 — Aggregation timeout and partial reporting

Title: As the dlr-processor, I want to publish a PARTIAL aggregated DLR if not all segments have DLRs within 24 h so downstream is unblocked.

Acceptance Criteria:

  • Cron every 5 min selects messages with any segment in PENDING past 24h
  • Publishes sms.dlr.aggregated.v1 with status: "PARTIAL", received: N, expected: M
  • Audit log entry per partial
  • Webhook payload includes per-segment detail array

Story Points: 3


US-DLR-020 — Customer-portal segment-detail view

Title: As a customer, I want to see per-segment status of my long messages so I can diagnose partial failures.

Acceptance Criteria:

  • GET /v1/sms/{messageId}/segments returns array of segment-status entries
  • Customer-portal message-detail page renders segment table
  • Tenant isolation enforced via RLS
  • Contract test verifies segment-level visibility

Story Points: 3


EP-DLR-08: Orphan-DLR Burial Queue with Time-Boxed Retention

Description: DLRs that cannot be correlated to any known message (correlation expired, MNO mis-routed, replay attack) must be buried in a quarantine queue with bounded retention so the system isn't poisoned.


US-DLR-021 — Orphan DLR detection and burial

Title: As the dlr-processor, I want DLRs with no correlation row buried in dlr.orphan_dlrs so downstream is never confused.

Acceptance Criteria:

  • No row in smpp.message_correlations for (operatorId, operatorMessageId) → insert into dlr.orphan_dlrs; ACK NATS
  • Metric dlr_orphan_total{operatorId} counter
  • Alert DlrOrphanRateHigh if > 1% sustained for 15 min

Story Points: 3


US-DLR-022 — Orphan-DLR retention and archive

Title: As an SRE, I want orphan DLRs retained 30 d hot, 1 y cold, then purged so storage doesn't grow unbounded.

Acceptance Criteria:

  • Partitioned by month; 30-d-old partitions archived to S3 (s3://ghasi-dlr-archive/orphans/{yyyy}/{mm}/)
  • 1-y-old archives purged via lifecycle policy
  • Manual export endpoint for ops investigations

Story Points: 2