DLR Processor — Epics & User Stories
Status: populated Owner: Platform Engineering / Product Last updated: 2026-04-18
EP-DLR-01: Core DLR Processing Pipeline
Description: Build the foundational event-driven pipeline that consumes delivery receipts from NATS, normalises status codes, and persists results to PostgreSQL.
Acceptance Criteria:
- durable NATS consumer
dlr-processoractive onsms.dlr.inbound - DLR status normalisation covers all SMPP stat values
dlr.delivery_receiptswritten with idempotency onoperatorMessageIdorch.sms_messagesupdated for terminal statuses- Processing latency p99 < 500 ms at 500 msg/s sustained load
US-DLR-001: Consume DLR Events from NATS JetStream
Title: As the platform, I want the DLR Processor to consume sms.dlr.inbound messages durably so that no delivery receipt is lost.
Description: Implement a durable NATS JetStream consumer (dlr-processor) with AckExplicit policy and MaxConcurrency: 10. Consumer must survive pod restarts and replay from last acked position.
Acceptance Criteria:
- Consumer name
dlr-processorappears in NATS consumer list - On pod restart, processing resumes from last acked offset
MaxConcurrency: 10enforced; concurrent processing verified under load/readyreturns 200 only when consumer is active
Story Points: 3
US-DLR-002: Normalise SMPP Status Codes to Canonical DlrStatus
Title: As the platform, I want all carrier status strings mapped to a canonical DlrStatus enum so that downstream services work with consistent values.
Description: Implement the stateless DlrStatusNormaliser module mapping all 8 known SMPP stat values plus fallback to UNKNOWN. Must be case-insensitive.
Acceptance Criteria:
DELIVRD→DELIVERED,UNDELIV→UNDELIVERED,EXPIRED→EXPIRED,DELETED→FAILED,ACCEPTD→UNKNOWN,REJECTD→REJECTED,UNKNOWN→UNKNOWN,FAILED→FAILED- Any unrecognised value →
UNKNOWN - Case-insensitive input handled
- Unit tests cover all 9 cases
Story Points: 2
US-DLR-003: Persist Delivery Receipts with Idempotency
Title: As the platform, I want each DLR written to dlr.delivery_receipts exactly once so that duplicate events from operators cause no side effects.
Description: Use INSERT ... ON CONFLICT (operator_message_id) DO NOTHING pattern. If conflict detected, Ack the NATS message and exit without further processing.
Acceptance Criteria:
- First occurrence of
operatorMessageIdcreates row - Second occurrence of same
operatorMessageIdresults indlr_duplicates_totalcounter increment and immediate Ack - No billing or webhook events emitted for duplicates
- Integration test confirms single row after 5 rapid duplicates
Story Points: 3
US-DLR-004: Update sms_messages Status on Terminal DLR
Title: As the platform, I want orch.sms_messages updated with DLR status so that message tracking reflects final delivery outcome.
Description: Execute targeted UPDATE setting status, dlr_status, dlr_received_at, processed_at within the same DB transaction as the delivery_receipts insert. Guard clause prevents overwriting already-terminal rows.
Acceptance Criteria:
- Terminal DLR statuses (
DELIVERED,FAILED,UNDELIVERED,EXPIRED,REJECTED) updateorch.sms_messages UNKNOWNstatus does NOT updateorch.sms_messages.status- Guard clause verified: second DLR for same
messageIddoes not overwrite terminal state - Verified with integration test using Testcontainers
Story Points: 3
EP-DLR-02: Orphan Handling & Reconciliation
Description: Handle DLRs that arrive with an operatorMessageId that cannot be correlated to a known outbound message.
US-DLR-005: Write Orphaned Receipts on Correlation Failure
Title: As the platform, I want unresolvable DLRs stored in dlr.orphaned_receipts so that they can be reconciled later.
Description: When correlation query returns no rows for operatorMessageId, insert full event payload into dlr.orphaned_receipts and Ack the NATS message (no retry since retrying will not help without a matching row).
Acceptance Criteria:
dlr.orphaned_receiptsrow created withoperatorMessageId,rawPayload,receivedAtdlr_orphans_totalPrometheus counter incrementeddlr.orphan_rategauge updated- NATS message Acked (not Nak'd)
- Log entry
dlr.orphanedat WARN level withoperatorMessageIdandoperatorId
Story Points: 2
US-DLR-006: Publish sms.dlr.unmatched for Orphaned Receipts
Title: As the reconciliation subsystem, I want to receive a sms.dlr.unmatched NATS event for each orphan so that automated reconciliation can be triggered.
Description: After writing to dlr.orphaned_receipts, publish sms.dlr.unmatched including orphanId for cross-reference.
Acceptance Criteria:
sms.dlr.unmatchedpublished within the same DB transaction scope (outbox pattern)- Event includes
orphanId,operatorMessageId,operatorId,rawStat,receivedAt - Integration test: publish DLR with unknown ID → verify
sms.dlr.unmatchedreceived by test consumer
Story Points: 2
US-DLR-007: Correlation Retry Before Orphaning (Race Condition Mitigation)
Title: As the platform, I want a brief retry window before orphaning a DLR so that DLRs arriving slightly before the SENT status update are not incorrectly orphaned.
Description: When correlation fails, retry the lookup after a 3-second delay up to 2 times before writing to orphaned_receipts. This mitigates the race condition where the smpp-connector DLR arrives before the orchestrator has written SENT status.
Acceptance Criteria:
- Correlation retried up to 2 times with 3s delay between attempts
- If found on retry: normal processing path (not orphaned)
- If not found after retries: orphan path
dlr_correlation_retry_totalcounter tracks retry attempts- Processing latency SLO accounts for up to 6s retry window
Story Points: 3
EP-DLR-03: Downstream Event Publishing
Description: Fan processed DLR outcomes to billing and webhook downstream consumers via the transactional outbox pattern.
US-DLR-008: Publish billing.events via Transactional Outbox
Title: As the billing service, I want a billing.events event for each terminal DLR so that charges can be calculated.
Description: Insert billing.events payload into dlr.outbox within the same DB transaction as the delivery_receipts insert. Outbox relay publishes to NATS asynchronously.
Acceptance Criteria:
billing.eventspublished forDELIVERED,FAILED,UNDELIVERED,EXPIRED,REJECTEDstatuses- NOT published for
UNKNOWNstatus - Event includes
messageId,accountId,segmentCount,dlrStatus,operatorId - Outbox pattern: no billing event lost even if NATS is temporarily unavailable
- Pact contract test verified with billing-service team
Story Points: 3
US-DLR-009: Publish webhook.dispatch via Transactional Outbox
Title: As the webhook dispatcher, I want a webhook.dispatch event for every DLR correlation so that customer webhooks can be notified.
Description: Insert webhook.dispatch payload into dlr.outbox within the same DB transaction. Published for all DlrStatus values including UNKNOWN.
Acceptance Criteria:
webhook.dispatchpublished for ALLDlrStatusvalues- Event includes
accountId,messageId,dlrStatus,to,occurredAt - Outbox pattern ensures at-least-once delivery
- Pact contract test verified with webhook-dispatcher team
Story Points: 2
EP-DLR-04: Observability & Operations
US-DLR-010: Instrument Prometheus Metrics
Title: As an SRE, I want comprehensive Prometheus metrics so that I can monitor DLR processing health in production.
Description: Implement all metrics listed in OBSERVABILITY.md including throughput counters, latency histogram, orphan gauge, outbox pending gauge, and NATS consumer status gauge.
Acceptance Criteria:
- All 11 metrics from OBSERVABILITY.md §1 implemented
/metricsendpoint returns Prometheus text formatdlr_processing_duration_secondshistogram has buckets: 5ms, 10ms, 25ms, 50ms, 100ms, 250ms, 500ms, 1s- Grafana dashboard
dlr-processor-overviewloads without errors
Story Points: 3
US-DLR-011: Structured JSON Logging
Title: As an SRE, I want structured JSON logs for all key processing events so that I can query and correlate incidents in Loki.
Description: Implement Pino structured logging for all processing paths with event field, traceId, spanId, and domain context. No PII in log lines.
Acceptance Criteria:
- All 8 log events from OBSERVABILITY.md §2 implemented
- Phone numbers (
destAddr,sourceAddr) never appear in log output traceIdandspanIdpresent when OTEL trace is active- Log level configurable via
LOG_LEVELenv var
Story Points: 2
US-DLR-012: Alerting Rules for Critical Failure Modes
Title: As an SRE, I want Prometheus alerting rules for orphan rate, outbox lag, and consumer disconnection so that on-call engineers are paged promptly.
Description: Implement the 4 alerting rules from OBSERVABILITY.md §4.
Acceptance Criteria:
DlrHighOrphanRatefires when orphan rate > 0.5% for 5 minDlrOutboxLagfires when outbox pending > 1000 for 2 minDlrConsumerDisconnectedfires when consumer status = 0 for 1 minDlrHighLatencyfires when p99 > 500 ms for 5 min- All rules tested in staging with simulated conditions
Story Points: 2
EP-DLR-05: Security & Compliance
US-DLR-013: Least-Privilege NATS Permissions
Title: As a security engineer, I want the DLR Processor NATS account scoped to minimum required subjects so that a compromise cannot publish arbitrary events.
Acceptance Criteria:
- NATS user
dlr-processorcan only:subscribe sms.dlr.inbound,publish billing.events,publish webhook.dispatch,publish sms.dlr.unmatched - No wildcard subjects granted
- Verified in NATS security review
Story Points: 1
US-DLR-014: Least-Privilege PostgreSQL Grants
Title: As a security engineer, I want the dlr-processor DB user restricted to minimum required operations so that a compromise cannot affect other service schemas.
Acceptance Criteria:
dlr_svchasSELECT, INSERTondlr.*tables onlydlr_svchasUPDATE(status, dlr_status, dlr_received_at, processed_at)onorch.sms_messagesonly- No DROP, TRUNCATE, or schema-level privileges
- Verified by DBA review
Story Points: 1
EP-DLR-06: Inbound DLR Idempotency & Replay Quarantine
Description: MNOs occasionally re-deliver deliver_sm DLRs from stale buffers (especially after MNO-side restarts). The processor must dedupe by (operatorId, operatorMessageId, status, timestampBucket) and quarantine suspicious replays for forensic review.
US-DLR-015 — DLR fingerprint dedup
Title: As the dlr-processor, I want every inbound DLR fingerprinted and deduped so that duplicate delivery from MNO buffers does not double-bill or double-trigger webhooks.
Acceptance Criteria:
- Fingerprint =
sha256(operatorId || operatorMessageId || status || floor(receivedAt / 60s)) - Redis key
dlr:fp:{fingerprint}set with TTL 24h on first seen - Duplicate fingerprint → ACK NATS, increment
dlr_dedup_total{operatorId}, do NOT process downstream - Integration test: same DLR replayed within 60 s → only one billing event, one webhook
Story Points: 3
US-DLR-016 — Replay quarantine (suspicious patterns)
Title: As an SRE, I want DLRs replayed from > 5 min after the original DLR window quarantined so forensic review can determine MNO mis-behaviour.
Acceptance Criteria:
- If
now - dlrTimestamp > 5 minAND original message reached terminal status > 24h ago → quarantine - Quarantine table
dlr.quarantined_dlrs(operatorId, payload JSONB, receivedAt, fingerprint) - Daily report aggregates quarantine count per operator; flag any operator > 1% replay rate
- Manual release endpoint
POST /v1/admin/dlr/quarantine/:id/release
Story Points: 3
US-DLR-017 — Late DLR handling (after message terminal status)
Title: As the dlr-processor, I want DLRs received after a message reaches terminal status logged but not propagated downstream so billing/webhooks are not reopened.
Acceptance Criteria:
- If
sms_messages.statusis terminal (DELIVERED/FAILED/EXPIRED) and DLR arrives → log todlr.late_dlrs, do not republish - Metric
dlr_late_total{operatorId}counter - Late-DLR rate dashboard panel; alert if > 0.5% of total DLRs sustained for 1 h
Story Points: 2
EP-DLR-07: Segment-Aware DLR Aggregation for Concatenated SMS
Description: A 3-segment concatenated SMS yields up to 3 separate DLRs. Per-segment status must be aggregated into a single message-level status using documented rules so downstream consumers see one event per outbound message.
US-DLR-018 — Segment correlation table and aggregation rules
Title: As the dlr-processor, I want segment-level DLRs correlated to the parent message and aggregated per documented rules so downstream consumers see a single per-message status.
Acceptance Criteria:
- Table
dlr.segment_dlrs(messageId, segmentIndex, totalSegments, status, receivedAt) - Aggregation:
DELIVEREDonly when all segments DELIVERED;FAILEDif any segment FAILED;PARTIALon mixed;PENDINGuntil all received or 24h grace expires - Message-level status published to
sms.dlr.aggregated.v1once all segments received OR 24h grace expires - Downstream consumers (billing, webhook) consume aggregated subject only
Story Points: 5
US-DLR-019 — Aggregation timeout and partial reporting
Title: As the dlr-processor, I want to publish a PARTIAL aggregated DLR if not all segments have DLRs within 24 h so downstream is unblocked.
Acceptance Criteria:
- Cron every 5 min selects messages with any segment in
PENDINGpast 24h - Publishes
sms.dlr.aggregated.v1withstatus: "PARTIAL",received: N,expected: M - Audit log entry per partial
- Webhook payload includes per-segment detail array
Story Points: 3
US-DLR-020 — Customer-portal segment-detail view
Title: As a customer, I want to see per-segment status of my long messages so I can diagnose partial failures.
Acceptance Criteria:
GET /v1/sms/{messageId}/segmentsreturns array of segment-status entries- Customer-portal message-detail page renders segment table
- Tenant isolation enforced via RLS
- Contract test verifies segment-level visibility
Story Points: 3
EP-DLR-08: Orphan-DLR Burial Queue with Time-Boxed Retention
Description: DLRs that cannot be correlated to any known message (correlation expired, MNO mis-routed, replay attack) must be buried in a quarantine queue with bounded retention so the system isn't poisoned.
US-DLR-021 — Orphan DLR detection and burial
Title: As the dlr-processor, I want DLRs with no correlation row buried in dlr.orphan_dlrs so downstream is never confused.
Acceptance Criteria:
- No row in
smpp.message_correlationsfor(operatorId, operatorMessageId)→ insert intodlr.orphan_dlrs; ACK NATS - Metric
dlr_orphan_total{operatorId}counter - Alert
DlrOrphanRateHighif > 1% sustained for 15 min
Story Points: 3
US-DLR-022 — Orphan-DLR retention and archive
Title: As an SRE, I want orphan DLRs retained 30 d hot, 1 y cold, then purged so storage doesn't grow unbounded.
Acceptance Criteria:
- Partitioned by month; 30-d-old partitions archived to S3 (
s3://ghasi-dlr-archive/orphans/{yyyy}/{mm}/) - 1-y-old archives purged via lifecycle policy
- Manual export endpoint for ops investigations
Story Points: 2