smpp-connector — Failure Modes
Status: populated | Last updated: 2026-04-18
Failure Scenarios
| # | Failure scenario | Detection method | Impact | Mitigation |
|---|---|---|---|---|
| F1 | SMPP session TCP disconnect (network blip) | enquire_link timeout or TCP RST; smpp.session.unbound log | Messages queued in NATS; not transmitted until rebind | Exponential backoff reconnect (5 s → 60 s); operator.health UNBOUND published; routing-engine redirects to backup operator |
| F2 | MNO enquire_link no response (silent session death) | 10 s timer fires; enquire_link.timeout log + metric | Session appears alive but is actually dead; submit_sm will time out | Force disconnect on timeout; trigger reconnect; treat as F1 |
| F3 | MNO rejects bind (wrong credentials) | bind_resp with error code (e.g. ESME_RBINDFAIL); logged at error | Cannot bind; messages remain in NATS queue | Alert fires; operator credentials must be rotated in Vault via operator-management-service; auto-retry paused until manual resolution |
| F4 | TPS limit exceeded | Redis INCR exceeds tpsLimit; smpp.submit_sm.throttled log | NATS messages NAKed with 500 ms delay | NATS redelivery handles queuing; no message loss; alert if throttle rate is sustained > 5 min |
| F5 | submit_sm_resp timeout (30 s) | Timer fires for pending PDU; pendingPduMap eviction log | Uncertain delivery state; possible double-send on retry | Correlation record written optimistically; DLR will confirm; sms-orchestrator handles SUBMITTED status timeout |
| F6 | DLR operator_message_id not in correlation table | dlr.correlation.not_found log + metric | DLR event published without messageId (degraded) | Alert if miss rate > 1/min; likely cause: correlation record expired (72 h TTL) or DB write failure on submit |
| F7 | Redis unavailable | PING fail; warn log | TPS enforcement disabled (fail-open); throttle metric stops | Alert fires; operator TPS contracts may be breached; MNO may throttle at SMPP level (ESME_RTHROTTLED) |
| F8 | PostgreSQL unavailable | Connection pool error; error log | Cannot write correlation records; DLR correlation impossible for new messages | Alert fires; existing in-memory pendingPduMap provides partial DLR correlation for in-flight messages |
| F9 | operator-management-service unavailable at bind time | HTTP 5xx / timeout; error log | Cannot fetch credentials; cannot bind new/reconnected sessions | Existing BOUND sessions continue; alert fires; reconnect attempts wait; manual credential injection as emergency procedure |
| F10 | Pod OOMKilled | Kubernetes event; liveness probe stops responding | All in-progress PDUs lost; pendingPduMap cleared | Kubernetes restarts pod; NATS messages with SUBMITTED status re-dispatched by sms-orchestrator after timeout; session re-established on startup |