Skip to main content

smpp-connector — Failure Modes

Status: populated | Last updated: 2026-04-18

Failure Scenarios

#Failure scenarioDetection methodImpactMitigation
F1SMPP session TCP disconnect (network blip)enquire_link timeout or TCP RST; smpp.session.unbound logMessages queued in NATS; not transmitted until rebindExponential backoff reconnect (5 s → 60 s); operator.health UNBOUND published; routing-engine redirects to backup operator
F2MNO enquire_link no response (silent session death)10 s timer fires; enquire_link.timeout log + metricSession appears alive but is actually dead; submit_sm will time outForce disconnect on timeout; trigger reconnect; treat as F1
F3MNO rejects bind (wrong credentials)bind_resp with error code (e.g. ESME_RBINDFAIL); logged at errorCannot bind; messages remain in NATS queueAlert fires; operator credentials must be rotated in Vault via operator-management-service; auto-retry paused until manual resolution
F4TPS limit exceededRedis INCR exceeds tpsLimit; smpp.submit_sm.throttled logNATS messages NAKed with 500 ms delayNATS redelivery handles queuing; no message loss; alert if throttle rate is sustained > 5 min
F5submit_sm_resp timeout (30 s)Timer fires for pending PDU; pendingPduMap eviction logUncertain delivery state; possible double-send on retryCorrelation record written optimistically; DLR will confirm; sms-orchestrator handles SUBMITTED status timeout
F6DLR operator_message_id not in correlation tabledlr.correlation.not_found log + metricDLR event published without messageId (degraded)Alert if miss rate > 1/min; likely cause: correlation record expired (72 h TTL) or DB write failure on submit
F7Redis unavailablePING fail; warn logTPS enforcement disabled (fail-open); throttle metric stopsAlert fires; operator TPS contracts may be breached; MNO may throttle at SMPP level (ESME_RTHROTTLED)
F8PostgreSQL unavailableConnection pool error; error logCannot write correlation records; DLR correlation impossible for new messagesAlert fires; existing in-memory pendingPduMap provides partial DLR correlation for in-flight messages
F9operator-management-service unavailable at bind timeHTTP 5xx / timeout; error logCannot fetch credentials; cannot bind new/reconnected sessionsExisting BOUND sessions continue; alert fires; reconnect attempts wait; manual credential injection as emergency procedure
F10Pod OOMKilledKubernetes event; liveness probe stops respondingAll in-progress PDUs lost; pendingPduMap clearedKubernetes restarts pod; NATS messages with SUBMITTED status re-dispatched by sms-orchestrator after timeout; session re-established on startup