Skip to main content

FAILURE_MODES — payment-gateway-service

Sibling: APPLICATION_LOGIC · OBSERVABILITY · SERVICE_RISK_REGISTER

This document enumerates the failure modes payments engineers must understand to operate this service. Each entry follows: Symptom → Detection → Cause → Mitigation → Long-term fix. Items map to runbook URLs at https://runbooks.melmastoon.ghasi.io/payments/<slug>.

1. Vendor adapter unavailable

Symptom: Authorize/capture/refund returns 502 MELMASTOON.PAYMENT.GATEWAY_UNAVAILABLE; circuit breaker opens.

Detection: payments_adapter_circuit_state{processor="…"} = 2 for > 5 m → P2 alert.

Cause: Vendor incident, network partition, certificate expiry, exhausted rate limit, expired API key.

Mitigation:

  1. Circuit breaker stops new calls to the failing adapter.
  2. Per-tenant adapter precedence routes new intents to the next-preference processor (e.g., Stripe → PayPal). Cash-on-arrival is always available.
  3. In-flight intents that succeeded but missed the response are reconciled when the daily batch runs (we hold the (intent_id, processor_idempotency_key) mapping).
  4. UI surfaces a property-aware notice via notification-service; staff can record cash receipts as an alternative.

Long-term fix: Add additional adapters per region (e.g., Adyen for EMEA, Razorpay for South Asia) and bake more aggressive fallback into the saga via reservation-service policy.

2. Webhook lost or delayed

Symptom: Transaction stuck in authorized state past expected capture, or pending_settlement for hours.

Detection: Daily reconciliation surfaces a vendor_only line; payments_webhook_inbox_lag_seconds rising; alert P2.

Cause: Vendor delivery failure, our endpoint returned non-2xx, signature secret rotated incorrectly, network drop.

Mitigation:

  1. Reconciliation matches platform vs vendor settlement and back-fills missing state changes (e.g., the captured state is applied based on the settlement report when the webhook never arrived). This is the canonical repair path.
  2. Operator can replay via payments-admin-cli webhooks replay --vendor stripe --since=… which re-fetches events from the vendor API and re-runs the dispatcher.
  3. webhook.duplicate_dropped.v1 ensures idempotency if the original eventually arrives.

Long-term fix: Verify each vendor's "list events since X" endpoint is wired into a 6-hourly catch-up job (currently only daily).

3. Double-charge attempt

Symptom: Same guest, same reservation, two transactions both captured.

Detection: A pre-write check in AuthorizePayment compares (reservationId, paymentMethodId, amount, currency) against the last 5 minutes of transactions; collisions emit payments_idempotency_collision_total{outcome="different_body"}. The 409 response indicates UI should retry as same-body.

Cause: Frontend retried without preserving the original Idempotency-Key; saga emitted held.v1 twice without dedupe at the inbox; race between two operators.

Mitigation:

  1. Idempotency-Key required on every authorize; same key + same body collapses to one charge.
  2. Saga inbox dedupe by (reservationId, sagaStepId).
  3. If two captures land despite controls, an immediate void+refund is issued and a P1 incident is opened.

Long-term fix: Move from advisory pre-check to a unique constraint at the application layer keyed on (tenantId, reservationId, paymentMethodId, amountMicro, currency) for a 5-minute sliding window using a TTL'd attempt_lock table.

4. Refund queue overflow

Symptom: Refunds remain pending for hours; refund SLO breaches.

Detection: payments_refund_total{outcome="pending"} increases without matching outcome="refunded"; payments_webhook_inbox_lag_seconds correlated.

Cause: Vendor processing slower than our submission rate, mass cancellation event (e.g., property closure), webhook backlog.

Mitigation:

  1. Mass-cancellation events trigger a per-tenant rate limit on refund submission (5 RPS) to avoid vendor throttling.
  2. Refund worker runs on its own queue with HPA on backlog; can scale to 30 pods.
  3. Operator dashboard shows pending count with ETA.

Long-term fix: Group refunds where a vendor supports batch refunds; pre-validate refund amounts against captures to fail fast.

5. FX provider down or stale

Symptom: Authorize fails with MELMASTOON.PAYMENT.FX_UNAVAILABLE; or quoted rate is older than tolerance.

Detection: payments_fx_lookup_total{outcome!="ok"} rising; alert at 1% over 5 m.

Cause: ECB/OANDA outage, network partition.

Mitigation:

  1. Cached rates (24-h TTL) are accepted with source="cached_24h" and emit a WARN log; the FX context records the fallback so finance can audit later.
  2. If cache is also stale beyond 24 h, the use case fails fast; the UI shows "Foreign exchange unavailable; try again or use local currency."
  3. Cash-on-arrival is unaffected.

Long-term fix: Add a second FX provider (e.g., OANDA + Open Exchange Rates) and pick the median quote.

6. PAN exposure attempt

Symptom: MELMASTOON.PAYMENT.PAN_EXPOSURE_BLOCKED raised.

Detection: any occurrence → P0; SecOps paged.

Cause: A bug or misuse caused PAN-shaped data to enter our request body, log line, or event payload.

Mitigation:

  1. The redaction proxy and serializer both block before the data leaves the request handler.
  2. The transaction is rejected with a 400; no card data persists.
  3. Incident: snapshot the offending request (with PAN redacted), trace it to the caller, file a P0.

Long-term fix: PCI scanner test in CI plus runtime detector — both must remain green. New fields touching guest input require a security review checkbox in the PR template.

7. Tenant offboarding with pending refunds

Symptom: Tenant requested deletion; refunds still in pending.

Detection: Tenant lifecycle event tenant.deletion.requested.v1 triggers a check; if any refunds are not in terminal state, the deletion is paused with a P2 alert to the accountant.

Cause: Vendor delay during sunset window.

Mitigation:

  1. Soft-delete for 30 days defers schema drop; refunds continue to process as webhooks arrive.
  2. After 30 days, any remaining pending refunds are escalated via notification-service; legal review may extend retention.
  3. Card tokens are deleted at processor level immediately on tenant request; we keep only the financial record.

Long-term fix: Pre-deletion checklist enforced by tenant-service requiring zero pending refunds before scheduling a deletion.

8. MFS confirmation timeout (HesabPay / async vendors)

Symptom: Transaction stuck in pending_settlement past the configured window (default 24 h).

Detection: Cron payments-mfs-timeouts-sweep (hourly) finds rows with created_at < now() - 24h AND status='pending_settlement' AND processor IN (mfs_*).

Cause: Guest abandoned the MFS confirmation flow, or vendor never delivered the confirmation.

Mitigation:

  1. Sweep marks the transaction as failed with MELMASTOON.PAYMENT.MFS_CONFIRMATION_TIMEOUT.
  2. Saga reacts: transaction.failed.v1 triggers reservation-service to release the hold or retry with a different method.
  3. Guest receives a notification with a re-payment link.

Long-term fix: Configurable per-tenant timeout; expose to admin UI in bff-backoffice-service.

9. Chargeback for cash payment

Symptom: Vendor sends a chargeback webhook that maps to a cash_on_arrival transaction.

Detection: RecordChargebackUseCase rejects the operation as logically impossible and raises MELMASTOON.PAYMENT.CHARGEBACK_ON_CASH_IMPOSSIBLE. chargeback.received.v1 is emitted with fraudSignal: true, impossible: true.

Cause: Most likely vendor mis-association, or a fraudulent dispute. In our model cash never traversed a vendor processor, so a chargeback against it is structurally impossible — it's a strong fraud signal worth investigating.

Mitigation:

  1. Open an investigation case in bff-backoffice-service; do not auto-issue any refund.
  2. Notify accountant + property manager.
  3. Capture details and submit "no charge ever existed" evidence to vendor.

Long-term fix: Push vendors for stricter merchant-record matching; consider adding a watermark on cash receipt artifacts.

10. Outbox flush failure

Symptom: Outbound events delayed; downstream services don't react.

Detection: payments_central.outbox row count where published_at IS NULL rising; alert at > 1000 rows or > 5 m oldest.

Cause: Pub/Sub outage, IAM permission revoked, schema drift between event payload and registered schema (validator rejects).

Mitigation:

  1. Outbox worker retries with backoff; events eventually flush when Pub/Sub returns.
  2. Schema rejects emit ERROR logs; payload is requeued and the bad envelope investigated by the team.

Long-term fix: Add a circuit breaker on the publisher and a backpressure signal so use cases can fail fast rather than fill the outbox if Pub/Sub is down for long.

11. Optimistic concurrency conflict storm

Symptom: 409 MELMASTOON.PAYMENT.OPTIMISTIC_CONFLICT rate climbs; UI sees retries.

Detection: payments_request_total{status="409"} correlated with payments_capture_total or payments_refund_total.

Cause: A buggy caller polling at high frequency; a saga retry storm; concurrent operators.

Mitigation: Rate limiting kicks in at 30 RPS per tenant; UI shows debounced state; logs include payment.id for triage.

Long-term fix: Add command-merge logic where appropriate (e.g., capture+capture for the same payment within 1 s collapses).

12. Adapter circuit half-open misbehavior

Symptom: Adapter recovers, closed → half-open → open repeatedly (flapping).

Detection: payments_adapter_circuit_transitions_total{from="half_open",to="open"} / time > 1/min for 10 m.

Cause: Vendor recovered partially; or our health-probe traffic is too aggressive after open.

Mitigation: Increase half-open probe interval to 60 s; require 5 consecutive successes before close.

Long-term fix: Adaptive half-open thresholds based on historical recovery patterns.

13. Local-cash desktop outbox stuck

Symptom: Property's desktop reports cash receipts pending for hours.

Detection: payments_desktop_cash_outbox_age_seconds{property_id} > 14400 → P3 alert.

Cause: Property network outage; bad operator session; expired JWT not refreshing.

Mitigation: Notification to property manager with troubleshooting steps; manual escalation can drain the outbox by re-issuing JWT and triggering sync.

Long-term fix: Build sync health into the desktop status bar with a one-click "force sync" affordance.

14. Database schema-migration partial failure

Symptom: A new tenant migration applies on 99 of 100 tenant schemas; one tenant remains on previous version.

Detection: payments_central.tenant_migrations query returns version mismatch; CI gate post-deploy.

Cause: Lock contention, ad-hoc DDL on the failing tenant, transient connection error.

Mitigation: Per-tenant migrations are idempotent retriable; payments-admin-cli migrate-tenant <id> resumes; alerts gate the deploy promotion.

Long-term fix: Pre-deploy dry-run on a sampled set of tenants; per-tenant lock timeout tuned.