Skip to main content

Webhook Dispatcher — Epics & User Stories

Status: populated Owner: Platform Engineering / Product Last updated: 2026-04-18


EP-HOOK-01: Webhook Configuration Management

Description: Provide customers with a REST API to register, manage, and monitor their webhook endpoints, enabling them to receive DLR notifications via HTTP.

Acceptance Criteria:

  • Full CRUD operations on webhook configurations
  • Maximum 10 webhooks per account enforced
  • Webhook secrets stored encrypted; never returned in API responses
  • All endpoints authenticated via JWT / Kong

US-HOOK-001: Register a Webhook Endpoint

Title: As a customer, I want to register a webhook URL so that I receive HTTP notifications when my messages are delivered.

Description: Implement POST /v1/webhooks. Validate URL (HTTPS), secret (16–128 chars), optional event filter, and max-10 constraint. Encrypt secret before persisting.

Acceptance Criteria:

  • 201 Created response with new webhookId
  • secret NOT in response
  • 400 for non-HTTPS URL
  • 400 for secret < 16 chars
  • 422 when account already has 10 active webhooks
  • Integration test: register → verify DB row; verify secret encrypted in secret_enc

Story Points: 3


US-HOOK-002: List, Update, and Delete Webhooks

Title: As a customer, I want to manage my webhook configurations so that I can change endpoints or disable notifications.

Description: Implement GET /v1/webhooks, PUT /v1/webhooks/:id, DELETE /v1/webhooks/:id. Enforce account isolation on all operations.

Acceptance Criteria:

  • GET returns paginated list with isActive, url, events but not secret
  • PUT supports partial update; re-encrypts secret if updated
  • DELETE hard-deletes config; delivery_attempts rows retained
  • 404 when attempting to access another account's webhook
  • Unit tests cover ownership enforcement

Story Points: 3


US-HOOK-003: Filter Webhooks by Event Type

Title: As a customer, I want to register webhooks that only fire for specific DLR outcomes so that I can route different events to different endpoints.

Description: Support events array on webhook_configs (default: all event types). Dispatch logic filters active webhooks by event type match before creating delivery attempts.

Acceptance Criteria:

  • Webhook with events: ['DLR_DELIVERED'] receives only DELIVERED dispatches
  • Default (empty events array) receives all event types
  • PUT can update events array
  • Integration test: two webhooks (different event filters) — correct routing verified

Story Points: 2


EP-HOOK-02: Webhook Delivery Engine

Description: Build the core delivery engine that consumes webhook.dispatch NATS events, performs HMAC-signed HTTP POSTs, and manages retry state in PostgreSQL.


US-HOOK-004: Consume webhook.dispatch Events from NATS JetStream

Title: As the platform, I want the Webhook Dispatcher to consume webhook.dispatch events durably so that no delivery notification is missed.

Description: Implement durable NATS consumer webhook-dispatcher with AckExplicit, MaxConcurrency: 20. Ack after delivery_attempts rows written to DB (before HTTP attempt).

Acceptance Criteria:

  • Consumer webhook-dispatcher present in NATS consumer list
  • Ack happens post-DB-write, pre-HTTP attempt
  • On pod restart, no events reprocessed if already Acked (DB records guard idempotency)
  • /ready returns 503 when consumer disconnected

Story Points: 3


US-HOOK-005: HMAC-SHA256 Request Signing

Title: As a customer, I want each webhook delivery to include an HMAC-SHA256 signature so that I can verify the request authenticity.

Description: Compute sha256=<hex> HMAC over raw request body using per-webhook secret. Include as X-Ghasi-Signature header. Also include X-Ghasi-Event, X-Ghasi-Delivery-Id, X-Ghasi-Timestamp.

Acceptance Criteria:

  • X-Ghasi-Signature: sha256=<64-char hex> present on every delivery
  • Signature computed over UTF-8 bytes of raw JSON body
  • Reference test: known body + known secret → expected signature hash verified
  • Unit test: tampered body produces different signature

Story Points: 2


US-HOOK-006: HTTP Delivery with Timeout and No Redirects

Title: As the platform, I want webhook delivery to POST with a 5-second timeout and reject redirects so that delivery is fast and predictable.

Description: Use undici or native fetch with AbortSignal.timeout(5000) and redirect: 'manual'. 2xx = SUCCESS; all other responses = FAILED.

Acceptance Criteria:

  • Delivery times out and fails after 5 s (not 5.1 s)
  • HTTP 3xx response treated as failure (not followed)
  • HTTP status code stored in delivery_attempts.http_status_code
  • First 512 chars of response body stored in response_body_preview
  • Integration test using mock HTTP server verifying all failure cases

Story Points: 3


US-HOOK-007: Exponential Backoff Retry Schedule

Title: As the platform, I want failed deliveries retried with exponential backoff so that transient customer endpoint outages are handled without losing events.

Description: Implement retry schedule: immediate → 30 s → 5 min → 30 min → 2 h. Retry state stored in hook.delivery_attempts (no Redis). Retry poller uses SKIP LOCKED for fan-out.

Acceptance Criteria:

  • next_retry_at values match defined schedule for each attempt number
  • SKIP LOCKED prevents double-processing across pods
  • Retry poller runs every 10 s
  • Failed attempt updates status = FAILED_RETRY and correct next_retry_at
  • Integration test: verify 3 consecutive failures result in correct next_retry_at timestamps

Story Points: 5


US-HOOK-008: Dead-Letter After Maximum Retries

Title: As the platform, I want deliveries that exhaust all 5 attempts to be dead-lettered so that they can be monitored and replayed by support.

Description: After attempt 5 fails: set status = DEAD_LETTER, publish webhook.dispatch.deadletter to NATS, increment hook_deliveries_dead_lettered_total.

Acceptance Criteria:

  • DEAD_LETTER status set after exactly 5 failed attempts
  • webhook.dispatch.deadletter NATS event published with reason: MAX_RETRIES_EXCEEDED
  • No further retry attempts after dead-letter
  • hook_deliveries_dead_lettered_total counter incremented
  • Alert fires when dead-letter rate > 100/min

Story Points: 3


EP-HOOK-03: Observability & Operations


US-HOOK-009: Delivery Attempt History API

Title: As a customer, I want to query my webhook delivery history via API so that I can diagnose delivery failures and track retry status.

Description: Implement GET /v1/webhooks/deliveries with optional webhookId and status filters. Paginated response including httpStatusCode, nextRetryAt, attemptNumber.

Acceptance Criteria:

  • Returns paginated delivery_attempts rows scoped to requesting account
  • webhookId filter narrows results
  • status filter accepts PENDING, SUCCESS, FAILED_RETRY, DEAD_LETTER
  • Does not expose payload_snapshot (internal field) — only safe fields returned
  • Response time p99 < 200 ms

Story Points: 2


US-HOOK-010: Prometheus Metrics Instrumentation

Title: As an SRE, I want comprehensive Prometheus metrics so that I can monitor delivery health and retry queue depth.

Description: Implement all 14 metrics from OBSERVABILITY.md §1.

Acceptance Criteria:

  • All metrics present at /metrics endpoint
  • hook_delivery_duration_seconds histogram with buckets: 100ms, 500ms, 1s, 2s, 5s, 10s
  • hook_retry_poller_lag_seconds gauge updates each poller cycle
  • Grafana dashboard webhook-dispatcher-overview loads without errors

Story Points: 3


US-HOOK-011: Structured JSON Logging

Title: As an SRE, I want structured JSON log events for all delivery paths so that I can correlate webhook failures with specific accounts and delivery IDs.

Description: Implement Pino structured logging for all 8 log events in OBSERVABILITY.md §2. No phone numbers (to field) in logs.

Acceptance Criteria:

  • All 8 log events implemented with correct fields
  • to (E.164 phone) never in log output
  • traceId and spanId present when OTLP trace active
  • deliveryId present in all delivery-related log events for correlation

Story Points: 2


EP-HOOK-04: Security


US-HOOK-012: Webhook Secret Encryption at Rest

Title: As a security engineer, I want webhook signing secrets encrypted in the database so that a DB compromise does not expose customer secrets.

Description: Implement AES-256-GCM envelope encryption using KMS-managed key. Plaintext secret never written to DB. Decryption happens in-process at delivery time only.

Acceptance Criteria:

  • secret_enc column contains ciphertext only
  • Plaintext secret absent from all DB queries, logs, and API responses
  • Secret rotation via PUT /v1/webhooks/:id re-encrypts with current KMS key
  • Security review sign-off obtained

Story Points: 3


US-HOOK-013: SSRF Prevention for Webhook URLs

Title: As a security engineer, I want webhook delivery to be blocked from accessing internal network ranges so that a malicious customer cannot use the service for SSRF.

Description: Enforce NetworkPolicy blocking egress to private IP ranges and cloud metadata endpoint. Document behaviour: URL pointing to private range will fail at network layer.

Acceptance Criteria:

  • NetworkPolicy blocks 10.0.0.0/8, 172.16.0.0/12, 192.168.0.0/16, 169.254.169.254/32
  • Test in staging: register webhook pointing to internal service → delivery fails at network level
  • Security review sign-off obtained

Story Points: 2


US-HOOK-014: Rate Limiting on REST API

Title: As a security engineer, I want the webhook management REST API rate-limited so that automated abuse is prevented.

Description: Configure Kong rate-limit plugin at 60 requests/minute per account. Request size limited to 1 KB.

Acceptance Criteria:

  • 61st request within 60 s returns HTTP 429 Too Many Requests
  • Request body > 1 KB returns HTTP 413 Payload Too Large
  • Rate limit headers (X-RateLimit-Remaining, X-RateLimit-Reset) present in responses

Story Points: 1


EP-HOOK-05: Customer-Endpoint Circuit Breaker + Tenant-Portal Alerts

Description: When a tenant's webhook endpoint becomes persistently unhealthy, the dispatcher must protect itself (and the platform) by opening a circuit breaker per (tenantId, endpointId), surfacing the failure to the tenant in the customer portal, and pausing further attempts until the tenant takes action OR the breaker auto-recovers.


US-HOOK-015 — Per-endpoint circuit breaker state machine

Title: As the webhook-dispatcher, I want a CLOSED → OPEN → HALF_OPEN → CLOSED state machine per endpoint so that consistently-failing endpoints don't waste retry budget.

Acceptance Criteria:

  • State stored in Redis hook:cb:{endpointId}; persisted snapshot to hook.endpoint_state every minute
  • Trips to OPEN after 10 consecutive failures or failureRate > 50% over 100 attempts
  • OPEN drops new attempts immediately; bookkeeping only
  • After 5 min in OPEN → HALF_OPEN; allows 3 trial requests
  • 3 trial successes → CLOSED; any failure → back to OPEN with doubling timeout (max 60 min)
  • Metric hook_circuit_state{endpointId,state} gauge
  • Unit tests cover all transitions

Story Points: 5


US-HOOK-016 — Tenant portal alert on circuit OPEN

Title: As a customer, I want a banner in the customer portal when one of my webhook endpoints is in OPEN state so I can fix it.

Acceptance Criteria:

  • hook.circuit.opened.v1 NATS event emitted on transition OPEN
  • notification-service consumes and creates portal banner + email digest
  • Banner shows endpoint URL, last error, retry-after timestamp
  • Banner persists until circuit returns to CLOSED

Story Points: 3


US-HOOK-017 — Manual circuit reset

Title: As a customer, I want to manually reset a webhook circuit after I've fixed my endpoint.

Acceptance Criteria:

  • POST /v1/webhooks/{endpointId}/reset resets circuit to CLOSED with no waiting
  • Audit event published; customer-portal action logged
  • Rate-limited to 1/min per endpoint to prevent abuse

Story Points: 3


US-HOOK-018 — Dead-endpoint pruning

Title: As a platform operator, I want webhook endpoints OPEN for > 30 d auto-disabled so they don't accumulate.

Acceptance Criteria:

  • Cron daily: endpoints OPEN > 30 d → status = DISABLED
  • Tenant notified by email + portal banner
  • Re-enable requires explicit tenant action

Story Points: 3


US-HOOK-019 — Per-endpoint failure analytics

Title: As an SRE, I want per-endpoint failure-rate analytics so I can spot tenant integration issues.

Acceptance Criteria:

  • Dashboard panel: top-50 endpoints by failure rate (24h)
  • Drill-down to endpoint shows: HTTP status code distribution, latency histogram, retry count
  • Linked alert WebhookEndpointMassiveFailure when an endpoint fails 1000 attempts in 1 h

Story Points: 5


EP-HOOK-06: Per-Tenant Egress Pool, Back-Pressure, Rate-Limit Caps

Description: A national event (e.g., emergency broadcast or campaign storm) can produce a webhook stampede that saturates egress and looks like DDoS to customers. Per-tenant egress pools and back-pressure prevent one tenant from starving another or overwhelming a small endpoint.


US-HOOK-020 — Per-tenant worker pool

Title: As a platform engineer, I want each tenant to have a bounded concurrent-worker pool so one tenant's stampede doesn't drain platform-wide workers.

Acceptance Criteria:

  • Worker pool size per tenant tier: TRIAL=10, STANDARD=50, ENTERPRISE=200
  • Excess work queued in hook.tenant_queue with FIFO order
  • Metric hook_worker_active{tenantId}, hook_queue_depth{tenantId}

Story Points: 5


US-HOOK-021 — Per-endpoint outbound rate limit

Title: As a tenant, I want to declare a max RPS my endpoint can handle so the dispatcher paces itself.

Acceptance Criteria:

  • hook.endpoints.maxRps field; default 50
  • Dispatcher enforces using token bucket per endpoint (Redis)
  • Excess work waits in queue (with TTL); hook_pacing_delayed_total counter

Story Points: 5


US-HOOK-022 — Webhook payload deduplication

Title: As a customer, I want at-least-once delivery with dedup-key so retries don't cause my application to process the same event twice.

Acceptance Criteria:

  • Header X-Ghasi-Idempotency-Key: {messageId}-{eventType} set on every POST
  • Customer can dedup using this header
  • Documented in developer portal

Story Points: 2


US-HOOK-023 — Tenant-level outbound bytes/s budget

Title: As a finance stakeholder, I want each tenant's outbound webhook bytes/s capped so we can predict bandwidth cost.

Acceptance Criteria:

  • Per-tier budget: TRIAL 1MB/s, STANDARD 10MB/s, ENTERPRISE 100MB/s
  • Excess paced (not dropped); customer notified if sustained
  • Metric hook_egress_bytes_total{tenantId} counter

Story Points: 3


EP-HOOK-07: Signing-Key Rotation with Dual-Sig Grace Period

Description: HMAC signing keys must rotate on a cadence (per-tenant max 365 d). During rotation, both old and new signatures are sent so that the customer's verifier doesn't break mid-flight.


US-HOOK-024 — Per-tenant signing-key versioning

Title: As a platform engineer, I want each tenant to have versioned HMAC keys so rotation is traceable.

Acceptance Criteria:

  • hook.signing_keys (tenantId, keyId, secret HMAC-256, status ENUM(ACTIVE,RETIRING,RETIRED), createdAt)
  • Always exactly one ACTIVE; up to one RETIRING
  • Secret stored as KMS-encrypted blob; never returned by any API after creation

Story Points: 3


US-HOOK-025 — Dual-signature grace period

Title: As a customer, I want both old and new signatures during rotation so my verifier doesn't break.

Acceptance Criteria:

  • During RETIRING window (default 7 d): outbound POST includes X-Ghasi-Signature: v1=<old>; v2=<new>
  • After RETIRING expires: only new signature
  • Customer-portal alert when retiring window starts and ends

Story Points: 5


US-HOOK-026 — Self-serve key rotation

Title: As a customer, I want to rotate my webhook signing key from the portal.

Acceptance Criteria:

  • POST /v1/webhooks/signing-keys/rotate initiates rotation
  • New key shown once in modal (not retrievable again)
  • Old key enters RETIRING state
  • Audit event webhook.signing_key.rotated.v1

Story Points: 3


EP-HOOK-08: mTLS-to-Customer Webhooks (optional, per-tenant)

Description: Enterprise customers may require Ghasi to present a client certificate (mTLS) on webhook delivery so they can pin Ghasi as the source. Pairs with EP-KONG-07 US-KONG-038.


US-HOOK-027 — Per-tenant mTLS toggle and platform CA exposure

Title: As an enterprise tenant, I want to enable mTLS on my webhook endpoint so I can reject any caller other than Ghasi.

Acceptance Criteria:

  • hook.endpoints.mtlsEnabled boolean per endpoint
  • Customer-portal page shows the platform CA chain + SPIFFE ID for the tenant to install in their TLS reverse proxy
  • Egress proxy presents the platform-wide client SVID (rotated 1 h via SPIRE)

Story Points: 5


US-HOOK-028 — mTLS handshake failure handling

Title: As the dispatcher, I want mTLS handshake failures distinguished from HTTP errors so customers can debug TLS issues.

Acceptance Criteria:

  • Failure logged as code: "TLS_HANDSHAKE_FAILED" in hook.delivery_logs
  • Distinct counter hook_mtls_handshake_failed_total
  • Customer-portal banner explains the likely cause (cert expired, wrong CA, SNI mismatch)

Story Points: 3