Webhook Dispatcher — Epics & User Stories
Status: populated Owner: Platform Engineering / Product Last updated: 2026-04-18
EP-HOOK-01: Webhook Configuration Management
Description: Provide customers with a REST API to register, manage, and monitor their webhook endpoints, enabling them to receive DLR notifications via HTTP.
Acceptance Criteria:
- Full CRUD operations on webhook configurations
- Maximum 10 webhooks per account enforced
- Webhook secrets stored encrypted; never returned in API responses
- All endpoints authenticated via JWT / Kong
US-HOOK-001: Register a Webhook Endpoint
Title: As a customer, I want to register a webhook URL so that I receive HTTP notifications when my messages are delivered.
Description: Implement POST /v1/webhooks. Validate URL (HTTPS), secret (16–128 chars), optional event filter, and max-10 constraint. Encrypt secret before persisting.
Acceptance Criteria:
201 Createdresponse with newwebhookIdsecretNOT in response400for non-HTTPS URL400for secret < 16 chars422when account already has 10 active webhooks- Integration test: register → verify DB row; verify secret encrypted in
secret_enc
Story Points: 3
US-HOOK-002: List, Update, and Delete Webhooks
Title: As a customer, I want to manage my webhook configurations so that I can change endpoints or disable notifications.
Description: Implement GET /v1/webhooks, PUT /v1/webhooks/:id, DELETE /v1/webhooks/:id. Enforce account isolation on all operations.
Acceptance Criteria:
GETreturns paginated list withisActive,url,eventsbut notsecretPUTsupports partial update; re-encryptssecretif updatedDELETEhard-deletes config;delivery_attemptsrows retained404when attempting to access another account's webhook- Unit tests cover ownership enforcement
Story Points: 3
US-HOOK-003: Filter Webhooks by Event Type
Title: As a customer, I want to register webhooks that only fire for specific DLR outcomes so that I can route different events to different endpoints.
Description: Support events array on webhook_configs (default: all event types). Dispatch logic filters active webhooks by event type match before creating delivery attempts.
Acceptance Criteria:
- Webhook with
events: ['DLR_DELIVERED']receives onlyDELIVEREDdispatches - Default (empty
eventsarray) receives all event types PUTcan updateeventsarray- Integration test: two webhooks (different event filters) — correct routing verified
Story Points: 2
EP-HOOK-02: Webhook Delivery Engine
Description: Build the core delivery engine that consumes webhook.dispatch NATS events, performs HMAC-signed HTTP POSTs, and manages retry state in PostgreSQL.
US-HOOK-004: Consume webhook.dispatch Events from NATS JetStream
Title: As the platform, I want the Webhook Dispatcher to consume webhook.dispatch events durably so that no delivery notification is missed.
Description: Implement durable NATS consumer webhook-dispatcher with AckExplicit, MaxConcurrency: 20. Ack after delivery_attempts rows written to DB (before HTTP attempt).
Acceptance Criteria:
- Consumer
webhook-dispatcherpresent in NATS consumer list - Ack happens post-DB-write, pre-HTTP attempt
- On pod restart, no events reprocessed if already Acked (DB records guard idempotency)
/readyreturns 503 when consumer disconnected
Story Points: 3
US-HOOK-005: HMAC-SHA256 Request Signing
Title: As a customer, I want each webhook delivery to include an HMAC-SHA256 signature so that I can verify the request authenticity.
Description: Compute sha256=<hex> HMAC over raw request body using per-webhook secret. Include as X-Ghasi-Signature header. Also include X-Ghasi-Event, X-Ghasi-Delivery-Id, X-Ghasi-Timestamp.
Acceptance Criteria:
X-Ghasi-Signature: sha256=<64-char hex>present on every delivery- Signature computed over UTF-8 bytes of raw JSON body
- Reference test: known body + known secret → expected signature hash verified
- Unit test: tampered body produces different signature
Story Points: 2
US-HOOK-006: HTTP Delivery with Timeout and No Redirects
Title: As the platform, I want webhook delivery to POST with a 5-second timeout and reject redirects so that delivery is fast and predictable.
Description: Use undici or native fetch with AbortSignal.timeout(5000) and redirect: 'manual'. 2xx = SUCCESS; all other responses = FAILED.
Acceptance Criteria:
- Delivery times out and fails after 5 s (not 5.1 s)
- HTTP 3xx response treated as failure (not followed)
- HTTP status code stored in
delivery_attempts.http_status_code - First 512 chars of response body stored in
response_body_preview - Integration test using mock HTTP server verifying all failure cases
Story Points: 3
US-HOOK-007: Exponential Backoff Retry Schedule
Title: As the platform, I want failed deliveries retried with exponential backoff so that transient customer endpoint outages are handled without losing events.
Description: Implement retry schedule: immediate → 30 s → 5 min → 30 min → 2 h. Retry state stored in hook.delivery_attempts (no Redis). Retry poller uses SKIP LOCKED for fan-out.
Acceptance Criteria:
next_retry_atvalues match defined schedule for each attempt numberSKIP LOCKEDprevents double-processing across pods- Retry poller runs every 10 s
- Failed attempt updates
status = FAILED_RETRYand correctnext_retry_at - Integration test: verify 3 consecutive failures result in correct
next_retry_attimestamps
Story Points: 5
US-HOOK-008: Dead-Letter After Maximum Retries
Title: As the platform, I want deliveries that exhaust all 5 attempts to be dead-lettered so that they can be monitored and replayed by support.
Description: After attempt 5 fails: set status = DEAD_LETTER, publish webhook.dispatch.deadletter to NATS, increment hook_deliveries_dead_lettered_total.
Acceptance Criteria:
DEAD_LETTERstatus set after exactly 5 failed attemptswebhook.dispatch.deadletterNATS event published withreason: MAX_RETRIES_EXCEEDED- No further retry attempts after dead-letter
hook_deliveries_dead_lettered_totalcounter incremented- Alert fires when dead-letter rate > 100/min
Story Points: 3
EP-HOOK-03: Observability & Operations
US-HOOK-009: Delivery Attempt History API
Title: As a customer, I want to query my webhook delivery history via API so that I can diagnose delivery failures and track retry status.
Description: Implement GET /v1/webhooks/deliveries with optional webhookId and status filters. Paginated response including httpStatusCode, nextRetryAt, attemptNumber.
Acceptance Criteria:
- Returns paginated
delivery_attemptsrows scoped to requesting account webhookIdfilter narrows resultsstatusfilter acceptsPENDING,SUCCESS,FAILED_RETRY,DEAD_LETTER- Does not expose
payload_snapshot(internal field) — only safe fields returned - Response time p99 < 200 ms
Story Points: 2
US-HOOK-010: Prometheus Metrics Instrumentation
Title: As an SRE, I want comprehensive Prometheus metrics so that I can monitor delivery health and retry queue depth.
Description: Implement all 14 metrics from OBSERVABILITY.md §1.
Acceptance Criteria:
- All metrics present at
/metricsendpoint hook_delivery_duration_secondshistogram with buckets: 100ms, 500ms, 1s, 2s, 5s, 10shook_retry_poller_lag_secondsgauge updates each poller cycle- Grafana dashboard
webhook-dispatcher-overviewloads without errors
Story Points: 3
US-HOOK-011: Structured JSON Logging
Title: As an SRE, I want structured JSON log events for all delivery paths so that I can correlate webhook failures with specific accounts and delivery IDs.
Description: Implement Pino structured logging for all 8 log events in OBSERVABILITY.md §2. No phone numbers (to field) in logs.
Acceptance Criteria:
- All 8 log events implemented with correct fields
to(E.164 phone) never in log outputtraceIdandspanIdpresent when OTLP trace activedeliveryIdpresent in all delivery-related log events for correlation
Story Points: 2
EP-HOOK-04: Security
US-HOOK-012: Webhook Secret Encryption at Rest
Title: As a security engineer, I want webhook signing secrets encrypted in the database so that a DB compromise does not expose customer secrets.
Description: Implement AES-256-GCM envelope encryption using KMS-managed key. Plaintext secret never written to DB. Decryption happens in-process at delivery time only.
Acceptance Criteria:
secret_enccolumn contains ciphertext only- Plaintext secret absent from all DB queries, logs, and API responses
- Secret rotation via
PUT /v1/webhooks/:idre-encrypts with current KMS key - Security review sign-off obtained
Story Points: 3
US-HOOK-013: SSRF Prevention for Webhook URLs
Title: As a security engineer, I want webhook delivery to be blocked from accessing internal network ranges so that a malicious customer cannot use the service for SSRF.
Description: Enforce NetworkPolicy blocking egress to private IP ranges and cloud metadata endpoint. Document behaviour: URL pointing to private range will fail at network layer.
Acceptance Criteria:
- NetworkPolicy blocks 10.0.0.0/8, 172.16.0.0/12, 192.168.0.0/16, 169.254.169.254/32
- Test in staging: register webhook pointing to internal service → delivery fails at network level
- Security review sign-off obtained
Story Points: 2
US-HOOK-014: Rate Limiting on REST API
Title: As a security engineer, I want the webhook management REST API rate-limited so that automated abuse is prevented.
Description: Configure Kong rate-limit plugin at 60 requests/minute per account. Request size limited to 1 KB.
Acceptance Criteria:
- 61st request within 60 s returns HTTP
429 Too Many Requests - Request body > 1 KB returns HTTP
413 Payload Too Large - Rate limit headers (
X-RateLimit-Remaining,X-RateLimit-Reset) present in responses
Story Points: 1
EP-HOOK-05: Customer-Endpoint Circuit Breaker + Tenant-Portal Alerts
Description: When a tenant's webhook endpoint becomes persistently unhealthy, the dispatcher must protect itself (and the platform) by opening a circuit breaker per (tenantId, endpointId), surfacing the failure to the tenant in the customer portal, and pausing further attempts until the tenant takes action OR the breaker auto-recovers.
US-HOOK-015 — Per-endpoint circuit breaker state machine
Title: As the webhook-dispatcher, I want a CLOSED → OPEN → HALF_OPEN → CLOSED state machine per endpoint so that consistently-failing endpoints don't waste retry budget.
Acceptance Criteria:
- State stored in Redis
hook:cb:{endpointId}; persisted snapshot tohook.endpoint_stateevery minute - Trips to OPEN after 10 consecutive failures or
failureRate > 50%over 100 attempts - OPEN drops new attempts immediately; bookkeeping only
- After 5 min in OPEN → HALF_OPEN; allows 3 trial requests
- 3 trial successes → CLOSED; any failure → back to OPEN with doubling timeout (max 60 min)
- Metric
hook_circuit_state{endpointId,state}gauge - Unit tests cover all transitions
Story Points: 5
US-HOOK-016 — Tenant portal alert on circuit OPEN
Title: As a customer, I want a banner in the customer portal when one of my webhook endpoints is in OPEN state so I can fix it.
Acceptance Criteria:
hook.circuit.opened.v1NATS event emitted on transition OPENnotification-serviceconsumes and creates portal banner + email digest- Banner shows endpoint URL, last error, retry-after timestamp
- Banner persists until circuit returns to CLOSED
Story Points: 3
US-HOOK-017 — Manual circuit reset
Title: As a customer, I want to manually reset a webhook circuit after I've fixed my endpoint.
Acceptance Criteria:
POST /v1/webhooks/{endpointId}/resetresets circuit to CLOSED with no waiting- Audit event published; customer-portal action logged
- Rate-limited to 1/min per endpoint to prevent abuse
Story Points: 3
US-HOOK-018 — Dead-endpoint pruning
Title: As a platform operator, I want webhook endpoints OPEN for > 30 d auto-disabled so they don't accumulate.
Acceptance Criteria:
- Cron daily: endpoints OPEN > 30 d → status =
DISABLED - Tenant notified by email + portal banner
- Re-enable requires explicit tenant action
Story Points: 3
US-HOOK-019 — Per-endpoint failure analytics
Title: As an SRE, I want per-endpoint failure-rate analytics so I can spot tenant integration issues.
Acceptance Criteria:
- Dashboard panel: top-50 endpoints by failure rate (24h)
- Drill-down to endpoint shows: HTTP status code distribution, latency histogram, retry count
- Linked alert
WebhookEndpointMassiveFailurewhen an endpoint fails 1000 attempts in 1 h
Story Points: 5
EP-HOOK-06: Per-Tenant Egress Pool, Back-Pressure, Rate-Limit Caps
Description: A national event (e.g., emergency broadcast or campaign storm) can produce a webhook stampede that saturates egress and looks like DDoS to customers. Per-tenant egress pools and back-pressure prevent one tenant from starving another or overwhelming a small endpoint.
US-HOOK-020 — Per-tenant worker pool
Title: As a platform engineer, I want each tenant to have a bounded concurrent-worker pool so one tenant's stampede doesn't drain platform-wide workers.
Acceptance Criteria:
- Worker pool size per tenant tier: TRIAL=10, STANDARD=50, ENTERPRISE=200
- Excess work queued in
hook.tenant_queuewith FIFO order - Metric
hook_worker_active{tenantId},hook_queue_depth{tenantId}
Story Points: 5
US-HOOK-021 — Per-endpoint outbound rate limit
Title: As a tenant, I want to declare a max RPS my endpoint can handle so the dispatcher paces itself.
Acceptance Criteria:
hook.endpoints.maxRpsfield; default 50- Dispatcher enforces using token bucket per endpoint (Redis)
- Excess work waits in queue (with TTL);
hook_pacing_delayed_totalcounter
Story Points: 5
US-HOOK-022 — Webhook payload deduplication
Title: As a customer, I want at-least-once delivery with dedup-key so retries don't cause my application to process the same event twice.
Acceptance Criteria:
- Header
X-Ghasi-Idempotency-Key: {messageId}-{eventType}set on every POST - Customer can dedup using this header
- Documented in developer portal
Story Points: 2
US-HOOK-023 — Tenant-level outbound bytes/s budget
Title: As a finance stakeholder, I want each tenant's outbound webhook bytes/s capped so we can predict bandwidth cost.
Acceptance Criteria:
- Per-tier budget: TRIAL 1MB/s, STANDARD 10MB/s, ENTERPRISE 100MB/s
- Excess paced (not dropped); customer notified if sustained
- Metric
hook_egress_bytes_total{tenantId}counter
Story Points: 3
EP-HOOK-07: Signing-Key Rotation with Dual-Sig Grace Period
Description: HMAC signing keys must rotate on a cadence (per-tenant max 365 d). During rotation, both old and new signatures are sent so that the customer's verifier doesn't break mid-flight.
US-HOOK-024 — Per-tenant signing-key versioning
Title: As a platform engineer, I want each tenant to have versioned HMAC keys so rotation is traceable.
Acceptance Criteria:
hook.signing_keys(tenantId, keyId, secret HMAC-256, status ENUM(ACTIVE,RETIRING,RETIRED), createdAt)- Always exactly one ACTIVE; up to one RETIRING
- Secret stored as KMS-encrypted blob; never returned by any API after creation
Story Points: 3
US-HOOK-025 — Dual-signature grace period
Title: As a customer, I want both old and new signatures during rotation so my verifier doesn't break.
Acceptance Criteria:
- During RETIRING window (default 7 d): outbound POST includes
X-Ghasi-Signature: v1=<old>; v2=<new> - After RETIRING expires: only new signature
- Customer-portal alert when retiring window starts and ends
Story Points: 5
US-HOOK-026 — Self-serve key rotation
Title: As a customer, I want to rotate my webhook signing key from the portal.
Acceptance Criteria:
POST /v1/webhooks/signing-keys/rotateinitiates rotation- New key shown once in modal (not retrievable again)
- Old key enters RETIRING state
- Audit event
webhook.signing_key.rotated.v1
Story Points: 3
EP-HOOK-08: mTLS-to-Customer Webhooks (optional, per-tenant)
Description: Enterprise customers may require Ghasi to present a client certificate (mTLS) on webhook delivery so they can pin Ghasi as the source. Pairs with EP-KONG-07 US-KONG-038.
US-HOOK-027 — Per-tenant mTLS toggle and platform CA exposure
Title: As an enterprise tenant, I want to enable mTLS on my webhook endpoint so I can reject any caller other than Ghasi.
Acceptance Criteria:
hook.endpoints.mtlsEnabledboolean per endpoint- Customer-portal page shows the platform CA chain + SPIFFE ID for the tenant to install in their TLS reverse proxy
- Egress proxy presents the platform-wide client SVID (rotated 1 h via SPIRE)
Story Points: 5
US-HOOK-028 — mTLS handshake failure handling
Title: As the dispatcher, I want mTLS handshake failures distinguished from HTTP errors so customers can debug TLS issues.
Acceptance Criteria:
- Failure logged as
code: "TLS_HANDSHAKE_FAILED"inhook.delivery_logs - Distinct counter
hook_mtls_handshake_failed_total - Customer-portal banner explains the likely cause (cert expired, wrong CA, SNI mismatch)
Story Points: 3