Skip to main content

SMS Orchestrator — Jira-Ready Epics & User Stories

Status: populated Owner: Platform Engineering Last updated: 2026-04-18 Service prefix: ORCH Scope: New epics/stories covering HTTP submit migration (from retired api-gateway per ADR-0001), pipeline orchestration, idempotency, retry/DLQ, and observability.


Epic Summary

Epic IDTitleStoriesPoints
EP-ORCH-01HTTP Submit API (Kong-Fronted)US-ORCH-001 – US-ORCH-00634
EP-ORCH-02Outbound Pipeline OrchestrationUS-ORCH-010 – US-ORCH-01640
EP-ORCH-03Idempotency & DeduplicationUS-ORCH-020 – US-ORCH-02318
EP-ORCH-04Retry & Dead-Letter HandlingUS-ORCH-030 – US-ORCH-03422
EP-ORCH-05Observability & ReadinessUS-ORCH-040 – US-ORCH-04416

EP-ORCH-01 · HTTP Submit API (Kong-Fronted)

Context: Per ADR-0001 the retired custom api-gateway is replaced by Kong. HTTP submit responsibility moves to sms-orchestrator. This epic covers implementing the HTTP-facing submit endpoints that Kong proxies.

US-ORCH-001 · Implement POST /v1/sms/send endpoint

Type: Feature | Points: 5

Description:
As a Kong upstream target, I need a POST /v1/sms/send endpoint in sms-orchestrator that accepts a single outbound SMS request so that clients can submit messages through Kong.

Acceptance Criteria:

  • POST /v1/sms/send accepts { to, from, body, messageId?, metadata? } JSON payload
  • Reads X-Tenant-Id and X-Request-Id headers injected by Kong
  • Returns 202 Accepted with { messageId, status: "QUEUED", acceptedAt } on success
  • Returns 400 Bad Request with structured error body on Zod validation failure
  • Returns 409 Conflict on idempotency key collision (duplicate Idempotency-Key header within 48h)
  • messageId auto-generated as UUID v4 if not provided by client
  • Integration test: valid payload → 202 with messageId

US-ORCH-002 · Implement POST /v1/sms/bulk endpoint

Type: Feature | Points: 8

Description:
As a Kong upstream target, I need a POST /v1/sms/bulk endpoint accepting up to 1,000 SMS submissions per request so that clients can submit bulk campaigns efficiently.

Acceptance Criteria:

  • Accepts { messages: Array<{ to, from, body, messageId? }> } with max 1,000 items
  • Returns 202 Accepted with { batchId, accepted: N, rejected: N, results: [...] }
  • Each message in results includes messageId and status (QUEUED or INVALID)
  • Invalid messages within a batch are rejected individually; valid ones proceed
  • Returns 400 if all messages fail validation
  • Returns 413 if array exceeds 1,000 items
  • E2E test: 500 messages, mix of valid/invalid → correct accepted/rejected counts

US-ORCH-003 · Implement GET /v1/sms/{messageId} status endpoint

Type: Feature | Points: 3

Description:
As a client, I need to poll GET /v1/sms/{messageId} to check the current status of a submitted message.

Acceptance Criteria:

  • Returns { messageId, status, tenantId, to, from, createdAt, updatedAt } for known message
  • Returns 404 for unknown messageId
  • Returns 403 if X-Tenant-Id header does not match message's tenantId
  • Response time P95 ≤ 50 ms (PG indexed query on messageId)

US-ORCH-004 · Zod schema validation middleware

Type: Feature | Points: 5

Description:
As the submit pipeline, I need all incoming payloads validated against Zod schemas before processing so that malformed requests are rejected at the HTTP boundary.

Acceptance Criteria:

  • E.164 regex validation on to field; returns field-level error path on failure
  • from (sender ID): 1–11 chars alphanumeric or 1–15 digit numeric
  • body: 1–1600 characters; segment count computed and returned in 202 response
  • messageId: UUID v4 format when provided
  • Validation errors return { errors: [{ field, message, code }] } array
  • Unit tests for all validation rules including boundary values

US-ORCH-005 · Idempotency-Key header processing (HTTP layer)

Type: Feature | Points: 8

Description:
As the HTTP submit layer, I need to process Idempotency-Key headers so that duplicate requests within 48 hours return the original response without reprocessing.

Acceptance Criteria:

  • On first request: compute sha256(tenantId + ":" + Idempotency-Key), store in Redis orch:submit-idem:{hash} with 48h TTL, value = serialized 202 response
  • On replay: return stored 202 response with Idempotency-Replayed: true header, skip pipeline
  • On Redis unavailable: process request normally (fail open) + emit warn log
  • Key collision scenario tested: two concurrent requests with same key → only one processed
  • SET NX EX used atomically

US-ORCH-006 · Kong route configuration for /v1/sms/* routes

Type: Configuration | Points: 5

Description:
As a platform operator, I need Kong routes for /v1/sms/send, /v1/sms/bulk, and /v1/sms/{messageId} pointing to sms-orchestrator so that client traffic reaches the correct upstream.

Acceptance Criteria:

  • Kong Service resource: sms-orchestrator, upstream http://sms-orchestrator:3001
  • Kong Route resources for all three paths with correct methods (POST, POST, GET)
  • jwt plugin applied (validates Bearer token from auth-service JWKS)
  • correlation-id plugin injects X-Request-Id
  • request-transformer plugin injects X-Tenant-Id from JWT sub claim
  • Configuration stored in services/api-gateway/kong/ declarative config
  • Integration test through Kong: 401 without token, 202 with valid token + payload

EP-ORCH-02 · Outbound Pipeline Orchestration

Context: Core NATS consumer pipeline: idempotency → validation → routing → operator publish → state persistence.

US-ORCH-010 · NATS consumer setup (sms.outbound.request)

Type: Feature | Points: 5

Description:
As the pipeline, I need a durable NATS JetStream consumer on sms.outbound.request so that submitted messages are processed reliably with at-least-once delivery.

Acceptance Criteria:

  • Durable consumer name: orch-consumer
  • AckExplicit mode — NATS message acknowledged only after pipeline completion
  • AckWait 30s; MaxDeliver 3 (application handles retries, not NATS)
  • Configurable MAX_CONCURRENCY (default 10 in-flight messages)
  • Reconnect on NATS disconnect without losing in-flight messages
  • Metrics: nats_consumer_lag, nats_messages_in_flight exposed on /metrics

US-ORCH-011 · Pipeline idempotency check (NATS layer)

Type: Feature | Points: 3

Description:
As the NATS consumer pipeline, I need to check Redis for a processed messageId before executing pipeline stages so that NATS redeliveries don't double-process messages.

Acceptance Criteria:

  • Key pattern: orch:idem:{messageId} checked with Redis GET
  • On key present: ACK NATS message, emit warn log with duplicate: true, return
  • On key absent: SET NX with 48h TTL before processing
  • On Redis unavailable: proceed with processing, emit warn

US-ORCH-012 · Domain validation (pipeline stage)

Type: Feature | Points: 3

Description:
As the pipeline, I need domain-level validation of the NATS message payload so that structurally invalid messages are terminated early without retrying.

Acceptance Criteria:

  • E.164 to validation, non-empty from, body length ≤ 1600 chars, valid UUID messageId, non-empty tenantId
  • On failure: update PG status to FAILED, publish sms.outbound.deadletter, ACK NATS, no retry
  • Segment count computed and stored in sms_messages.segment_count

US-ORCH-013 · gRPC routing stage (routing-engine integration)

Type: Feature | Points: 8

Description:
As the pipeline, I need to call the routing-engine via gRPC to select an operator for each message so that messages are dispatched to the correct SMPP connector.

Acceptance Criteria:

  • gRPC call: SelectOperator(tenantId, to, from, messageType, messageId){operatorId, operatorSubject}
  • NO_ROUTE_FOUND error → permanent failure: FAILED status + DLQ, no retry
  • Transient gRPC error (timeout, UNAVAILABLE) → triggers retry mechanism (EP-ORCH-04)
  • P95 gRPC call latency ≤ 50 ms (measured via span)
  • operatorId and routeId stored in sms_messages on success
  • Update PG status to ROUTING before gRPC call, ROUTED on success

US-ORCH-014 · Operator NATS publish stage

Type: Feature | Points: 5

Description:
As the pipeline, I need to publish the SMS payload to smpp.operator.{operatorId} after routing so that the SMPP connector receives the message for carrier submission.

Acceptance Criteria:

  • Published subject: smpp.operator.{operatorId} with SmppOutboundMessage schema
  • X-Correlation-ID NATS header set to messageId
  • Original messageId, tenantId, and routing metadata included in payload
  • Update PG status to SENT only after successful NATS publish ACK
  • On NATS publish failure: triggers retry mechanism

US-ORCH-015 · Domain event emission (sms.events.status)

Type: Feature | Points: 8

Description:
As downstream consumers (billing, webhooks), I need status change events published to sms.events.status after every PG write so that consumers can react to message lifecycle transitions.

Acceptance Criteria:

  • Payload: { messageId, tenantId, previousStatus, newStatus, timestamp, metadata? }
  • Published after PG commit, not before
  • Publish failure logged but does not fail the pipeline stage
  • All transitions emitted: QUEUED→ROUTING, ROUTING→ROUTED, ROUTED→SENT, *→FAILED, *→DEAD_LETTER, *→RETRY
  • TypeScript interface defined in event-schemas.ts

US-ORCH-016 · Message state persistence (PostgreSQL)

Type: Feature | Points: 8

Description:
As the audit trail, I need all message state transitions written to orch.sms_messages atomically so that message history is reliable and queryable.

Acceptance Criteria:

  • INSERT on QUEUED (HTTP layer); UPDATE on all subsequent transitions
  • All status updates wrapped in PG transactions
  • status_updated_at updated on every transition
  • processed_at set on first SENT or DEAD_LETTER terminal transition
  • attempt_count incremented on each RETRY
  • last_error updated with failure reason on each failed attempt
  • PG partitioned by month (PARTITION BY RANGE (created_at)), 90-day retention policy

EP-ORCH-03 · Idempotency & Deduplication

US-ORCH-020 · Redis SET NX idempotency key creation

Type: Feature | Points: 3

Description:
As the pipeline, I need atomic SET NX operations for idempotency keys so that concurrent message deliveries don't result in double-processing in multi-replica deployments.

Acceptance Criteria:

  • SET NX EX 172800 used for all idempotency key writes
  • Key pattern orch:idem:{messageId} for pipeline-level; orch:submit-idem:{hash} for HTTP-level
  • Race condition test: two concurrent goroutines/workers with same messageId → only one proceeds
  • Redis MULTI/EXEC not required (SET NX is atomic)

US-ORCH-021 · Idempotency TTL and expiry behaviour

Type: Feature | Points: 3

Description:
As the platform, I need idempotency keys to expire after 48 hours so that Redis memory is bounded and replay protection windows are well-defined.

Acceptance Criteria:

  • TTL = 172800 seconds (48h) set at key creation
  • Expired keys allow re-processing (new submission treated as fresh request)
  • TTL visible in Redis key inspection (TTL orch:idem:*)
  • Redis key count monitored in Prometheus: redis_key_count{prefix="orch:idem"}

US-ORCH-022 · Idempotency replay response for HTTP clients

Type: Feature | Points: 5

Description:
As an HTTP client, I need replayed requests to return the original 202 response body so that retrying clients receive consistent responses.

Acceptance Criteria:

  • Original 202 response body serialized and stored in Redis alongside idempotency key
  • Replay returns identical { messageId, status, acceptedAt } body
  • Response includes Idempotency-Replayed: true header
  • Storage overhead bounded: response body stored as compact JSON string

US-ORCH-023 · Idempotency Redis failover behaviour

Type: Feature | Points: 7

Description:
As the platform, I need graceful degradation when Redis is unavailable so that idempotency failures don't block message submission.

Acceptance Criteria:

  • Redis connection failure → warn log emitted with component: idempotency
  • Message processing continues (fail open) — no 503 to client
  • redis_idempotency_skip_total counter incremented on each skip
  • Alert rule: redis_idempotency_skip_total > 10 in 5m window → PagerDuty warning

EP-ORCH-04 · Retry & Dead-Letter Handling

US-ORCH-030 · Exponential backoff retry policy

Type: Feature | Points: 5

Description:
As the pipeline, I need transient failures to trigger an exponential backoff retry so that temporary operator or routing outages don't immediately dead-letter messages.

Acceptance Criteria:

  • Max 3 attempts (1 initial + 2 retries)
  • Delays: attempt 1 → 1s, attempt 2 → 2s, attempt 3 → 4s
  • attempt_count incremented in PG on each attempt
  • Status updated to RETRY with last_error populated
  • Retry timing enforced via NATS delayed NAK (nak(delay))

US-ORCH-031 · sms.outbound.retry event on each retry

Type: Feature | Points: 3

Description:
As downstream consumers and operations, I need a sms.outbound.retry event emitted on each retry attempt so that retry patterns are observable.

Acceptance Criteria:

  • Published to sms.outbound.retry subject with { messageId, tenantId, attemptNumber, failureReason, nextRetryAt }
  • TypeScript interface defined in event-schemas
  • Published before delayed NAK
  • Unit test: 3 consecutive failures → 3 retry events emitted

US-ORCH-032 · Dead-letter queue routing after max retries

Type: Feature | Points: 5

Description:
As the platform, I need exhausted messages routed to sms.outbound.deadletter so that no message is silently lost and dead-letter consumers can handle reprocessing or alerting.

Acceptance Criteria:

  • After 3 failed attempts: publish to sms.outbound.deadletter
  • DLQ payload: { messageId, tenantId, to, from, body, attemptCount, failureReason, failedAt }
  • PG status updated to DEAD_LETTER
  • Original NATS message ACK'd only after successful DLQ publish
  • DLQ publish failure retried up to 3 times independently before logging error

US-ORCH-033 · Permanent failure handling (validation + no-route)

Type: Feature | Points: 5

Description:
As the pipeline, I need certain failure types (invalid payload, no route found) to skip retries and go directly to DLQ so that unretryable messages don't consume retry budget.

Acceptance Criteria:

  • Validation failure → immediate FAILED + DLQ, attemptCount: 1
  • NO_ROUTE_FOUND gRPC error → immediate FAILED + DLQ, attemptCount: 1
  • Failure reason encoded as structured object { code: "VALIDATION_FAILED" | "NO_ROUTE" | ..., detail: string }
  • Unit tests for each permanent failure path

US-ORCH-034 · Stuck ROUTED row reconciliation job

Type: Feature | Points: 4

Description:
As the platform, I need a periodic reconciliation job that detects messages stuck in ROUTED status (operator publish confirmed but crash before PG update) so that they can be recovered or alerting triggered.

Acceptance Criteria:

  • Cron job runs every 5 minutes
  • Queries orch.sms_messages WHERE status = 'ROUTED' AND status_updated_at < NOW() - INTERVAL '2 minutes'
  • Emits warn log per stuck message; increments orch_stuck_routed_total counter
  • Configurable threshold: STUCK_ROUTED_THRESHOLD_SECONDS env var (default 120)
  • Does NOT auto-retry (manual intervention or separate DLQ consumer)

EP-ORCH-05 · Observability & Readiness

US-ORCH-040 · Health and readiness endpoints

Type: Feature | Points: 2

Description:
As Kubernetes, I need /health/live and /health/ready endpoints so that liveness and readiness probes work correctly.

Acceptance Criteria:

  • GET /health/live → 200 always if process is running
  • GET /health/ready → 200 only if NATS, PG, Redis, and routing-engine gRPC are reachable
  • GET /health/ready → 503 with { dependencies: { nats: "down", ... } } if any dependency unhealthy
  • Response time ≤ 200 ms (with 1s timeout per dependency check)

US-ORCH-041 · Prometheus metrics endpoint

Type: Feature | Points: 3

Description:
As Prometheus, I need a /metrics endpoint in OpenMetrics format so that all pipeline metrics are scrapable.

Acceptance Criteria:

  • Metrics exposed: orch_messages_submitted_total, orch_pipeline_duration_seconds, orch_retry_total, orch_dlq_total, orch_idempotency_hit_total, nats_consumer_lag
  • All metrics labeled with tenant_id, status, operator_id where applicable
  • Histogram buckets for orch_pipeline_duration_seconds: 50ms, 100ms, 200ms, 500ms, 1s, 2s
  • /metrics endpoint not exposed via Kong (internal only)

US-ORCH-042 · Structured JSON logging

Type: Feature | Points: 3

Description:
As the operations team, I need all logs emitted as structured JSON so that Loki can index and query them efficiently.

Acceptance Criteria:

  • Log fields: level, timestamp, messageId, tenantId, traceId, spanId, service: "sms-orchestrator", action, durationMs
  • pino logger (or equivalent) configured with JSON transport
  • Sensitive fields redacted: body (SMS content) not logged at INFO; only at DEBUG with explicit opt-in
  • Log level configurable via LOG_LEVEL env var

US-ORCH-043 · OpenTelemetry trace propagation

Type: Feature | Points: 5

Description:
As the observability platform, I need trace context propagated from Kong through sms-orchestrator to downstream services so that end-to-end request traces are available in Grafana Tempo.

Acceptance Criteria:

  • W3C TraceContext header propagated from Kong X-Request-Id → OTel span
  • Spans created for each pipeline stage: validate, idempotency, route, publish, persist
  • gRPC call to routing-engine propagates trace context
  • NATS publish includes traceparent header
  • OTel exporter configured via OTEL_EXPORTER_OTLP_ENDPOINT env var

US-ORCH-044 · Kubernetes deployment manifest and HPA

Type: DevOps | Points: 3

Description:
As the platform, I need a Kubernetes Deployment, Service, and HPA for sms-orchestrator so that it scales horizontally under load.

Acceptance Criteria:

  • Deployment with minReplicas: 2, resource requests cpu: 100m mem: 128Mi, limits cpu: 500m mem: 512Mi
  • HPA scaling on nats_consumer_lag > 1000 (KEDA) or CPU > 70%
  • Service ClusterIP on port 3001
  • livenessProbe and readinessProbe wired to health endpoints
  • PodDisruptionBudget ensuring at least 1 pod available during rolling updates

EP-ORCH-06 · Priority-Lane Routing on Inbound Submit

Context: Per EP-PLAT-NB-06 (national priority lanes), every inbound submit must be tagged into one of P0 (emergency), P1 (OTP), P2 (transactional), P3 (marketing), P4 (broadcast). The orchestrator is the single point where lane assignment happens, and lane assignment determines NATS subject (lane.p0.*lane.p4.*), TPS budget, compliance treatment, and SLA target.

US-ORCH-050 · Accept and validate X-Priority-Lane header

Type: Feature | Points: 5

Description: As a tenant, I need to optionally hint the priority lane via X-Priority-Lane: P0|P1|P2|P3|P4 so that traffic class is explicit. The orchestrator must validate the tenant is authorised for the requested lane.

Acceptance Criteria:

  • Header values constrained to P0..P4; invalid values return 400 with code: "INVALID_LANE".
  • Authorisation table auth.tenant_lane_grants (tenantId, allowedLanes[]); enforced in orchestrator. Unauthorised lane → 403 code: "LANE_NOT_GRANTED".
  • Default lane assignment when header absent: per tenant tier (tier=ENTERPRISE → P2, tier=STANDARD → P3, tier=TRIAL → P3).
  • P0 requests require both X-Priority-Lane: P0 and a valid government PKI signature in X-Gov-Signature (else 403).
  • Unit test covers: each lane × authorised vs. unauthorised matrix (10 cases minimum).

US-ORCH-051 · Lane-aware NATS subject routing

Type: Feature | Points: 5

Description: As the orchestrator, I need to publish sms.outbound.request to a lane-specific subject so that downstream consumers (compliance, routing, smpp) can apply lane-aware processing.

Acceptance Criteria:

  • Subjects: lane.p0.outbound.request, lane.p1.outbound.request, …, lane.p4.outbound.request.
  • Subject choice driven by the resolved lane (US-ORCH-050).
  • Each lane subject is a distinct JetStream consumer with independent ack/lag metrics.
  • Metric orch_publish_lane_total{lane} increments per publish.

US-ORCH-052 · Per-lane back-pressure and 429-shaping

Type: Feature | Points: 5

Description: As the orchestrator, I need to shed inbound P3/P4 traffic before P0/P1 if NATS lag spikes so that high-priority traffic always wins.

Acceptance Criteria:

  • Redis gauge lane:lag:{P} updated every 5 s by a sidecar that reads JetStream consumer lag.
  • If lag.P0 > 0 (any backlog) for 30 s → reject all P3/P4 with 429 code: "LANE_SHED".
  • If lag.P1 > 1000 for 30 s → reject P4 with 429 code: "LANE_SHED".
  • Shed responses include Retry-After: 60.
  • Telemetry: orch_lane_shed_total{shedLane,protectedLane} counter.

US-ORCH-053 · Lane carried in NATS payload and DB row

Type: Feature | Points: 3

Description: The lane assignment is a permanent property of the message and must be persisted on orch.sms_messages.lane and included in every downstream NATS payload so it can be enforced and reported on.

Acceptance Criteria:

  • Schema migration adds lane VARCHAR(2) NOT NULL DEFAULT 'P3' to orch.sms_messages.
  • All downstream NATS payloads (sms.outbound.request, compliance.evaluate.request, sms.dispatch.command) include lane field.
  • GET /v1/sms/{messageId} response includes lane in the payload.

US-ORCH-054 · Lane-SLO Prometheus instrumentation

Type: Feature | Points: 3

Description: As SRE, I need Prometheus histograms per lane so that lane-specific SLO alerts can be wired (P1 OTP submit→DLR P95 ≤ 3 s, P2 transactional ≤ 10 s, …).

Acceptance Criteria:

  • Histograms orch_submit_to_dlr_seconds{lane} and orch_submit_to_ack_seconds{lane} registered.
  • Buckets tuned per lane (P1 buckets 0.5–10 s, P3 buckets 5–300 s).
  • Recording rules emit per-lane SLO compliance percentages.
  • Alert OrchLaneSloBreach{lane=P1} fires when 5-min P95 > 3 s.

EP-ORCH-07 · Trusted-Tenant Fast-Path Submit (signed-template short-circuit)

Context: Per EP-PLAT-NB-08 and EP-CE-13 (trusted-tenant fast-path), pre-vetted regulated tenants (banks, ministries, healthcare) submit using a pre-approved signed template; the orchestrator verifies fingerprint and routes with compliance in shadow mode rather than blocking. This delivers OTP-class latency without losing compliance evidence.

US-ORCH-060 · Accept X-Template-Id and variable bindings

Type: Feature | Points: 5

Description: As a trusted tenant, I want to submit by template ID + variable bindings so that the message body is reconstructed server-side from a pre-approved template.

Acceptance Criteria:

  • POST /v1/sms/send accepts { to, from, templateId, variables } instead of body.
  • Template fetched from compliance.approved_templates (cached in Redis 5 min); 404 with code: "TEMPLATE_NOT_FOUND" if missing or revoked.
  • Variables substituted via Mustache; unsupplied variables → 400 with code: "TEMPLATE_VARIABLE_MISSING".
  • Resulting body length recomputed; segment count returned in 202 response.

US-ORCH-061 · Verify content fingerprint and tenant approval

Type: Feature | Points: 5

Description: As the orchestrator, I need to verify that the rendered body fingerprint matches the approved template hash and that the tenant is on the trusted-tenant allow-list for that template.

Acceptance Criteria:

  • Fingerprint = sha256(templateId || normalised(rendered_body)).
  • compliance.approved_templates.fingerprint_pattern is a regex (allows variable spans); orchestrator verifies the rendered body matches.
  • compliance.template_tenant_grants (templateId, tenantId, expiresAt); enforced.
  • Mismatch → fall back to full compliance evaluation; emit orch.fastpath.fallback.v1 event.

US-ORCH-062 · Compliance shadow-mode evaluation in fast path

Type: Feature | Points: 5

Description: Even when the fast path is taken, compliance must be evaluated in shadow mode so that any drift is detected without blocking delivery.

Acceptance Criteria:

  • Orchestrator publishes compliance.evaluate.shadow.v1 (non-blocking, fire-and-forget within ack budget).
  • Compliance verdict on shadow path is logged but does not alter delivery decision.
  • If shadow verdict ≠ ALLOW, alert ComplianceFastpathDrift{tenantId,templateId} fires.
  • 1-in-1000 sample of fast-path messages is re-evaluated in blocking mode for drift detection.

US-ORCH-063 · Fast-path metrics and audit

Type: Feature | Points: 3

Description: As an auditor, I need metrics distinguishing fast-path from full-compliance traffic so that the share of bypass is monitorable.

Acceptance Criteria:

  • Metric orch_submit_path_total{path="fastpath"|"full"} increments per submit.
  • orch.sms_messages.compliance_path column (FAST_PATH | FULL) populated.
  • Audit event orch.fastpath.taken.v1 emitted with tenantId, templateId, messageId, fingerprintMatched.
  • Grafana panel: fast-path share by tenant per hour.

US-ORCH-064 · Per-tenant fast-path kill-switch

Type: Feature | Points: 3

Description: As a security incident responder, I need to disable fast-path for a specific tenant or template within 30 s so that an abused fast-path can be revoked instantly.

Acceptance Criteria:

  • POST /v1/internal/orch/fastpath/disable accepts { tenantId?, templateId? } and pushes a Redis key with TTL 24 h.
  • Orchestrator checks the kill-switch on every submit before granting fast path.
  • Re-enable via DELETE or TTL expiry.
  • Audit event orch.fastpath.killed.v1 published.