Skip to main content

Admin Dashboard — Failure Modes

Status: populated Owner: Platform Engineering (Frontend) Last updated: 2026-04-18

1. Failure Taxonomy

ModeDetectionDashboard BehaviorRecovery
Kong gateway down502/504 from fetchFull-page error boundary; alert banner: "Backend unreachable"Retry on next navigation
auth-service down502 on /v1/internal/auth/meLogin fails with "Authentication service unavailable"Retry when service recovers
JWT expired (normal)401 from KongMiddleware silently refreshes using __admin_refresh cookieTransparent
JWT expired + refresh expired401 from refreshRedirect to /login?reason=session_expiredRe-authenticate
analytics-service down502 on analytics endpointsDashboard shows stale data with "Metrics unavailable" banner; charts show last-known statePolling continues; recovers automatically
operator-management-service down502 on /v1/internal/operators/operators page shows error state cardManual refresh
routing-engine down502 on routing endpoints/routing page shows error state; DnD reorder disabledManual refresh
Polling failure (3 consecutive)admin_poll_total{result="error"} counterToast: "Metrics refresh paused — backend error"Polling resumes after 5 min with exponential backoff
Drag-and-drop rule reorder conflict409 from routing-engineToast: "Reorder failed — rule list was updated by another admin" + list re-fetchesAutomatic re-sync
Rate limiting (429)429 from KongToast: "Rate limit exceeded"Wait and retry

2. Partial Degradation Strategy

  • Each dashboard section (MetricsSummary, ThroughputChart, DeliveryBreakdown, TopOperatorsTable) is wrapped in an independent Suspense + ErrorBoundary.
  • A failure in analytics does not prevent the operator health section from rendering.
  • System health page remains independent of dashboard polling.

3. Concurrent Admin Edit Conflict

Multiple admins editing the same operator or routing rule simultaneously:

  • Optimistic updates are not used for operator or routing rule mutations.
  • All mutations are request-response: the admin waits for a 200/204 before the UI updates.
  • On 409 Conflict: toast error with "Please reload the list to see the latest state."

4. SMPP Operator Deletion with Active Routing Rules

The backend (operator-management-service) returns a 422 if an operator is referenced by active routing rules. The dashboard surfaces this as:

"This operator is referenced by [N] active routing rules. Update the routing rules before deleting."

The dashboard links directly to the /routing page.

5. Known Limitations

LimitationImpactMitigation
30s polling lag for metricsAlert detection delayed up to 30sSSE-based push planned for post-MVP
No optimistic updates on operator CRUDSlightly slower UX for create/editAcceptable for low-frequency admin operations
Single Cloudflare Access zoneIf Cloudflare Access is down, admin login is blockedEmergency bypass via VPN + direct cluster access (documented in runbook)