Admin Dashboard — Failure Modes
Status: populated Owner: Platform Engineering (Frontend) Last updated: 2026-04-18
1. Failure Taxonomy
| Mode | Detection | Dashboard Behavior | Recovery |
|---|---|---|---|
| Kong gateway down | 502/504 from fetch | Full-page error boundary; alert banner: "Backend unreachable" | Retry on next navigation |
| auth-service down | 502 on /v1/internal/auth/me | Login fails with "Authentication service unavailable" | Retry when service recovers |
| JWT expired (normal) | 401 from Kong | Middleware silently refreshes using __admin_refresh cookie | Transparent |
| JWT expired + refresh expired | 401 from refresh | Redirect to /login?reason=session_expired | Re-authenticate |
| analytics-service down | 502 on analytics endpoints | Dashboard shows stale data with "Metrics unavailable" banner; charts show last-known state | Polling continues; recovers automatically |
| operator-management-service down | 502 on /v1/internal/operators | /operators page shows error state card | Manual refresh |
| routing-engine down | 502 on routing endpoints | /routing page shows error state; DnD reorder disabled | Manual refresh |
| Polling failure (3 consecutive) | admin_poll_total{result="error"} counter | Toast: "Metrics refresh paused — backend error" | Polling resumes after 5 min with exponential backoff |
| Drag-and-drop rule reorder conflict | 409 from routing-engine | Toast: "Reorder failed — rule list was updated by another admin" + list re-fetches | Automatic re-sync |
| Rate limiting (429) | 429 from Kong | Toast: "Rate limit exceeded" | Wait and retry |
2. Partial Degradation Strategy
- Each dashboard section (
MetricsSummary,ThroughputChart,DeliveryBreakdown,TopOperatorsTable) is wrapped in an independentSuspense+ErrorBoundary. - A failure in analytics does not prevent the operator health section from rendering.
- System health page remains independent of dashboard polling.
3. Concurrent Admin Edit Conflict
Multiple admins editing the same operator or routing rule simultaneously:
- Optimistic updates are not used for operator or routing rule mutations.
- All mutations are request-response: the admin waits for a 200/204 before the UI updates.
- On 409 Conflict: toast error with "Please reload the list to see the latest state."
4. SMPP Operator Deletion with Active Routing Rules
The backend (operator-management-service) returns a 422 if an operator is referenced by active routing rules. The dashboard surfaces this as:
"This operator is referenced by [N] active routing rules. Update the routing rules before deleting."
The dashboard links directly to the /routing page.
5. Known Limitations
| Limitation | Impact | Mitigation |
|---|---|---|
| 30s polling lag for metrics | Alert detection delayed up to 30s | SSE-based push planned for post-MVP |
| No optimistic updates on operator CRUD | Slightly slower UX for create/edit | Acceptable for low-frequency admin operations |
| Single Cloudflare Access zone | If Cloudflare Access is down, admin login is blocked | Emergency bypass via VPN + direct cluster access (documented in runbook) |