OBSERVABILITY — staff-service
Sibling: DEPLOYMENT_TOPOLOGY · FAILURE_MODES · SECURITY_MODEL
Strategic anchors: 02 §12 Resilience · 02 §14 Observability · standards/SERVICE_TEMPLATE
staff-service exposes the canonical platform telemetry stack (OpenTelemetry → Cloud Trace / Cloud Logging / Cloud Monitoring; Pub/Sub → BigQuery for events). The dashboards and SLOs below are operationally focused: capacity-signal freshness, clock-in correctness, and outbox health are the three things that absolutely must alert.
1. SLOs
| SLO | Target | Window | Error budget |
|---|---|---|---|
Punch availability (POST /clock/punch 2xx) | 99.95 % | 30 d | 21.6 min |
| Punch latency p95 | < 400 ms | 30 d | n/a (latency) |
| Capacity freshness | ≤ 60 s end-to-end (event-to-cache) | 30 d | 1 % > 60 s |
| Outbox publish lag p99 | < 30 s | 30 d | 0.1 % > 30 s |
| Inbox processing lag p99 | < 60 s | 30 d | 0.1 % > 60 s |
Schedule grid availability (GET /shifts 2xx) | 99.9 % | 30 d | 43.2 min |
| DSAR export turnaround | < 48 h | 90 d | n/a (compliance) |
Error budget exhaustion triggers the freeze policy in DOD §Error budget.
2. SLIs (Cloud Monitoring)
| SLI name | Source | Notes |
|---|---|---|
staff.api.punch.success_rate | API HTTP metrics, code in [200, 299] | Excludes intentional 4xx (PIN failures) |
staff.api.punch.latency.p95 | API HTTP metrics | Per-region |
staff.api.shifts.success_rate | API HTTP metrics | |
staff.outbox.depth | Custom gauge, staff.outbox count where published_at IS NULL | Sampled every 15 s |
staff.outbox.publish.latency | Outbox relay span timing | |
staff.inbox.lag | Pub/Sub oldest_unacked_message_age | Per subscription |
staff.capacity.cache.hit_rate | Custom counter | Target ≥ 90 % |
staff.capacity.event_to_cache_ms | Trace span | Punch-event publish to cache invalidate |
staff.pin.failure_rate.per_property | Custom counter | Excessive → security alert |
staff.shift.staffing_gap_count | Counter from staffing_gap_detected.v1 | Trended in dashboard, not alerted (informational) |
staff.iam.revoke.failure_count | Counter | Alert if > 0 in 5 min window |
staff.ai.shift_suggestion.applied_rate | Counter | Health of AI value loop |
staff.ai.edge.anomaly_rate | Per-property gauge from edge telemetry | Drift signal |
3. Dashboards
3.1 staff/overview (Cloud Monitoring)
- Punch RPS, success rate, latency p50/p95/p99
- Schedule grid RPS + latency
- Capacity freshness (
event_to_cache_msp95 + cache hit rate) - Active shift count by property
- DB CPU, connections, slow queries
3.2 staff/clock-in
- Punches per minute, by source (
electron_pin,electron_jwt,mobile_jwt,web_jwt,manager_override,offline_replay,system_auto) - PIN failure rate per property + per device
- Manager-override count per actor (top 10)
- Multi-device-collision count (rare, alert at > 0)
- Offline-replay age distribution
3.3 staff/outbox
- Outbox depth (gauge)
- Publish latency p95
- Failed publish count (DLQ candidates)
- Per-topic publish rate
3.4 staff/inbox
- Per-consumed-subject lag p99
- DLQ count (alert at > 0)
- Reprocessing count (replays)
3.5 staff/leave
- Leave requests by status, type, day
- Approval latency p95 (request → decided)
- Force-unassigned count per approval
3.6 staff/ai
- Shift suggestion: viewed, applied, rejected (counts)
- Forecast call count + degraded rate
- Edge anomaly rate, model version distribution
4. Tracing
OpenTelemetry instrumentation:
- HTTP server spans (
http.method,http.route,http.status_code,tenant.id,property.id,actor.id). - DB spans (
db.system=postgresql,db.statement.template, parameter cardinality only — no values). - Pub/Sub spans (publish + subscribe; correlation via
traceparentcarried in event envelopemetadata.traceparent). - Outbound HTTP to
iam-service,property-service,ai-orchestrator-service.
Trace exemplars are linked from latency dashboards. Sampling: 100 % for 5xx and 4xx∈{401,403,409,422,423,429}; 5 % head-based for the rest; 1 % adaptive for high-volume clock.punch (always sampled when status non-2xx).
5. Logging
JSON-structured, single-line, fields:
| Field | Source |
|---|---|
timestamp | RFC 3339 |
severity | INFO / WARN / ERROR / CRITICAL |
service | 'staff-service' |
tenantId | from app.tenant_id |
actorId | JWT subject |
propertyId | header |
route | HTTP route template |
method | HTTP |
status | HTTP |
latencyMs | request duration |
traceId | W3C |
spanId | W3C |
event | one of request.complete, outbox.publish.ok, outbox.publish.fail, inbox.process.ok, inbox.process.fail, pin.fail, pin.lock, clock.override, iam.revoke.fail, audit.write, ai.degraded, migration.run |
metadata | event-specific |
PII-clean: emails, phone numbers, emergency contacts are never logged. PIN values never logged. The codebase has a CI lint rule (@ghasi/eslint-pii) that flags string templates containing forbidden field names.
6. Alerts
Pager-grade alerts route to PagerDuty service staff-service; warn-grade route to Slack #staff-svc-warnings.
| Alert | Severity | Trigger | Runbook |
|---|---|---|---|
| Punch error rate > 0.5 % for 5 min | P1 | SLO burn 14× normal | runbooks/staff/clock-error-spike.md |
| Outbox depth > 1000 sustained 5 min | P1 | Publishers stuck | runbooks/staff/outbox-stalled.md |
| Outbox publish lag p99 > 60 s for 10 min | P2 | Pub/Sub or DB pressure | same as above |
| Inbox lag p99 > 5 min for 5 min | P1 | Consumer backed up | runbooks/staff/inbox-stalled.md |
| Inbox DLQ depth > 0 | P2 | Schema mismatch or consumer bug | runbooks/staff/inbox-dlq.md |
iam.revoke.failure_count > 0 in 5 min | P1 | Termination cascade incomplete | runbooks/staff/iam-revoke-failed.md |
multi_device_punch_detected > 0 in 1 h | P2 | Reconciliation needed | runbooks/staff/multi-device-punch.md |
| PIN failure rate > 1 / s sustained 2 min for any property | P2 | Possible brute-force | runbooks/staff/pin-bruteforce.md |
| Capacity freshness p95 > 120 s for 10 min | P2 | Cache invalidation stuck | runbooks/staff/capacity-stale.md |
| Edge anomaly model drift (FPR > 12 % weekly) | P3 | Model needs retraining | runbooks/staff/edge-anomaly-drift.md |
| DSAR export overdue > 48 h | P2 | Compliance breach risk | runbooks/staff/dsar-overdue.md |
| Migration failed in prod | P1 | Flyway non-zero exit | runbooks/staff/migration-failure.md |
Each alert page includes the dashboard panel link and the runbook link. Runbooks live in runbooks/staff/ in the docs repo and ship with the service.
7. Health Endpoints
| Endpoint | Returns |
|---|---|
GET /health/startup | 200 once Flyway migrations complete and JWKS is loaded (deadline 30 s) |
GET /health/live | 200 if process can serve any request (no deps) |
GET /health/ready | 200 if Postgres reachable + Redis reachable + KMS reachable + Pub/Sub publisher healthy + outbox depth < 5000 |
/health/ready returns a JSON body with per-dep status for Cloud Run probes and on-call diagnosis.
8. Edge Telemetry (Anomaly Model Drift)
The Electron edge anomaly model (AI_INTEGRATION §4) emits a small heartbeat to bff-backoffice-service containing:
- Aggregate count of scores in three buckets (
< 0.3,0.3..0.7,≥ 0.7) - Model version
- Property + device
Aggregated by staff-service as staff.ai.edge.anomaly_rate. A weekly batch job recomputes the false-positive rate (compared to ground-truth: did a manager-override actually happen for that punch?) and alerts if FPR > 12 %.
9. Audit Log Visibility
- Audit rows are queryable via
bff-backoffice-serviceadmin endpoints (capabilitystaff.audit.read). - BigQuery cold export:
melmastoon.staff.audit_eventstable; partitioned daily, clustered bytenant_id. - Suspicious-action analytics dashboards are owned by
security-service, not by us.
10. Cost Telemetry
| Counter | Use |
|---|---|
| Pub/Sub publish bytes per topic | Pub/Sub cost attribution |
| BigQuery export rows per day | Cold storage cost |
| Cloud SQL row growth per table | DB sizing forecast |
| AI orchestrator calls per surface | AI cost attribution (cross-checked with ai-orchestrator) |
| Egress bytes per BFF | Network cost |
A monthly tenant cost report is computed from BigQuery and surfaced via tenant-service (we do not bill).