Observability

:::info Source Sourced from services/sync-service/OBSERVABILITY.md in the documentation repo. :::

1. Logs

Events: sync.push.received|completed, sync.pull.served, sync.ack.received, sync.mutation.applied|conflicted|rejected, sync.delta.projected, sync.conflict.resolved, sync.cursor.advanced|out_of_range, sync.device.stale|disconnected, sync.full_resync.initiated.

Attrs: device_id_hash, scope, mutation_count, delta_count, lamport, conflict_policy.

2. Metrics

RED

sync_api_requests_total{endpoint,status} counter
sync_api_duration_seconds{endpoint} histogram

Domain

sync_push_mutations_total{status=applied|conflicted|rejected} counter
sync_push_batch_size histogram
sync_pull_deltas_total{scope} counter
sync_pull_duration_seconds{scope} histogram
sync_conflicts_total{entity_type,policy} counter
sync_conflicts_pending gauge
sync_conflicts_resolution_duration_seconds histogram
sync_cursor_out_of_range_total counter (full-resync triggers)
sync_delta_projection_lag_seconds gauge (event → delta materialized)
sync_devices_by_status{status=healthy|stale|disconnected} gauge
sync_device_mutation_backlog histogram

USE

sync_outbox_lag_seconds gauge
sync_deltas_table_rows{scope} gauge

3. Traces

Spans: sync.push.batch, sync.push.mutation.route{service}, sync.pull.query, sync.delta.project, sync.conflict.detect, sync.conflict.resolve.

4. Dashboards

Push: throughput, batch size, rejection rate, conflict rate.
Pull: throughput, delta volume, cursor age distribution.
Conflicts: pending count, resolution rate, policy distribution.
Device health: healthy/stale/disconnected distribution, offline duration histogram.
Delta projection lag.
Full-resync frequency (should be rare).

5. Alerts

Alert	Threshold	Severity
conflict-rate-spike	> 5% of pushes	P2
delta-projection-lag	> 60s p99	P2
full-resync-spike	> 100/day	P2
mutation-rejection-spike	> 3%	P2
stale-devices	> 10% of active devices	P3
outbox-lag	> 30s p99	P2
push-failure	> 1%	P2
conflict-backlog	> 1000 pending	P3

6. SLOs

SLI	Target
Push p95	< 500ms (100-mutation batch)
Pull p95	< 300ms (500-entity delta)
Delta projection lag p99	< 30s
Conflict auto-resolution rate	≥ 95%
Full-resync rate	< 0.1% of active devices/week

6a. EP-20 multi-device NFRs (product ↔ operations)

Epic EP-20 (docs/07-epics-and-user-stories.md) states end-to-end propagation windows (e.g. < 30 s for cursor/conflict visibility on line, ≤ 60 s for wipe/revocation after a device is next online). Those are customer-facing SLOs, validated in staging and production, not as fixed wall-clock assertions in every local dev run.

How we evidence them

Story need	Primary signals / dashboards	Notes
US-100 / US-101 — < 30 s	`sync_outbox_lag_seconds` (p99), `sync_pull_duration_seconds`, `sync_conflicts_resolution_duration_seconds`, `delta-projection-lag` alert (< 60s p99 in §5)	End-to-end includes client pull + UI; server side bounded by outbox + pull + projection.
US-102 / US-103 — ≤ 60 s after next online	Same + `sync_bundle_revoked_total`, device reconnect sync	“After next online” is client-dependent; server guarantees materialize in pull after projection.

Release criteria: Grafana (or equivalent) panels show p99 within targets for the windows above; alert policies in §5 are green during soak. Local integration tests assert correctness of pull/wipe/revoke paths; latency SLOs are operational.

7. Device Telemetry

Offline devices buffer metrics.
On reconnect, sync pushes device metrics (offline duration, local mutation count, app crashes).
Correlate with sync health dashboard.

1. Logs​

2. Metrics​

RED​

Domain​

USE​

3. Traces​

4. Dashboards​

5. Alerts​

6. SLOs​

6a. EP-20 multi-device NFRs (product ↔ operations)​

7. Device Telemetry​