Observability
:::info Source
Sourced from services/media-service/OBSERVABILITY.md in the documentation repo.
:::
1. Logs
Events: media.upload.initiated|completed|failed, media.scan.*, media.transcode.*, media.caption.*, media.transcript.*, media.ai.*, media.quarantine.*, media.delete.*.
2. Metrics
RED
media_api_requests_total{endpoint,status}countermedia_api_duration_seconds{endpoint}histogram
Domain
media_uploads_total{kind}countermedia_uploads_bytes_totalcountermedia_scan_duration_secondshistogrammedia_scan_result_total{result}counter (clean/quarantined)media_transcode_duration_seconds{profile}histogrammedia_transcode_success_rategaugemedia_ai_image_total{model}countermedia_ai_audio_total{model}countermedia_caption_duration_secondshistogrammedia_storage_bytes{tenant_id,class=hot|cold}gaugemedia_cdn_cache_hit_ratiogauge
Cost
media_ai_cost_micro_usd_total{tenant_id}countermedia_storage_cost_estimate{tenant_id}gaugemedia_egress_bytes{tenant_id}counter
3. Traces
Spans: media.upload.finalize, media.scan, media.transcode, media.caption.generate, media.ai.image, media.delete.
4. Dashboards
- Upload volume + success rate.
- Transcode queue + latency.
- Scan outcomes.
- AI usage + cost.
- Storage cost per tenant.
- CDN cache hit.
5. Alerts
| Alert | Threshold | Severity |
|---|---|---|
| upload-failure-rate | > 2% | P2 |
| scan-queue-backlog | > 5000 | P2 |
| transcode-failure-rate | > 3% | P2 |
| quarantine-spike | > 10/min | P1 (possible attack) |
| ai-budget-exhausted | tenant budget > 100% | P3 |
| storage-cost-spike | > 50% WoW | P3 |
| cdn-cache-hit-low | < 85% | P3 |
6. SLOs
| SLI | Target |
|---|---|
| Upload URL p95 | < 200ms |
| Scan p95 | < 30s |
| Transcode (1-min video 1080p) p95 | < 3 min |
| Stream URL p95 | < 100ms |
| AI image gen p95 | < 15s |