file-storage-service — OBSERVABILITY
Companion: SERVICE_OVERVIEW §7 SLOs · APPLICATION_LOGIC · FAILURE_MODES · platform:
@ghasi/telemetry
This service emits structured logs, OpenTelemetry traces, and Prometheus-format metrics through the platform @ghasi/telemetry package. Telemetry is initialized before NestFactory.create in src/main.ts so that the very first DB connection and Pub/Sub publish are traced. All telemetry tags carry tenant_id, service, version, region, and (where applicable) scope, data_class, caller_surface.
1. SLIs and SLOs
| SLI | Definition | Target | Window |
|---|---|---|---|
initiate_upload_p95_latency_seconds | Server-side latency of POST /api/v1/files/uploads | ≤ 0.150 s | rolling 30 d |
confirm_upload_p95_latency_seconds | Server-side latency of POST /uploads/{ups}/confirm | ≤ 0.250 s | rolling 30 d |
download_url_p95_latency_seconds | Server-side latency of POST /files/{med}/download-url | ≤ 0.120 s | rolling 30 d |
scan_completion_p95_seconds | scan.passed.v1.occurredAt − upload.completed.v1.occurredAt | ≤ 15 s | rolling 30 d |
optimization_p95_seconds | optimization.completed.v1 − scan.passed.v1 for bytes ≤ 5 MiB | ≤ 30 s | rolling 30 d |
cdn_get_p95_seconds_cache_hit | Cloud CDN log latency, hits only | ≤ 0.080 s | rolling 30 d |
availability_read_pct | (5xx / total) for read endpoints | ≥ 99.95 % | rolling 30 d |
availability_write_pct | (5xx / total) for write endpoints | ≥ 99.9 % | rolling 30 d |
cross_tenant_access_count | events file.access.denied.v1 with reason='cross_tenant' | 0 | rolling 24 h |
retention_sweep_lag_seconds | now() − min(hard_delete_after WHERE not yet purged) | ≤ 3600 | continuous |
outbox_publish_lag_p95_seconds | now() − created_at for unpublished rows | ≤ 2 | continuous |
quarantine_rate_pct_24h | quarantined / total uploads | ≤ 0.5 % | rolling 24 h (anomaly bound) |
signed_url_revoke_to_block_seconds_p95 | time from DELETE access-grant to next request blocked | ≤ 30 s | rolling 24 h |
Error budgets are calculated on the rolling 30 d window. Burn-rate alerts at 2 % / hr (slow) and 14.4 % / hr (fast).
2. Metrics (Prometheus exposition on port 9090)
2.1 Request metrics
file_storage_http_requests_total{method,route,status,tenant_id,caller_surface}
file_storage_http_request_duration_seconds_bucket{method,route,status,le,...}
file_storage_http_request_size_bytes{method,route}
file_storage_http_response_size_bytes{method,route}
2.2 Domain metrics
file_storage_uploads_initiated_total{scope,data_class,tenant_id}
file_storage_uploads_confirmed_total{scope,alias,tenant_id}
file_storage_uploads_failed_total{scope,reason,tenant_id}
file_storage_downloads_issued_total{scope,variant,tenant_id,purpose}
file_storage_downloads_denied_total{scope,reason,tenant_id} # reason=cross_tenant|expired|revoked|quarantined|scan_pending|missing_role
file_storage_scan_results_total{scanner,verdict}
file_storage_scan_duration_seconds_bucket{scanner,scope,le}
file_storage_optimization_duration_seconds_bucket{scope,preset,le}
file_storage_optimization_failed_total{scope,preset,reason}
file_storage_dedupe_alias_total{scope,tenant_id}
file_storage_quarantine_total{scope,reason} # reason=virus|ai_safety|magic_byte|polyglot
file_storage_retention_purges_total{scope,policy_name}
file_storage_erasure_runs_total{scope_kind,outcome}
file_storage_erasure_purged_objects_total{scope_kind}
file_storage_erasure_deferred_objects_total{scope_kind}
file_storage_quota_bytes_used{tenant_id,scope}
file_storage_quota_bytes_cap{tenant_id}
file_storage_quota_objects_used{tenant_id,scope}
file_storage_quota_objects_cap{tenant_id}
2.3 Infrastructure metrics
file_storage_outbox_unpublished_total
file_storage_outbox_publish_lag_seconds_bucket{le}
file_storage_outbox_publish_failures_total{reason}
file_storage_inbox_dedupe_skips_total{event_type}
file_storage_inbox_handler_failures_total{event_type,handler}
file_storage_db_pool_in_use{pool}
file_storage_db_pool_idle{pool}
file_storage_db_query_duration_seconds_bucket{op,le}
file_storage_db_constraint_violations_total{constraint}
file_storage_redis_command_duration_seconds_bucket{cmd,le}
file_storage_signed_url_cache_hits_total
file_storage_signed_url_cache_misses_total
file_storage_signed_url_blacklist_size
file_storage_pubsub_publish_total{topic,result}
file_storage_pubsub_consume_total{subscription,result}
file_storage_gcs_op_duration_seconds_bucket{op,le} # op=sign,head,delete,copy
file_storage_gcs_op_failures_total{op,error}
file_storage_cdn_invalidate_attempts_total{result}
file_storage_cdn_invalidate_duration_seconds_bucket{le}
file_storage_cdn_invalidate_backlog
2.4 AI metrics
file_storage_ai_calls_total{purpose,outcome,tier}
file_storage_ai_latency_seconds_bucket{purpose,tier,le}
file_storage_ai_cost_micro_usd_total{purpose,tier,tenant_id}
file_storage_ai_hitl_pending{purpose}
file_storage_ai_hitl_oldest_age_seconds{purpose}
3. Tracing (OpenTelemetry → Cloud Trace)
Span attributes required on every span:
| Attribute | Source |
|---|---|
tenant_id | request context |
request_id | X-Request-Id header |
caller_surface | derived from JWT |
actor.kind / actor.userId | JWT subject |
route | controller path |
idempotency_key | request header (writes only) |
Spans of interest:
http.server.request(root)usecase.<command>(e.g.,usecase.initiateUpload)db.<operation>(e.g.,db.file_objects.insert)db.outbox.appendgcs.signed_url.uploadpubsub.publish(carriestopic,messageId,orderingKey)
- For confirm:
gcs.headgcs.stream_sha256_compare(only on missing object metadata)usecase.dedupe_lookup
outbox.relay.batch— root for the relay worker; one span per batch, child spans per message published.inbox.consumer.<subscription>— root for each consumed Pub/Sub message.sweeper.retention.scanandsweeper.session.cleanup— long-running spans with periodic events.ai.<purpose>— wraps every orchestrator call; child of the originating use-case span.
Trace sampling: head-based 5 %; tail-based always-keep if any span has error=true, ai.tier=cloud, gcs.op_failures ≥ 1, or cross_tenant=true.
4. Dashboards (Grafana / Cloud Monitoring)
4.1 file-storage / overview
- RED panels per route: rate, errors, p50/p95/p99 latency.
- Upload funnel: initiated → confirmed → ready → quarantined.
- Outbox lag, inbox lag, Pub/Sub backlog per topic.
- 30 d error budget burn (multi-window).
4.2 file-storage / quotas
- Top 20 tenants by bytes_used %, objects_used %.
- Tenants in 80–95 % band (warn), ≥ 95 % (critical).
- Reserved bytes pending confirm (leaks?).
4.3 file-storage / scan-and-quarantine
- Scan throughput (per scanner).
- Quarantine count + reason distribution (last 24 h, 7 d).
- Time-to-scan p50/p95.
- ClamAV definitions version freshness.
4.4 file-storage / optimization
- Variant build success / fail (per preset, per scope).
- Optimization latency buckets per byte-size band.
- Optimizer DLQ depth.
4.5 file-storage / cdn
- CDN hit ratio per host.
- p95 edge latency.
- Invalidation backlog and time-to-invalidate.
4.6 file-storage / erasure-and-retention
- Erasure runs in flight, completed, partial, failed (last 30 d).
- Deferred objects by policy.
- Retention sweep lag.
- Holds about to release in next 7 d.
4.7 file-storage / ai
- Per purpose: call rate, latency, refusal rate, cost / 1k calls.
- HITL queue depth and oldest age per purpose.
5. Logs (pino JSON → Cloud Logging)
Required fields on every log line:
{ "ts", "level", "service":"file-storage-service", "version", "tenant_id", "trace_id", "span_id",
"request_id", "caller_surface", "actor": { "userId" | "service" }, "route", "msg" }
Sensitive fields are redacted by pino's redactor: Authorization, Cookie, any header with signature, Signature, X-Goog-Signature, request bodies on PII-scope endpoints (guest_id_scan upload), and PUT URL query strings.
Notable log lines:
upload.initiate— at info level on success; warn on quota near 95 %.upload.confirm.dedupe— info; carriesaliasOf.upload.confirm.hash_mismatch— warn; alert if rate spikes (potential client bug or attack).download.issue— info forprivate/archive, sampled 1/100 forpublic_media(volume).access.denied— warn; SIEM keys off this line in addition to the event.scan.callback.passed|failed— info / warn.scan.callback.late— warn;scan_completion_seconds > 60.optimize.callback.failed— warn;attempts >= 3.cdn.invalidate.failed— error; backlog alert.erasure.completed— info with counts;erasure.partialwarn.retention.sweep.lag— warn at lag > 30 min, error at > 60 min.
6. Audit log export
Domain events on these topics are dual-published to BigQuery via Pub/Sub-to-BQ subscription with a 7-year table retention:
melmastoon.file.scan.results.v1melmastoon.file.access.v1(denied; granted for non-public scopes only)melmastoon.file.deletion.v1melmastoon.file.retention.v1melmastoon.file.erasure.v1melmastoon.file.upload.lifecycle.v1
access_grants table is exported nightly to BigQuery with a 13-month rolling partition retention; older partitions are dropped from Postgres and queried from BigQuery for historical audits.
7. Alerts (Cloud Monitoring + PagerDuty)
| Alert | Condition | Severity | Runbook |
|---|---|---|---|
FileStorage_ErrorBudgetFastBurn | 14.4× burn over 1 h on availability_write | P1 | file/error-budget-burn |
FileStorage_CrossTenantAccessDenied | ≥ 1 in 5 min from same actor | P1 | file/cross-tenant-leak-suspected |
FileStorage_OutboxLagHigh | publish_lag p95 > 30 s for 5 m | P2 | file/outbox-lag |
FileStorage_PubSubDLQGrowth | DLQ count > 10 in 5 m on any topic | P2 | file/dlq-growth |
FileStorage_ScanLatencyHigh | scan_completion p95 > 60 s for 10 m | P2 | file/scan-latency |
FileStorage_ScanQueueBacklog | ClamAV queue depth > 1000 for 5 m | P2 | file/quarantine-storm |
FileStorage_QuarantineRateHigh | quarantine_rate_24h > 5 % for 30 m | P2 | file/quarantine-anomaly |
FileStorage_QuotaBlock | any tenant blocked_at not null & uploads attempted ≥ 100/min | P3 | file/quota-runaway |
FileStorage_OptimizerDLQ | optimizer DLQ depth > 50 | P3 | file/optimizer-dlq |
FileStorage_CdnInvalidationBacklog | backlog > 50 for 15 m | P3 | file/cdn-invalidation-backlog |
FileStorage_RetentionSweepLag | sweep_lag > 60 min | P2 | file/retention-sweep-lag |
FileStorage_ErasureFailed | any erasure_request status='failed' or 'partial' | P2 | file/erasure-cert-failure |
FileStorage_AIBudgetExhausted | budget refusals > 100/min for 10 m | P3 | file/ai-budget |
FileStorage_HITLBacklog | hitl_oldest_age > 24 h | P3 | file/hitl-backlog |
FileStorage_KMS_Unavailable | KMS error rate > 1 % for 5 m | P1 | file/kms-unavailable |
FileStorage_DBConnPoolSaturated | in_use / max > 0.9 for 10 m | P2 | file/db-saturation |
8. SLO computation pipeline
SLO numbers are computed in BigQuery from the Datastreamed file_objects table + raw event topics. A scheduled query refreshes a slos_file_storage table every 5 min; Grafana datasource for the SLO panels reads from there. Error-budget-burn alerts run on Cloud Monitoring metrics for sub-minute responsiveness.
9. Cardinality controls
tenant_idlabel on per-tenant metrics is capped at the top 100 tenants by usage; remaining tenants roll up totenant_id="other". Dashboards include both.- Avoid
routecardinality explosion by templatizing IDs (/files/{med}not/files/med_01HXY...). scopeanddata_classare bounded enums (≤ 9 + 3 values).actor.userIdis never a metric label; only a log/span attribute.
10. Validation
Three CI gates on this doc:
- Every metric named here must be emitted by source code (script
bin/check-metrics.tsgrepsfile_storage_*against the Prometheus collector registry). - Every alert here must have a corresponding Terraform resource in
infra/melmastoon/modules/file-storage/alerts.tf. - Every runbook URL must resolve (404 fails CI).