Observability
:::info Source
Sourced from services/certification-service/OBSERVABILITY.md in the documentation repo.
:::
1. Logs
Events: certification.certificate.issued, .revoked, .verified (AUDIT), .render.started, .render.completed, .render.failed, .offline_claim.submitted, .offline_claim.verified / .rejected, .template.updated, .kid.rotated.
Attrs: certificate_id, template_id, course_version_id, verification_token_fingerprint.
2. Metrics
RED
cert_api_requests_total{endpoint,status}countercert_api_duration_seconds{endpoint}histogram
Domain
cert_issued_total{tenant_id,template_id}countercert_revoked_total{reason}countercert_verification_total{result=issued|revoked|not_found|invalid}countercert_offline_claim_total{status=verified|rejected}countercert_render_duration_seconds{format=pdf|png|openbadges}histogramcert_issuance_latency_secondshistogram (completion event → cert visible, target p95 < 10s)cert_public_verify_cache_hit_ratiogauge
USE
cert_outbox_lag_secondsgaugecert_render_queue_depthgauge
3. Traces
Spans: cert.issue, cert.render.pdf, cert.render.png, cert.sign_jws, cert.verify.public, cert.offline_claim.verify.
4. Dashboards
- Issuance funnel: completion events → certs issued rate.
- Verification: total, per-tenant, cache hit.
- Rendering: latency + failure rate.
- Revocations: rate + reason distribution.
- JWKS: serve rate, rotations.
5. Alerts
| Alert | Threshold | Severity |
|---|---|---|
| issuance-latency-high | p95 > 30s | P2 |
| issuance-failure-rate | > 1% | P2 |
| revocation-spike | > 100/hour for a tenant | P2 (may be AI flag candidate) |
| public-verify-4xx | > 10% | P3 |
| offline-claim-rejection-spike | > 10% | P2 |
| render-queue-backlog | > 500 | P2 |
| kid-rotation-pending-overdue | P3 | |
| jwks-serve-error | > 0.1% | P1 (verification broken) |
6. SLOs
| SLI | Target |
|---|---|
| Issuance pipeline p95 (completion → cert) | < 10s |
Public /verify p95 | < 100ms (cache hit); < 300ms (miss) |
| Issuance success rate | ≥ 99.9% |
| JWKS availability | 99.999% |
7. RUM
- Learner certificate page LCP < 1.5s.
- Verifier page LCP < 1s (mostly cached).
8. Public Verify Analytics
- Geographic distribution of verifies.
- User-agent distribution (helps detect bot scraping).
- Anomaly: same cert token verified 10000x in 1 min → alert.