Skip to content

Metrics

Stage: Alpha Status: Draft

IronRoot metrics describe PKI operations, API health, security posture, storage latency, and telemetry pipeline health. Metrics are available at /metrics when Prometheus export is enabled and can also be exported through OTLP.

API Metrics

Metric Type Labels Unit Meaning
pki_api_requests_total Counter method, route, status_code requests API traffic volume and endpoint usage.
pki_api_request_duration_seconds Histogram method, route, status_code seconds API latency.
pki_api_request_failures_total Counter method, route, status_code requests Server-side API failures.

Good values are steady request rates, low p95 latency, and near-zero 5xx responses. Bad values are latency spikes, request floods, or elevated 5xx responses.

rate(pki_api_requests_total[5m])
histogram_quantile(0.95, rate(pki_api_request_duration_seconds_bucket[5m]))
rate(pki_api_request_failures_total[5m])

Alert on sustained high p95 latency or elevated 5xx rates.

Enrollment Metrics

Metric Type Labels Unit Meaning
pki_enrollments_total Counter none enrollments Successful enrollment activity.
pki_enrollment_failures_total Counter none failures Failed enrollment attempts.
pki_bootstrap_token_validation_failures_total Counter none failures Invalid, expired, or revoked bootstrap token attempts.

Use these metrics to spot onboarding problems and possible token guessing attacks.

increase(pki_enrollment_failures_total[15m])
increase(pki_bootstrap_token_validation_failures_total[15m])

Certificate Lifecycle Metrics

Metric Type Labels Unit Meaning
pki_certificates_issued_total Counter certificate_type, issuer, status certificates New certificate issuance.
pki_certificates_renewed_total Counter certificate_type, issuer, status certificates Renewal activity.
pki_certificates_revoked_total Counter certificate_type, issuer, status certificates Revocation activity.
pki_active_certificates_total UpDownCounter certificate_type, issuer, status certificates Approximate active inventory.
pki_expiring_certificates_total UpDownCounter certificate_type, issuer, status certificates Certificates approaching expiry.
increase(pki_certificates_issued_total[24h])
increase(pki_certificates_revoked_total[1h])
pki_active_certificates_total

High revocation rates may indicate compromise or automation errors. Expiring certificate growth means renewals are not keeping up.

Security-Check Metrics

Metric Type Labels Unit Meaning
pki_security_check_results_total Counter severity, status, category checks Security posture checks by outcome.

Alert when critical or high severity checks fail.

increase(pki_security_check_results_total{severity=~"critical|high",status="fail"}[1h])

Bootstrap Metrics

Metric Type Labels Unit Meaning
pki_bootstrap_runs_total Counter none runs Bootstrap guide executions.
pki_bootstrap_failures_total Counter none failures Failed bootstrap runs.

Repeated failures usually mean automation is missing required acknowledgements or config paths.

Database Metrics

Metric Type Labels Unit Meaning
pki_database_query_duration_seconds Histogram operation, status seconds Storage latency per repository operation.
pki_database_errors_total Counter operation, status errors Storage failures.
histogram_quantile(0.95, rate(pki_database_query_duration_seconds_bucket[5m]))
rate(pki_database_errors_total[5m])

High DB latency often points to slow persistent storage, overloaded SQLite filesystems, or future PostgreSQL network problems.

OpenTelemetry Exporter Metrics

Metric Type Labels Unit Meaning
pki_otel_export_failures_total Counter operation failures Exporter initialization or export failures.
pki_otel_export_duration_seconds Histogram operation, status seconds Exporter operation latency.

Failures here mean observability may be degraded even when PKI still works.

CLI Metrics

Metric Type Labels Unit Meaning
pki_cli_command_duration_seconds Histogram command, status seconds CLI operation duration.
pki_cli_command_errors_total Counter command, status errors CLI command failures.
pki_cli_command_results_total Counter command, status commands CLI success and failure count.

Use CLI metrics to identify slow enrollment, renewal friction, and operator workflow failures.

Runtime Metrics

When runtime metrics are enabled by the collector or Go runtime instrumentation, track memory, goroutines, GC duration, CPU, and open file descriptors. Good values are stable. Bad values are steadily rising memory, long GC pauses, or file descriptor exhaustion.

Dashboard Guidance

Dashboards should include API health, certificate issuance, renewals, revocations, expiring certificates, failed enrollments, failed token validations, security-check failures, DB latency, and telemetry export health. Example dashboards live in examples/grafana/.