Skip to content

Alerting

Stage: Alpha Status: Draft

Alerts should focus on operator action. Avoid paging on one-off failures unless they indicate trust compromise or control-plane outage.

Recommended alerts:

Alert Severity Example PromQL Remediation
API unavailable critical up{job="ironroot"} == 0 Check pod/process health, TLS, Service, and network policy.
High API latency medium histogram_quantile(0.95, rate(pki_api_request_duration_seconds_bucket[5m])) > 1 Inspect traces for slow DB or CA spans.
Elevated 5xx errors high rate(pki_api_request_failures_total[5m]) > 0.05 Check server logs and recent config changes.
Enrollment attack pattern high increase(pki_bootstrap_token_validation_failures_total[15m]) > 20 Revoke exposed tokens and review source IPs.
Excessive revocations high increase(pki_certificates_revoked_total[1h]) > 10 Confirm whether this is planned rotation or incident response.
DB failures high rate(pki_database_errors_total[5m]) > 0 Check storage permissions, disk space, and SQLite PVC health.
Telemetry export failures medium rate(pki_otel_export_failures_total[5m]) > 0 Check collector endpoint, TLS, and network policy.
Critical security-check failure critical increase(pki_security_check_results_total{severity="critical",status="fail"}[1h]) > 0 Run ironroot-admin security-check and apply remediation.

Also monitor Root CA and Intermediate CA expiry through security-check reports until dedicated CA expiry gauges are added.