Monitoring Certificate Health in Production: Surviving the 90-Day Lifespan and Automation Failures
In late 2023, a global outage rippled through the Starlink network, leaving users worldwide without internet access. The root cause wasn't a sophisticated cyberattack or a catastrophic hardware failure in space. It was an expired TLS certificate on a ground station.
Starlink is not alone. Cisco recently had to issue an urgent field notice when a root certificate hardcoded into their SD-WAN vManage software expired, severing communication between controllers and edge routers. In fact, industry reports from PKI leaders reveal that over 80% of organizations have experienced at least one certificate-related outage in the past 24 months.
Certificate health monitoring has transitioned from a routine IT administrative chore to a critical pillar of Site Reliability Engineering (SRE) and DevSecOps. With Google’s impending push to reduce the maximum validity of public TLS certificates from 398 days to just 90 days, the volume and velocity of certificate renewals are about to explode.
If your team is still relying on spreadsheets, calendar invites, or basic uptime checks to monitor certificate health, you are operating on borrowed time. This guide explores the modern landscape of certificate health monitoring, the metrics you must track, and how to build a resilient, automated observability stack.
The 90-Day Mandate and the "Automation Trap"
Google's "Moving Forward, Together" initiative proposes a drastic reduction in public TLS certificate lifespans. The era of buying a multi-year certificate, installing it on a load balancer, and forgetting about it is officially dead.
To survive a 90-day lifecycle, organizations are universally adopting the Automated Certificate Management Environment (ACME) protocol, utilizing authorities like Let's Encrypt and automated clients like cert-manager for Kubernetes.
However, this reliance on automation has created a dangerous new blind spot: The Automation Trap.
Engineering teams often assume that because a certificate is configured to auto-renew, it is safe. But automation silently fails all the time. Common causes include:
* Firewall regressions: A network engineer inadvertently blocks Port 80, causing an ACME HTTP-01 validation challenge to fail.
* DNS API Rate Limits: Your cloud provider throttles the API calls required for DNS-01 validation.
* Permissions changes: The IAM role granting your Kubernetes cluster access to AWS Route53 or Azure DNS is modified, breaking the renewal pipeline.
Modern certificate monitoring is no longer just about tracking expiration dates; it is about implementing a "Trust but Verify" architecture. You must monitor for automation failures well before they result in a production outage.
The 5 Certificate Metrics You Must Actually Monitor
To build robust defense-in-depth observability, your monitoring tools must look beyond the basic expiration date. Here are the five critical metrics your stack should track:
1. Absolute Expiration (Days Until Expiry)
This remains the baseline metric. However, the alerting thresholds must adapt to automated lifecycles. If you are using Let's Encrypt (90-day validity), cert-manager will typically attempt to renew the certificate 30 days before expiration.
* Best Practice: Do not alert at 30 days. Instead, set your critical alerts to trigger at 15 days, 7 days, and 1 day. If a certificate drops below 15 days of validity, it mathematically proves your 30-day automated renewal script has failed.
2. Chain of Trust Validity
A leaf certificate might be perfectly valid, but if an intermediate certificate expires or is misconfigured on the server, modern browsers will throw a NET::ERR_CERT_AUTHORITY_INVALID error. Your monitoring must validate the entire chain up to the Root CA, ensuring no "broken chain" errors exist in production.
3. Signature Algorithm and Key Length
Compliance frameworks like PCI-DSS v4.0 require strict adherence to modern cryptography. Monitoring systems should actively scan for deprecated algorithms (like SHA-1) or weak keys (like RSA under 2048-bit). As the industry prepares for Post-Quantum Cryptography (PQC) standards recently finalized by NIST, identifying legacy cryptographic primitives in your infrastructure is the first step toward crypto-agility.
4. Issuer Name (Detecting Shadow IT)
Developers often spin up rogue infrastructure, bypassing centralized IT. By actively monitoring the Issuer Name on your endpoints and cross-referencing public Certificate Transparency (CT) logs, you can detect unauthorized Certificate Authorities issuing certificates for your domains.
5. Revocation Status (OCSP/CRL)
If a private key is compromised, the certificate must be revoked. However, if the server continues to serve the revoked certificate, you remain vulnerable. Advanced monitoring tools must check the Online Certificate Status Protocol (OCSP) or Certificate Revocation Lists (CRL) to ensure the certificate is actively trusted by the CA, not just unexpired.
Technical Implementation: Whitebox vs. Blackbox Monitoring
A resilient observability strategy combines internal metrics (Whitebox) with external validation (Blackbox).
Whitebox Monitoring: Inside the Kubernetes Cluster
If you are running cloud-native infrastructure, you likely use cert-manager. It natively exposes rich Prometheus metrics that give you a direct view into the state of your internal certificate requests.
You can query the certmanager_certificate_expiration_timestamp_seconds metric to find certificates nearing expiration. Here is a practical PromQL alert rule that triggers if an internal certificate will expire in less than 15 days:
groups:
- name: cert-manager-alerts
rules:
- alert: CertificateRenewalFailed
expr: |
certmanager_certificate_expiration_timestamp_seconds - time() < (15 * 24 * 3600)
for: 1h
labels:
severity: critical
annotations:
summary: "Certificate {{ $labels.name }} is failing to renew"
description: "The certificate {{ $labels.name }} in namespace {{ $labels.namespace }} expires in less than 15 days. ACME renewal has likely failed."
While Whitebox monitoring is excellent for debugging why a renewal failed, it has a fatal flaw: it assumes the certificate successfully made it from the Kubernetes secret to the actual listening Ingress controller or Load Balancer.
Blackbox Monitoring: The External Source of Truth
To guarantee what your users are actually experiencing, you must simulate client handshakes from the outside. The industry standard for this in the open-source ecosystem is the Prometheus Blackbox Exporter.
By configuring the Blackbox Exporter's tcp probe, you can extract the TLS certificate directly from the live endpoint:
# blackbox.yml configuration
modules:
tls_check:
prober: tcp
tcp:
tls: true
tls_config:
insecure_skip_verify: false
You can then alert on the probe_ssl_earliest_cert_expiry metric:
- alert: EndpointCertificateExpiring
expr: probe_ssl_earliest_cert_expiry - time() < (15 * 24 * 3600)
for: 1h
labels:
severity: page
annotations:
summary: "Public endpoint {{ $labels.instance }} certificate expiring"
Blackbox monitoring is the ultimate safety net. It doesn't care if you use AWS ACM, Azure Key Vault, or a manual installation—it only cares about the cryptographic truth being served on port 443.
Tool Comparison: Building Your Certificate Observability Stack
Depending on the scale of your infrastructure, you have several paths for tooling. Here is how the landscape breaks down:
1. Heavyweight Enterprise CLM (Venafi, Keyfactor)
For massive enterprises managing hundreds of thousands of machine identities across legacy datacenters and multi-cloud environments, full Certificate Lifecycle Management (CLM) platforms like Venafi or Keyfactor are the gold standard. They provide centralized control planes, policy enforcement, and automated provisioning.
* Pros: Total lifecycle control, deep integration with HSMs (Hardware Security Modules), and strict policy enforcement.
* Cons: Highly complex integration, significant engineering overhead to deploy, and enterprise-tier pricing.
2. Deep Cryptographic Auditing (Testssl.sh, SSL Labs)
When you need to deeply inspect cipher suites, protocol support (e.g., ensuring TLS 1.0/1.1 are disabled), and vulnerability to specific attacks (like Heartbleed or BEAST), tools like Testssl.sh and the SSL Labs API are unparalleled.
* Pros: Incredibly thorough cryptographic analysis.
* Cons: Designed for point-in-time auditing rather than continuous, high-frequency observability. Difficult to integrate into real-time alerting pipelines.
3. Dedicated External Expiry Monitoring (Expiring.at)
For teams that want the ultimate "Trust but Verify" safety net without the overhead of maintaining complex Prometheus Blackbox configurations or paying for heavy CLM enterprise software, dedicated external monitoring is the ideal solution.
Services like Expiring.at act as an independent, external observer. They continuously probe your public endpoints, validating the entire certificate chain, monitoring for impending expirations, and alerting you via Slack, email, or webhooks long before an outage occurs.
* Pros: Zero-configuration setup, independent from your internal infrastructure (no single point of failure), and catches edge cases where internal automation silently fails.
* Cons: Focuses primarily on public-facing endpoints and external validation rather than internal, private PKI issuance.
The Hidden Danger: Internal mTLS and Zero Trust
While much of the focus is on public-facing websites, the outages that cost millions often hide in internal networks.
Modern microservice architectures rely heavily on Zero Trust Architecture (ZTA), which enforces mutual TLS (mTLS) for all service-to-service communication. Service meshes like Istio or Linkerd automatically rotate these internal certificates—sometimes every 24 hours.
If the internal Certificate Authority (often running within the cluster) goes down, or if a specific microservice gets disconnected from the control plane and fails to rotate its mTLS certificate, it will be instantly isolated by the