Beyond Expiration Dates: A DevOps Guide to Certificate Health in Production
In the modern enterprise, an expired TLS certificate is no longer just an embarrassing oversight; it is a catastrophic single point of failure. According to recent industry reports, 81% of organizations have experienced at least one certificate-related outage in the past 24 months, with the average cost of a severe outage exceeding $300,000 per hour in lost revenue and productivity.
We don't have to look far for real-world examples. In recent years, an expired ground station certificate triggered a massive global outage for Starlink users, while Epic Games suffered a cascading backend failure that locked millions of players out of their accounts—all due to a single expired internal certificate.
Monitoring certificate health in production has evolved from a basic IT housekeeping task into a critical Site Reliability Engineering (SRE) discipline. Furthermore, the definition of "certificate health" has expanded dramatically. It is no longer just about tracking the notAfter expiration date. Today, DevOps engineers must monitor cryptographic strength, chain of trust validity, revocation status, and protocol compliance.
With impending industry mandates and the rise of zero-trust architectures, manual tracking is officially obsolete. Here is your comprehensive guide to monitoring certificate health in production for 2024 and beyond.
The New Reality: Why Manual Tracking is Dead
If your organization is still tracking TLS certificates in a shared spreadsheet or relying on calendar reminders, you are operating on borrowed time. Two major industry shifts are forcing engineering teams to completely rethink their certificate lifecycle management.
The 90-Day Certificate Mandate
Google’s Chrome Root Program has announced its intention to reduce the maximum validity of public TLS certificates from 398 days to just 90 days. While the exact enforcement date is still pending, the security industry is treating the current landscape as the final preparation window.
When public certificates expire every 90 days, manual renewal processes will mathematically break down. A team managing just 100 public-facing endpoints will be forced to execute a manual renewal, deployment, and validation process more than 400 times a year.
The Explosion of mTLS and Zero Trust
The transition to Zero Trust Architectures and the widespread adoption of service meshes like Istio and Linkerd rely heavily on mutual TLS (mTLS). This architectural shift has caused an explosion in the sheer volume of internal, short-lived certificates.
The average enterprise now manages hundreds of thousands of internal and external certificates. When microservices require cryptographic identity to communicate, a single expired internal certificate can cascade into a massive system-wide outage. Monitoring must now shift from focusing solely on public-facing load balancers to observing the deep, internal communications of your infrastructure.
Redefining Certificate "Health"
When setting up observability, the first instinct is to monitor the expiration date. While crucial, an unexpired certificate can still be deeply unhealthy and pose a massive security risk. A robust monitoring stack must track the following metrics:
1. Validity Start and End Times
Monitoring notAfter (time to expiry) is the baseline. However, you should also monitor the notBefore (validity start) metric. This ensures that newly deployed certificates are actually valid and prevents subtle clock-skew issues between your servers and your Certificate Authority (CA).
2. Full Chain Validation
A common monitoring blind spot is checking only the leaf (endpoint) certificate. If the intermediate or root certificate in your chain of trust expires or is revoked, the endpoint will fail validation regardless of the leaf certificate's health. Your monitoring tools must validate the entire trust chain.
3. Cryptographic Strength and Protocol Versions
With the National Institute of Standards and Technology (NIST) finalizing the first Post-Quantum Cryptography (PQC) algorithms, organizations are now required to build "crypto-agility" into their monitoring. You should actively scan for and flag deprecated algorithms (like SHA-1 or RSA keys smaller than 2048-bit). Furthermore, your monitors should alert you if endpoints negotiate deprecated protocols like TLS 1.0 or 1.1, enforcing a strict minimum of TLS 1.2 or 1.3.
4. Revocation Blind Spots
If a private key is compromised, the certificate must be revoked. However, if your monitoring stack doesn't actively check Online Certificate Status Protocol (OCSP) stapling or Certificate Revocation Lists (CRLs), a compromised-but-unexpired certificate will still appear "healthy" on your dashboards.
Architecting a Production-Grade Monitoring Stack
To achieve total visibility, you need a defense-in-depth approach that utilizes both outside-in (synthetic endpoint monitoring) and inside-out (agent-based and cluster-native) monitoring.
Outside-In Monitoring with Prometheus Blackbox Exporter
For external-facing endpoints and APIs, the Prometheus Blackbox Exporter is the industry standard. It allows you to probe endpoints over HTTPS and scrape rich TLS metrics.
First, configure your blackbox.yml to enforce strict TLS verification during the probe:
modules:
http_2xx_strict_tls:
prober: http
timeout: 5s
http:
valid_http_versions: ["HTTP/1.1", "HTTP/2.0"]
valid_status_codes: [] # Defaults to 2xx
method: GET
fail_if_ssl: false
fail_if_not_ssl: true
tls_config:
insecure_skip_verify: false # Enforce chain validation
Once Prometheus scrapes the Blackbox exporter, you can visualize the health of your endpoints in Grafana. The most critical PromQL query you will use is:
probe_ssl_earliest_cert_expiry{job="blackbox"} - time()
This query calculates the exact number of seconds remaining until the earliest certificate in the chain expires, allowing you to build highly accurate countdown dashboards.
Inside-Out Monitoring with cert-manager
For internal Kubernetes environments, cert-manager is the undisputed standard. It automates the issuance and renewal of certificates from various issuers (including Let's Encrypt and HashiCorp Vault) and exposes native Prometheus metrics out of the box.
Instead of probing endpoints over the network, cert-manager tells you the exact state of the Certificate resources living inside your cluster. You can track internal expiration using this PromQL query:
certmanager_certificate_expiration_timestamp_seconds - time()
By combining Blackbox Exporter metrics with cert-manager metrics, you cover both your public perimeter and your internal microservices.
Defeating Alert Fatigue with Smart Routing
One of the most common reasons certificate outages still happen is alert fatigue. If an SRE gets paged at 3:00 AM because a certificate expires in 30 days, they will eventually mute the alerting channel. When the certificate actually expires, no one notices.
To prevent this, you must implement a tiered alerting strategy in Alertmanager. Alerts should escalate based on urgency, routing to different channels.
Here is an example of a tiered Prometheus alerting rule configuration:
groups:
- name: certificate_health
rules:
# Tier 1: Warning (30 Days) - Route to Slack/Jira
- alert: CertificateExpiringSoon
expr: (probe_ssl_earliest_cert_expiry - time()) / 86400 < 30
labels:
severity: warning
annotations:
summary: "Certificate for {{ $labels.instance }} expires in less than 30 days."
description: "Plan routine maintenance to renew the certificate."
# Tier 2: Critical (7 Days) - Route to Email/High-Priority Slack
- alert: CertificateExpiringCritical
expr: (probe_ssl_earliest_cert_expiry - time()) / 86400 < 7
labels:
severity: critical
annotations:
summary: "URGENT: Certificate for {{ $labels.instance }} expires in {{ $value | humanizeDuration }}."
description: "Immediate action required. Automated renewal may have failed."
# Tier 3: Pager (48 Hours) - Route to PagerDuty/OpsGenie
- alert: CertificateExpiringPager
expr: (probe_ssl_earliest_cert_expiry - time()) / 86400 < 2
labels:
severity: page
annotations:
summary: "PAGER: Certificate for {{ $labels.instance }} expires in less than 48 hours!"
description: "Wake someone up. Outage imminent."
By segmenting your alerts, you ensure that routine warnings are handled during business hours, while imminent outages trigger immediate incident response.
Industry Best Practices for 2024 and Beyond
As you mature your certificate monitoring capabilities, keep these industry best practices in mind to maintain compliance with frameworks like PCI-DSS v4.0 and the upcoming Digital Operational Resilience Act (DORA).
- Shift-Left Certificate Management: Integrate certificate linting and validation directly into your CI/CD pipelines. Do not allow deployments of containers with misconfigured keystores or deprecated ciphers. Tools like SSLyze can be run as CLI checks during your build process.
- ACME Everywhere: Adopt the Automated Certificate Management Environment (ACME) protocol universally. Do not restrict ACME to public Let's Encrypt certificates; use it for your internal PKI as well.
- Continuous Discovery: Shadow IT is a massive risk. Developers often spin up undocumented infrastructure with rogue certificates. Integrate Certificate Transparency (CT) log monitoring and automated network scanning to find unmanaged endpoints before they expire silently.
Centralizing Visibility with Expiring.at
Building and maintaining a custom Prometheus, Alertmanager, and Grafana stack for certificate monitoring requires significant engineering time. For teams looking to eliminate the overhead of managing complex observability infrastructure while still escaping the "spreadsheet trap," dedicated lifecycle monitoring platforms are the ideal solution.
This is where Expiring.at becomes an invaluable tool for your DevOps and security teams. Rather than writing custom PromQL queries and managing YAML configurations, Expiring.at provides a centralized, single source of truth for your entire certificate fleet.
With Expiring.at, you can:
* Automate Discovery and Tracking: Easily monitor domains, SSL/TLS certificates, and critical endpoints without deploying internal agents.
* Implement Smart Alerting: Leverage out-of-the-box tiered alerting that integrates seamlessly with the tools your team already uses, including Slack, email, and custom webhooks for incident management platforms.
* Gain Comprehensive Health Insights: Look beyond simple expiration dates with automated checks that ensure your infrastructure remains healthy, compliant, and secure.
By centralizing your certificate observability, you free up your engineering team to focus on building features rather than chasing down expiring infrastructure.
Conclusion
The era of static, long-lived certificates managed by manual calendar reminders is over. As maximum lifespans shrink to 90 days and internal microservices multiply the volume of certificates exponentially, automated monitoring is no longer a luxury—it is a survival tactic.
By redefining what certificate "health" means, implementing strict outside-in and inside-out observability, and leveraging intelligent tiered alerting, you can protect your organization from costly, embarrassing outages. Whether you build a custom Prometheus stack or rely on purpose-built platforms like Expiring.at to handle the heavy lifting, the time to automate your certificate lifecycle monitoring is right now.