Stop the Outages: A DevOps Guide to SSL/TLS Certificate Monitoring in the 90-Day Era
In the modern infrastructure landscape, a single expired SSL/TLS certificate is a ticking time bomb. Historically, managing these cryptographic assets was a tedious but manageable chore relegated to calendar reminders and shared spreadsheets. Today, in an era of highly distributed microservices, multi-cloud deployments, and edge computing, manual tracking is not just inefficient—it is a critical point of failure.
According to a recent State of Machine Identity Report, a staggering 80% of organizations have experienced at least one outage due to an expired certificate in the past 24 months. With enterprise IT downtime costing an estimated $300,000 per hour, the financial and reputational damage of a dropped certificate is immense. High-profile outages at companies like Epic Games, Spotify, and even Starlink’s global satellite network all share the same mundane root cause: an expired SSL certificate that slipped through the cracks.
The industry is currently undergoing a massive paradigm shift from reactive certificate tracking to Automated Certificate Lifecycle Management (CLM). However, automation without visibility is dangerous. Monitoring is no longer just about tracking expiration dates; it is the ultimate failsafe when your automation breaks.
Here is a comprehensive guide to modern SSL certificate monitoring best practices, the shifting industry standards, and how to build a resilient monitoring stack.
The 90-Day Mandate: Why Your Strategy is Already Obsolete
If your organization is still relying on manual renewals, your infrastructure is on collision course with reality. Google’s "Moving Forward, Together" initiative has proposed reducing the maximum validity of public TLS certificates from 398 days to just 90 days.
While the exact enforcement date for Chrome is pending, the industry is already treating 90-day lifespans as the de facto standard. When certificates expire every three months, manual renewal across hundreds or thousands of endpoints becomes mathematically impossible for a human team to manage.
Furthermore, the adoption of Zero Trust Architecture (ZTA) requires mutual TLS (mTLS) for every machine, microservice, and workload. This explodes the volume of certificates required. In a modern Kubernetes environment, you aren't just tracking your public-facing www.example.com; you are tracking thousands of ephemeral certificates securing East-West traffic between internal pods.
The Anatomy of a Certificate Outage
To build a robust monitoring strategy, DevOps and security teams must understand why certificate outages still happen despite the widespread adoption of automation tools like Let's Encrypt and ACME protocols.
The "Renewed vs. Deployed" Trap
This is the most common cause of "unexplainable" certificate outages. Your automated ACME client (like Certbot) successfully negotiates with the Certificate Authority, downloads the new certificate, and saves it to the disk. Your internal monitoring sees the new file and reports everything is fine.
However, the web server (Nginx, Apache, or Envoy) was never restarted or reloaded. It continues to serve the old, expiring certificate from its memory cache. When the expiration date hits, the site goes down, even though a valid certificate sits idle on the server's hard drive. Monitoring must check the live endpoint, not just the file on disk.
The Chain of Trust Failure
Monitoring tools often focus exclusively on the leaf (server) certificate. But SSL relies on a chain of trust. If an Intermediate CA or Root CA certificate expires or is revoked, the leaf certificate instantly becomes invalid, triggering browser warnings and dropped connections.
The Wildcard Blast Radius
Wildcard certificates (*.company.com) are heavily discouraged in modern security postures. If a single wildcard certificate expires—or worse, its private key is compromised—it takes down all subdomains simultaneously. Monitoring tools must flag wildcards as high-risk assets requiring aggressive alerting.
5 Core Best Practices for Modern SSL Monitoring
To prevent catastrophic downtime, organizations must adopt a multi-layered approach to certificate observability.
1. Ditch the Spreadsheet for Continuous Discovery
You cannot monitor what you cannot see. Shadow IT is a massive vulnerability; a developer might spin up a temporary staging server, secure it with a one-off certificate, and forget about it. Use continuous network scanning to build a dynamic, centralized inventory of all certificates across your IP ranges.
2. Monitor from Multiple Vantage Points
Relying on a single source of truth is dangerous. A robust strategy requires monitoring from two distinct angles:
* Internal Monitoring: Checking the file system, secrets management tools, or internal load balancers to ensure automation is successfully fetching new certificates.
* External/Synthetic Monitoring: Probing the endpoint from the public internet to ensure the actually served certificate is valid and correctly configured.
3. Implement Tiered, Actionable Alerting
Alert fatigue is a real danger. If an engineer receives a PagerDuty call at 3:00 AM for a certificate expiring in 25 days, they will eventually mute the alerts. Implement tiered escalation:
* 30 Days Out: Informational alert (Slack/Microsoft Teams notification, automated Jira ticket creation).
* 15 Days Out: Warning alert (Direct email to the specific service owner).
* 7 Days Out: Critical alert (PagerDuty or Opsgenie routing to the on-call engineer, treating it as an impending outage).
4. Track Certificate Transparency (CT) Logs
Modern monitoring doesn't just ping your known endpoints; it actively monitors public Certificate Transparency (CT) logs. CT logs are append-only cryptographic ledgers of all publicly issued SSL certificates. By monitoring these logs for your domains, you can instantly detect rogue certificates issued by malicious actors or unauthorized internal teams, preventing spoofing and shadow IT.
5. Monitor Cryptographic Health (PQC Readiness)
Expiration is only one metric. With NIST finalizing Post-Quantum Cryptography (PQC) standards in late 2024 (FIPS 203, 204, 205), organizations must achieve "crypto-agility." Your monitoring tools must track weak ciphers (TLS 1.0/1.1), insufficient key lengths (RSA < 2048), and the specific cryptographic algorithms in use to prepare for the inevitable migration to quantum-safe algorithms.
Technical Implementation: Building Your Monitoring Stack
How do you actually implement this visibility? Here are three concrete approaches depending on your infrastructure.
The Open-Source Standard: Prometheus & Blackbox Exporter
For teams heavily invested in the cloud-native ecosystem, Prometheus paired with the blackbox_exporter is the industry standard for external endpoint monitoring.
First, configure your blackbox.yml to probe over HTTPS:
modules:
http_2xx:
prober: http
http:
preferred_ip_protocol: "ip4"
fail_if_ssl: false
fail_if_not_ssl: true
Then, configure Prometheus to scrape your endpoints and evaluate the probe_ssl_earliest_cert_expiry metric. You can write a PromQL alerting rule to trigger when a certificate has less than 15 days (1,296,000 seconds) remaining:
groups:
- name: ssl_expiry
rules:
- alert: SSLCertExpiringSoon
expr: probe_ssl_earliest_cert_expiry - time() < 86400 * 15
for: 1h
labels:
severity: warning
annotations:
summary: "SSL certificate for {{ $labels.instance }} expires in less than 15 days"
description: "The SSL certificate for {{ $labels.instance }} will expire soon. Ensure automated renewal is functioning."
Cloud-Native Automation: Kubernetes cert-manager
If you are running Kubernetes, cert-manager is the undisputed king of Automated Certificate Lifecycle Management. It handles the ACME challenges and provisions certificates as Kubernetes Secrets.
cert-manager natively exposes Prometheus metrics. Instead of probing from the outside, you can monitor the internal health of your certificate resources directly using the certmanager_certificate_expiration_timestamp_seconds metric.
- alert: CertManagerCertExpiring
expr: certmanager_certificate_expiration_timestamp_seconds - time() < (86400 * 15)
for: 1h
labels:
severity: critical
annotations:
summary: "Kubernetes Certificate {{ $labels.name }} in namespace {{ $labels.namespace }} is failing to renew."
The Quick Sanity Check: Bash & OpenSSL
For legacy systems, edge devices, or quick troubleshooting, nothing beats the raw power of OpenSSL. You can easily script a check to pull the expiration date directly from a live server:
#!/bin/bash
DOMAIN="example.com"
PORT="443"
EXPIRY_DATE=$(echo | openssl s_client -servername $DOMAIN -connect $DOMAIN:$PORT 2>/dev/null | openssl x509 -noout -enddate | cut -d= -f2)
# Convert to Unix timestamp for comparison
EXPIRY_EPOCH=$(date -d "$EXPIRY_DATE" +%s)
CURRENT_EPOCH=$(date +%s)
DAYS_LEFT=$(( ($EXPIRY_EPOCH - $CURRENT_EPOCH) / 86400 ))
echo "Certificate for $DOMAIN expires in $DAYS_LEFT days."
The Role of Dedicated Monitoring Solutions
While building a Prometheus stack is powerful, it requires significant engineering overhead to maintain, configure, and scale. Furthermore, internal monitoring tools often suffer from the "watching the watchers" problem—if your internal network goes down, your alerts go down with it.
This is where a dedicated, external vantage point becomes critical. Platforms like Expiring.at are designed specifically to bridge the gap between basic uptime pingers and heavy, complex enterprise CLM deployments.