The 90-Day Countdown: A DevOps Guide to Automated SSL Certificate Monitoring
The landscape of SSL/TLS certificate management is undergoing a seismic shift. For years, IT teams have scraped by using sprawling spreadsheets, calendar reminders, and ad-hoc scripts to track certificate expirations. Today, that approach is not just inefficient—it is mathematically and operationally impossible to sustain.
In 2023, Google announced its intention to reduce the maximum validity of public TLS certificates from 398 days to just 90 days. Expected to be enforced via Chrome root program policies in late 2024 or early 2025, this mandate means certificates will expire four times faster. If your organization relies on human intervention to renew and deploy certificates, you are sitting on a ticking time bomb.
Certificate outages remain a leading cause of preventable downtime. The average enterprise now manages over 250,000 machine identities, a number growing by 20% annually. When these identities expire unexpectedly, the results are catastrophic.
This guide explores the modern best practices for SSL/TLS certificate expiration monitoring, providing DevOps engineers, security professionals, and IT administrators with actionable strategies and technical implementations to eliminate certificate-related outages permanently.
The True Cost of Certificate Failure
Before diving into the technical solutions, it is crucial to understand the stakes. According to recent industry data, 80% of organizations have experienced at least one outage caused by an expired certificate in the past 24 months. For large enterprises, the cost of a certificate-related outage is estimated at $300,000 per hour, factoring in lost revenue, SLA penalties, and severe brand damage.
Even the most sophisticated engineering teams fall victim to this. Recent high-profile incidents highlight the fragility of manual certificate management:
- RubyGems (2024): The Ruby community package manager suffered a massive outage due to an expired Fastly TLS certificate, preventing developers globally from downloading critical dependencies.
- Starlink (2023/2024): SpaceX’s satellite internet service experienced a global outage caused by an expired ground station certificate, locking users out of the network.
- Cisco Meraki: Several instances of expired certificates have caused severe connectivity issues for enterprise edge devices, highlighting the specific dangers of embedded certificates in IoT and hardware.
Epic Games, Spotify, and Microsoft Teams have all historically suffered massive, highly publicized outages due to expired certificates. The lesson is clear: monitoring is a reactive measure, but automated, proactive monitoring is the only cure.
Best Practice 1: Establish a Single Source of Truth
The most common reason certificates expire unexpectedly is "Shadow IT." A developer spins up a quick cloud instance, provisions a certificate using a personal email address, and moves on. A year later, the developer leaves the company, the certificate expires, and a critical internal service goes down.
To combat this, organizations must establish a Single Source of Truth (SSOT) for all certificates—public, private, cloud, and on-premises.
Continuous Discovery
Do not rely on manual entry to populate your SSOT. Implement continuous network scanning across port 443 and monitor Certificate Transparency (CT) logs to discover certificates provisioned outside official channels. Integration with your load balancers (such as HAProxy or F5) and web servers is essential to automatically discover and log certificates as they are deployed.
Best Practice 2: Implement Tiered Expiration Alerting
Alert fatigue is the silent killer of observability. If your monitoring system triggers a high-priority PagerDuty incident for a certificate that expires in 60 days, engineers will quickly learn to ignore those alerts. When a certificate is actually three days away from expiration, the alert will be lost in the noise.
Modern Site Reliability Engineering (SRE) demands a tiered alerting strategy. You should never send a critical page for an event that is weeks away, but you also cannot afford to stay silent.
Implement the following tiered escalation path:
- 60 Days (Informational): Send an automated email to the service owner and the security team. This provides ample time to review the certificate, especially if it requires manual validation or a complex deployment pipeline.
- 30 Days (Warning): Send a notification to the responsible team's Slack or Microsoft Teams channel. This elevates the visibility of the impending expiration to the broader team.
- 15 Days (Action Required): Create a high-priority ticket in Jira, ServiceNow, or your preferred issue tracker. This ensures the renewal is officially added to a sprint or work queue.
- 7 Days (Critical Incident): Trigger a PagerDuty or OpsGenie alert to the on-call engineer. At this point, automation has failed, standard workflows have been ignored, and immediate human intervention is required to prevent an outage.
Using a dedicated monitoring platform like Expiring.at allows you to easily configure these tiered webhooks and notifications without having to build custom alerting logic from scratch.
Best Practice 3: Shift-Left and Automate Everything
The ultimate goal of expiration monitoring in the modern era is not to remind a human to renew a certificate, but to monitor the success of your automated renewals.
The Automated Certificate Management Environment (ACME) protocol, popularized by Let's Encrypt, is now the de facto standard. Commercial Certificate Authorities (CAs) heavily support ACME for enterprise automation. Your DevOps teams should be integrating certificate provisioning directly into CI/CD pipelines and Infrastructure as Code (IaC).
Kubernetes Automation with cert-manager
For containerized environments, cert-manager is the industry standard for Kubernetes. It automatically provisions and injects certificates into Pods and Ingress controllers.
Here is an example of how you can define a Certificate resource in Kubernetes that automatically handles its own lifecycle:
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
name: prod-api-certificate
namespace: production
spec:
secretName: prod-api-tls-secret
duration: 2160h # 90 days
renewBefore: 360h # 15 days
subject:
organizations:
- My Enterprise Corp
isCA: false
privateKey:
algorithm: RSA
encoding: PKCS1
size: 2048
dnsNames:
- api.example.com
issuerRef:
name: letsencrypt-prod
kind: ClusterIssuer
In this setup, cert-manager will automatically attempt to renew the certificate 15 days before it expires. Your external monitoring tools should be configured to alert you if this certificate drops below the 14-day mark, which indicates that the cert-manager automation has silently failed.
Best Practice 4: Observability and Infrastructure Monitoring
To catch the failures of your automated systems, you need robust observability. For teams utilizing open-source stacks, Prometheus combined with the Blackbox Exporter is the industry standard for monitoring endpoint health and SSL certificate status.
Prometheus Blackbox Exporter Implementation
The Blackbox Exporter allows you to probe endpoints over HTTP, HTTPS, DNS, TCP, and ICMP. When probing an HTTPS endpoint, it automatically extracts the SSL certificate metrics.
The critical metric to track is probe_ssl_earliest_cert_expiry, which returns the Unix timestamp of when the certificate will expire.
You can create a powerful PromQL alert rule to trigger your tiered alerting system. Here is a real-world example of an alert that fires when a certificate has less than 15 days of validity remaining:
groups:
- name: ssl_expiry_alerts
rules:
- alert: SSLCertExpiringSoon
expr: probe_ssl_earliest_cert_expiry{job="blackbox"} - time() < 86400 * 15
for: 10m
labels:
severity: warning
annotations:
summary: "SSL Certificate expiring soon on {{ $labels.instance }}"
description: "The SSL certificate for {{ $labels.instance }} will expire in less than 15 days. Automated renewal may have failed."
(Note: 86400 is the number of seconds in a day. Multiplying by 15 gives us the 15-day threshold).
While Prometheus is powerful, managing the Blackbox exporter, maintaining the PromQL rules, and routing the alerts can be overhead-intensive. This is where specialized tools like Expiring.at provide immediate value, offering out-of-the-box monitoring and alerting for SSL endpoints without the need to maintain complex infrastructure.
Best Practice 5: Monitor More Than Just Expiration
A comprehensive certificate monitoring strategy looks beyond the expiration date. A valid certificate utilizing deprecated cryptography is just as dangerous as an expired one.
Security and Compliance Checks
Your monitoring solution must validate the entire TLS configuration. This includes:
- Weak Cipher Suites and Protocols: Ensure that your endpoints are not negotiating connections using deprecated protocols like TLS 1.0 or TLS 1.1.
- Revocation Status: Monitor the Online Certificate Status Protocol (OCSP) and Certificate Revocation Lists (CRLs). A certificate might be perfectly valid by date, but revoked by the CA due to a key compromise.
- Impending CA Root Expirations: It is not just your leaf certificates that expire. If the root or intermediate CA certificate expires, the entire chain of trust is broken.
- Wildcard Certificate Risks: Security teams are rapidly moving away from Wildcard certificates (
*.example.com). If a wildcard private key is compromised, every subdomain is vulnerable. Automation makes it easy to provision specific Subject Alternative Name (SAN) certificates instead. Monitor your environments to flag and phase out legacy wildcard usage.
Meeting these standards is not just about best practices; it is about compliance. The latest Payment Card Industry standards (PCI-DSS v4.0) require strict management of cryptography. Expired certificates or the use of deprecated TLS versions will result in immediate audit failures. Furthermore, the National Institute of Standards and Technology (NIST) outlines strict guidelines in NIST SP 1800-16, which serves as the gold standard for securing web transactions and managing TLS server certificates.
Preparing for the Future: Post-Quantum Cryptography (PQC)
Looking beyond the 90-day lifespan mandate, the next massive hurdle for certificate management is Post-Quantum Cryptography (PQC). In August 2024, NIST released the final standards for PQC algorithms designed to withstand attacks from future quantum computers.
Organizations must begin auditing their certificate inventories immediately to prepare for the massive migration from RSA and ECC algorithms to quantum-safe certificates. "Crypto-agility"—the ability to rapidly swap out cryptographic algorithms and certificates