Beyond Expiration: A Modern Guide to Monitoring Certificate Health in Production

In February 2023, a single expired SSL certificate brought down Microsoft Teams, Exchange Online, and a suite of other Microsoft 365 services for millions of users worldwide. A year later, a similar i...

Tim Henrich
November 18, 2025
8 min read
155 views

Beyond Expiration: A Modern Guide to Monitoring Certificate Health in Production

In February 2023, a single expired SSL certificate brought down Microsoft Teams, Exchange Online, and a suite of other Microsoft 365 services for millions of users worldwide. A year later, a similar incident at Starlink caused a global outage for its satellite internet service. These events are not isolated anomalies; they are symptoms of a growing challenge. Certificate management has become a business-critical function where failure results in public outages, security breaches, and significant damage to brand reputation.

The days of manually tracking certificate expiration dates in a spreadsheet are over. The industry's push towards 90-day certificate lifespans, the explosion of microservices and ephemeral infrastructure, and the looming threat of quantum computing have transformed certificate health into a complex, high-stakes discipline.

This guide provides a modern, actionable framework for monitoring certificate health in production. We'll move beyond simple expiry checks and explore how to build a robust, automated strategy that ensures reliability, security, and crypto-agility for your organization. You'll learn how to combine automated discovery, proactive monitoring with tools like Prometheus, and configuration validation to prevent outages before they happen.

The New Reality of Certificate Management

The landscape of digital certificates is changing rapidly. What worked five years ago is now a recipe for disaster. Three major trends are forcing engineering and security teams to rethink their entire approach.

The 90-Day Countdown is Here

Google has been a vocal proponent of reducing the maximum validity of public TLS certificates to 90 days. While not yet a formal CA/Browser Forum baseline requirement, it's widely expected to become the industry standard. This shift renders manual renewal processes completely untenable. An organization with hundreds or thousands of certificates simply cannot keep up with a 90-day cycle without end-to-end automation. This change forces the adoption of protocols like ACME (Automated Certificate Management Environment), the engine behind services like Let's Encrypt.

Infrastructure and Certificates as Code

Modern infrastructure is declarative and ephemeral. In environments like Kubernetes, services are spun up and torn down constantly. Manually issuing and installing a certificate for each new pod is impossible.

The solution is "Certificates-as-Code." Tools like cert-manager, the de facto standard for Kubernetes, allow you to define certificate requirements as version-controlled YAML files. The cluster's control loop then works to automatically issue, renew, and configure the certificate, shifting responsibility "left" into the hands of developers while platform teams set the governing policies.

The Looming Quantum Threat

Post-Quantum Cryptography (PQC) is no longer a distant academic concept. NIST is finalizing standards for quantum-resistant algorithms, and CAs will soon begin offering hybrid certificates. Certificate health monitoring must evolve to include checks for cryptographic compliance. Your monitoring system will need to identify legacy, non-PQC-compliant certificates and track the rollout of new ones to ensure a smooth transition without breaking clients.

The High Cost of Failure: Lessons from Real-World Outages

To understand the importance of robust monitoring, we need only look at recent high-profile incidents. These aren't just cautionary tales; they are blueprints of what not to do.

  • Microsoft Teams (2023): This outage wasn't caused by a lack of knowledge, but a failure of process and visibility at scale. It proves that even the most sophisticated technology companies can be brought down by a single oversight. The key lesson is that your monitoring and automation must cover every endpoint, internal and external, not just your primary web servers.

  • Starlink (2024): The Starlink outage demonstrated that certificate health is critical for core infrastructure, not just websites. For IoT, hardware, and embedded systems, certificate rotation is far more complex. This incident underscores the need for long-term planning and robust, automated processes that are tested well in advance.

  • Equifax (2017): This remains the foundational security lesson. While the breach wasn't directly caused by an expired certificate, an expired certificate on an internal traffic inspection device created a massive blind spot. For months, it allowed malicious traffic to move through the network completely undetected. This proves that certificate health is a core security function; an expired certificate can render your entire security stack useless.

Building a Multi-Layered Monitoring Strategy

A comprehensive monitoring strategy doesn't rely on a single tool or process. It's a multi-layered approach that combines discovery, proactive alerting, and deep configuration validation.

Layer 1: Automated Discovery and Inventory

You can't monitor what you don't know you have. The first step is to eliminate "shadow IT" certificates by creating a comprehensive, continuously updated inventory.

A robust discovery process combines two methods:

  1. Internal Scanning: Use network scanning tools to probe your internal IP ranges for services running on common TLS ports (443, 8443, 636, etc.). A simple Nmap script can get you started:
    bash # Scan a subnet for hosts with port 443 open and run the ssl-cert script nmap -p 443 --open -sV -oN nmap_scan.txt 192.168.1.0/24 --script ssl-cert

  2. External Certificate Transparency (CT) Log Monitoring: Certificate Transparency logs are public, append-only records of all publicly trusted certificates issued by CAs. By monitoring these logs for your domains, you can discover every certificate issued, including those created by other teams or potentially malicious actors. You can manually search logs using tools like crt.sh, but automation is key.

Services like Expiring.at automate this entire discovery process. By simply adding your domains, the platform continuously monitors CT logs and probes your subdomains to build a complete inventory of your public-facing certificates, giving you a single source of truth without any complex setup.

Layer 2: Proactive Expiration Monitoring with Prometheus

Once you have an inventory, you need to monitor each certificate for impending expiration. Prometheus, a leading open-source monitoring system, is perfectly suited for this task when paired with the blackbox_exporter. The blackbox_exporter allows Prometheus to probe endpoints over various protocols, including HTTPS, and collect metrics about them.

Here’s how to set it up:

Step 1: Configure the blackbox_exporter

In your blackbox.yml configuration file, define a module for probing TLS certificates.

# blackbox.yml
modules:
  http_2xx:
    prober: http
    timeout: 10s
    http:
      valid_status_codes: [] # Probing will not fail on status codes
      method: GET
      preferred_ip_protocol: "ip4"
      tls_config:
        insecure_skip_verify: false # Ensure we validate the certificate chain

Step 2: Configure Prometheus to Scrape the Exporter

In your prometheus.yml, add a scrape job that tells Prometheus to use the blackbox_exporter to probe your target domains.

# prometheus.yml
scrape_configs:
  - job_name: 'blackbox_https_expiringat'
    metrics_path: /probe
    params:
      module: [http_2xx] # Use the module defined in blackbox.yml
    static_configs:
      - targets:
        - https://expiring.at
        - https://app.expiring.at
        - https://docs.expiring.at
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: blackbox-exporter:9115 # The address of the blackbox_exporter

Step 3: Create Alerting Rules

The blackbox_exporter exposes a key metric: probe_ssl_earliest_cert_expiry. This is a timestamp of when the certificate will expire. You can use PromQL (Prometheus Query Language) to create tiered alerts.

Add the following rules to your alerting configuration file:

# alert_rules.yml
groups:
- name: SSLCertificates
  rules:
  - alert: SSLCertificateExpiresInLessThan30Days
    expr: probe_ssl_earliest_cert_expiry - time() < 86400 * 30
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: "SSL certificate for {{ $labels.instance }} will expire in less than 30 days."
      description: "The SSL certificate for {{ $labels.instance }} expires on {{ $value | humanizeTimestamp }}. Please plan for renewal."

  - alert: SSLCertificateExpiresInLessThan7Days
    expr: probe_ssl_earliest_cert_expiry - time() < 86400 * 7
    for: 10m
    labels:
      severity: critical
    annotations:
      summary: "SSL certificate for {{ $labels.instance }} will expire in less than 7 days!"
      description: "The SSL certificate for {{ $labels.instance }} expires on {{ $value | humanizeTimestamp }}. RENEW IMMEDIATELY."

This setup will create a low-priority warning 30 days out and escalate to a high-priority critical alert when a certificate is within 7 days of expiration, giving your team ample time to react.

Layer 3: Validating Configuration and Trust Chains

Expiration is just one aspect of certificate health. A valid certificate served with a weak configuration is a major security risk. Your monitoring should also validate:

  • Protocol Support: Alert if servers support outdated and insecure protocols like SSLv3, TLS 1.0, or TLS 1.1.
  • Cipher Suites: Identify the use of weak cipher suites (e.g., those using RC4 or 3DES).
  • Trust Chain Integrity: Ensure the server provides the complete certificate chain, including all necessary intermediate certificates. While most modern browsers can fetch missing intermediates, many non-browser clients and older devices cannot, leading to trust failures.
  • OCSP Stapling: Verify that the server provides a valid, fresh OCSP Stapling response to improve performance and privacy for clients.

You can use the API from Qualys SSL Labs for periodic deep analysis of public-facing endpoints or incorporate open-source tools like testssl.sh into your CI/CD pipelines to scan internal services before they are deployed.

Best Practices for Holistic Certificate Health

Tools are only part of the solution. A resilient certificate management strategy is built on a foundation of strong processes and clear ownership.

  1. Automate the Entire Lifecycle: Strive for zero-touch automation. This includes discovery, issuance, deployment, renewal, and revocation. Manual intervention should be reserved for emergencies only.
  2. Centralize Visibility: While Prometheus is excellent for metrics, a dedicated dashboard is crucial for at-a-glance visibility. A platform like Expiring.at provides a centralized inventory, intuitive expiration timelines, and proactive notifications via Slack, email, and webhooks, aggregating data from multiple sources into a single pane of glass.
  3. Establish Clear Ownership: Every certificate must have a designated owner—a team, not an individual. Use tags and metadata in your inventory to track ownership, application context, and contact information. When an alert fires, there should be no ambiguity about who is responsible for taking action.
  4. **Conduct "

Share This Insight

Related Posts