Surviving the 90-Day Cert Lifespan: Automating Kubernetes PKI in 2025
Managing certificates in Kubernetes has historically been treated as a best-effort operational chore—a task relegated to calendar reminders and frantic manual updates. Today, that approach is a guaranteed recipe for catastrophic cluster failure.
The security landscape has fundamentally shifted. Following Google's "Moving Forward, Together" proposal, the industry is bracing for the maximum lifespan of public TLS certificates to drop from 398 days to just 90 days. In a Kubernetes environment where ephemeral pods scale dynamically and machine identities now outnumber human identities by a factor of 45 to 1, manual Public Key Infrastructure (PKI) management is no longer just inefficient; it is mathematically and operationally impossible.
Whether you are securing North-South traffic at your Ingress, enforcing East-West Zero Trust with a service mesh, or preparing for the looming transition to Post-Quantum Cryptography (PQC), automation is your only path forward. In this guide, we will explore the critical challenges of Kubernetes certificate management, dissect the tools required to automate your PKI, and establish resilient monitoring practices to ensure your clusters never suffer a silent outage.
The "Silent Outage" Epidemic in Kubernetes
According to recent state-of-machine-identity reports, a staggering 81% of organizations have experienced at least one certificate-related outage in the past 24 months. In Kubernetes, these outages are rarely loud, localized failures. They are cascading, silent disasters.
The Danger of Shadow PKI
Developers, eager to deploy microservices quickly, often bypass central IT to spin up self-signed certificates or utilize unauthorized Certificate Authorities (CAs). This creates "Shadow PKI"—a sprawling, untracked network of cryptographic identities. When these undocumented certificates inevitably expire, services silently drop connections, leaving platform teams scrambling to locate the source of the failure across thousands of containers.
The Webhook Expiration Trap
A uniquely devastating Kubernetes problem lies in Validating and Mutating Admission Webhooks. These webhooks intercept requests to the Kubernetes API server prior to persistence, allowing administrators to enforce policies or mutate pod specifications.
Because the API server must communicate with these webhooks securely, they require TLS certificates. If an internal webhook certificate expires, the API server severs the connection. The result? No new pods can be scheduled, updated, or modified. Your cluster is effectively frozen, yet basic health checks might still show the nodes as "Ready."
Control Plane Failures
Platform engineers often focus entirely on workload certificates while neglecting the control plane. Tools like kubeadm default to 1-year lifespans for critical internal certificates, such as those used by kube-apiserver and etcd. Failing to rotate these certificates results in the total collapse of the control plane.
You can manually verify your control plane certificates at any time using:
kubeadm certs check-expiration
However, relying on manual CLI checks is a ticking time bomb.
North-South vs. East-West: Securing the Entire Traffic Flow
A common misconception in Kubernetes security is that encrypting public-facing traffic is sufficient. Modern compliance and Zero Trust architectures require a decoupled, comprehensive approach to both external and internal traffic.
North-South Traffic (Ingress)
North-South traffic represents data flowing into your cluster from the outside world. Securing this requires public trust. You must utilize public CAs like Let's Encrypt via the Automated Certificate Management Environment (ACME) protocol. This ensures that browsers and external APIs inherently trust the connection.
East-West Traffic (mTLS)
East-West traffic is the communication between pods and microservices inside your cluster. Securing this traffic requires Mutual TLS (mTLS), where both the client and the server cryptographically verify each other's identity.
For East-West traffic, public CAs are inappropriate. Instead, you need a robust internal private CA—such as HashiCorp Vault or AWS Private CA—to issue short-lived certificates (often valid for just hours) to your workloads. This aligns with the SPIFFE (Secure Production Identity Framework for Everyone) standard, which provides universal workload identity beyond simple, easily spoofed IP addresses.
The Gold Standard: Automating with cert-manager
The CNCF project cert-manager has become the de facto standard for Kubernetes certificate lifecycle automation. It extends the Kubernetes API by introducing Custom Resource Definitions (CRDs) that treat certificates as first-class citizens.
The core architecture relies on three primary resources:
1. Issuer / ClusterIssuer: Defines how to obtain a certificate (e.g., via ACME, Vault, or Venafi).
2. Certificate: Defines the desired state and lifecycle of a certificate.
3. CertificateRequest: The actual Certificate Signing Request (CSR) sent to the Issuer.
Implementation: ACME with Let's Encrypt
To automate your Ingress certificates, you first define a ClusterIssuer. This resource sits at the cluster level and can fulfill certificate requests for any namespace.
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
name: letsencrypt-prod
spec:
acme:
# The ACME server URL
server: https://acme-v02.api.letsencrypt.org/directory
# Email address used for ACME registration
email: security@yourdomain.com
# Name of a secret used to store the ACME account private key
privateKeySecretRef:
name: letsencrypt-prod-account-key
# Enable the HTTP-01 challenge provider
solvers:
- http01:
ingress:
class: nginx
Once the ClusterIssuer is active, you simply annotate your Ingress resources. cert-manager will automatically detect the annotation, generate the CSR, solve the ACME challenge, and populate the resulting TLS certificate into a Kubernetes Secret.
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: api-gateway
namespace: production
annotations:
# Trigger cert-manager to provision the certificate
cert-manager.io/cluster-issuer: "letsencrypt-prod"
spec:
ingressClassName: nginx
tls:
- hosts:
- api.yourdomain.com
# cert-manager will automatically create and rotate this secret
secretName: api-gateway-tls
rules:
- host: api.yourdomain.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: api-service
port:
number: 443
With this Infrastructure as Code (IaC) approach, your certificates are version-controlled in Git, and rotation happens entirely without human intervention.
Securing East-West Traffic with Service Meshes
While cert-manager is brilliant for Ingress and foundational PKI, managing individual mTLS certificates for thousands of pods manually is unscalable. This is where Service Meshes like Istio and Linkerd shine.
A service mesh deploys a lightweight proxy (like Envoy) as a "sidecar" alongside every pod. The mesh control plane acts as an internal CA. It generates cryptographic identities, pushes them to the sidecars via the Secret Discovery Service (SDS) API, and rotates them automatically.
A Critical Warning: During migration, teams often set Istio or Linkerd mTLS to "Permissive" mode to prevent breaking legacy HTTP traffic. Permissive mode allows both plaintext and encrypted traffic. Several high-profile security incidents have occurred because teams left meshes in Permissive mode indefinitely, allowing malicious internal actors to intercept unencrypted pod-to-pod traffic. "Permissive" is a migration state, not a resting state. Always enforce "Strict" mTLS once your migration is complete.
Compliance Mandates Driving Automation
The push for automated Kubernetes PKI isn't just a best practice; it is rapidly becoming a legal requirement. Several major regulatory frameworks are forcing the hands of IT administrators:
- DORA (Digital Operational Resilience Act): Taking effect in the EU in January 2025, DORA mandates strict ICT risk management for financial entities. Relying on manual certificate rotation fails DORA's requirement to prove "operational resilience" against preventable outages.
- NIS2 Directive: This EU cybersecurity directive requires stringent access control and cryptography. Leaving internal Kubernetes traffic unencrypted violates NIS2 mandates directly.
- PCI-DSS v4.0: Fully enforceable by March 2025, this standard requires continuous monitoring of certificate inventories and stronger cryptography. Kubernetes environments must definitively prove that cardholder data is encrypted both in transit (Ingress) and internally (mTLS).
Furthermore, as NIST finalizes the first Post-Quantum Cryptography (PQC) standards (FIPS 203, 204, and 205), organizations must achieve "crypto-agility." If you cannot automatically rotate your current RSA/ECC certificates today, you will be entirely incapable of swapping them for quantum-resistant algorithms tomorrow.
Proactive Monitoring: Don't Rely on CA Emails
Automation is the goal, but blind trust in automation is a vulnerability. ACME challenges fail. Webhook configurations drift. CAs experience rate-limiting. If your automated renewal process breaks, you need to know before the certificate expires.
Relying on the standard "Your certificate is expiring in 20 days" email from your CA is insufficient—those emails are often routed to unmonitored shared inboxes or ignored by developers who assume the automation will eventually kick in.
Internal Metrics with Prometheus
If you are using cert-manager, you must expose its metrics to Prometheus. You can configure PromQL alerts to trigger PagerDuty or Slack notifications when a certificate drops below a safe threshold.
# Alert if a certificate expires in less than 15 days
certmanager_certificate_expiration_timestamp_seconds - time() < (15 * 24 * 3600)
External Validation with Expiring.at
Internal metrics are great, but they only monitor what the cluster thinks is happening. What if your Ingress controller is failing to serve the newly rotated secret? What if a DNS misconfiguration is routing traffic to a deprecated, expiring endpoint?
To achieve true defense-in-depth, you must monitor your public-facing endpoints externally. This is where Expiring.at becomes an invaluable part of your platform engineering toolkit. By tracking your actual, externally visible SSL/TLS certificates and domain expirations, Expiring.at acts as the ultimate source of truth.
If cert-manager successfully updates a Secret, but your NGINX Ingress fails to reload it, Prometheus might report