Mastering Certificate Management in Kubernetes: Preparing for the 90-Day Lifespan
In modern cloud-native architectures, Kubernetes relies heavily on Public Key Infrastructure (PKI) and X.509 certificates. They are the cryptographic backbone of your cluster, securing everything from external web traffic to internal control plane communications and pod-to-pod service mesh routing.
Yet, despite their critical importance, certificates remain one of the leading causes of catastrophic "Day 2" outages. High-profile companies from Epic Games to Spotify have suffered massive global downtime due to a single expired certificate.
As we move through 2024 and 2025, the industry is undergoing a seismic shift. The impending reduction of public TLS lifespans to 90 days, the finalization of Post-Quantum Cryptography (PQC) standards, and the strict operational resilience mandates of DORA and NIS2 have changed the rules. Automated certificate lifecycle management (CLM) in Kubernetes is no longer an optional DevOps enhancement—it is a mandatory security and uptime requirement.
In this comprehensive guide, we will explore the unique complexities of Kubernetes certificate management, the tools required to automate it, and the non-negotiable best practices you need to implement to keep your clusters secure and online.
Why Kubernetes Certificate Management is Uniquely Complex
Managing certificates in a traditional virtual machine environment usually involves a load balancer and a few web servers. In Kubernetes, the scale and dynamic nature of the environment multiply the complexity exponentially.
A single production Kubernetes cluster can contain thousands of microservices, each requiring its own cryptographic identity. This complexity is divided into three distinct layers:
- The Control Plane: Kubernetes components (
kube-apiserver,etcd,kubelet,kube-controller-manager) rely on internal certificates to communicate securely. Tools likekubeadmgenerate these automatically, but they typically expire after one year. If these expire, your control plane crashes, and you lose the ability to manage your cluster entirely. - Ingress (External Traffic): These are the public-facing certificates that secure traffic flowing from the internet into your cluster via an Ingress Controller (like NGINX or Traefik). These require validation from public Certificate Authorities (CAs) like Let's Encrypt.
- Workload-to-Workload (mTLS): In Zero Trust architectures, network perimeters are assumed to be compromised. Pods must authenticate with each other using mutual TLS (mTLS). These certificates are often highly ephemeral, living for only hours or minutes.
Without a centralized, automated strategy, this sprawl creates massive "blind spots." Security teams lose visibility into what certificates exist, who issued them, and exactly when they expire.
The 2024-2025 Landscape: What's Forcing the Change?
Several industry shifts are forcing organizations to rethink how they manage Kubernetes PKI.
The 90-Day Public Certificate Mandate
Google's proposal to reduce the maximum validity of public TLS certificates from 398 days to 90 days is the most pressing driver for automation. When this policy takes effect, manual certificate rotation will become mathematically impossible for teams managing dozens of domains and ingresses. Enterprises are aggressively adopting ACME (Automated Certificate Management Environment) protocols in Kubernetes to prepare for this inevitability.
Post-Quantum Cryptography (PQC) Readiness
In August 2024, NIST finalized the first three PQC standards (FIPS 203, 204, and 205). Organizations must now audit their Kubernetes clusters for "crypto-agility"—the ability to rapidly swap out traditional RSA/ECC certificates for quantum-safe algorithms without incurring downtime. Static, manually deployed certificates make crypto-agility impossible.
SPIFFE/SPIRE and Workload Identity
The Secure Production Identity Framework for Everyone (SPIFFE) has become the standard for workload identity. Instead of relying on IP addresses or network policies for security, Kubernetes workloads are issued short-lived cryptographic identities known as SVIDs (SPIFFE Verifiable Identity Documents) to authenticate across heterogeneous environments.
The Standard Toolkit: cert-manager and the CRD Ecosystem
To achieve zero-touch certificate automation in Kubernetes, the undisputed industry standard is cert-manager, a Cloud Native Computing Foundation (CNCF) graduated project.
cert-manager extends the Kubernetes API using Custom Resource Definitions (CRDs) to treat certificates as first-class citizens. Understanding these CRDs is critical for any DevOps engineer:
- Issuer / ClusterIssuer: Represents the Certificate Authority (CA) that will sign your certificates.
Issuersare scoped to a specific namespace, whileClusterIssuersare available globally across the entire cluster. - Certificate: A human-readable definition of the desired certificate, including the domains (SANs), the duration, and the target secret name.
- CertificateRequest: The actual request (CSR) generated by cert-manager and sent to the Issuer.
- Secret: The standard Kubernetes Secret where cert-manager securely stores the resulting TLS private key and signed certificate.
Practical Implementation: Automating Let's Encrypt via ACME
To automate public-facing Ingress certificates, you must configure a ClusterIssuer to communicate with Let's Encrypt using the ACME protocol.
Here is a production-ready example of a Let's Encrypt ClusterIssuer using HTTP-01 challenge validation:
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
name: letsencrypt-prod
spec:
acme:
# The ACME server URL
server: https://acme-v02.api.letsencrypt.org/directory
# Email address used for ACME registration and expiration notices
email: devops@yourdomain.com
# Name of a secret used to store the ACME account private key
privateKeySecretRef:
name: letsencrypt-prod-account-key
# Enable the HTTP-01 challenge provider
solvers:
- http01:
ingress:
class: nginx
Once the ClusterIssuer is active, you do not need to manually create Certificate resources for every web service. Instead, you simply add annotations to your standard Ingress resources. cert-manager intercepts these annotations, automatically generates the CSR, solves the ACME challenge, and mounts the certificate to your Ingress controller.
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: frontend-ingress
namespace: production
annotations:
# Trigger cert-manager to provision the certificate
cert-manager.io/cluster-issuer: "letsencrypt-prod"
# Force SSL redirection
nginx.ingress.kubernetes.io/force-ssl-redirect: "true"
spec:
ingressClassName: nginx
tls:
- hosts:
- app.yourdomain.com
# cert-manager will create this secret automatically
secretName: frontend-tls-secret
rules:
- host: app.yourdomain.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: frontend-service
port:
number: 80
Securing Internal Communication: Private PKI and Service Meshes
While Let's Encrypt is perfect for public Ingress, it should never be used for internal pod-to-pod communication. Public CAs log all issued certificates to public Certificate Transparency (CT) logs. If you use a public CA for internal services (e.g., db-backend.default.svc.cluster.local), you will leak your internal network architecture to the public internet.
Integrating HashiCorp Vault
For internal PKI, HashiCorp Vault is the enterprise standard. Vault's PKI Secrets Engine acts as a dynamic, private CA.
You can integrate Vault directly with cert-manager by creating a Vault ClusterIssuer. This allows developers to request internal certificates using the exact same Kubernetes native workflow they use for public certificates, while security teams maintain strict control over the root CA and issuance policies within Vault.
Service Meshes for Zero-Touch mTLS
For large-scale microservice architectures, manually defining internal certificates—even with cert-manager—becomes cumbersome. This is where Service Meshes like Istio or Linkerd excel.
Service meshes abstract internal certificate management entirely. They deploy a sidecar proxy alongside every pod. The mesh control plane acts as its own CA (or chains up to a root CA in Vault), automatically provisioning, rotating, and mounting highly ephemeral certificates (often living for just 1 to 24 hours) directly into the sidecar proxies. This provides zero-touch mTLS between all workloads without requiring developers to write a single line of cryptographic code.
5 Non-Negotiable Best Practices for Kubernetes CLM
To build a resilient, compliant, and secure Kubernetes environment, implement these five best practices.
1. Automate Everything (Zero Human Touch)
If a human has to generate a CSR, copy a private key, or manually apply a Kubernetes Secret, your architecture is flawed. Human intervention is the root cause of almost all certificate-related outages. Rely entirely on ACME protocols for public certificates and integrated private CAs (like Vault or AWS PCA) for internal certificates.
2. Implement Short-Lived Certificates
Do not issue 1-year or 5-year certificates for internal services. Long-lived certificates expand the "blast radius" if a private key is compromised, and they mask automation failures. By reducing internal certificate lifespans to 7 days (or less), you limit the window of vulnerability and force your automation to continuously prove that it works. If rotation fails, you find out in days, not years.
3. Protect Private Keys with Encryption at Rest
By default, Kubernetes Secrets are merely base64 encoded, not encrypted. Anyone with etcd access or high-level RBAC permissions can read your private TLS keys in plain text. You must enable Encryption at Rest for Kubernetes Secrets using a Key Management Service (KMS) provider (such as AWS KMS, Google Cloud KMS, or Azure Key Vault) to ensure that your private keys are cryptographically secured on disk.
4. Use Staging Environments for Rate Limits
Misconfigured automated issuance—such as a pod stuck in a crash loop continuously requesting certificates—can quickly exhaust Let's Encrypt's strict rate limits. If you hit these limits, your entire cluster will be temporarily blocked from obtaining new certificates.
Always configure a letsencrypt-staging ClusterIssuer for development and testing environments. The staging environment has significantly higher rate limits and generates untrusted certificates, making it perfect for validating your cert-manager configuration without risking your production quotas.
5. Monitor, Alert, and Track Expirations Externally
While cert-manager automation is highly reliable, it