The Silent Killer of Kubernetes Clusters: A Guide to Modern Certificate Management

It was a quiet Tuesday in March 2023 when a significant portion of the Microsoft Azure cloud began to fail. Services like Storage, Cosmos DB, and App Service went dark, impacting thousands of customer...

Tim Henrich
November 17, 2025
8 min read
79 views

The Silent Killer of Kubernetes Clusters: A Guide to Modern Certificate Management

It was a quiet Tuesday in March 2023 when a significant portion of the Microsoft Azure cloud began to fail. Services like Storage, Cosmos DB, and App Service went dark, impacting thousands of customers globally. The root cause wasn't a sophisticated cyberattack or a massive hardware failure. It was something far more common and insidious: a single, expired internal TLS certificate.

This incident is a stark reminder of a painful truth in the cloud-native world: certificate management is a critical, high-stakes discipline that is too often neglected. In the dynamic, ephemeral landscape of Kubernetes, where services spin up and down in seconds, manual tracking in a spreadsheet is not just inefficient; it's a direct path to catastrophic failure.

This guide will walk you through the modern approach to managing certificates in Kubernetes. We'll move beyond basic setup and dive into the automated, policy-driven strategies that define resilient and secure systems. You'll learn how to tame certificate sprawl, secure traffic both into and within your cluster, and gain the visibility needed to prevent the next silent, certificate-induced outage.

From Manual Toil to Declarative Automation: The cert-manager Standard

The first step in maturing your certificate strategy is to eliminate manual intervention. The dynamic nature of Kubernetes demands an automated, declarative approach, and the de facto standard for this is cert-manager.

cert-manager is a Kubernetes controller that automates the entire lifecycle of your TLS certificates: issuance, renewal, and revocation. It introduces Custom Resource Definitions (CRDs) like Issuer, ClusterIssuer, and Certificate that allow you to manage certificates as code, perfectly aligning with GitOps principles.

How it Works: A Practical Example

Let's say you want to secure an application's ingress with a free, trusted certificate from Let's Encrypt.

Step 1: Install cert-manager

First, you need to install cert-manager into your cluster. The recommended method is using the official Helm chart:

# Add the Jetstack Helm repository
helm repo add jetstack https://charts.jetstack.io

# Update your local Helm chart repository cache
helm repo update

# Install the cert-manager chart
helm install \
  cert-manager jetstack/cert-manager \
  --namespace cert-manager \
  --create-namespace \
  --version v1.14.5 \
  --set installCRDs=true

Step 2: Configure a ClusterIssuer

An Issuer is a resource that represents a certificate authority (CA) from which to obtain certificates. A ClusterIssuer is a cluster-scoped version that can be used from any namespace. Here’s how to configure one for Let's Encrypt's production environment using the ACME protocol:

# cluster-issuer-prod.yaml
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: letsencrypt-prod
spec:
  acme:
    server: https://acme-v02.api.letsencrypt.org/directory
    email: your-email@example.com
    privateKeySecretRef:
      name: letsencrypt-prod-private-key
    solvers:
    - http01:
        ingress:
          class: nginx

This manifest tells cert-manager how to communicate with Let's Encrypt. The http01 solver will automatically configure your NGINX ingress controller to solve the domain ownership challenge required by Let's Encrypt.

Step 3: Request a Certificate

Now, you can request a certificate simply by creating a Certificate resource. cert-manager will see this resource, communicate with the ClusterIssuer, and once the certificate is issued, store it in a Kubernetes Secret.

# my-app-certificate.yaml
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  name: my-app-tls
  namespace: production
spec:
  secretName: my-app-tls-secret
  issuerRef:
    name: letsencrypt-prod
    kind: ClusterIssuer
  commonName: myapp.example.com
  dnsNames:
  - myapp.example.com
  - www.myapp.example.com

Apply this manifest, and within minutes, a Secret named my-app-tls-secret will appear in the production namespace containing a valid TLS certificate and private key. You can then mount this secret into your Ingress resource to enable HTTPS.

The best part? cert-manager will monitor this certificate and automatically begin the renewal process long before it expires, ensuring continuous availability.

Beyond the Edge: Securing East-West Traffic with a Service Mesh

Securing traffic entering your cluster (north-south) is essential, but it's only half the battle. In a microservices architecture, a significant amount of traffic flows between services within the cluster (east-west). This internal traffic is often unencrypted, creating a major security blind spot. An attacker who compromises a single pod can then move laterally, sniffing traffic and attacking other services.

This is where a service mesh comes in. A service mesh like Istio or Linkerd provides a dedicated infrastructure layer for making service-to-service communication safe, reliable, and observable. One of its core security features is automatic mutual TLS (mTLS).

The Magic of Automatic mTLS

With a service mesh, every pod gets a sidecar proxy (like Envoy). This proxy intercepts all network traffic coming into and out of the application container. Here's how it enables mTLS automatically:

  1. Identity Bootstrapping: When a pod starts, the sidecar proxy requests an identity from the service mesh control plane (e.g., Istiod). This identity is based on the pod's Kubernetes Service Account and conforms to the SPIFFE (Secure Production Identity Framework for Everyone) standard. A typical SPIFFE ID looks like spiffe://cluster.local/ns/default/sa/my-app-sa.
  2. Certificate Issuance: The control plane, acting as a private Certificate Authority (CA), validates the request and issues a short-lived x.509 certificate back to the sidecar. This certificate encodes the pod's SPIFFE ID.
  3. Encrypted Communication: When Service A wants to talk to Service B, its sidecar proxy transparently initiates a TLS handshake with Service B's proxy. They present their certificates to each other, cryptographically verifying their identities. Once authenticated, they establish an encrypted channel.

This entire process is transparent to the application. Your developers don't need to modify their code to handle TLS; the mesh takes care of it. By enforcing mTLS by default, you adopt a Zero Trust posture inside your cluster—no service trusts another by default, and all communication must be authenticated and encrypted.

Advanced Strategies for Bulletproof Certificate Management

Automating ingress and internal traffic is a huge step forward, but mature organizations need to layer on additional controls for policy, visibility, and security.

1. Enforce Policy as Code

As your organization grows, you need to enforce standards. Can a developer issue a wildcard certificate? Can a staging service use the production issuer? You can enforce these rules using a policy-as-code engine like Kyverno or OPA Gatekeeper.

For example, here is a simple Kyverno ClusterPolicy that prevents the creation of Certificate resources that request a wildcard domain:

# disallow-wildcard-certs.yaml
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: disallow-wildcard-certificates
spec:
  validationFailureAction: Enforce
  rules:
  - name: validate-dns-names
    match:
      any:
      - resources:
          kinds:
          - Certificate
    validate:
      message: "Wildcard certificates are not allowed."
      pattern:
        spec:
          dnsNames:
          - "!*.*"
          commonName: "!*.*"

When a user tries to apply a Certificate manifest with *.example.com, the Kubernetes API server will reject it, ensuring compliance with your security policies.

2. Embrace Short-Lived Certificates

The industry is rapidly moving away from certificates with one-year lifetimes. A shorter lifetime—from 90 days down to just a few hours for internal services—dramatically reduces the "blast radius" of a compromised private key. If a key is stolen, it's only useful to an attacker for a very short period.

This is only feasible with robust automation. A service mesh like Istio already issues certificates with a default lifetime of 24 hours, rotating them automatically. For public-facing certificates managed by cert-manager, you can configure shorter lifetimes, but this requires confidence in your renewal automation and monitoring.

3. Centralize Visibility to Tame Sprawl

Automation is a double-edged sword. While cert-manager and service meshes make it easy to issue certificates, they can also lead to "certificate sprawl." In a large organization with dozens of clusters and thousands of services, you can quickly end up with tens of thousands of certificates. This raises critical questions for security and compliance teams:

  • Which certificates are expiring in the next 30 days across all our clusters?
  • Are any teams using weak cryptographic algorithms?
  • Do we have a complete inventory for our next compliance audit?

This is where automation tools fall short. They execute tasks but don't provide a centralized, human-friendly view of your entire certificate inventory. Relying on kubectl commands and sifting through Prometheus metrics is inefficient and error-prone.

This is precisely the problem that a centralized certificate monitoring platform like Expiring.at is designed to solve. It integrates with your Kubernetes clusters and other sources to provide a single dashboard for all your certificates. Instead of wondering what's about to break, you can:

  • Discover and inventory all certificates automatically, including those managed by cert-manager.
  • Receive proactive alerts via Slack, email, or webhooks for upcoming expirations, preventing outages.
  • Track compliance by ensuring all certificates meet your organization's security policies.

By pairing powerful automation tools with a dedicated visibility platform, you get the best of both worlds: efficient, hands-off lifecycle management and the high-level oversight needed for governance and risk management.

Your Action Plan for Modern Certificate Management

The days of treating certificates as a "set it and forget it" task are over. In the world of Kubernetes, certificate management is a continuous, automated, and policy-driven process that is fundamental to both security and reliability.

Here are your key takeaways:

  1. Automate Everything: Immediately move away from manual processes. Install cert-manager and define all public-facing certificates as code in your Git repositories.
  2. Secure the Interior: Don't stop at the ingress. Implement a service mesh like Istio or Linkerd to enforce mTLS for all internal service-to-service communication.
  3. Enforce Your Rules: Use a policy engine like Kyverno to set and enforce guardrails on certificate issuance, preventing insecure configurations before they happen.
  4. Gain Centralized Visibility: You cannot manage what you cannot see. Implement a monitoring solution like Expiring.at to create a comprehensive inventory of all certificates across your environments, turning unknown risks into manageable assets.

By adopting these practices, you can transform certificate management from a source of stress and outages into a strategic advantage, building a more secure, resilient, and compliant Kubernetes platform.

Share This Insight

Related Posts