The End of Manual Certs: Scaling Certificate Management with Infrastructure as Code
In April 2023, Starlink suffered a massive global outage that left users without internet access for hours. The root cause wasn't a sophisticated cyberattack or a catastrophic hardware failure in orbit. It was a single, expired TLS certificate on their ground station infrastructure.
Similarly, Epic Games experienced a severe backend outage in 2022 due to an expired internal machine identity. These incidents highlight a painful reality in modern IT: manual certificate tracking is a ticking time bomb. According to industry research, outages caused by expired certificates cost organizations an average of $300,000 per hour in lost revenue and productivity.
With the explosion of microservices, Kubernetes clusters, and zero-trust architectures, the sheer volume of machine identities has outpaced human capacity. Furthermore, with Google’s Chromium Root Program pushing to reduce the maximum validity of public TLS certificates from 398 days to just 90 days, manual renewals are no longer just inefficient—they are mathematically impossible at scale.
The industry mandate is clear: certificate provisioning, renewal, and revocation must be integrated directly into your deployment pipelines. By leveraging Infrastructure as Code (IaC) for certificate management, DevOps and security teams can ensure that cryptography scales synchronously with infrastructure.
Here is how to architect, secure, and monitor certificate management using modern IaC practices.
The IaC Security Trap: Stop Leaking Private Keys in State Files
When teams first attempt to automate certificates using tools like Terraform or OpenTofu, they often make a critical security error: generating private keys directly within the IaC configuration.
Using native providers like the Terraform tls provider generates private keys that are stored in plaintext within the terraform.tfstate file. If an attacker, or even an unauthorized internal user, gains access to your state file, every associated certificate is instantly compromised.
The Wrong Way: Generating Keys in State
# DANGEROUS: This stores the private key in plaintext in your state file
resource "tls_private_key" "example" {
algorithm = "RSA"
rsa_bits = 2048
}
resource "tls_self_signed_cert" "example" {
private_key_pem = tls_private_key.example.private_key_pem
# ... other configurations
}
The Right Way: Requesting Keys from Secure Enclaves
The fundamental rule of IaC certificate management is to shift from generating keys to requesting them from secure enclaves. Your IaC should configure cloud-native services like AWS Certificate Manager (ACM) or HashiCorp Vault, ensuring the private key never touches the IaC state file.
With Vault, for example, Terraform authenticates to the Vault API and requests a short-lived certificate using the PKI secrets engine. The private key is generated and stored securely within Vault, not in your Git repository or state file.
# SECURE: Requesting a cert from Vault. The key is never exposed to the state.
data "vault_pki_secret_backend_cert" "app_cert" {
backend = "pki_int"
name = "my-app-role"
common_name = "app.internal.company.com"
}
Recent updates to Terraform (1.3+) and OpenTofu have also introduced native state file encryption, adding a much-needed layer of defense-in-depth for infrastructure secrets. However, decoupling key generation from state remains the architectural gold standard.
Day 2 Operations: Decoupling Provisioning from Renewal
A common pitfall in IaC certificate management is treating certificates strictly as "Day 1" infrastructure. Terraform and Pulumi are exceptional at provisioning resources when you run an apply. But what happens on Day 60 when that certificate is about to expire?
Traditional IaC pipelines only execute when triggered by a code change. If a certificate expires and no infrastructure changes have been merged to trigger a pipeline run, the certificate will silently expire, resulting in an outage.
To solve this, modern infrastructure teams decouple provisioning from renewal. Instead of using IaC to provision the certificate directly, use IaC to deploy automated renewal agents.
Enter GitOps and cert-manager
In cloud-native environments, the convergence of Kubernetes, GitOps controllers like ArgoCD or Flux, and cert-manager has become the industry standard.
Instead of writing imperative scripts, you declare your certificate requirements as Kubernetes Custom Resource Definitions (CRDs) stored in Git. ArgoCD applies these manifests to the cluster, and cert-manager takes over the lifecycle management. cert-manager communicates with your Certificate Authority (CA)—whether that is Let's Encrypt via the ACME protocol, HashiCorp Vault, or a private enterprise CA—to handle the initial CSR, private key generation, and all subsequent automated renewals.
By leveraging GitOps, the private key is generated natively inside the cluster, keeping it entirely out of Git, while the renewal loop runs continuously without requiring a pipeline trigger.
Practical Implementation: Production-Ready Patterns
Depending on your infrastructure architecture, here are the three dominant patterns for implementing IaC-driven certificate management.
Pattern 1: Cloud-Native Automation (AWS ACM + Terraform)
For workloads running heavily in public clouds, leveraging native managed services is the path of least resistance. You can use Terraform to provision an Application Load Balancer (ALB) alongside an ACM certificate, utilizing DNS validation via Route53. Terraform automates the DNS record creation, allowing AWS to validate and issue the certificate seamlessly.
resource "aws_acm_certificate" "cert" {
domain_name = "api.production.company.com"
validation_method = "DNS"
lifecycle {
create_before_destroy = true
}
}
resource "aws_route53_record" "cert_validation" {
for_each = {
for dvo in aws_acm_certificate.cert.domain_validation_options : dvo.domain_name => {
name = dvo.resource_record_name
record = dvo.resource_record_value
type = dvo.resource_record_type
}
}
allow_overwrite = true
name = each.value.name
records = [each.value.record]
ttl = 60
type = each.value.type
zone_id = data.aws_route53_zone.main.zone_id
}
AWS ACM automatically handles the renewal of this certificate before it expires, completely removing the Day 2 operational burden from your team.
Pattern 2: Enterprise Machine Identity Management
For massive enterprises with strict compliance requirements, tools like Venafi or Smallstep integrate deeply with Terraform to enforce machine identity policies. You can create standardized, approved IaC modules (e.g., a Terraform module for an NGINX server) that automatically request a certificate from the corporate PKI.
When developers consume this module, they get a compliant, properly configured certificate by default, eliminating shadow IT and rogue self-signed certificates.
Pattern 3: Automated mTLS in Service Meshes
For internal service-to-service communication, zero-trust principles dictate the use of mutual TLS (mTLS). Managing thousands of internal certificates manually is impossible. Instead, use IaC to deploy a service mesh like Istio or Linkerd. The IaC provisions the control plane, which in turn acts as a CA, dynamically issuing and rotating ephemeral certificates (often with lifespans measured in hours) to every microservice proxy without human intervention.
Shift-Left Security and Post-Quantum Agility
Treating infrastructure as code allows you to treat security as code. By integrating policy-as-code tools like Checkov, tfsec, or Open Policy Agent (OPA) into your CI/CD pipelines, you can "shift left" and validate certificate configurations before they are ever deployed.
You can write policies that automatically block pull requests if they attempt to:
* Use weak cryptographic ciphers (e.g., forcing TLS 1.3).
* Request certificates with excessive validity periods (enforcing the 90-day maximum).
* Use unauthorized or untrusted Certificate Authorities.
Furthermore, following NIST’s finalization of Post-Quantum Cryptography (PQC) standards in August 2024, organizations must begin preparing for "crypto-agility." When the time comes to migrate away from legacy RSA or ECC algorithms to quantum-resistant algorithms, having your certificates managed via IaC is your greatest asset. Instead of manually updating thousands of servers, your security team can update the central Terraform module or Vault PKI configuration, and the new quantum-resistant certificates will roll out across your entire fleet via a single code commit.
Trust, but Verify: The Need for Continuous Monitoring
Automation is powerful, but silent failures in automation are deadly. An ACME rate limit, a misconfigured DNS record, a revoked API token, or a failing Kubernetes webhook can easily halt your automated renewal pipelines. If your automation fails and you have no external visibility, you will only find out when your users experience an outage.
This is where a robust monitoring safety net becomes essential. While IaC and GitOps handle the execution of your certificate lifecycle, Expiring.at provides the verifiable truth of your external security posture.
By continuously monitoring your endpoints independently of your deployment pipelines, Expiring.at acts as the ultimate failsafe. If your Terraform pipeline breaks or cert-manager fails to reconcile a certificate, Expiring.at detects the impending expiration and alerts your team via Slack, email, or webhook long before the 90-day window closes.
Integrating independent expiration tracking ensures that you are never caught off guard by a silent automation failure. You get the speed and scale of Infrastructure as Code, backed by the reliability of external, continuous validation.
Key Takeaways and Next Steps
The transition to 90-day TLS lifespans and the rise of ephemeral machine identities mean that manual certificate management is officially obsolete. To modernize your infrastructure:
- Stop generating keys in state files: Audit your existing Terraform/OpenTofu codebases and remove any instances of the
tlsprovider generating private keys. Migrate to cloud KMS, ACM, or HashiCorp Vault. - Standardize on ACME and GitOps: Implement
cert-managerfor Kubernetes workloads to automate CSR generation and renewals natively within the cluster. - Enforce Policy as Code: Add OPA or Checkov to your CI/CD pipelines to block non-compliant certificate configurations before they reach production.
- Implement an External Failsafe: Don't rely solely on your automation's internal logs. Set up Expiring.at to monitor your critical endpoints and alert you if your automated renewal pipelines stall.
By treating certificates as ephemeral infrastructure and codifying their lifecycles, you eliminate the risk of expiration outages, ensure compliance by default, and free your engineering teams to focus on building features rather than chasing down expiring keys.