SSL Cert Auto-Renewal Failed Silently, Site Now Untrusted

Certbot or platform renewal stopped months ago. You only noticed when the browser flashed NET::ERR_CERT_DATE_INVALID. Find why and restore the cron.

The browser flashes red. NET::ERR_CERT_DATE_INVALID. Your Let’s Encrypt cert expired two days ago, but you “set up auto-renewal months ago and never touched it again”. That is exactly how this fails — Let’s Encrypt certs last 90 days, the auto-renewal cron silently broke at some renewal cycle, no one was watching the logs, and now you’re scrambling at 11pm. The renewal job has three independent failure surfaces (cron itself, the challenge mechanism, the deploy hook) and any one breaking is enough to leave you with a stale cert that quietly counts down to expiry.

Common causes

Ordered by what we see most often in incident postmortems.

1. The cron / systemd timer never actually ran

Certbot’s renewal is typically scheduled via cron, systemd timer, or a package-provided unit. If the timer was disabled during a server reboot, OS upgrade, or systemctl mask, it has been silently no-op for months.

How to spot it: systemctl list-timers | grep certbot shows no active timer, or journalctl -u certbot.timer --since '90 days ago' is empty. crontab -l and /etc/cron.d/certbot are missing or commented out.

2. HTTP-01 challenge can no longer reach .well-known/acme-challenge/

You added a CDN, a WAF, an auth_basic block, or a rewrite rule that intercepts /.well-known/acme-challenge/* and returns 401/403/redirect. Let’s Encrypt cannot fetch the challenge token and renewal fails.

How to spot it: curl -I https://yourdomain.com/.well-known/acme-challenge/test returns anything but 404. A 401/403/301 means a proxy is intercepting.

3. DNS-01 challenge API credentials expired or rotated

You used a DNS plugin (Cloudflare, Route 53, Google Cloud DNS) with an API token. The token had an expiry, or you rotated keys, or the IAM policy changed — but the certbot config still has the old credentials.

How to spot it: /var/log/letsencrypt/letsencrypt.log shows 403 Forbidden or Authentication error from the DNS API during the renewal attempt.

4. Disk full or read-only filesystem blocks the renewal

Certbot writes to /etc/letsencrypt/live/, /etc/letsencrypt/archive/, and stages temporary files. If the partition is full, or if the system is in degraded mode and /etc is read-only, renewal aborts.

How to spot it: df -h /etc shows 100% used or mount | grep ' / ' shows (ro,...). Certbot log shows OSError: [Errno 28] No space left on device.

5. Cert renewed but the deploy hook never reloaded nginx / haproxy

The cert on disk is new — openssl x509 -in fullchain.pem -noout -dates shows a fresh expiry — but the running web server is still holding the old cert in memory because the post-renewal hook (--deploy-hook 'systemctl reload nginx') was never set or silently failed.

How to spot it: File mtime on fullchain.pem is recent, but openssl s_client -connect yourdomain.com:443 -servername yourdomain.com shows the expired cert.

6. Rate limit hit during a failed loop

If the renewal cron retried 50 times in a window, Let’s Encrypt’s per-domain rate limit kicks in (5 failed validations per hour, 50 cert issuances per week per registered domain). Future renewal attempts get blocked even after you fix the underlying cause.

How to spot it: Log shows urn:ietf:params:acme:error:rateLimited or too many failed authorizations.

Before you start

  • Note exactly how many hours/days the cert has been expired — that determines user impact and triage urgency.
  • Identify the cert tool in use: certbot, acme.sh, caddy-built-in, cert-manager (k8s), Vercel/Netlify/Cloudflare managed.
  • Have shell / sudo access to the host that runs the renewal.
  • Have a fallback HTTPS option in your back pocket: Cloudflare proxy in front (provides edge cert), or a manually issued cert as a one-off.

Information to collect

  • Output of openssl s_client -connect yourdomain.com:443 -servername yourdomain.com 2>/dev/null | openssl x509 -noout -dates -subject.
  • Output of ls -la /etc/letsencrypt/live/yourdomain.com/.
  • Last 200 lines of /var/log/letsencrypt/letsencrypt.log.
  • systemctl status certbot.timer and systemctl list-timers --all | grep cert.
  • crontab -l and cat /etc/cron.d/certbot (or equivalent).
  • Whether anything sits in front of the origin: CDN, WAF, load balancer.

Step-by-step fix

Ordered to get HTTPS back fastest, then fix automation properly.

Step 1: Run the renewal manually and capture the real error

sudo certbot renew --force-renewal --dry-run

--dry-run hits Let’s Encrypt’s staging environment so it does not eat your rate limit. The output tells you exactly what is failing — challenge fetch, DNS API, hook, filesystem. Read it line by line.

If dry-run succeeds:

sudo certbot renew --force-renewal

If dry-run fails, fix that error first before consuming production cert quota.

Step 2: If HTTP-01 challenge is being blocked

Test the challenge path directly:

sudo mkdir -p /var/www/letsencrypt/.well-known/acme-challenge
echo "test123" | sudo tee /var/www/letsencrypt/.well-known/acme-challenge/test
curl -I http://yourdomain.com/.well-known/acme-challenge/test

You want HTTP/1.1 200 OK over plain HTTP. If you get redirect-to-HTTPS or 401, find the offending nginx location block / WAF rule and exempt /.well-known/acme-challenge/ from auth and redirect:

location /.well-known/acme-challenge/ {
    root /var/www/letsencrypt;
    allow all;
    auth_basic off;
}

Place this block BEFORE any return 301 https://... redirect inside the :80 server block.

Step 3: If DNS-01 challenge credentials are stale

Regenerate the API token (Cloudflare, Route 53, etc.). Update the credentials file:

sudo nano /etc/letsencrypt/cloudflare.ini
# dns_cloudflare_api_token = NEW_TOKEN_HERE
sudo chmod 600 /etc/letsencrypt/cloudflare.ini

Re-run renewal. If you also use the same API token elsewhere (Terraform, monitoring), update those references in the same change so you don’t break a different consumer.

Step 4: Restore the renewal timer / cron

For systemd:

sudo systemctl enable --now certbot.timer
systemctl list-timers | grep certbot

You should see a future Next time within ~12h. Certbot’s timer is typically twice a day.

For cron:

sudo crontab -e -u root

Add:

0 3,15 * * * certbot renew --quiet --deploy-hook "systemctl reload nginx"

The --deploy-hook is critical — it reloads nginx ONLY when a cert actually renewed, so it does not flap unnecessarily.

Step 5: Reload the web server with the new cert

sudo systemctl reload nginx   # or haproxy, apache2, caddy
openssl s_client -connect yourdomain.com:443 -servername yourdomain.com 2>/dev/null | openssl x509 -noout -dates

The notAfter= date should now be ~89 days in the future. If the on-disk cert is new but s_client still shows the old cert, the web server is holding the file open — reload (not restart) should re-read it, but if it does not, do restart.

Step 6: Add proactive monitoring so this never silently breaks again

sudo crontab -e -u root

Add a check that emails / pings if the cert is within 25 days of expiry:

0 9 * * * /usr/local/bin/cert-expiry-check.sh

Where cert-expiry-check.sh is:

#!/bin/bash
DOMAIN="yourdomain.com"
END=$(echo | openssl s_client -connect ${DOMAIN}:443 -servername ${DOMAIN} 2>/dev/null \
      | openssl x509 -noout -enddate | cut -d= -f2)
END_EPOCH=$(date -d "${END}" +%s)
NOW_EPOCH=$(date +%s)
DAYS=$(( (END_EPOCH - NOW_EPOCH) / 86400 ))
if [ "${DAYS}" -lt 25 ]; then
  echo "Cert for ${DOMAIN} expires in ${DAYS} days" | mail -s "CERT WARN" you@example.com
fi

25 days gives you 35 days of slack before the cert actually expires — plenty of time to fix automation calmly.

Verify

  • Browser loads https://yourdomain.com with no warning.
  • openssl s_client -connect yourdomain.com:443 -servername yourdomain.com returns a cert with notAfter ~89 days out.
  • systemctl list-timers | grep certbot shows an active future timer.
  • A certbot renew --dry-run exits clean.
  • Your monitoring (the script above, or external like UptimeRobot / Better Stack) confirms the new expiry date is being tracked.

Long-term prevention

  • Always pair certbot renew with a --deploy-hook — without it, renewed certs sit on disk while the running process serves the old one.
  • Subscribe to Let’s Encrypt’s expiry notification emails for the account (verify the registered email is one you actually read).
  • Add external cert-expiry monitoring (UptimeRobot, Better Stack, Datadog, Pingdom) — out-of-band, not running on the same box that issues the cert.
  • After any OS upgrade, immediately re-test that certbot.timer is enabled. Upgrades sometimes disable third-party timers.
  • Document the renewal flow in your runbook so the next on-call engineer can fix it in 5 minutes, not 5 hours.

Common pitfalls

  • Running certbot renew over and over to “force it” while the underlying cause (blocked challenge) is unfixed — burns rate limit quota and now you can’t renew for a week.
  • Assuming the platform auto-renews when you actually disabled the managed cert and switched to your own. Some platforms (Vercel, Netlify) only auto-renew certs they issued.
  • Renewing the cert but forgetting systemctl reload nginx — file is new, served cert is old.
  • Using --standalone in a cron when nginx is already bound to :80, which makes certbot fail to bind the port.
  • Setting up renewal monitoring on the same server that does the renewing — if the box dies, monitoring dies with it. Always external.

FAQ

Q: My cert just expired. Will users lose data or sessions?

Sessions are not lost server-side, but every user gets a browser warning. Mobile apps with strict cert pinning may stop working entirely. Restore HTTPS before worrying about sessions.

Q: Can I issue a new cert from a different CA as a fast fallback?

Yes — ZeroSSL, Buypass, and Sectigo all support ACME. Add a second issuer in your config. You can also park behind Cloudflare’s proxy for an edge cert in minutes (DNS change) while you fix Let’s Encrypt.

Q: I hit the rate limit. How long until I can issue again?

The 50-certs-per-domain-per-week limit is a rolling 7-day window. The 5-failed-validations-per-hour limit clears after 60 minutes. Use staging (--test-cert) until the production limit clears.

Q: Should I move from 90-day certs to 1-year certs to avoid this?

Public CAs no longer issue certs longer than 397 days (and trending shorter — 90 days is becoming standard for everyone). Fix automation instead of fighting the trend.

See also SSL cert delay, CAA record blocks cert issuance, and HTTPS not forced.

Tags: #Troubleshooting #SSL #letsencrypt #automation #certbot