robots.txt Not Taking Effect

You changed robots.txt but Google still crawls / indexes the same URLs.

You added Disallow: /admin/ to public/robots.txt or src/pages/robots.txt.ts, deployed, and Search Console’s robots.txt report still shows the old version while Googlebot keeps hitting /admin/. Usually it’s not a syntax error — Google caches robots.txt for up to 24 hours, your CDN piles on another cache layer, and many projects accidentally ship two robots.txt files that override each other.

This article breaks it into 5 causes ordered by hit rate, each with a curl or Search Console check.

Common causes

Ordered by hit rate, highest first.

1. Google’s robots.txt cache hasn’t expired

Google documents that it caches robots.txt for up to 24 hours. You just shipped — Googlebot is still using the cached copy.

How to spot it: Search Console → Settings → robots.txt report shows a “Last fetched” timestamp. If it’s hours ago and the content is the old version, it’s the cache.

2. CDN edge cache on the .txt response

Cloudflare / Vercel Edge cache .txt files based on Cache-Control, often 4 to 24 hours. The origin is fine, but Googlebot is pulling the stale copy from the CDN edge.

How to spot it:

curl -I "https://yourdomain.com/robots.txt"

Check cf-cache-status / x-vercel-cache: HIT means cached. The age header tells you how long that copy has been there.

3. Two robots.txt files exist; the static one wins

The worst trap: you wrote a dynamic src/pages/robots.txt.ts in Astro / Next to generate robots.txt, but an old public/robots.txt is still in the repo. Most frameworks serve public/ static files first, silently overriding the dynamic route.

How to spot it:

ls public/robots.txt src/pages/robots.txt* 2>/dev/null

If both exist, that’s the bug.

4. Using Disallow when you actually want noindex

Disallow: /private/ blocks crawling, not indexing. If an external link points to that URL, Google can still index it (the SERP shows “No information is available for this page”). You think robots.txt failed — it didn’t, it did exactly what it promises. You needed a different tool.

How to spot it: Search Console → URL Inspection on the URL. If “Indexing allowed?” is yes but “Crawling allowed?” is no, robots is working — you need noindex.

5. Syntax / casing errors

disallow: must be Disallow: (capital D). User-agent: cannot be User-Agents:. Wildcards have placement rules — Disallow: */admin/ is unsupported by some bots.

How to spot it: Paste the file into Search Console’s robots.txt tester; it flags syntax errors with line numbers.

Shortest path to fix

Ordered by ROI. The first three usually solve 80% of cases.

Step 1: Use curl + cache-buster to get the true origin response

curl -I "https://yourdomain.com/robots.txt?cb=$(date +%s)"
curl -s "https://yourdomain.com/robots.txt?cb=$(date +%s)"

The first shows headers; the second shows content. Checklist:

  • Status must be 200 (not 301 or 404)
  • content-type: text/plain or similar (not text/html — Google ignores HTML)
  • cf-cache-status / x-vercel-cache should be MISS or DYNAMIC (because of the buster)
  • The body is your latest version

If with buster you see the new version but without you see the old → cache (Step 2). If with buster it’s still old → the deploy didn’t pick it up (Step 3).

Step 2: Purge the CDN for /robots.txt only

Don’t purge everything.

  • Cloudflare: Caching → Configuration → Purge Custom URLs, paste https://yourdomain.com/robots.txt
  • Vercel: From the project root run vercel --prod --force, or trigger a redeploy in the dashboard
  • Netlify: Deploys → Trigger deploy → Clear cache and deploy site

Then re-run Step 1’s curl without the buster and confirm MISS + updated content.

Step 3: Make sure there’s only one robots.txt source

# delete the static file, keep the dynamic route (or vice versa)
rm public/robots.txt
# or: delete the dynamic route
rm src/pages/robots.txt.ts

Pick one. Rebuild and deploy:

npm run build
ls dist/robots.txt && head -20 dist/robots.txt

Confirm dist/robots.txt is the version you want.

Step 4: Force Google to re-fetch

Search Console → Settings → robots.txt report → “Request a recrawl”. The “Last fetched” timestamp typically updates within 1-2 hours.

If your goal is actually deindexing (not just blocking crawl), don’t use Disallow — use a page meta:

<meta name="robots" content="noindex">

or the response header:

X-Robots-Tag: noindex

A Disallow rule actually prevents Google from seeing the noindex tag, so the two are opposites in practice.

Prevention

  • Document the robots.txt “source of truth” in your README so the team knows it’s public/ or a dynamic route, never both
  • Test robots.txt changes in Search Console’s tester before deploying
  • For deindexing, use noindex meta or X-Robots-Tag, not Disallow
  • Add a post-deploy smoke test that curls /robots.txt and asserts 200 + text/plain + expected lines
  • Set a short Cache-Control (e.g. max-age=600) on robots.txt to reduce CDN wait time

Tags: #Hosting #Debug #Troubleshooting #SEO