You added Disallow: /admin/ to public/robots.txt or src/pages/robots.txt.ts, deployed, and Search Console’s robots.txt report still shows the old version while Googlebot keeps hitting /admin/. Usually it’s not a syntax error — Google caches robots.txt for up to 24 hours, your CDN piles on another cache layer, and many projects accidentally ship two robots.txt files that override each other.
This article breaks it into 5 causes ordered by hit rate, each with a curl or Search Console check.
Common causes
Ordered by hit rate, highest first.
1. Google’s robots.txt cache hasn’t expired
Google documents that it caches robots.txt for up to 24 hours. You just shipped — Googlebot is still using the cached copy.
How to spot it: Search Console → Settings → robots.txt report shows a “Last fetched” timestamp. If it’s hours ago and the content is the old version, it’s the cache.
2. CDN edge cache on the .txt response
Cloudflare / Vercel Edge cache .txt files based on Cache-Control, often 4 to 24 hours. The origin is fine, but Googlebot is pulling the stale copy from the CDN edge.
How to spot it:
curl -I "https://yourdomain.com/robots.txt"
Check cf-cache-status / x-vercel-cache: HIT means cached. The age header tells you how long that copy has been there.
3. Two robots.txt files exist; the static one wins
The worst trap: you wrote a dynamic src/pages/robots.txt.ts in Astro / Next to generate robots.txt, but an old public/robots.txt is still in the repo. Most frameworks serve public/ static files first, silently overriding the dynamic route.
How to spot it:
ls public/robots.txt src/pages/robots.txt* 2>/dev/null
If both exist, that’s the bug.
4. Using Disallow when you actually want noindex
Disallow: /private/ blocks crawling, not indexing. If an external link points to that URL, Google can still index it (the SERP shows “No information is available for this page”). You think robots.txt failed — it didn’t, it did exactly what it promises. You needed a different tool.
How to spot it: Search Console → URL Inspection on the URL. If “Indexing allowed?” is yes but “Crawling allowed?” is no, robots is working — you need noindex.
5. Syntax / casing errors
disallow: must be Disallow: (capital D). User-agent: cannot be User-Agents:. Wildcards have placement rules — Disallow: */admin/ is unsupported by some bots.
How to spot it: Paste the file into Search Console’s robots.txt tester; it flags syntax errors with line numbers.
Shortest path to fix
Ordered by ROI. The first three usually solve 80% of cases.
Step 1: Use curl + cache-buster to get the true origin response
curl -I "https://yourdomain.com/robots.txt?cb=$(date +%s)"
curl -s "https://yourdomain.com/robots.txt?cb=$(date +%s)"
The first shows headers; the second shows content. Checklist:
- Status must be 200 (not 301 or 404)
content-type: text/plainor similar (nottext/html— Google ignores HTML)cf-cache-status/x-vercel-cacheshould beMISSorDYNAMIC(because of the buster)- The body is your latest version
If with buster you see the new version but without you see the old → cache (Step 2). If with buster it’s still old → the deploy didn’t pick it up (Step 3).
Step 2: Purge the CDN for /robots.txt only
Don’t purge everything.
- Cloudflare: Caching → Configuration → Purge Custom URLs, paste
https://yourdomain.com/robots.txt - Vercel: From the project root run
vercel --prod --force, or trigger a redeploy in the dashboard - Netlify: Deploys → Trigger deploy → Clear cache and deploy site
Then re-run Step 1’s curl without the buster and confirm MISS + updated content.
Step 3: Make sure there’s only one robots.txt source
# delete the static file, keep the dynamic route (or vice versa)
rm public/robots.txt
# or: delete the dynamic route
rm src/pages/robots.txt.ts
Pick one. Rebuild and deploy:
npm run build
ls dist/robots.txt && head -20 dist/robots.txt
Confirm dist/robots.txt is the version you want.
Step 4: Force Google to re-fetch
Search Console → Settings → robots.txt report → “Request a recrawl”. The “Last fetched” timestamp typically updates within 1-2 hours.
If your goal is actually deindexing (not just blocking crawl), don’t use Disallow — use a page meta:
<meta name="robots" content="noindex">
or the response header:
X-Robots-Tag: noindex
A Disallow rule actually prevents Google from seeing the noindex tag, so the two are opposites in practice.
Prevention
- Document the robots.txt “source of truth” in your README so the team knows it’s
public/or a dynamic route, never both - Test robots.txt changes in Search Console’s tester before deploying
- For deindexing, use
noindexmeta orX-Robots-Tag, notDisallow - Add a post-deploy smoke test that curls
/robots.txtand asserts 200 + text/plain + expected lines - Set a short
Cache-Control(e.g.max-age=600) on robots.txt to reduce CDN wait time
Related
Tags: #Hosting #Debug #Troubleshooting #SEO