robots.txt Not Taking Effect

Q: I blocked a URL in robots.txt but it is still in Google's index. Why?

`Disallow` stops crawling, not indexing. A page Google already knows about (or that other sites link to) can stay indexed with a "No information is available" snippet. To remove it, allow crawling again and add `noindex` (Step 4); blocking crawl actually hides the `noindex` tag from Google.

Q: Does the order of `Allow` and `Disallow` rules matter?

For Googlebot, no — it matches the **most specific** (longest) rule, regardless of order. So `Allow: /admin/public/` beats `Disallow: /admin/` for `/admin/public/page`. Other crawlers may use first-match, so keep rules unambiguous.

Q: Why does my dynamic `robots.txt.ts` route get ignored?

A static `public/robots.txt` (Astro/Vite) or `public/robots.txt` alongside `app/robots.ts` (Next.js) is served first and the dynamic route never runs. Delete the static file (Step 3) and rebuild.

Q: Can I block a single PDF or image from search?

`Disallow` blocks crawling, but for true deindexing of non-HTML files use the `X-Robots-Tag: noindex` response header, since you cannot put a meta tag inside a PDF.

You changed robots.txt but Google still crawls or indexes the same URLs. The fix is almost always a cache layer or a duplicate file — here is how to find which one.

Published: May 17, 2026 Updated: Jun 21, 2026 Author: AI Productivity Guide Team 🌐 查看中文版本

You added Disallow: /admin/ to public/robots.txt or src/pages/robots.txt.ts, deployed, and the robots.txt report in Search Console still shows the old version while Googlebot keeps hitting /admin/. Usually it is not a syntax error.

Fastest fix: run curl -s "https://yourdomain.com/robots.txt?cb=$(date +%s)". If that cache-busted request shows your new rules, the origin is correct and the stale copy is coming from a cache layer — purge the CDN for that one file and wait out Google’s 24-hour cache. If the cache-busted request still shows the old rules, the deploy never shipped them, usually because a leftover public/robots.txt is overriding your dynamic route.

Why this happens: Google caches robots.txt for up to 24 hours (longer on 5xx or timeouts), your CDN piles on another cache layer, and many projects accidentally ship two robots.txt files that override each other. This article splits it into five causes ordered by hit rate, each with a curl or Search Console check.

Which bucket are you in

Run both curls from “Shortest path to fix” Step 1 first, then read this table.

Cache-busted curl shows	Plain curl shows	Most likely cause	Go to
New rules	Old rules	CDN edge cache	Cause 2, Fix Step 2
New rules	New rules	Google’s own 24h cache	Cause 1, Fix Step 4
Old rules	Old rules	Duplicate file or deploy didn’t ship	Cause 3, Fix Step 3
New rules but URL still indexed	—	`Disallow` cannot deindex	Cause 4, Fix Step 4
`text/html` content-type or non-200	—	Wrong route or error page served	Cause 5

Common causes

Ordered by hit rate, highest first.

1. Google’s robots.txt cache hasn’t expired

Google documents that it caches robots.txt for up to 24 hours, and may hold it longer when it can’t refresh (timeouts or 5xx errors). You just shipped, so Googlebot is still using the cached copy. Google also advises against editing robots.txt many times a day for exactly this reason — your rapid edits won’t be seen until the cache turns over.

How to spot it: Search Console → Settings → robots.txt report. The “Checked on” timestamp shows when Google last fetched the file. If it is hours ago and the content is the old version, it is the cache.

2. CDN edge cache on the .txt response

Cloudflare and Vercel Edge cache .txt files based on Cache-Control, often 4 to 24 hours. The origin is fine, but Googlebot is pulling the stale copy from the CDN edge.

How to spot it:

curl -I "https://yourdomain.com/robots.txt"

Check cf-cache-status / x-vercel-cache: HIT means it was served from cache. The age header tells you how many seconds that cached copy has been sitting there.

3. Two robots.txt files exist; the static one wins

The worst trap: you wrote a dynamic src/pages/robots.txt.ts in Astro / Next to generate robots.txt, but an old public/robots.txt is still in the repo. Most frameworks serve public/ static files first, silently overriding the dynamic route.

How to spot it:

ls public/robots.txt src/pages/robots.txt* 2>/dev/null

If both exist, that is the bug. The static public/robots.txt is winning and your dynamic route never runs.

4. Using `Disallow` when you actually want `noindex`

Disallow: /private/ blocks crawling, not indexing. If an external link points to that URL, Google can still index it (the SERP shows “No information is available for this page”). You think robots.txt failed, but it did exactly what it promises — you needed a different tool.

How to spot it: Search Console → URL Inspection on the URL. If “Crawl allowed?” is No but the page is still in the index, robots is working and you need noindex instead.

5. Syntax, casing, or wrong content-type

disallow: must be Disallow: (capital D). User-agent: cannot be User-Agents:. Wildcard rules have placement constraints — Google supports * and $ but a malformed pattern like Disallow: */admin/ is treated loosely and some non-Google bots ignore wildcards entirely. A more common failure: your route returns the rules as text/html (or a 404 page), and Google only honors a text/plain body returned with HTTP 200.

How to spot it: Google retired the standalone robots.txt tester in late 2023 / 2024; there is no in-Console tester anymore. Instead, paste the URL into URL Inspection to see whether a specific path is blocked, read the warnings shown in the robots.txt report, or validate the file offline against Google’s open-source robots.txt parser, which is the exact library Googlebot uses.

Shortest path to fix

Ordered by ROI. The first three usually solve 80% of cases.

Step 1: Use curl + cache-buster to get the true origin response

curl -I "https://yourdomain.com/robots.txt?cb=$(date +%s)"
curl -s "https://yourdomain.com/robots.txt?cb=$(date +%s)"

The first shows headers; the second shows content. Checklist:

Status must be 200 (not 301 or 404)
content-type must be text/plain (not text/html — Google ignores an HTML body)
cf-cache-status / x-vercel-cache should be MISS or DYNAMIC (the buster forces a fresh fetch)
The body is your latest version

Match the result against the “Which bucket are you in” table above. In short: new-with-buster + old-without means a cache (Step 2); old even with the buster means the deploy didn’t ship the file (Step 3).

Step 2: Purge the CDN for `/robots.txt` only

Do not purge everything — that needlessly cold-starts your whole site cache.

Cloudflare: Caching → Configuration → Purge Cache → Custom Purge → “Purge by: URL”, paste https://yourdomain.com/robots.txt. Use the exact UTF-8 URL; single-file purge does not accept wildcards.
Vercel: a new production deployment automatically purges the edge cache, so run vercel --prod --force from the project root (bypasses the build cache too), or click Redeploy in the dashboard. To redeploy an existing build, vercel redeploy <deployment-url> --target production.
Netlify: Deploys → Trigger deploy → Clear cache and deploy site.

Then re-run Step 1’s curl without the buster and confirm cf-cache-status: MISS (it will flip to HIT again on the next request) plus updated content.

Step 3: Make sure there’s only one robots.txt source

# delete the static file, keep the dynamic route (or vice versa)
rm public/robots.txt
# or: delete the dynamic route
rm src/pages/robots.txt.ts

Pick one. Rebuild and deploy:

npm run build
ls dist/robots.txt && head -20 dist/robots.txt

Confirm dist/robots.txt is the version you want. (On Next.js with the App Router the dynamic file is app/robots.ts, and the same rule applies: a public/robots.txt will still win over it.)

Step 4: Force Google to re-fetch

Search Console → Settings → robots.txt report → click the more settings (three-dot) icon next to the file → Request a recrawl. The “Checked on” timestamp typically updates within minutes to a couple of hours; Google calls this the emergency path versus the standard ~24h auto-refresh.

If your goal is actually deindexing (not just blocking crawl), do not use Disallow — use a page-level meta tag:

<meta name="robots" content="noindex">

or the response header (better for non-HTML files like PDFs):

X-Robots-Tag: noindex

A Disallow rule actually prevents Google from crawling the page and therefore from ever seeing the noindex tag, so the two are opposites in practice. To deindex, leave the URL crawlable, add noindex, wait for a recrawl, and only then add Disallow if you also want to block crawling.

How to confirm it’s fixed

Plain curl -s https://yourdomain.com/robots.txt returns the new rules with 200 + text/plain and cf-cache-status: MISS (or a low age).
In Search Console’s robots.txt report, “Checked on” is recent and the displayed content matches your new file.
For a blocked path, URL Inspection shows “Crawl allowed? No”. For a deindexed path, it shows “Indexing allowed? No” and, after the next crawl, the URL drops out of the Pages → Indexed report.

Prevention

Document the robots.txt “source of truth” in your README so the team knows it is public/ or a dynamic route, never both.
Validate robots.txt changes against the open-source robots.txt parser before deploying — the old in-Console tester is gone.
For deindexing, use noindex meta or X-Robots-Tag, not Disallow.
Add a post-deploy smoke test that curls /robots.txt and asserts 200 + text/plain + the expected lines.
Set a short Cache-Control (for example max-age=600) on robots.txt; Google honors a low max-age and the CDN will hold the stale copy for less time.

FAQ

How long until Google respects my new robots.txt? Up to 24 hours on its own, or minutes to a couple of hours if you click “Request a recrawl” in the robots.txt report. The CDN cache is separate and can add hours on top — purge it (Step 2) so Googlebot fetches a fresh copy on its next visit.

I blocked a URL in robots.txt but it is still in Google’s index. Why? Disallow stops crawling, not indexing. A page Google already knows about (or that other sites link to) can stay indexed with a “No information is available” snippet. To remove it, allow crawling again and add noindex (Step 4); blocking crawl actually hides the noindex tag from Google.

curl shows the new file but Googlebot still uses the old one. What now? Your origin and CDN are correct, so this is Google’s own 24-hour cache. Request a recrawl in the robots.txt report and wait. There is no faster supported method.

Does the order of Allow and Disallow rules matter? For Googlebot, no — it matches the most specific (longest) rule, regardless of order. So Allow: /admin/public/ beats Disallow: /admin/ for /admin/public/page. Other crawlers may use first-match, so keep rules unambiguous.

Why does my dynamic robots.txt.ts route get ignored? A static public/robots.txt (Astro/Vite) or public/robots.txt alongside app/robots.ts (Next.js) is served first and the dynamic route never runs. Delete the static file (Step 3) and rebuild.

Can I block a single PDF or image from search? Disallow blocks crawling, but for true deindexing of non-HTML files use the X-Robots-Tag: noindex response header, since you cannot put a meta tag inside a PDF.

Tags: #Hosting #Debug #Troubleshooting #SEO

Which bucket are you in

Common causes

1. Google’s robots.txt cache hasn’t expired

2. CDN edge cache on the .txt response

3. Two robots.txt files exist; the static one wins

4. Using Disallow when you actually want noindex

5. Syntax, casing, or wrong content-type

Shortest path to fix

Step 1: Use curl + cache-buster to get the true origin response

Step 2: Purge the CDN for /robots.txt only

Step 3: Make sure there’s only one robots.txt source

Step 4: Force Google to re-fetch

How to confirm it’s fixed

Prevention

FAQ

Related

Related Articles

Astro Adapter Mismatch Between SSR and SSG Modes

Deploy Preview URLs Got Indexed by Google

GitHub Actions Deploy Step Hangs Until the 6-Hour Job Limit

Monorepo Deploy Only Ships One App Out of Several

Netlify Function Times Out at 10s on Cold Start

Next.js ISR Revalidation Stuck on Stale Pages (Vercel, 2026)

4. Using `Disallow` when you actually want `noindex`

Step 2: Purge the CDN for `/robots.txt` only