What is the difference between robots.txt and `noindex`?

`robots.txt` controls **crawling** — whether a bot fetches the page. `noindex` controls **indexing** — whether Google shows it in results. They are different layers, and `Disallow` + `noindex` actually conflict, because Google cannot read the `noindex` if crawling is blocked.

Will blocking GPTBot stop ChatGPT from citing me?

No. `GPTBot` is OpenAI's **training** crawler. ChatGPT's live browsing and citations run through `OAI-SearchBot` and `ChatGPT-User`. Block `GPTBot` to refuse training while keeping `OAI-SearchBot` allowed so you can still appear in ChatGPT answers.

Does `Google-Extended` remove me from Google Search?

No. `Google-Extended` is a training opt-out token for Gemini and Google's AI features only. Regular `Googlebot` still crawls, indexes, and ranks your pages normally.

Should I block internal search-result pages (`/?q=...`)?

For most content sites, yes — internal search results are thin and spawn infinite URL variations. Block with `Disallow: /search/` (or your actual search URL pattern).

Can I list multiple sitemaps in robots.txt?

Yes. Add multiple `Sitemap:` lines — useful when you split sitemaps by language or content type. Google reads all of them.

How long until Google notices my robots.txt changes?

Google recrawls `robots.txt` frequently — usually within a day. After unblocking important URLs you can request a recrawl from the robots.txt report in Search Console → Settings to nudge it.

Indie Dev & Website Building

robots.txt — What to Put, What to Never Put (2026)

A surgical guide to robots.txt for indie sites: the working default, the rules that quietly deindex you, robots.txt vs noindex, and the AI-crawler tokens (GPTBot, ClaudeBot, Google-Extended) that matter in 2026.

Published: May 15, 2026 Updated: Jun 05, 2026 Author: AI Productivity Guide Team 🌐 查看中文版本

robots.txt is a 500-byte text file at the root of your domain that can either do nothing useful or quietly destroy your indexing overnight. The default that most generators ship is fine. The “clever” 200-line versions people copy from old blogs are how indie sites accidentally tell Google to forget they exist — and, since early 2026, how they accidentally hand all their content to AI training crawlers (or block the search bots they actually wanted).

This guide gives you the working default, the exact rules that backfire, the real difference between robots.txt and noindex, and the current AI-crawler tokens (GPTBot, ClaudeBot, Google-Extended) as of June 2026.

TL;DR

For a typical indie content site, the correct robots.txt is three lines: User-agent: *, Allow: /, and a full-URL Sitemap: line.
Disallow means “do not crawl”, not “do not index”. A blocked page can still appear in Google with just its URL and no description.
To keep a page out of the index, use <meta name="robots" content="noindex"> on the page — never Disallow. The two conflict: Google cannot read your noindex if you blocked crawling.
Never block /_next/, /static/, /assets/, or your CSS/JS — Google needs them to render the page, and blocking them can hurt rankings.
AI crawlers (GPTBot, ClaudeBot, Google-Extended) obey separate User-agent tokens. If you want to opt out of model training, add them explicitly — a generic User-agent: * does not change AI behavior the way most people assume.
The standalone “robots.txt Tester” was sunset; Google replaced it with the robots.txt report under Search Console → Settings.

What robots.txt actually controls

robots.txt is a crawl-control file standardized as RFC 9309. It lives at exactly one place — https://yoursite.com/robots.txt — and tells crawlers which paths they may fetch.

The single most expensive misunderstanding: Disallow does NOT mean “do not index.” It means “do not crawl.” A page that is Disallow-ed can still be indexed (URL only, no snippet) if anything links to it. The right tool to prevent indexing is the page-level directive <meta name="robots" content="noindex">, not robots.txt.

You want to…	Use	Do NOT use
Stop a bot from fetching a path	`Disallow:` in robots.txt	`noindex` (page still gets crawled)
Keep a page out of search results	`<meta name="robots" content="noindex">`	`Disallow` (URL can still appear, no snippet)
Opt out of AI model training	named AI `User-agent` + `Disallow: /`	`User-agent: *` (does not target AI specifically)
Hide a staging site	HTTP auth / IP allowlist	`Disallow: /` (URLs can still be indexed)

When this article is for you

You inherited or generated a robots.txt and have no idea whether it is correct.
Search Console’s Pages report shows “Blocked by robots.txt.”
site:yourdomain.com returns URLs with “A description for this result is not available because of this site’s robots.txt.”
You added Disallow rules to “hide” pages and they are still indexed.
You want to decide whether ChatGPT, Claude, or Google’s AI features train on your content.

The default that works

For most indie content sites, this is the entire correct file:

User-agent: *
Allow: /
Sitemap: https://yoursite.com/sitemap.xml

Add Disallow only for paths you genuinely do not want crawled — typically /admin/, /cart/, /api/, and internal search results. To keep a page out of the index, reach for noindex on the page, not Disallow here.

Step by step

Open https://yoursite.com/robots.txt in a browser. A 404 means your server is not serving the file — most static hosts auto-expose one if your framework places robots.txt in the public/ folder.
Confirm it starts with User-agent: * and contains a Sitemap: line with the full URL (https://yoursite.com/sitemap.xml), not a relative path.
Decide what to block (if anything). Common to block: /admin/, /cart/, /api/, internal search /?q=. Common mistakes to never block: /static/, /_next/, /assets/, your sitemap, your CSS/JS. Blocking those breaks rendering for Google.
For pages you want crawled but NOT indexed (thank-you pages, internal duplicates), add <meta name="robots" content="noindex"> to the HTML — do not Disallow them. Disallow + noindex conflict: Google cannot read the noindex if you blocked crawling.
Make your AI-crawler decision (see the next section) and add named User-agent blocks if you want to opt in or out.
Validate with the robots.txt report in Search Console → Settings, and use the URL Inspection tool to confirm a specific URL is allowed. The old standalone “robots.txt Tester” was removed.
After changing the file, you usually do not need to do anything — Google recrawls robots.txt frequently. If you just unblocked important URLs, request a recrawl from the robots.txt report to speed things up.

AI crawlers in 2026: the part that is actually new

Since early 2026, the most consequential robots.txt decisions for many sites are about AI crawlers, not Googlebot. OpenAI and Anthropic both split their bots into separate, independently controllable tokens, so a blanket block is rarely what you want. There are two categories:

Training crawlers collect content to train or refine a model: GPTBot (OpenAI), ClaudeBot (Anthropic), Google-Extended (Google AI/Gemini training), Applebot-Extended (Apple Intelligence), CCBot (Common Crawl).
Retrieval / search crawlers fetch pages in real time to answer a live query and usually cite you: OAI-SearchBot and ChatGPT-User (OpenAI), Claude-SearchBot and Claude-User (Anthropic), PerplexityBot (Perplexity).

The practical split: blocking a training crawler keeps your content out of the next model; blocking a retrieval crawler removes you from AI answers and their citations — usually the opposite of what an indie publisher wants. As of June 2026, GPTBot is the single most-blocked AI token across the web, but blocking it does not affect ChatGPT’s live-browsing citations, which run through OAI-SearchBot/ChatGPT-User.

Token	Operator	Purpose	Typical indie choice
`GPTBot`	OpenAI	Model training	Block if you don’t want training
`OAI-SearchBot`	OpenAI	ChatGPT search index	Allow (you want citations)
`ChatGPT-User`	OpenAI	User-initiated live fetch	Allow
`ClaudeBot`	Anthropic	Model training	Block if you don’t want training
`Claude-SearchBot`	Anthropic	Claude search index	Allow
`Claude-User`	Anthropic	User-initiated live fetch	Allow
`Google-Extended`	Google	Gemini training opt-out token	Block to opt out; does NOT affect Google Search
`PerplexityBot`	Perplexity	Perplexity search index	Allow

A defensible “allow search, refuse training” block looks like this:

User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: Applebot-Extended
Disallow: /

User-agent: *
Allow: /

Sitemap: https://yoursite.com/sitemap.xml

Two things to know. First, Google-Extended is a training opt-out token only — it does not remove you from Google Search; regular Googlebot still crawls and ranks you. Second, robots.txt is an honor system; reputable crawlers (OpenAI, Anthropic, Google, Perplexity) document compliance, but it is not an enforcement mechanism. To actually block non-compliant scrapers you need server-side controls (WAF rules, rate limits, or a CDN bot product like Cloudflare’s).

Common pitfalls

Using Disallow: / to “hide” a public staging site. Google can still index the URLs (no content) if anyone links to them. Use HTTP auth or an IP allowlist on staging instead.
Blocking CSS, JS, or /_next/, /static/, /assets/. Google needs these to render the page; blocking them can hurt rankings.
Disallow-ing a page to “noindex” it. Disallow does not deindex — it stops crawling. The URL can still appear in results with no description.
Assuming User-agent: * controls AI bots. It does not target them the way people expect, and named AI crawlers follow their own more-specific block. Add explicit tokens if you have a training preference.
Copy-pasting a 200-line robots.txt from a different stack. WordPress robots.txt files on a non-WordPress site block paths that do not exist — harmless, but a tell that the file is not under your control. Write the few lines you actually need.
Leaving off the Sitemap: line. Not fatal (you also submit it in Search Console), but the redundancy costs nothing.

When to skip this

Sites with custom enterprise crawl management — selective indexing per user-agent, crawl-rate negotiation, or a managed bot product. This article assumes you want most things crawled and want a sane AI-crawler stance.

FAQ

What is the difference between robots.txt and noindex?: robots.txt controls crawling — whether a bot fetches the page. noindex controls indexing — whether Google shows it in results. They are different layers, and Disallow + noindex actually conflict, because Google cannot read the noindex if crawling is blocked.
Will blocking GPTBot stop ChatGPT from citing me?: No. GPTBot is OpenAI’s training crawler. ChatGPT’s live browsing and citations run through OAI-SearchBot and ChatGPT-User. Block GPTBot to refuse training while keeping OAI-SearchBot allowed so you can still appear in ChatGPT answers.
Does Google-Extended remove me from Google Search?: No. Google-Extended is a training opt-out token for Gemini and Google’s AI features only. Regular Googlebot still crawls, indexes, and ranks your pages normally.
Should I block internal search-result pages (/?q=...)?: For most content sites, yes — internal search results are thin and spawn infinite URL variations. Block with Disallow: /search/ (or your actual search URL pattern).
Can I list multiple sitemaps in robots.txt?: Yes. Add multiple Sitemap: lines — useful when you split sitemaps by language or content type. Google reads all of them.
How long until Google notices my robots.txt changes?: Google recrawls robots.txt frequently — usually within a day. After unblocking important URLs you can request a recrawl from the robots.txt report in Search Console → Settings to nudge it.

Tags: #Indie dev #SEO #Technical SEO #robots.txt

TL;DR

What robots.txt actually controls

When this article is for you

The default that works

Step by step

AI crawlers in 2026: the part that is actually new

Common pitfalls

When to skip this

FAQ

Related

Related Articles

Internal Search Result Pages: Index or Noindex?

Noindex vs Nofollow vs Disallow: When to Use Each

Canonical URLs Explained — What to Set and What to Avoid

hreflang for Bilingual Sites — The Parts That Actually Matter

Should Your Category Pages Be Indexed?

Should Tag Pages Be Noindex? (For Most Sites, Yes)