robots.txt — What to Put, What to Never Put

A surgical guide to robots.txt for indie sites — the two-line default that works, the rules that quietly deindex you, and the difference vs noindex.

robots.txt is a 500-byte file that can either do nothing useful or destroy your indexing overnight. The default file most generators ship is fine. The “clever” versions people copy from old blogs are how indie sites accidentally tell Google to forget they exist.

Background

robots.txt is a crawl-control file at the root of your domain. It tells search engine crawlers which paths they may fetch. Critically, Disallow in robots.txt does NOT mean “do not index” — it means “do not crawl”. A page can still be indexed (with just its URL, no content) if it is linked from elsewhere and blocked by robots.txt. The right tool to prevent indexing is <meta name="robots" content="noindex"> — not robots.txt.

How to tell

  • You just inherited or generated a robots.txt and have no idea if it is correct.
  • Search Console shows “Blocked by robots.txt” in your Pages report.
  • site:yourdomain.com returns URLs with the description “A description for this result is not available because of this site’s robots.txt”.
  • You added Disallow rules to “hide” pages and they are still indexed.

Quick verdict

For a typical indie content site, the right robots.txt is three lines: User-agent: *, Allow: /, Sitemap: https://yoursite.com/sitemap.xml. Add Disallow only for paths you genuinely do not want crawled (admin, cart, search). To prevent indexing, use <meta name="robots" content="noindex"> on the page, not Disallow here.

Step by step

  1. Open https://yoursite.com/robots.txt in a browser. If you get a 404, your server is not serving the file — most static hosts auto-generate one if the framework places robots.txt in the public folder.
  2. Confirm it starts with User-agent: * and contains Sitemap: https://yoursite.com/sitemap.xml (full URL, not a relative path).
  3. Identify what to block (if anything). Common: /admin/, /cart/, /api/, internal search results /?q=. Common mistakes to NOT block: /static/, /_next/, /assets/, your sitemap, your CSS/JS. Blocking those breaks rendering for Google.
  4. For pages you want crawled but NOT indexed (thank-you pages, internal duplicates), use <meta name="robots" content="noindex"> in the HTML, not Disallow in robots.txt. Disallow + noindex actually conflict — Google cannot read the noindex if you blocked crawling.
  5. Test in Search Console -> “robots.txt Tester” (still available under legacy tools). Paste candidate URLs to confirm they are allowed/disallowed as intended.
  6. After changing robots.txt, request indexing on your homepage to nudge Google to re-fetch the file. The file is cached up to 24 hours; do not expect changes to take effect instantly.

Common pitfalls

  • Using Disallow: / to “hide” a staging site that is on the public web. Google may still index the URLs (just without content) if anyone links to them. Use HTTP auth on staging instead.
  • Blocking CSS, JS, or /_next/, /static/, /assets/ directories. Google needs those to render the page; blocking them can hurt rankings.
  • Disallowing a page to “noindex” it. Disallow does not deindex — it just stops crawling. The URL can still appear in results with no description.
  • Forgetting to include Sitemap: line. Not fatal — you also submit the sitemap in Search Console — but redundancy here costs nothing.
  • Copy-pasting a 200-line robots.txt from a different site without understanding it. Especially WordPress robots.txt files on non-WordPress sites — they block paths that do not exist on your site, which is harmless but signals you do not know what you are doing.

Who this is for

Anyone who has never opened their robots.txt or who just saw a “blocked by robots.txt” warning in Search Console.

When to skip this

Sites with custom enterprise crawl management (e.g. selective indexing per user-agent). This article assumes you want most things crawled.

FAQ

  • What is the difference between robots.txt and noindex?: robots.txt controls crawling — whether Google fetches the page. noindex controls indexing — whether Google shows it in results. They are different layers, and Disallow + noindex actually conflict because Google cannot read the noindex if crawling is blocked.
  • Should I block search-result pages (/?q=...)?: For most content sites, yes — internal search results are thin and create infinite URL variations. Block via Disallow: /search/ or whatever your search URL pattern is.
  • Can I have multiple sitemaps in robots.txt?: Yes, just add multiple Sitemap: lines. Useful if you split sitemaps by language or by content type. All listed sitemaps are picked up by Google.
  • How long until Google notices my robots.txt changes?: Google caches robots.txt for up to 24 hours. After deploying a change, expect 1-24 hours before crawl behavior reflects it. You can request indexing on your homepage to nudge a refresh.

Tags: #Indie dev #SEO #Technical SEO #robots.txt