Do I still need a sitemap with good internal linking?

For an established site with solid internal links, Google usually finds everything without one. For a brand-new site with few backlinks, a submitted sitemap meaningfully accelerates first-time discovery.

How often should the sitemap update?

On every publish. If `app/sitemap.ts` reads your content collection, it regenerates on every build automatically — no manual step.

Should I include `lastModified`?

Yes, when it reflects a real update timestamp. It helps Google prioritize re-crawling changed pages. Do not set it to the current date on every URL or it becomes noise Google ignores.

Can I keep AI search bots while blocking training bots?

Yes. Block `GPTBot`, `ClaudeBot`, `Google-Extended`, and `CCBot` to opt out of model training, but leave `OAI-SearchBot`, `Claude-SearchBot`, `ChatGPT-User`, and `PerplexityBot` allowed so you stay citable in AI search results.

When would I use `generateSitemaps()`?

Once you cross 50,000 URLs (or 50 MB) in one file. Below that, a single `app/sitemap.ts` is simpler and works fine.

Indie Dev & Website Building

Next.js Sitemap and robots.txt: The App Router Way (2026)

A correct sitemap and robots.txt decide whether Google indexes your Next.js site. Here are the App Router idioms for both, verified against Next.js 16.

Published: May 15, 2026 Updated: Jun 05, 2026 Author: AI Productivity Guide Team 🌐 查看中文版本

Sitemaps and robots.txt are boring infrastructure until Google reports “Discovered – currently not indexed” because your sitemap is missing half your pages, or until a stray Disallow: / left over from a staging deploy quietly de-indexes your whole site. Next.js App Router gives you two clean, typed ways to generate both. Pick one per file and own it.

TL;DR

Under ~100 static pages: a hand-written public/robots.txt plus a generated sitemap is fine. For anything driven by a content collection or database, use App Router’s app/robots.ts and app/sitemap.ts so the files rebuild on every deploy.
A single sitemap maxes out at 50,000 URLs / 50 MB (Google’s hard limit). Next.js does not auto-split — past that you call generateSitemaps().
robots.txt is a crawl preference, not access control. Well-behaved bots obey it; it does not stop anyone from fetching a URL.
Verified on Next.js 16.2 (App Router) as of June 2026.

Static vs dynamic: which to use

Situation	robots	sitemap
Marketing site, <100 fixed pages	`public/robots.txt`	`app/sitemap.ts` or hand-written `public/sitemap.xml`
Content/MDX site, pages from a collection	`app/robots.ts`	`app/sitemap.ts` reading the same collection
Over 50,000 URLs	`app/robots.ts`	`app/sitemap.ts` + `generateSitemaps()`
Per-tenant or per-locale rules	`app/robots.ts` (dynamic)	`app/sitemap.ts` (dynamic)

The rule of thumb: if the file’s contents change when you publish content, generate it in code. A hand-written sitemap goes stale the first time you forget to edit it.

Symptoms you actually have a problem

You launched a Next.js site and Google has indexed fewer than 30% of your articles after two weeks.
Search Console → Sitemaps shows “Couldn’t fetch”, “Sitemap could not be read”, or “0 discovered URLs”.
The deployed robots.txt contains Disallow: /. That blocks everything. It is the single most common self-inflicted SEO outage.
Your sitemap returns Content-Type: text/html instead of application/xml, so Google rejects it without a clear error.

Static robots.txt

For a fixed ruleset, drop a plain file at public/robots.txt. Next.js serves it verbatim at the root:

User-agent: *
Allow: /
Disallow: /api/
Disallow: /preview/
Disallow: /drafts/

Sitemap: https://yourdomain.com/sitemap.xml

Do not disallow /_next/. Modern Googlebot renders pages and needs your JS and CSS chunks; blocking /_next/ can break rendering and hurt indexing. Only disallow routes that genuinely should never be crawled.

Dynamic app/robots.ts

The App Router idiom is a default-exported function returning a typed MetadataRoute.Robots object:

// app/robots.ts
import type { MetadataRoute } from 'next';

const SITE = 'https://yourdomain.com';

export default function robots(): MetadataRoute.Robots {
  return {
    rules: [
      { userAgent: '*', allow: '/', disallow: ['/api/', '/drafts/', '/preview/'] },
    ],
    sitemap: `${SITE}/sitemap.xml`,
    host: SITE,
  };
}

sitemap also accepts an array of strings if you serve a sitemap index plus sub-sitemaps. The host field is still valid in Next.js 16 and signals your canonical host to crawlers that honor it.

Blocking AI crawlers: know what you are blocking

There are two distinct families of AI bots, and they are not the same lever (as of June 2026):

Training crawlers scrape content to train models: GPTBot (OpenAI), ClaudeBot (Anthropic), Google-Extended (Gemini training), CCBot (Common Crawl), Meta-ExternalAgent.
Search / RAG crawlers fetch a page at answer time to cite it in an AI search result: OAI-SearchBot and ChatGPT-User (OpenAI), Claude-SearchBot and Claude-User (Anthropic), PerplexityBot.

For a monetized content site, the usual stance is: block the training bots (they take your content and give nothing back) but keep the search bots (they can send you referral traffic). Blocking everything also removes you from AI answer citations.

// app/robots.ts — block training crawlers, keep AI-search citations
import type { MetadataRoute } from 'next';

const SITE = 'https://yourdomain.com';

export default function robots(): MetadataRoute.Robots {
  return {
    rules: [
      { userAgent: '*', allow: '/', disallow: ['/api/', '/drafts/', '/preview/'] },
      {
        userAgent: ['GPTBot', 'ClaudeBot', 'Google-Extended', 'CCBot', 'Meta-ExternalAgent'],
        disallow: '/',
      },
    ],
    sitemap: `${SITE}/sitemap.xml`,
    host: SITE,
  };
}

Remember robots.txt is voluntary. Cloudflare reported in 2025 that AI training crawling overtook the rest of AI bot activity, and not every operator obeys the file. If you need a hard block, do it at the CDN or WAF layer, not in robots.txt.

Dynamic app/sitemap.ts

Generate the sitemap from the same content source your pages render, so it can never drift. For a bilingual MDX site:

// app/sitemap.ts
import type { MetadataRoute } from 'next';
import { getAllArticles } from '@/lib/content';

const SITE = 'https://yourdomain.com';

export default async function sitemap(): Promise<MetadataRoute.Sitemap> {
  const articles = await getAllArticles();

  const staticPaths: MetadataRoute.Sitemap = [
    { url: `${SITE}/`,       changeFrequency: 'daily',   priority: 1.0 },
    { url: `${SITE}/about/`, changeFrequency: 'monthly', priority: 0.5 },
  ];

  const articlePaths: MetadataRoute.Sitemap = articles.map((a) => ({
    url: `${SITE}/en/articles/${a.slug}/`,
    lastModified: a.updatedAt ?? a.publishedAt,
    changeFrequency: 'weekly',
    priority: 0.8,
    alternates: {
      languages: {
        en: `${SITE}/en/articles/${a.slug}/`,
        zh: `${SITE}/zh/articles/${a.slug}/`,
        'x-default': `${SITE}/en/articles/${a.slug}/`,
      },
    },
  }));

  return [...staticPaths, ...articlePaths];
}

The supported entry fields are url, lastModified, changeFrequency, priority, alternates.languages, and images / videos for media sitemaps. Next emits each alternates.languages map as <xhtml:link rel="alternate" hreflang="..."> tags:

<url>
  <loc>https://yourdomain.com/en/articles/foo/</loc>
  <lastmod>2026-05-22T00:00:00.000Z</lastmod>
  <changefreq>weekly</changefreq>
  <priority>0.8</priority>
  <xhtml:link rel="alternate" hreflang="en" href="https://yourdomain.com/en/articles/foo/" />
  <xhtml:link rel="alternate" hreflang="zh" href="https://yourdomain.com/zh/articles/foo/" />
  <xhtml:link rel="alternate" hreflang="x-default" href="https://yourdomain.com/en/articles/foo/" />
</url>

Past 50,000 URLs: generateSitemaps()

Google caps a single sitemap at 50,000 URLs or 50 MB uncompressed. Next.js will not split for you; you export generateSitemaps() alongside the default function. Note the v16 change: the id argument is now a Promise that resolves to a string, so you must await it.

// app/articles/sitemap.ts
import type { MetadataRoute } from 'next';
import { countArticles, getArticlePage } from '@/lib/content';

const SITE = 'https://yourdomain.com';
const PER_FILE = 50_000;

export async function generateSitemaps() {
  const total = await countArticles();
  const count = Math.ceil(total / PER_FILE);
  return Array.from({ length: count }, (_, id) => ({ id }));
}

export default async function sitemap(props: {
  id: Promise<string>;
}): Promise<MetadataRoute.Sitemap> {
  const id = Number(await props.id);
  const articles = await getArticlePage(id * PER_FILE, PER_FILE);
  return articles.map((a) => ({
    url: `${SITE}/en/articles/${a.slug}/`,
    lastModified: a.updatedAt ?? a.publishedAt,
  }));
}

The chunks are served at /articles/sitemap/0.xml, /articles/sitemap/1.xml, and so on. Reference each one (or a sitemap index) from robots.ts.

Verify after every deploy

Run these three checks against the live URL, not localhost:

curl -sI https://yourdomain.com/robots.txt | grep -i content-type
# content-type: text/plain; charset=utf-8

curl -sI https://yourdomain.com/sitemap.xml | grep -i content-type
# content-type: application/xml; charset=utf-8

curl -s  https://yourdomain.com/sitemap.xml | grep -c '<loc>'
# should roughly equal total articles + static pages

If the <loc> count is far below your page count, your getAllArticles() is likely filtering out drafts or a locale you meant to include.

Submit and monitor in Search Console

Submit https://yourdomain.com/sitemap.xml under Indexing → Sitemaps. Status moves from “Pending” to “Success” in 1–2 days.
Check Pages weekly for the first month. Coverage should climb from a handful of URLs to most of the site.
If it plateaus, paste one stuck URL into the URL Inspection tool to read Google’s exact reason (“Discovered – currently not indexed”, “Crawled – currently not indexed”, “Blocked by robots.txt”, and so on).

For a deeper walkthrough of submission, see Submit Sitemap in Search Console.

Common pitfalls

Leaving Disallow: / in robots.txt after a staging or preview deploy. The classic full-site de-index.
Disallowing /_next/, which can stop Googlebot from loading your CSS/JS and rendering the page.
A route-handler typo that returns HTML instead of XML for the sitemap. Google rejects it with no useful error.
Trailing-slash mismatch between sitemap URLs and your canonical tags. Google treats /foo and /foo/ as two different URLs.
Advertising paginated or filtered URLs (?page=2, faceted filters) in the sitemap. Those should be noindex, not promoted.
Shipping only one language’s URLs. /en/foo and /zh/foo are separate URLs and both belong in the sitemap with hreflang alternates.
Faking lastModified to “now” on every URL each build. Google learns to ignore it. Only set it from a real content timestamp.

FAQ

Do I still need a sitemap with good internal linking?: For an established site with solid internal links, Google usually finds everything without one. For a brand-new site with few backlinks, a submitted sitemap meaningfully accelerates first-time discovery.
How often should the sitemap update?: On every publish. If app/sitemap.ts reads your content collection, it regenerates on every build automatically — no manual step.
Should I include lastModified?: Yes, when it reflects a real update timestamp. It helps Google prioritize re-crawling changed pages. Do not set it to the current date on every URL or it becomes noise Google ignores.
Can I keep AI search bots while blocking training bots?: Yes. Block GPTBot, ClaudeBot, Google-Extended, and CCBot to opt out of model training, but leave OAI-SearchBot, Claude-SearchBot, ChatGPT-User, and PerplexityBot allowed so you stay citable in AI search results.
When would I use generateSitemaps()?: Once you cross 50,000 URLs (or 50 MB) in one file. Below that, a single app/sitemap.ts is simpler and works fine.

Tags: #Indie dev #Next.js #SEO #Technical SEO #robots.txt #Indexing

TL;DR

Static vs dynamic: which to use

Symptoms you actually have a problem

Static robots.txt

Dynamic app/robots.ts

Blocking AI crawlers: know what you are blocking

Dynamic app/sitemap.ts

Past 50,000 URLs: generateSitemaps()

Verify after every deploy

Submit and monitor in Search Console

Common pitfalls

FAQ

Related

Related Articles

MDX Tooling for Next.js Content Sites in 2026 (next-mdx-remote is Archived)

Next.js On-Demand Revalidation: Webhook Setup (2026)

Next.js App Router: The 8 Concepts You Actually Need

Next.js Content-Site SEO: The Footguns to Check First

Deploy Next.js to Vercel: The 10-Minute Path (June 2026)

Next.js Image Optimization: The 16 Upgrade Checklist