Sitemap Lists URLs That Carry noindex (How to Fix the Conflict)

Q: Should robots.txt disallow my noindex pages?

No. Google has to crawl a page to see its `noindex`. `Disallow` plus `noindex` is the most common foot-gun — the crawler is blocked, never reads the meta, and the URL can still surface from backlinks.

Q: Can I use the `X-Robots-Tag` HTTP header instead of the meta tag?

Yes. For HTML, an `X-Robots-Tag: noindex` response header and the ` ` tag are equivalent. For non-HTML resources (PDFs, images), the header is the only option since they can't carry a meta tag.

Q: How long until "Excluded by 'noindex' tag" clears after I fix it?

After clicking **Validate Fix**, expect up to about two weeks for Google to re-crawl and re-evaluate; large sites can take longer. The count drops gradually, not all at once.

Q: Is it bad to have noindex pages at all?

No — `noindex` is the correct tool for thin or duplicate pages. The bug is only the *contradiction* of also advertising those URLs in the sitemap.

Your sitemap.xml lists URLs that render `<meta name="robots" content="noindex">`. Google indexes some and not others. Why it happens and how to make both signals agree.

Published: May 19, 2026 Updated: Jun 21, 2026 Author: AI Productivity Guide Team 🌐 查看中文版本

Fastest fix: a URL should be in your sitemap.xml or carry noindex — never both. Decide per URL whether you want it indexed, then remove it from one side. Pull the conflict list with the audit command below, fix the source of truth, rebuild, resubmit the sitemap, and click Validate Fix in Search Console.

Search Console flags “Excluded by ‘noindex’ tag” (and sometimes “Indexed, though blocked by robots.txt”) on a batch of your URLs. Looking closer, those same URLs are in your sitemap.xml. The sitemap says “please index this.” The noindex meta says “please don’t.” When both fire on one URL you’ve sent Google contradictory instructions, so the outcome drifts — some pages get indexed, some don’t, and you’ve lost control of which.

The root cause is almost always that sitemap generation and the noindex decision read from different sources of truth. The fix is making them share one.

Which bucket are you in

Symptom in the sitemap	Likely cause	Fix
Drafts / retracted posts listed	Generator iterates all files, ignores `draft`	Filter drafts out of the sitemap
`?page=2`, `/page/3/` listed	Paginated URLs auto-collected, but `noindex`’d	Exclude paginated paths from sitemap
`/author/`, `/tag/`, `/category/` listed	Thin archive pages `noindex`’d but auto-discovered	Exclude those path prefixes
A real article listed but rendering `noindex`	Template bug (bad locale/field check)	Fix the template condition, keep in sitemap
Recently retracted URL still listed	Sitemap is stale	Rebuild the sitemap

Common causes

Ordered by hit rate, highest first.

1. Sitemap pulls all routes; noindex is template-decided

Your build generates a sitemap from “every URL that exists.” Your layout applies noindex from frontmatter (draft: true) or category logic. The sitemap generator never sees that decision, so it lists everything.

This is especially common on Astro sites. Per the official @astrojs/sitemap docs, the integration “can’t analyze a given page’s source code,” so it has no way to read noindex or draft from frontmatter. Unless you exclude those URLs yourself in filter() or serialize(), they land in the sitemap regardless of what the page renders.

How to spot it:

# Compare sitemap URLs against pages that actually render noindex
xmllint --xpath '//*[local-name()="loc"]' sitemap.xml | grep -oP 'https://[^<]+' > /tmp/sitemap_urls.txt
while read url; do
  if curl -s "$url" | grep -q 'name="robots" content="[^"]*noindex'; then
    echo "CONFLICT: $url"
  fi
done < /tmp/sitemap_urls.txt

Any output is a sitemap/noindex conflict.

2. Pagination pages have noindex but are in the sitemap

You set noindex on ?page=2 and beyond to avoid duplicate-content issues. The sitemap generator pulls everything, paginated URLs included.

How to spot it: grep your sitemap for ?page=, /page/N/, and similar. If they’re present and the pages render noindex, that’s the conflict.

3. Author / tag / archive pages auto-listed

You set noindex on /author/foo/ (thin profile pages). The sitemap auto-discovers all routes and includes the author URLs.

How to spot it: look in the sitemap for /author/, /tag/, /category/ URLs and cross-reference against what’s noindex’d.

4. Draft pages accidentally in the sitemap

Draft pages (draft: true) emit noindex. The sitemap generator iterates *.mdx files without filtering by draft status.

How to spot it: the Astro sitemap integration may include drafts depending on config. Because it can’t read frontmatter, you must exclude drafts by URL pattern in filter() — check astro.config.mjs for that.

5. Frontmatter says publish, but the page renders noindex from a bug

A conditional in your template fires noindex when it shouldn’t (missing field, wrong locale check). The sitemap includes the page legitimately; the template wrongly noindexes it.

How to spot it: manually inspect a flagged page. Compare its frontmatter against the rendered HTML’s robots meta (view-source: the live URL).

6. URL was published, then changed to draft, but the sitemap wasn’t regenerated

You flipped draft: false to draft: true to retract a page. The template now emits noindex. The sitemap is stale and still lists the URL.

How to spot it: recently retracted articles still in the sitemap means a stale sitemap.

Shortest path to fix

In Astro, the filter() callback receives the full page URL as a string (including your domain) and returns true to keep it. Because the integration can’t see frontmatter, drive the exclusion from a path convention or a precomputed set of noindex URLs:

// astro.config.mjs
import sitemap from '@astrojs/sitemap';

export default defineConfig({
  integrations: [
    sitemap({
      // `page` is the absolute URL string, e.g. 'https://yoursite.com/author/foo/'
      filter: (page) =>
        !page.includes('/draft/') &&
        !page.includes('/author/') &&
        !/\/page\/\d+\//.test(page) &&
        !page.includes('?page='),
    }),
  ],
});

For a build that tracks noindex per page, generate that set first (e.g. while rendering, write the slugs to a JSON file) and check it in filter() so the two never drift again.

For Next.js / next-sitemap:

// next-sitemap.config.js
module.exports = {
  siteUrl: 'https://yoursite.com',
  exclude: ['/draft/*', '/author/*', '/api/*'],
  // Or programmatically, returning null to drop a URL:
  transform: async (config, path) => {
    if (await isNoindex(path)) return null;
    return { loc: path, changefreq: 'weekly', priority: 0.7 };
  },
};

Step 2: Audit the live sitemap for conflicts

# Pull all URLs, check each for noindex
curl -s https://yoursite.com/sitemap.xml | grep -oP '<loc>\K[^<]+' | while read url; do
  if curl -s "$url" | grep -q 'name="robots"[^>]*content="[^"]*noindex'; then
    echo "$url"
  fi
done > /tmp/conflicts.txt

wc -l /tmp/conflicts.txt

If your sitemap is a sitemap index (multiple child sitemaps), the top-level file lists child <loc> entries, not page URLs — fetch each child first, or point the command at sitemap-0.xml / the child files directly.

Each URL in conflicts.txt needs a decision: keep noindex (remove from sitemap) or drop noindex (keep in sitemap).

Step 3: Decide per URL — index or not?

For each conflict, ask: does this URL give a Google searcher unique value?

Yes -> remove noindex, keep it in the sitemap. To clear the meta, render <meta name="robots" content="index, follow"> or simply omit the robots meta.
No -> keep noindex, remove it from the sitemap.

Pagination pages, author profiles with no bio, and empty tag archives are usually “no.” Real articles with content are usually “yes.”

Step 4: Regenerate and resubmit the sitemap

After cleanup:

npm run build  # regenerates the sitemap
curl -s https://yoursite.com/sitemap.xml | grep -c '<loc>'  # confirm URL count

In Search Console, go to Indexing -> Sitemaps, and submit the sitemap URL again even if it’s unchanged. That nudges Google to re-fetch it. For a handful of high-priority URLs, paste each into the URL Inspection bar at the top and click Request Indexing — but note the quota is small (roughly 10-12 URLs per day per property on a rolling 24-hour window, as of June 2026), so rely on the sitemap for bulk re-evaluation.

Step 5: Add a CI check so it can’t regress

// scripts/check-sitemap-noindex.mjs
import fs from 'node:fs';

const xml = fs.readFileSync('dist/sitemap-0.xml', 'utf8');
const urls = [...xml.matchAll(/<loc>([^<]+)<\/loc>/g)].map((m) => m[1]);

let bad = [];
for (const url of urls) {
  const html = await fetch(url).then((r) => r.text());
  if (/name="robots"[^>]*content="[^"]*noindex/.test(html)) bad.push(url);
}
if (bad.length) {
  console.error('Sitemap contains noindex URLs:\n' + bad.join('\n'));
  process.exit(1);
}

Run it post-build and fail the build if any sitemap URL renders noindex. (For a local-only check, point fetch at your preview server or read the built HTML files directly instead of hitting production.)

Step 6: Tell Search Console and watch it clear

In Search Console: Indexing -> Pages, scroll to “Why pages aren’t indexed,” open “Excluded by ‘noindex’ tag,” and click Validate Fix. This tells Google the URLs are ready for re-evaluation. Validation typically completes in up to about two weeks, though it can take longer for large sites. The “Excluded” count should fall as Google re-crawls each URL.

How to confirm it’s fixed

The audit command in Step 2 prints zero lines.
view-source: on a previously conflicting URL shows no noindex (or the URL is gone from the sitemap, depending on your decision).
In Search Console, URL Inspection on a fixed URL reports “URL is on Google” or “Indexing allowed? Yes.”
The “Excluded by ‘noindex’ tag” count in the Pages report trends down over the following weeks.

When this is not on you

Search Console alerts lag by days. Verify against the live page source with view-source:, not just the report — the report often shows stale data after you’ve already fixed the issue. Use the URL Inspection -> Test Live URL button to see what Googlebot fetches right now versus the last cached crawl.

Easy to misdiagnose

People try to “strengthen” noindex by also disallowing the URL in robots.txt. That backfires: a disallowed page can’t be crawled, so Google never sees the noindex meta. If external links point at it, Google can still index the bare URL with no snippet — worse than before. Pick one mechanism: to keep a page out of the index, leave it crawlable and let it render noindex.

Prevention

Generate the sitemap from a “publishable AND indexable” filter, never from all routes.
CI check: any sitemap URL whose page renders noindex fails the build.
When you decide noindex for a whole category (pagination, author, drafts), update the sitemap filter in the same commit.
Never combine robots.txt Disallow with noindex — choose one.
Audit the sitemap quarterly against your “intended to index” list.

FAQ

Should robots.txt disallow my noindex pages? No. Google has to crawl a page to see its noindex. Disallow plus noindex is the most common foot-gun — the crawler is blocked, never reads the meta, and the URL can still surface from backlinks.

Can I use the X-Robots-Tag HTTP header instead of the meta tag? Yes. For HTML, an X-Robots-Tag: noindex response header and the <meta name="robots" content="noindex"> tag are equivalent. For non-HTML resources (PDFs, images), the header is the only option since they can’t carry a meta tag.

How long until “Excluded by ‘noindex’ tag” clears after I fix it? After clicking Validate Fix, expect up to about two weeks for Google to re-crawl and re-evaluate; large sites can take longer. The count drops gradually, not all at once.

Is it bad to have noindex pages at all? No — noindex is the correct tool for thin or duplicate pages. The bug is only the contradiction of also advertising those URLs in the sitemap.

Does removing a URL from the sitemap deindex it? No. The sitemap is a discovery hint, not an index command. To deindex, the page must render noindex (or return 410/404). Removing it from the sitemap only stops you re-advertising it.

Tags: #SEO #Troubleshooting #Debug #Structured data #robots.txt #Sitemap

Which bucket are you in

Common causes

1. Sitemap pulls all routes; noindex is template-decided

2. Pagination pages have noindex but are in the sitemap

3. Author / tag / archive pages auto-listed

4. Draft pages accidentally in the sitemap

5. Frontmatter says publish, but the page renders noindex from a bug

6. URL was published, then changed to draft, but the sitemap wasn’t regenerated

Shortest path to fix

Step 1: Make the sitemap and noindex share one source of truth

Step 2: Audit the live sitemap for conflicts

Step 3: Decide per URL — index or not?

Step 4: Regenerate and resubmit the sitemap

Step 5: Add a CI check so it can’t regress

Step 6: Tell Search Console and watch it clear

How to confirm it’s fixed

When this is not on you

Easy to misdiagnose

Prevention

FAQ

Related

Related Articles

Dynamic Title Set by JavaScript Not Indexed by Google

HowTo Schema Is Deprecated But Your Template Still Emits It

Product Schema Review Count Does Not Match Visible Reviews

Fix Article Schema Missing Field author.name in Search Console

Sitemap lastmod Is Always Today and Google Stopped Trusting It

Title Tag and H1 Mismatch Causes Google Rewrites