Search Console flags “Indexed though blocked by robots” or “Excluded by ‘noindex’ tag” on a chunk of your URLs. Looking closer, those same URLs are in your sitemap.xml. The sitemap tells Google “please index this.” The noindex meta tells Google “please don’t.” When both fire on the same URL, Google’s behavior is non-deterministic — some pages get indexed anyway, some don’t, and you’ve effectively lost control over the outcome. This is one of the most common SEO bugs on indie sites that auto-generate both sitemap and robots meta independently.
The root cause is almost always that sitemap generation and noindex decisions use different sources of truth. The fix is making them share one.
Common causes
Ordered by hit rate, highest first.
1. Sitemap pulls all routes; noindex is template-decided
Your build generates a sitemap from “all URLs that exist.” Your layout applies noindex based on frontmatter draft: true or category logic. Sitemap doesn’t know about the noindex decision and lists everything.
How to spot it:
# Compare sitemap URLs against pages with noindex
xmllint --xpath '//*[local-name()="loc"]' sitemap.xml | grep -oP 'https://[^<]+' > /tmp/sitemap_urls.txt
while read url; do
if curl -s "$url" | grep -q 'name="robots" content="[^"]*noindex'; then
echo "CONFLICT: $url"
fi
done < /tmp/sitemap_urls.txt
Any output = sitemap/noindex conflict.
2. Pagination pages have noindex but in sitemap
You set noindex on ?page=2 and beyond to avoid duplicate content issues. Sitemap generator pulls everything including paginated URLs.
How to spot it: Grep your sitemap for URLs with ?page=, /page/N/, etc. If any present and pagination pages have noindex, conflict.
3. Author / tag / archive pages auto-listed
You set noindex on /author/foo/ (thin profile pages). Sitemap auto-discovers all routes and includes author URLs.
How to spot it: Look at sitemap for /author/, /tag/, /category/ URLs. Cross-reference with what’s noindex’d.
4. Draft pages accidentally in sitemap
Draft pages (draft: true in frontmatter) emit noindex. Sitemap generator iterates *.mdx files without filtering by draft status.
How to spot it: Astro sitemap integration may include draft pages depending on config. Check astro.config.mjs sitemap integration — should filter on filter: (page) => !page.draft.
5. Frontmatter says publish, but page rendered noindex due to bug
Bug in your template: a conditional fires noindex when it shouldn’t (e.g., based on a missing field, wrong locale check). Sitemap includes the page legitimately; template wrongly noindexes it.
How to spot it: Manually inspect any flagged page. Compare its frontmatter to its rendered HTML’s robots meta.
6. URL was published, then frontmatter changed to draft, but sitemap not regenerated
You changed draft: false to draft: true to retract a page. Template emits noindex. Sitemap is stale — still lists the URL.
How to spot it: Recently retracted articles in sitemap = stale sitemap.
Shortest path to fix
Step 1: Make sitemap and noindex share one source of truth
In Astro, configure sitemap to filter:
// astro.config.mjs
import sitemap from '@astrojs/sitemap';
export default defineConfig({
integrations: [
sitemap({
filter: (page) => {
// Read the page's frontmatter via the file system or a precomputed map
// Return false to exclude from sitemap
return !page.includes('/draft/') && !page.includes('/author/');
},
}),
],
});
For Next.js / next-sitemap:
// next-sitemap.config.js
module.exports = {
siteUrl: 'https://yoursite.com',
exclude: ['/draft/*', '/author/*', '/api/*'],
// Or programmatically:
transform: async (config, path) => {
if (await isNoindex(path)) return null;
return { loc: path, /* ... */ };
},
};
Step 2: Audit existing sitemap for conflicts
# Pull all URLs, check each for noindex
curl -s https://yoursite.com/sitemap.xml | grep -oP '<loc>\K[^<]+' | while read url; do
if curl -s "$url" | grep -q 'name="robots"[^>]*content="[^"]*noindex'; then
echo "$url"
fi
done > /tmp/conflicts.txt
wc -l /tmp/conflicts.txt
Each URL needs a decision: keep noindex (remove from sitemap) or remove noindex (keep in sitemap).
Step 3: Decide per URL — index or not?
For each conflict, ask:
- “Does this URL provide unique value to a Google searcher?”
- Yes → remove noindex, keep in sitemap.
- No → keep noindex, remove from sitemap.
Pagination pages, author profile pages with no bio, empty tag archives — usually “no.” Real articles with content — usually “yes.”
Step 4: Regenerate sitemap and resubmit
After cleanup:
npm run build # regenerates sitemap
curl -s https://yoursite.com/sitemap.xml | grep -c '<loc>' # confirm URL count
In Search Console → Sitemaps → submit the URL again, even if it’s the same. This triggers a re-crawl.
Step 5: Add a CI check
// scripts/check-sitemap-noindex.mjs
import fs from 'node:fs';
import { parseString } from 'xml2js';
import fetch from 'node-fetch';
const sitemap = fs.readFileSync('dist/sitemap-index.xml', 'utf8');
// parse, fetch each URL, check for noindex, fail if found
Fail the build if any sitemap URL has noindex.
Step 6: Watch Search Console clear
After a clean build, Search Console → Indexing → Pages → “Excluded” should reduce over 2-4 weeks as Google re-evaluates each URL.
When this is not on you
Search Console alerts lag for days. Verify by viewing source on live pages, not solely on the report. Sometimes the report shows old data even though the issue is fixed.
Easy to misdiagnose as
People try to “strengthen” noindex by also disallowing the URL in robots.txt. This is wrong: disallowed pages can’t be crawled, so Google can’t see the noindex meta. External backlinks can still cause Google to index the URL with no description, which is worse.
Prevention
- Generate sitemap from a “publishable + indexable” filter, not from all routes.
- CI check: any URL in sitemap whose page has noindex fails the build.
- When deciding noindex for a category of pages (pagination, author, drafts), update the sitemap filter at the same time.
- Don’t combine
robots.txt Disallowwithnoindex— pick one. - Audit sitemap quarterly against your “intended to index” list.
FAQ
- Should robots.txt disallow noindex pages? No — Google needs to crawl them to see the noindex. Disallow + noindex is the most common foot-gun.
- Can I rely on
X-Robots-TagHTTP header instead? Yes for non-HTML resources (PDFs, images). For HTML, either meta tag or header works equivalently.
Related
Tags: #SEO #Troubleshooting #Debug #Structured data #robots.txt #Sitemap