You generated a single sitemap.xml from your content collection. The file is 18 MB and contains 73,000 URLs. Search Console either reports “Couldn’t fetch” or shows “Discovered URLs: 50,000” — it processed exactly the limit and silently dropped the rest. Pages beyond index 50,000 never get crawled via the sitemap. The sitemaps protocol caps a single file at 50,000 URLs and 50 MB uncompressed, and Google enforces both strictly.
The fix is a sitemap index that points at multiple per-section sitemap files. Below: how to split, name, and verify them.
Common causes
1. Single sitemap generator with no chunking
The build script writes every URL into one sitemap.xml regardless of count. Works fine until your content collection crosses 50k.
How to spot it: wc -l public/sitemap.xml. If it’s over ~50,003 lines (header + 50k entries + footer), you’re at or over the limit.
2. Sitemap is over 50 MB uncompressed even with fewer URLs
Each URL entry has a long <loc>, <lastmod>, multiple <xhtml:link rel="alternate" hreflang> tags for bilingual sites, and <image:image> blocks. Even at 25k URLs you can hit 50 MB.
How to spot it: ls -lh public/sitemap.xml. Over 50 MB? Split regardless of URL count.
3. Hreflang inflates the file dramatically
A bilingual site with en and zh versions adds 2-3 xhtml:link tags per URL. A 30k-article site doubles to ~60k URL entries (each language is a separate <url>), each entry is larger.
How to spot it: grep -c '<url>' public/sitemap.xml. Compare to your distinct page count. If 2-3× higher, hreflang or duplicates inflated it.
4. Sitemap index pointing to itself or missing
You created a sitemap index but it lists only one child sitemap (the original 73k file), or accidentally lists the index file inside itself.
How to spot it: cat public/sitemap-index.xml. Should be <sitemapindex> with multiple <sitemap><loc> children pointing to distinct child files, each under 50k URLs.
5. Splits by alphabet rather than count
Naive split: sitemap-a.xml, sitemap-b.xml, etc. by URL slug first letter. If “p” articles are 80k of your 300k total, that file still exceeds the limit.
How to spot it: wc -l public/sitemap-*.xml. If any individual file > 50,003 lines, your split logic is wrong.
6. Compressed file under 50 MB but uncompressed over
Google checks the uncompressed size. A 8 MB gzipped file that’s 80 MB uncompressed will fail.
How to spot it: gzip -l public/sitemap.xml.gz shows compressed and uncompressed bytes.
7. Sitemap files don’t match robots.txt declaration
You split into sitemap-1.xml, sitemap-2.xml, but robots.txt still says Sitemap: https://example.com/sitemap.xml. Search Console doesn’t auto-discover the index.
How to spot it: curl https://yoursite.com/robots.txt | grep -i sitemap. Should list the sitemap-index URL.
Shortest path to fix
Step 1: Decide a chunk size and split scheme
Conservative target: 25,000 URLs or 25 MB per file (half the hard limit). Group by content type for human debuggability:
sitemap-articles-1.xml…sitemap-articles-N.xmlsitemap-categories.xmlsitemap-tags.xmlsitemap-pages.xml(static pages)
Step 2: Generate child sitemaps in chunks
// scripts/generate-sitemaps.mjs
import fs from 'node:fs';
const CHUNK = 25000;
const articles = JSON.parse(fs.readFileSync('articles.json', 'utf8'));
const total = articles.length;
const numFiles = Math.ceil(total / CHUNK);
for (let i = 0; i < numFiles; i++) {
const chunk = articles.slice(i * CHUNK, (i + 1) * CHUNK);
const urls = chunk.map(a => `<url><loc>https://example.com/articles/${a.slug}/</loc><lastmod>${a.modifiedAt}</lastmod></url>`);
fs.writeFileSync(
`public/sitemap-articles-${i + 1}.xml`,
`<?xml version="1.0" encoding="UTF-8"?>\n<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">${urls.join('')}</urlset>`
);
}
Step 3: Generate a sitemap index
const indexEntries = [];
for (let i = 1; i <= numFiles; i++) {
indexEntries.push(`<sitemap><loc>https://example.com/sitemap-articles-${i}.xml</loc><lastmod>${new Date().toISOString()}</lastmod></sitemap>`);
}
indexEntries.push(`<sitemap><loc>https://example.com/sitemap-categories.xml</loc></sitemap>`);
fs.writeFileSync(
'public/sitemap.xml',
`<?xml version="1.0" encoding="UTF-8"?>\n<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">${indexEntries.join('')}</sitemapindex>`
);
Keep the top-level file named sitemap.xml so existing robots.txt references still work — it’s now an index, not a urlset.
Step 4: Update robots.txt
User-agent: *
Allow: /
Sitemap: https://example.com/sitemap.xml
That’s it — the index file at the same URL now lists all children.
Step 5: Validate each file
for f in public/sitemap*.xml; do
echo -n "$f: "
grep -c '<url>\|<sitemap>' "$f"
xmllint --noout "$f" && echo "valid"
done
Each child sitemap must be under 50,000 <url> entries. The index file lists <sitemap> entries (not <url>).
Step 6: Resubmit in Search Console
Search Console → Sitemaps → remove the old single sitemap entry, submit https://example.com/sitemap.xml (the index). Google will discover and process children automatically.
Step 7: Monitor processing
Search Console → Sitemaps → click each child sitemap. Watch the “Discovered URLs” count rise over a few days. If any child sitemap reports “Couldn’t fetch,” check that file’s HTTP status and XML validity.
When this is not on you
For sites under 50,000 URLs, splitting may not help. The limit only matters at scale. Don’t over-engineer a 5k-URL site.
Easy to misdiagnose as
A crawl-budget issue. The crawl budget concept exists but matters mainly for sites with millions of URLs. The 50k-per-sitemap cap is a much sharper, simpler problem to fix first.
Prevention
- Generate sitemaps with a hard chunk size (e.g., 25k URLs per file).
- Validate XML in CI before deploy.
- Keep
robots.txtpointing at a single canonical sitemap-index URL. - Track sitemap file sizes in build logs; alert if any exceeds 40 MB or 40k URLs.
- Use gzip for transport but verify uncompressed size against the 50 MB cap.
FAQ
- Can I submit multiple sitemaps separately instead of an index? Yes, but an index file is the standard and is easier to maintain.
- Does Bing have the same 50,000 limit? Yes — the 50k / 50 MB cap is the official Sitemaps protocol, all major search engines follow it.
Related
- Sitemap Submitted but URLs Not Indexed
- Sitemap Pages Missing from Search Console
- Large Site Indexing Slow
- Sitemap lastmod Is Always Today and Google Stopped Trusting It
- Discovered Currently Not Indexed
- Internal Links Not Discovered by Google
- Indexing Slow on New Site
- Hreflang Warning in Search Console
- Pagination Noindex Follow Trap
- Orphan Pages Not Indexed
Tags: #SEO #Troubleshooting #Indexing #Search Console #Sitemap #xml