You have 10,000 articles, Search Console shows Indexed at only 2,500 — a 25% rate. Crawl Stats says Googlebot fetches just 200 URLs per day. New articles take 2 months to index.
The core constraint on large sites is crawl budget. Google uses quality signals at the site, section, and page level to decide how many requests to spend daily. The fix isn’t “publish more” — it’s cut the thin half, strengthen the rest.
Symptoms
- Index Coverage report “Discovered — currently not indexed” grows into the thousands
- Most new pages take 2+ months to index (or never get there)
- Crawl Stats on a 10k+ page site shows only 100-500 URLs/day fetched
- Overall indexing rate < 50%
Quick verdict
On large sites, crawl budget becomes the bottleneck. Google decides how much of your site is worth indexing based on quality signals at the site, section, and page level. The fix is not “publish more” — it is “cut the thin half, strengthen the rest.”
Common causes
1. Many pages thin or duplicate, eating crawl budget
If 5,000 of 10,000 pages are < 300 words / nearly identical programmatic pages — Googlebot exhausts daily budget on them, real articles can’t get in.
How to confirm:
// scripts/count-thin.mjs
import fg from "fast-glob";
import fs from "node:fs";
import matter from "gray-matter";
let total = 0, thin = 0;
for (const f of fg.sync("src/content/**/*.{md,mdx}")) {
const { content } = matter(fs.readFileSync(f, "utf8"));
const words = content.replace(/```[\s\S]+?```/g, "").split(/\s+/).filter(Boolean).length;
total++;
if (words < 300) thin++;
}
console.log(`Thin (<300 words): ${thin}/${total} = ${(thin/total*100).toFixed(1)}%`);
20% thin pages = major problem.
2. Faceted navigation / URL parameters generate massive low-value URLs
E-commerce or large blog typical:
/products
/products?color=red
/products?color=red&size=M
/products?color=red&size=M&sort=price
... combinatorial explosion can yield 100,000+ URLs
Every variant is treated as a separate URL, wasting budget.
How to confirm: Crawl Stats → “By URL type” to see which paths eat the most budget.
3. Poor internal link structure, most pages 5+ clicks from homepage
Home → /blog → /blog/page/15 → /blog/2023 → /blog/2023/04 → article
5-click-deep articles = lowest priority to Google.
4. Sitemap is enormous, mixed with parameter URLs / noindex / 404
# Check sitemap size and URL count
curl -s https://yourdomain.com/sitemap.xml | wc -c
curl -s https://yourdomain.com/sitemap.xml | grep -c "<loc>"
# Single file > 50MB or > 50,000 URLs → Google truncates
It might also contain 410 / noindex / redirected URLs, wasting Google’s time.
5. Low site authority, Google caps daily crawl rate
If your site has DR < 20 / < 10k monthly traffic, the overall crawl budget is capped — even with optimization, you’re working within a small budget.
6. Server slow / unstable
Googlebot self-throttles on slow sites. If your server averages > 2s response to Googlebot or 5xx frequently, daily fetch volume halves immediately.
Shortest path to fix
Step 1: Audit the sitemap
# List URLs in sitemap, verify each returns 200
curl -s https://yourdomain.com/sitemap.xml | grep -oE 'https://[^<]+' > all-sitemap.txt
while read url; do
status=$(curl -sI -o /dev/null -w "%{http_code}" "$url")
[ "$status" != "200" ] && echo "$status $url"
done < all-sitemap.txt
From the sitemap, remove:
- 404 / 410 URLs
- noindex pages
- Parameter URLs (utm / sort / filter)
- Pages with < 100 words
Step 2: Identify the thinnest 20-30%, merge or noindex
Use the Step 1 thin-page script. Decision matrix:
| Words | Action |
|---|---|
| < 100 | 410 delete |
| 100-300 | noindex,follow or merge into related hub |
| 300-500 | expand to 800+, otherwise noindex |
| 500+ | keep, check quality |
Step 3: Block faceted / parameter URLs via robots.txt or noindex
User-agent: *
Disallow: /*?utm_
Disallow: /*?sort=
Disallow: /*?filter=
Disallow: /*?color=
Disallow: /*?size=
Disallow: /search
Disallow: /tag/
Sitemap: https://yourdomain.com/sitemap.xml
This step can immediately free up 50-80% of crawl budget from faceted URLs.
Step 4: Flatten the structure
Old: Home → /blog → page/15 → article (5 clicks)
New: Home → /articles index → article (2 clicks)
---
// src/pages/articles/index.astro
import { getCollection } from 'astro:content';
const all = await getCollection('posts');
const grouped = groupByTopic(all); // hub / pillar pages
---
{Object.entries(grouped).map(([topic, posts]) => (
<section>
<h2><a href={`/topic/${topic}/`}>{topic}</a> ({posts.length})</h2>
<ul>{posts.map(p => <li><a href={`/articles/${p.slug}/`}>{p.data.title}</a></li>)}</ul>
</section>
))}
Important articles ≤ 3 clicks. Add hub / pillar pages to concentrate authority.
Step 5: Earn backlinks for strongest content
Site-wide authority is the crawl budget ceiling. Each 5-10 quality backlinks → noticeable crawl budget lift.
Focus on your top 20 highest-value articles; aim for 1-3 backlinks each:
- Original content on Reddit / HN
- Guest post on industry sites
- awesome-list submissions
- Data-driven reports as news angle
Step 6: Optimize server response time
# Speed test simulating Googlebot
curl -sL -A "Mozilla/5.0 (compatible; Googlebot/2.1)" \
-w "%{time_total}\n" -o /dev/null https://yourdomain.com/articles/foo/
1500ms → add CDN, caching, optimize SSR. Halving response time can double crawl volume.
Step 7: Patience — 4-12 weeks
Measurable improvement typically takes 4-12 weeks:
- 4 weeks: crawl budget reallocates, new article fetch rate rises
- 8 weeks: thin pages deindexed, total indexed may dip then climb
- 12 weeks: overall indexing rate up 20-40%
When this is not on you
For a brand-new 10k-page site, Google deliberately ramps crawl rate over months. Even with perfect technical setup, 100% indexing within weeks is unrealistic. Patience + sustained quality + backlinks = the only path.
Easy to misdiagnose
- One-by-one URL Inspection doesn’t scale: with 10k pages at 10/day quota, you’d need 1000 days
- Thinking more sitemap submissions accelerate: sitemap only affects discovery, not budget allocation
- Thinking
<priority>1.0 helps: Google ignores priority - Thinking more content “boosts” authority: bulk low-quality content activates SpamBrain in the wrong direction
Prevention
- Only publish a page if it provides something the existing 1000 pages don’t
- Periodically run a content audit: keep top half, prune bottom half
- Parameter / faceted URLs default to noindex or robots-blocked from day one
- Sitemap always auto-generated; periodic health checks
- Weekly Crawl Stats review to catch budget waste early
FAQ
Q: How fast should a 10k-page site fully index? A: Realistically 3-9 months, often longer for new domains. 100% indexing is rare — 60-80% is the practical maximum for good sites.
Q: Does Google index sites at a fixed rate? A: No. Rate scales with site authority, content uniqueness, and historical crawl health, dynamically.
Q: Can I request higher crawl budget from Google? A: No direct request. Settings let you cap max crawl rate (to prevent overload), not raise it. Lift comes from authority building.
Q: Does deleting half the pages really help the others? A: Yes. Frees crawl budget + lifts overall quality signal; 6-12 weeks later retained pages’ indexing rates and rankings both rise.
Related
- Discovered - currently not indexed
- Thin pages deprioritized by Google
- Google crawls homepage but never articles
Tags: #SEO #Google #Search Console #Indexing #Troubleshooting #Large site #Crawl budget