Large Content Site Indexing Slowly

Thousands of pages, only a fraction indexed. How to tell if it's really crawl budget, and what actually moves the needle in 2026.

Published: May 19, 2026 Updated: Jun 21, 2026 Author: AI Productivity Guide Team 🌐 查看中文版本

You have 10,000 articles, Search Console shows Indexed at only 2,500 (a 25% rate), Crawl Stats says Googlebot fetches just 200 URLs per day, and new articles take two months to index.

Fastest fix: find the thin/duplicate half of your URLs and remove them from the crawl path (410 delete, noindex, or robots block), then concentrate internal links on the pages worth keeping. On a large site the constraint is crawl budget plus quality demand, not publishing volume. The move is “cut the thin half, strengthen the rest” — not “publish more.”

But first, confirm crawl budget is actually your problem. Per Google’s crawl budget guidance (as of June 2026), crawl budget only becomes a real bottleneck for:

Sites with 1,000,000+ unique pages whose content changes about weekly, or
Sites with 10,000+ unique pages whose content changes very rapidly (daily), or
Any site where “Discovered — currently not indexed” is large and growing.

If you are under those thresholds and pages still aren’t indexed, the cause is usually quality/demand (Google found the page but didn’t think it was worth indexing), not raw crawl capacity. The diagnosis section below tells you which bucket you’re in.

Symptoms

Index Coverage report “Discovered — currently not indexed” grows into the thousands
Most new pages take 2+ months to index, or never get there
Crawl Stats on a 10k+ page site shows only 100-500 URLs/day fetched
Overall indexing rate < 50%

Which bucket are you in?

Signal	Likely cause	Where to look
”Discovered — currently not indexed” huge; Crawl Stats fetches few URLs/day	Crawl capacity / budget	Crawl Stats → total requests
”Crawled — currently not indexed” huge; Google did fetch but dropped it	Quality / demand	Index → Page indexing report
Crawl Stats shows mostly non-200 (3xx/4xx/5xx) responses	Wasted budget on junk URLs	Crawl Stats → “By response”
Average response time to Googlebot `> 1s`, or 5xx spikes	Server capacity throttling	Crawl Stats → “Average response time”

“Discovered — currently not indexed” means Google knows the URL exists but hasn’t spent budget crawling it yet — a crawl-budget signal. “Crawled — currently not indexed” means Google fetched the page and chose not to index it — a quality signal. The fixes differ, so read this report first.

Quick verdict

On large sites, crawl budget plus content demand becomes the bottleneck. Google decides how much of your site is worth crawling and indexing based on serving capacity, popularity, uniqueness, and freshness, at the site, section, and page level. The fix is not “publish more” — it is “cut the thin half, strengthen the rest.”

Common causes

1. Many pages thin or duplicate, eating crawl budget

If 5,000 of 10,000 pages are < 300 words or near-identical programmatic pages, Googlebot exhausts daily budget on them and real articles can’t get in. Google explicitly treats duplicate/near-duplicate URLs as wasted “perceived inventory.”

How to confirm:

// scripts/count-thin.mjs
import fg from "fast-glob";
import fs from "node:fs";
import matter from "gray-matter";

let total = 0, thin = 0;
for (const f of fg.sync("src/content/**/*.{md,mdx}")) {
  const { content } = matter(fs.readFileSync(f, "utf8"));
  const words = content.replace(/```[\s\S]+?```/g, "").split(/\s+/).filter(Boolean).length;
  total++;
  if (words < 300) thin++;
}
console.log(`Thin (<300 words): ${thin}/${total} = ${(thin/total*100).toFixed(1)}%`);

20% thin pages is a major problem.

E-commerce or large-blog typical:

/products
/products?color=red
/products?color=red&size=M
/products?color=red&size=M&sort=price
... combinatorial explosion can yield 100,000+ URLs

Every variant is treated as a separate URL, wasting budget. This is the single most common crawl-budget sink Google calls out for large sites.

How to confirm: Crawl Stats → “By URL path” (the breakdown that shows which directories or path patterns eat the most requests).

3. Poor internal link structure, most pages 5+ clicks from homepage

Home → /blog → /blog/page/15 → /blog/2023 → /blog/2023/04 → article

Articles buried 5 clicks deep get the lowest crawl priority. Internal links are how Google discovers and prioritizes URLs, so depth maps directly to crawl frequency.

4. Sitemap is enormous, mixed with parameter URLs / noindex / 404

# Check sitemap size and URL count
curl -s https://yourdomain.com/sitemap.xml | wc -c
curl -s https://yourdomain.com/sitemap.xml | grep -c "<loc>"

# Single file > 50MB (uncompressed) or > 50,000 URLs → split it

A sitemap that contains 410 / noindex / redirected URLs sends Googlebot to dead ends and wastes the discovery you get from the sitemap. Use a sitemap index that points to multiple clean child sitemaps, and include an accurate <lastmod> so Google can prioritize what actually changed.

5. Low site authority, Google caps crawl demand

If your site has DR < 20 or < 10k monthly traffic, crawl demand is low — Google simply doesn’t have much reason to crawl deeply yet. Even with perfect structure, you’re working within a small budget until popularity and uniqueness rise.

6. Server slow or unstable

Googlebot self-throttles on slow sites. If your average response to Googlebot is > 1s, or you return 5xx / 429 frequently, Google lowers the crawl capacity limit and daily fetch volume drops. This is the one crawl-capacity lever you fully control: faster, more reliable serving raises the ceiling.

Shortest path to fix

Step 1: Audit the sitemap

# List URLs in sitemap, verify each returns 200
curl -s https://yourdomain.com/sitemap.xml | grep -oE 'https://[^<]+' > all-sitemap.txt
while read url; do
  status=$(curl -sI -o /dev/null -w "%{http_code}" "$url")
  [ "$status" != "200" ] && echo "$status $url"
done < all-sitemap.txt

From the sitemap, remove:

404 / 410 URLs
noindex pages
Parameter URLs (utm / sort / filter)
Pages with < 100 words

Step 2: Identify the thinnest 20-30%, merge or noindex

Use the Step 1 thin-page script, then apply this matrix:

Words	Action
`< 100`	`410` delete
100-300	`noindex,follow` or merge into related hub
300-500	expand to 800+, otherwise `noindex`
500+	keep, check quality

To genuinely recover crawl budget, prefer 410/robots over noindex. A noindex page still has to be crawled for Google to see the tag, so it keeps consuming budget; a 410 or robots-blocked path stops the crawl entirely.

Step 3: Block faceted / parameter URLs via robots.txt

User-agent: *
Disallow: /*?utm_
Disallow: /*?sort=
Disallow: /*?filter=
Disallow: /*?color=
Disallow: /*?size=
Disallow: /search
Disallow: /tag/

Sitemap: https://yourdomain.com/sitemap.xml

Google’s own large-site guidance recommends robots.txt (not noindex) for infinite parameter spaces, precisely because blocked URLs are never fetched. This step can immediately free 50-80% of crawl budget from faceted URLs.

Step 4: Flatten the structure

Old: Home → /blog → page/15 → article (5 clicks)
New: Home → /articles index → article (2 clicks)

---
// src/pages/articles/index.astro
import { getCollection } from 'astro:content';
const all = await getCollection('posts');
const grouped = groupByTopic(all);  // hub / pillar pages
---
{Object.entries(grouped).map(([topic, posts]) => (
  <section>
    <h2><a href={`/topic/${topic}/`}>{topic}</a> ({posts.length})</h2>
    <ul>{posts.map(p => <li><a href={`/articles/${p.slug}/`}>{p.data.title}</a></li>)}</ul>
  </section>
))}

Important articles <= 3 clicks deep. Add hub / pillar pages to concentrate authority and shorten the crawl path to your best content.

Step 5: Earn backlinks for strongest content

Popularity is a direct input to crawl demand, so external links raise both indexing priority and the crawl ceiling. Focus on your top 20 highest-value articles; aim for 1-3 quality backlinks each:

Original content on Reddit / HN
Guest posts on industry sites
awesome-list submissions
Data-driven reports with a news angle

Step 6: Optimize server response time

# Speed test simulating Googlebot
curl -sL -A "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" \
  -w "%{time_total}\n" -o /dev/null https://yourdomain.com/articles/foo/

If response time is > 1000ms, add a CDN, enable caching, and optimize SSR. Google raises the crawl capacity limit when the site responds quickly and stops returning errors, so halving response time can noticeably lift crawl volume.

Step 7: Patience — 4-12 weeks

Measurable improvement typically takes 4-12 weeks:

4 weeks: crawl budget reallocates, new-article fetch rate rises
8 weeks: thin pages deindexed, total indexed may dip then climb
12 weeks: overall indexing rate up 20-40%

How to confirm it’s fixed

Don’t judge by the headline “Indexed” number alone — it moves slowly. Check these instead, weekly:

Crawl Stats → Total crawl requests trends up, and “By response” shows a higher share of 200s (less budget spent on 3xx/4xx/5xx).
Crawl Stats → Average response time trends down after Step 6.
Page indexing report: the “Discovered — currently not indexed” count stops growing, then shrinks.
Spot-check new URLs with the URL Inspection tool: “Crawl → Crawled as” shows a recent date instead of “URL is unknown to Google.”

When this is not on you

For a brand-new 10k-page site, Google deliberately ramps crawl rate over months. Even with perfect technical setup, 100% indexing within weeks is unrealistic. Patience, sustained quality, and backlinks are the only path.

Easy to misdiagnose

One-by-one URL Inspection doesn’t scale: the manual “Request indexing” button is capped at roughly 10-12 URLs per day per property (as of June 2026), and it’s a nudge for discovery, not a budget increase. At 10k pages that’s pointless.
The Indexing API won’t help here: Google’s Indexing API officially supports only JobPosting and BroadcastEvent (livestream) pages. Using it for normal articles is ignored and, per Google Search Relations, can be dropped without notice — it does not index blog posts or product pages.
More sitemap submissions don’t accelerate indexing: a sitemap affects discovery, not budget allocation. Resubmitting daily changes nothing.
<priority> 1.0 does nothing: Google ignores the <priority> and <changefreq> sitemap hints.
More content does not “boost” authority: bulk low-quality publishing pushes quality signals (SpamBrain) the wrong way and can lower your overall crawl demand.

Prevention

Only publish a page if it offers something your existing pages don’t
Periodically run a content audit: keep the top half, prune the bottom half
Parameter / faceted URLs default to noindex or robots-blocked from day one
Sitemap always auto-generated with accurate <lastmod>; periodic health checks
Weekly Crawl Stats review to catch budget waste early

FAQ

Q: How fast should a 10k-page site fully index? A: Realistically 3-9 months, often longer for new domains. 100% indexing is rare; 60-80% is the practical maximum for good sites.

Q: Does Google index sites at a fixed rate? A: No. The rate scales dynamically with serving capacity, content uniqueness, popularity, and historical crawl health.

Q: Can I ask Google to raise my crawl budget? A: Not directly. The old Search Console “crawl rate” setting that let you limit Googlebot was removed on January 8, 2024 — Google now sets crawl rate automatically based on your server’s responsiveness. There has never been a control to raise it. The only levers are faster serving (raises the capacity limit) and more authority/uniqueness (raises demand). If Googlebot is actively overloading your server, you file the special “Report problems with Googlebot crawling” form instead.

Q: Does deleting half the pages really help the others? A: Yes. It frees crawl budget and lifts the overall quality signal. Six to twelve weeks later, the retained pages’ indexing rates and rankings both tend to rise.

Q: My pages say “Crawled — currently not indexed,” not “Discovered.” Same fix? A: No. “Crawled — currently not indexed” means Google fetched the page and judged it not worth indexing — that’s a quality/uniqueness problem, not a budget one. Improve depth and originality on those specific pages rather than touching robots.txt.

Tags: #SEO #Google #Search Console #Indexing #Troubleshooting #Large site #Crawl budget