Large Content Site Indexing Slowly

Site has thousands of pages but only a fraction get indexed. Why this happens and what actually moves the needle.

You have 10,000 articles, Search Console shows Indexed at only 2,500 — a 25% rate. Crawl Stats says Googlebot fetches just 200 URLs per day. New articles take 2 months to index.

The core constraint on large sites is crawl budget. Google uses quality signals at the site, section, and page level to decide how many requests to spend daily. The fix isn’t “publish more” — it’s cut the thin half, strengthen the rest.

Symptoms

  • Index Coverage report “Discovered — currently not indexed” grows into the thousands
  • Most new pages take 2+ months to index (or never get there)
  • Crawl Stats on a 10k+ page site shows only 100-500 URLs/day fetched
  • Overall indexing rate < 50%

Quick verdict

On large sites, crawl budget becomes the bottleneck. Google decides how much of your site is worth indexing based on quality signals at the site, section, and page level. The fix is not “publish more” — it is “cut the thin half, strengthen the rest.”

Common causes

1. Many pages thin or duplicate, eating crawl budget

If 5,000 of 10,000 pages are < 300 words / nearly identical programmatic pages — Googlebot exhausts daily budget on them, real articles can’t get in.

How to confirm:

// scripts/count-thin.mjs
import fg from "fast-glob";
import fs from "node:fs";
import matter from "gray-matter";

let total = 0, thin = 0;
for (const f of fg.sync("src/content/**/*.{md,mdx}")) {
  const { content } = matter(fs.readFileSync(f, "utf8"));
  const words = content.replace(/```[\s\S]+?```/g, "").split(/\s+/).filter(Boolean).length;
  total++;
  if (words < 300) thin++;
}
console.log(`Thin (<300 words): ${thin}/${total} = ${(thin/total*100).toFixed(1)}%`);

20% thin pages = major problem.

2. Faceted navigation / URL parameters generate massive low-value URLs

E-commerce or large blog typical:

/products
/products?color=red
/products?color=red&size=M
/products?color=red&size=M&sort=price
... combinatorial explosion can yield 100,000+ URLs

Every variant is treated as a separate URL, wasting budget.

How to confirm: Crawl Stats → “By URL type” to see which paths eat the most budget.

Home → /blog → /blog/page/15 → /blog/2023 → /blog/2023/04 → article

5-click-deep articles = lowest priority to Google.

4. Sitemap is enormous, mixed with parameter URLs / noindex / 404

# Check sitemap size and URL count
curl -s https://yourdomain.com/sitemap.xml | wc -c
curl -s https://yourdomain.com/sitemap.xml | grep -c "<loc>"

# Single file > 50MB or > 50,000 URLs → Google truncates

It might also contain 410 / noindex / redirected URLs, wasting Google’s time.

5. Low site authority, Google caps daily crawl rate

If your site has DR < 20 / < 10k monthly traffic, the overall crawl budget is capped — even with optimization, you’re working within a small budget.

6. Server slow / unstable

Googlebot self-throttles on slow sites. If your server averages > 2s response to Googlebot or 5xx frequently, daily fetch volume halves immediately.

Shortest path to fix

Step 1: Audit the sitemap

# List URLs in sitemap, verify each returns 200
curl -s https://yourdomain.com/sitemap.xml | grep -oE 'https://[^<]+' > all-sitemap.txt
while read url; do
  status=$(curl -sI -o /dev/null -w "%{http_code}" "$url")
  [ "$status" != "200" ] && echo "$status $url"
done < all-sitemap.txt

From the sitemap, remove:

  • 404 / 410 URLs
  • noindex pages
  • Parameter URLs (utm / sort / filter)
  • Pages with < 100 words

Step 2: Identify the thinnest 20-30%, merge or noindex

Use the Step 1 thin-page script. Decision matrix:

WordsAction
< 100410 delete
100-300noindex,follow or merge into related hub
300-500expand to 800+, otherwise noindex
500+keep, check quality

Step 3: Block faceted / parameter URLs via robots.txt or noindex

User-agent: *
Disallow: /*?utm_
Disallow: /*?sort=
Disallow: /*?filter=
Disallow: /*?color=
Disallow: /*?size=
Disallow: /search
Disallow: /tag/

Sitemap: https://yourdomain.com/sitemap.xml

This step can immediately free up 50-80% of crawl budget from faceted URLs.

Step 4: Flatten the structure

Old: Home → /blog → page/15 → article (5 clicks)
New: Home → /articles index → article (2 clicks)
---
// src/pages/articles/index.astro
import { getCollection } from 'astro:content';
const all = await getCollection('posts');
const grouped = groupByTopic(all);  // hub / pillar pages
---
{Object.entries(grouped).map(([topic, posts]) => (
  <section>
    <h2><a href={`/topic/${topic}/`}>{topic}</a> ({posts.length})</h2>
    <ul>{posts.map(p => <li><a href={`/articles/${p.slug}/`}>{p.data.title}</a></li>)}</ul>
  </section>
))}

Important articles ≤ 3 clicks. Add hub / pillar pages to concentrate authority.

Site-wide authority is the crawl budget ceiling. Each 5-10 quality backlinks → noticeable crawl budget lift.

Focus on your top 20 highest-value articles; aim for 1-3 backlinks each:

  • Original content on Reddit / HN
  • Guest post on industry sites
  • awesome-list submissions
  • Data-driven reports as news angle

Step 6: Optimize server response time

# Speed test simulating Googlebot
curl -sL -A "Mozilla/5.0 (compatible; Googlebot/2.1)" \
  -w "%{time_total}\n" -o /dev/null https://yourdomain.com/articles/foo/

1500ms → add CDN, caching, optimize SSR. Halving response time can double crawl volume.

Step 7: Patience — 4-12 weeks

Measurable improvement typically takes 4-12 weeks:

  • 4 weeks: crawl budget reallocates, new article fetch rate rises
  • 8 weeks: thin pages deindexed, total indexed may dip then climb
  • 12 weeks: overall indexing rate up 20-40%

When this is not on you

For a brand-new 10k-page site, Google deliberately ramps crawl rate over months. Even with perfect technical setup, 100% indexing within weeks is unrealistic. Patience + sustained quality + backlinks = the only path.

Easy to misdiagnose

  • One-by-one URL Inspection doesn’t scale: with 10k pages at 10/day quota, you’d need 1000 days
  • Thinking more sitemap submissions accelerate: sitemap only affects discovery, not budget allocation
  • Thinking <priority> 1.0 helps: Google ignores priority
  • Thinking more content “boosts” authority: bulk low-quality content activates SpamBrain in the wrong direction

Prevention

  • Only publish a page if it provides something the existing 1000 pages don’t
  • Periodically run a content audit: keep top half, prune bottom half
  • Parameter / faceted URLs default to noindex or robots-blocked from day one
  • Sitemap always auto-generated; periodic health checks
  • Weekly Crawl Stats review to catch budget waste early

FAQ

Q: How fast should a 10k-page site fully index? A: Realistically 3-9 months, often longer for new domains. 100% indexing is rare — 60-80% is the practical maximum for good sites.

Q: Does Google index sites at a fixed rate? A: No. Rate scales with site authority, content uniqueness, and historical crawl health, dynamically.

Q: Can I request higher crawl budget from Google? A: No direct request. Settings let you cap max crawl rate (to prevent overload), not raise it. Lift comes from authority building.

Q: Does deleting half the pages really help the others? A: Yes. Frees crawl budget + lifts overall quality signal; 6-12 weeks later retained pages’ indexing rates and rankings both rise.

Tags: #SEO #Google #Search Console #Indexing #Troubleshooting #Large site #Crawl budget