Google Crawls My Homepage But Never the Article Pages

Search Console crawls your homepage daily but article pages stay at "Discovered — currently not indexed." Split discovery vs. crawl-budget failures and fix the right one.

Published: May 19, 2026 Updated: Jun 21, 2026 Author: AI Productivity Guide Team 🌐 查看中文版本

Search Console → Crawl Stats shows Googlebot hits your homepage dozens of times a day, but /articles/* combined gets a handful. New articles aren’t crawled for weeks. The sitemap was submitted long ago and doesn’t move the needle.

Fastest check (60 seconds): run curl -sL https://yourdomain.com/ | grep -c "/articles/". If that number is near zero, Google literally cannot see your article links from the homepage HTML (a discovery failure) — fix that first. If the count is normal but the pages still sit at “Discovered — currently not indexed,” it’s a crawl-budget / quality failure, and the fix is different. Step 1 below makes the split precise.

Symptoms

Crawl Stats shows daily homepage hits but rare hits to /articles/*
New articles take 2-4 weeks to be discovered (well over the normal 3-7 days)
The sitemap is submitted but Google rarely fetches its URLs
URL Inspection reports article URLs as “Discovered — currently not indexed” or “URL is unknown to Google”

Which bucket are you in?

The two root causes need opposite fixes. This table tells you which one applies before you touch anything.

Signal	Discovery failure	Crawl-budget / quality failure
`curl` homepage → count `/articles/` links	0 or under 5	normal count
URL Inspection status	”URL is unknown to Google"	"Discovered — currently not indexed”
Is the URL in the sitemap and reachable by `<a href>`?	no	yes
What Google is telling you	”I never found this URL"	"I found it but chose not to spend budget yet”
Go to	Steps 2-3	Steps 4-8

“Discovered — currently not indexed” specifically means Google knows the URL (from your sitemap or a link) but has not crawled it yet — almost always a crawl-budget, queue, or perceived-quality decision rather than a technical block. (Google’s crawl-budget docs say crawl budget only becomes a real constraint on sites with 1M+ pages, or 10,000+ pages whose content changes daily — so on a small site this is usually a quality/internal-link signal, not a true budget cap.)

Common causes

1. Homepage uses JS-rendered article links Google can’t see

Most common. The “latest articles” list is mounted by React/Vue/Svelte after hydration, so the raw HTML Googlebot fetches contains no <a href> to any article.

How to confirm:

# Look at homepage WITHOUT executing JS (this is roughly what Googlebot fetches first)
curl -sL https://yourdomain.com/ > home.html

# Count article links in the raw HTML
grep -oE 'href="/articles/[^"]+"' home.html | wc -l
# 0 or very few = links are rendered by JS, not in the HTML

Google does render JS, but rendering is queued separately and can lag days to weeks behind the initial HTML crawl. Links present in the first HTML response are discovered immediately; links that only appear after hydration are not.

2. Homepage shows only 5-10 latest; older articles are 3+ clicks deep

Homepage (5 latest) → /blog (pagination page 1) → /blog/page/2 → article

3+ clicks from the homepage = Google treats the page as low priority and crawls it rarely.

3. Article URLs missing from the sitemap, or the sitemap exceeds the limit

# Count URLs in the sitemap
curl -s https://yourdomain.com/sitemap.xml | grep -c "<loc>"
# Should roughly equal your actual article count

Hand-maintained sitemaps routinely miss new articles. Per the sitemaps.org protocol, a single sitemap file may contain at most 50,000 URLs and must be no larger than 50MB uncompressed; past either limit Google rejects the file and you must split into a sitemap index.

4. Article pages too thin — Google deprioritizes the whole pattern

If Googlebot’s first crawls of /articles/* find pages that are under ~300 words or template-shaped, it learns “this URL pattern isn’t worth re-visiting” and drops the priority of the entire directory, not just the thin page.

5. Internal-link anchors too generic (“read more”)

<a href="/articles/foo/">read more</a>
<a href="/articles/bar/">view post</a>

Generic anchors are weak discovery signals. Google sees that a link exists but learns nothing about the topic it points to, so the target gets little ranking or crawl-priority benefit.

6. Article URLs blocked in robots.txt

Disallow: /articles/draft/

A too-broad rule (e.g. Disallow: /a) silently blocks /articles/. Confirm with URL Inspection → “Page indexing” → it will say “Blocked by robots.txt” if so. Note: noindex is worse than useless here — Google still spends a crawl to fetch the page before seeing the tag, so use robots.txt only for pages you truly never want crawled.

7. Server slow or 5xx-ing for Googlebot

Google’s crawl-capacity limit rises when your server responds fast and reliably, and drops when it slows down or returns errors. Frequent timeouts or 5xx make Google self-throttle the whole site.

How to confirm: Crawl Stats → “Crawl responses” → look at average response time and the share of 5xx / timeout responses. A rising response time or a non-trivial error share is the signal to act (Google doesn’t publish a hard millisecond threshold).

Shortest path to fix

Step 1: Distinguish discovery vs. crawl-budget failure

# Disable JS, view homepage source, count article links
curl -sL https://yourdomain.com/ | grep -c "/articles/"

0 or under 5 → discovery failure → Steps 2-3
Normal count → crawl-budget / quality failure → Steps 4-8

Cross-check one specific article in URL Inspection: “URL is unknown to Google” confirms discovery failure; “Discovered — currently not indexed” confirms the budget/quality bucket.

Step 2: Render listing components on the server (SSR / SSG)

In Next.js:

// Wrong: client-side useEffect fetch (links absent from first HTML)
function LatestPosts() {
  const [posts, setPosts] = useState([]);
  useEffect(() => { fetch('/api/posts').then(r => r.json()).then(setPosts); }, []);
  return posts.map(p => <a href={p.url}>{p.title}</a>);
}

// Right: getStaticProps (links baked into the HTML)
export async function getStaticProps() {
  const posts = await getAllPosts();
  return { props: { posts } };
}

Astro is SSG by default, so its <a href> links are already in the HTML. A React-only SPA can pre-render with react-snap, Prerender.io, or by moving to a framework that ships HTML. After deploying, re-run the curl ... | grep -c check — the count should jump.

Step 3: Add a paginated `/articles` index and link it from the nav

---
// src/pages/articles/index.astro
import { getCollection } from 'astro:content';
const posts = await getCollection('posts');
const sorted = posts.sort((a, b) => b.data.publishedAt - a.data.publishedAt);
---
<h1>All articles ({sorted.length})</h1>
<ul>
  {sorted.map(p => (
    <li>
      <a href={`/articles/${p.slug}/`}>{p.data.title}</a>
      <span>{p.data.publishedAt.toLocaleDateString()}</span>
    </li>
  ))}
</ul>

Link /articles/ from both the homepage and the main nav so every article is at most two clicks from anywhere on the site.

Step 4: Fix the sitemap

Every article URL needs a real lastmod:

<url>
  <loc>https://yourdomain.com/articles/foo/</loc>
  <lastmod>2026-05-21</lastmod>
</url>

lastmod must be the genuine modification time. Stamping every URL with today’s date doesn’t speed anything up — Google treats an unreliable lastmod as noise and starts ignoring the field for your whole site.

Past 50,000 URLs or 50MB, split into a sitemap index:

<!-- sitemap-index.xml -->
<sitemapindex>
  <sitemap><loc>https://yourdomain.com/sitemap-articles.xml</loc></sitemap>
  <sitemap><loc>https://yourdomain.com/sitemap-pages.xml</loc></sitemap>
</sitemapindex>

Step 5: Force-crawl 5-10 priority articles via URL Inspection

This is the fastest way to confirm the pages themselves aren’t blocked, and it kick-starts indexing while your structural fixes propagate.

Search Console → URL Inspection → paste the article URL.
Click “Test Live URL” — confirm “URL is available to Google” and that the rendered HTML contains your content (not an empty shell).
Click “Request Indexing.”

As of June 2026 the button is rate-limited to roughly 10-12 URLs per day per property, then greys out for 24 hours — so spend it on your highest-value pages. If you need bulk submission, the URL Inspection API allows far higher volume. (Don’t bother with IndexNow for Google: as of 2026 Google still does not consume IndexNow — only Bing/Yandex do.)

Step 6: Audit the 10 most recent articles’ quality baseline

Each should have:

Exactly one real <h1>
An intro paragraph (80+ words)
A 600+ word body
2+ internal links from other articles pointing to it
1+ image with alt text

Pages below this bar should be improved before publishing — otherwise they drag the whole /articles/* priority down.

Step 7: Fix internal-link anchor text

<!-- Bad -->
<a href="/articles/foo/">read more</a>

<!-- Good -->
<a href="/articles/foo/">Astro Deploy to Vercel: Complete Guide</a>

Topic-bearing anchors are a strong discovery and ranking signal. Add 2-3 contextual links from already-indexed, high-traffic pages to each new article within a day of publishing — that’s the single biggest lever for getting “Discovered” pages actually crawled.

Step 8: Fix server response time for Googlebot

# Time a page request with a Googlebot user-agent
curl -sL -A "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" \
  -w "%{time_total}\n" -o /dev/null https://yourdomain.com/articles/foo/
# Aim for well under 1.5s

Slow → add a CDN, cache statically, and cut server-render time. Faster, more reliable responses raise Google’s crawl-capacity limit for the whole site.

How to confirm it’s fixed

Re-run curl -sL https://yourdomain.com/ | grep -c "/articles/" — the link count should now reflect your real article count.
In URL Inspection, a previously-stuck article should move from “Discovered — currently not indexed” to “Crawled” and then “Indexed” (allow days to weeks for the queue).
Crawl Stats → “By response → OK (200)” should show rising hits to /articles/* over the following 1-2 weeks. The directory-level trend, not any single day, is what confirms recovery.

When this isn’t on you

On very large sites Google deliberately throttles per-directory crawl rate. That isn’t a bug — it’s Google deciding how much budget your site has earned. The lever there is authority: earn backlinks and lift overall site quality so Google allocates more crawl budget over time.

Easy to misdiagnose

<priority> in the sitemap helps — Google almost entirely ignores it.
Resubmitting the sitemap repeatedly helps — Google already has the sitemap; resubmission doesn’t re-prioritize crawling.
Adding keywords to the homepage helps — keyword stuffing triggers spam signals instead.
Pinging the sitemap speeds it up — Google fully deprecated the sitemap ping endpoint in 2023; the old /ping?sitemap= URL now returns 404.

Prevention

Never rely solely on JavaScript-rendered links for important pages — always emit SSR / SSG <a href> in the HTML.
Keep every article at most two clicks from anywhere (Homepage → /articles index → article).
Build internal-link clusters around topics, with hub pages concentrating links.
Make every internal-link anchor a topic phrase — never “more” / “read.”
Watch Crawl Stats → response time and error rate; act when either trends up.

FAQ

Q: Does pinging the sitemap help anymore? A: No. Google deprecated the sitemap ping endpoint in 2023 and the old endpoint now returns 404. The strongest discovery signal is whether the URL is directly reachable via <a href> from authoritative pages on your own site.

Q: My article shows “Discovered — currently not indexed.” What’s the single most effective fix? A: Add 2-3 contextual internal links to it from already-indexed, high-traffic pages, then use URL Inspection → “Request Indexing.” That combination addresses both the priority signal and the crawl queue.

Q: How many URLs can I force-crawl per day? A: As of June 2026, URL Inspection → “Request Indexing” allows roughly 10-12 URLs per day per property before greying out for 24 hours. For higher volume, use the URL Inspection API.

Q: Should I switch from XML to a JSON sitemap? A: No — sitemap format isn’t the bottleneck. Discovery (internal links) and quality are.

Q: Can Cloudflare caching improve Googlebot response time? A: Yes. For statically-cacheable articles, a cache hit can cut response time from hundreds of milliseconds to tens, which raises Google’s crawl-capacity limit for the site.

Tags: #SEO #Google #Search Console #Indexing #Troubleshooting #Crawl budget