Search Console Sees Many Low-Value URLs

Q: Should I use `robots.txt` or `noindex` to get rid of these URLs?

Use `robots.txt` `Disallow` when you never want the URL crawled (sort/filter params, facets, internal search, attachments) — it saves crawl budget because Google never fetches the page. Use `noindex` when you want the page kept out of the index but still crawled-through to the links it contains (thin archives). Never use both on the same URL: a blocked URL can't be fetched, so Google never sees the `noindex`.

Q: Is `noindex, follow` still safe for pagination?

Short term, yes — but Google has said pages left `noindex` long-term are eventually treated as `nofollow`, so the links stop passing signal. As of June 2026, prefer keeping paginated pages indexable with self-canonicals and guaranteeing every article is also reachable via a sitemap or index, rather than blanket-noindexing deep pagination.

"Crawled — currently not indexed" balloons to thousands of `?sort=`, `?page=`, tag-combo URLs. Triage by source, block via robots.txt / canonical / noindex, and reclaim crawl budget — without the common mistake that hides your noindex tag.

Published: May 19, 2026 Updated: Jun 21, 2026 Author: AI Productivity Guide Team 🌐 查看中文版本

Search Console’s Pages report shows “Crawled — currently not indexed” at 3,400 URLs. You scroll and it’s all noise: ?sort=asc, ?sort=desc, ?page=12, ?page=13, ?tag=python&category=tutorials, /2024/03/, /author/jane/, /wp-content/uploads/.... Real articles are taking longer to get indexed because Googlebot is busy re-fetching parameter combinations.

Fastest fix: export the report, group the URLs by pattern, and block the biggest pattern at the source. For sort/filter/facet parameters and internal-search results, that means a robots.txt Disallow (stops the crawl before budget is spent). For thin archive and pagination pages you still want crawled-through, use noindex instead — and never apply both to the same URL (more on that below).

One reality check before you start: crawl budget is only a real constraint on large or fast-changing sites. Per Google’s own guidance, most sites under ~10,000 URLs are crawled efficiently and don’t need to worry about it. It becomes a genuine problem when you have a moderately large site (10,000+ URLs) that changes daily, or a very large site (1,000,000+ URLs). If your site is small and articles eventually index fine, “Crawled — not indexed” on junk URLs is cosmetic, not urgent. Tidy it for hygiene, but don’t panic.

That said, crawlable-but-valueless URLs still do two bad things on any site: they waste crawl budget (slowing real-article indexing on large sites) and they signal poor site hygiene (Google sees a sprawling, unstructured site). Below: how to categorize the noise, block it by source with the right tool, and confirm the fix worked.

Common causes

Ordered by hit rate, highest first.

1. URL parameters expose every sort / filter combination

?sort=asc, ?sort=desc&filter=x, ?sort=desc&filter=x&page=2 — every combination is a unique URL. A list with 3 filters, 2 sort options, and 10 pages = 60 URL combinations from one list page.

How to spot it: the report has many URLs with ? query strings. Group by parameter name; the offenders are the high-multiplicity parameters (sort, filter, order, view).

Note: Search Console’s old URL Parameters tool that used to handle this was removed on April 26, 2022. You can no longer tell Google to ignore a parameter from inside GSC — control now lives entirely in robots.txt, canonical tags, and noindex.

2. Pagination beyond page 5-10 has no substantive content

/category/ai/page/12/ shows articles 121-130 — content nobody searches for directly. Google crawls but doesn’t index it because the page itself is thin.

How to spot it: the report lists many /page/N/ URLs with N greater than 5. Deep pagination usually accounts for a meaningful chunk of the noise.

/products/?color=blue&size=large&brand=acme-style URLs. Every facet combination is a URL. 10 colors, 5 sizes, 20 brands = 1,000 URL combinations from one page. Google calls this out explicitly: crawling faceted URLs “tends to cost sites large amounts of computing resources” and can slow discovery of new content.

How to spot it: URLs with multiple stacked params, especially e-commerce facets. Combinatorial = many distinct URLs from a few real intents.

4. Auto-generated archive pages (year / month / author) are publicly crawlable

/2024/, /2024/03/, /2024/03/15/, /author/jane/ — your CMS auto-generates these. Each is a separate URL but most have nothing Google would rank.

How to spot it: patterns like /YYYY/, /YYYY/MM/, /author/, /tag/. Date and author archives are usually thin.

5. Internal search results indexed accidentally

/search?q=ai+tools — your site search produced a results page that got crawled. Now Google has every search anyone ever ran. Google has called indexable internal search results a classic crawl-budget trap for years.

How to spot it: URLs starting with /search?, /?s=, or /?q=. Internal search results should never be indexable.

6. WordPress / CMS attachment pages

/wp-content/uploads/image.png or ?attachment_id=42 — image attachment pages auto-generated by some CMSes. They have no real content; the image itself lives at a different URL.

How to spot it: URLs containing /wp-content/, attachment_id=, or similar CMS-specific attachment patterns.

Pick the right tool first

Before touching anything, get one rule straight, because mixing these up is the single most common mistake on this problem:

Tool	What it does	Crawl budget	Use it for
`robots.txt` `Disallow`	Blocks the crawl entirely — Googlebot never fetches the URL	Saves it (no fetch)	Params, facets, internal search, attachments you never want crawled
`rel="canonical"`	Asks Google to consolidate signals onto one URL; it still crawls the variants	Spends it (still fetched)	Sort/view permutations of a page you DO want indexed
`noindex` (meta or `X-Robots-Tag`)	Keeps the page out of the index; Google must crawl it to read the tag	Spends it (must fetch)	Thin pages you want kept out but still crawled-through (some archives)

The trap: if you Disallow a URL in robots.txt, Googlebot can’t fetch it, so it will never see a noindex tag or a rel="canonical" on that page. Blocking and tagging the same URL cancel each other out — and a blocked URL can still show up in search (without a snippet) if other pages link to it. So: either block it in robots.txt, or tag it with noindex/canonical — never both on the same URL. Google’s own faceted-navigation docs and the broader SEO consensus are firm on this. See noindex vs robots.txt for the full distinction.

Shortest path to fix

Ordered by ROI. Step 1 categorizes; Steps 2-6 fix by source.

Step 1: Export and categorize the noise

Search Console → Pages → "Crawled — currently not indexed" → Export

Group URLs by pattern. A spreadsheet pivot on the part after ? (or on the path prefix) does this in minutes:

Category	Example pattern	Count
Parameter sort/filter	`?sort=`, `?filter=`	1200
Pagination beyond page 5	`/page/N` where `N > 5`	600
Facets	`?color=&size=&brand=*`	800
Archive pages	`/YYYY/`, `/author/*`	300
Internal search	`/search?q=*`	200
Attachments	`/wp-content/uploads/*`	300

The biggest categories are where to apply fixes first.

Step 2: For sort / filter parameters — robots.txt (or canonical, not both)

If you never want these crawled (the usual case), block them in robots.txt:

User-agent: *
Disallow: /*?sort=
Disallow: /*?filter=
Disallow: /*sort=
Disallow: /*filter=

The /*?sort= form catches the param when it’s first in the string; /*sort= catches it anywhere (e.g. ?page=2&sort=asc). Test your patterns in the robots.txt report under Settings in Search Console before relying on them.

If instead you want the clean version of these pages indexed and the variants merely consolidated, do NOT block them — let Google crawl them and add a self-referencing canonical to the clean URL:

<link rel="canonical" href="https://yoursite.com/category/ai/" />

Then ?sort=asc and ?sort=desc&filter=x consolidate onto the canonical. Google’s docs note canonical “may, over time, decrease the crawl volume” of the variants — it’s slower and softer than a robots.txt block, so reserve it for URLs you actually want crawled.

This is the standard e-commerce pattern. Follow Google’s official faceted-navigation guidance: disallow the facet params, allow one canonical “view all” exception.

User-agent: Googlebot
Disallow: /*?*color=
Disallow: /*?*size=
Disallow: /*?*brand=
Allow: /*?products=all$

Then promote only the high-intent combinations to real, indexable landing pages with their own H1, body copy, and self-canonical:

Default:  every facet combination = blocked in robots.txt (no crawl)
Curated:  /products/blue-running-shoes/ (real page, body content) = indexable

Case studies of exactly this approach (block the junk facets, self-canonical the clean pages, real pages for the few high-demand combos) consistently report a meaningful drop in wasted crawl and a long-tail traffic lift over roughly one to two months — directional outcomes, not a guaranteed number for any given site.

Step 4: For deep pagination — keep it crawlable, don’t blanket-noindex

Older advice said “noindex pages 3 and beyond.” Be careful: Google has stated that a page kept noindex long-term is eventually treated as nofollow too, so a noindex paginated page can stop passing crawl signals to the articles it links. As of June 2026 the safer pattern is:

Leave paginated pages indexable with a self-referencing canonical (each /page/N/ canonicals to itself, NOT to page 1).
Make sure every article is also reachable from a sitemap or category index, so no article depends on a paginated page being crawled.
Only consider noindex on very deep pages if you have a guaranteed alternate crawl path — and accept it’s a temporary measure.

For short categories, the cleanest fix is to remove pagination entirely with a single “view all” page. Note Google no longer uses rel="next" / rel="prev" for indexing (it deprecated that signal), though Bing still honors it, so keeping the markup does no harm. See the pagination noindex/follow trap for the full breakdown.

Step 5: For archive pages — noindex if thin, leave them crawlable

<!-- /2024/, /author/jane/, /tag/python/ -->
<meta name="robots" content="noindex, follow" />

Use noindex here (not a robots.txt block) because you usually still want Google to crawl through these pages to reach the articles they list. If your archive pages carry real editorial content (a written intro, hand-picked curation), keep them indexable instead. If they’re auto-generated lists, noindex them. For the specific tag-page question, see should tag pages be noindex and should category pages be indexed.

Step 6: For internal search and attachments — block entirely

These should never be crawled at all, so robots.txt is the right tool:

# robots.txt
User-agent: *
Disallow: /search
Disallow: /*?s=
Disallow: /*?q=
Disallow: /wp-content/uploads/
Disallow: /*?attachment_id=

On WordPress, also turn off attachment pages (Yoast SEO → Settings → “Media pages” → redirect to the file) and confirm “Discourage search engines” is OFF on your live site (Settings → Reading) — that toggle blocks your whole site, not just search results. See internal search page indexing for the WordPress-specific steps.

How to confirm it’s fixed

The report won’t clear overnight; Google re-crawls on its own schedule. Verify in this order:

Test the rule today. In Search Console → Settings → robots.txt report, paste a sample blocked URL and confirm it shows as Disallowed. For noindex, run the URL through the URL Inspection tool — “Indexing allowed? No: ‘noindex’ detected” confirms Google can read the tag (which means you did NOT also block it in robots.txt).
Watch the count over 2-6 weeks. Open Search Console → Pages → “Crawled — currently not indexed” and track the total. Blocked URLs drop out of “Crawled — not indexed” and move toward “Blocked by robots.txt” / “Excluded by ‘noindex’ tag” — that migration is the signal it’s working.
Track the upside, not just the cleanup. The real win is faster indexing of new articles. In Search Console → Pages, watch the “Indexed” count and the time-to-index on your newest posts; rising impressions on fresh content is the proxy that crawl budget was reclaimed.

Don’t use “Remove URLs” for this — that tool only hides URLs from results for ~6 months; it does not stop crawling or fix the underlying budget waste.

Prevention

Plan URL structure to minimize parameter / facet combinations. Clean paths beat parameter explosions.
Set new param patterns to blocked-by-default in robots.txt; promote only the few worth indexing to real pages.
Make internal search, attachments, and admin/preview URLs Disallow-by-template, not something you clean up later.
Quarterly, re-export “Crawled — currently not indexed” and look for new noise categories your CMS started generating.
For e-commerce / faceted sites, treat each indexable URL as an editorial decision, not an automatic one.
Keep one rule memorized: block OR tag, never both — a Disallow-ed URL hides its own noindex and canonical.

FAQ

Should I use robots.txt or noindex to get rid of these URLs? Use robots.txt Disallow when you never want the URL crawled (sort/filter params, facets, internal search, attachments) — it saves crawl budget because Google never fetches the page. Use noindex when you want the page kept out of the index but still crawled-through to the links it contains (thin archives). Never use both on the same URL: a blocked URL can’t be fetched, so Google never sees the noindex.

Will blocking these URLs in robots.txt deindex pages already in Google? Not immediately, and not reliably. Once a URL is blocked, Google can’t re-crawl it to see a noindex, so a page already indexed may linger in results (shown without a snippet) if other sites link to it. To remove a page that’s already indexed, let it stay crawlable and add noindex first; only block it in robots.txt after it has dropped out of the index.

How long until “Crawled — currently not indexed” goes down? Typically 2 to 6 weeks. Google re-crawls on its own schedule and the report updates with a lag. You can speed verification by checking the robots.txt report and URL Inspection the same day, but the aggregate count moves slowly. Don’t repeatedly hit “Validate fix” expecting instant results.

Does crawl budget even matter for my small site? Probably not. Google says sites under roughly 10,000 URLs are crawled efficiently without special attention. Crawl budget becomes a real constraint on sites with 10,000+ URLs that change daily, or 1,000,000+ URLs total. If your articles index fine within a week or two, treat low-value URLs as a hygiene cleanup, not an emergency.

Where did the URL Parameters tool in Search Console go? Google removed it on April 26, 2022, after finding only about 1% of parameter configurations were actually useful. There’s no in-GSC parameter control anymore — handle parameters with robots.txt, canonical tags, and noindex instead.

Is noindex, follow still safe for pagination? Short term, yes — but Google has said pages left noindex long-term are eventually treated as nofollow, so the links stop passing signal. As of June 2026, prefer keeping paginated pages indexable with self-canonicals and guaranteeing every article is also reachable via a sitemap or index, rather than blanket-noindexing deep pagination.

Tags: #Content ops #Site quality #Site audit #Troubleshooting #Low-value URL

Search Console Sees Many Low-Value URLs

Common causes

1. URL parameters expose every sort / filter combination

3. Faceted navigation creates a URL combinatorial explosion

4. Auto-generated archive pages (year / month / author) are publicly crawlable

5. Internal search results indexed accidentally

6. WordPress / CMS attachment pages

Pick the right tool first

Shortest path to fix

Step 1: Export and categorize the noise

Step 2: For sort / filter parameters — robots.txt (or canonical, not both)

Step 3: For facets — block by default, curate the winners

Step 5: For archive pages — noindex if thin, leave them crawlable

Step 6: For internal search and attachments — block entirely

How to confirm it’s fixed

Prevention

FAQ

Common causes

1. URL parameters expose every sort / filter combination

2. Pagination beyond page 5-10 has no substantive content

3. Faceted navigation creates a URL combinatorial explosion

4. Auto-generated archive pages (year / month / author) are publicly crawlable

5. Internal search results indexed accidentally

6. WordPress / CMS attachment pages

Pick the right tool first

Shortest path to fix

Step 1: Export and categorize the noise

Step 2: For sort / filter parameters — robots.txt (or canonical, not both)

Step 3: For facets — block by default, curate the winners

Step 4: For deep pagination — keep it crawlable, don’t blanket-noindex

Step 5: For archive pages — noindex if thin, leave them crawlable

Step 6: For internal search and attachments — block entirely

How to confirm it’s fixed

Prevention

FAQ

Related

Related Articles

Internal Link Rot: Articles Point to Renamed or Deleted Slugs

Canonical Points to the Wrong Page: Translations Canonicalize Back to English

FAQ Rich Result Gone in Google? It's Deprecated, Not Your Schema

Hreflang Misconfigured Between EN and ZH: No Return Tags, Wrong Codes, Missing x-default

Image Alt Text Missing in Bulk: Audit, Backfill, and Lock It In

Publish Date Stuck in the Past: Articles Look Stale After Real Refreshes