Search Console’s “Crawled — currently not indexed” report has 3,400 URLs. You scroll: ?sort=asc, ?sort=desc, ?page=12, ?page=13, ?tag=python&category=tutorials, /2024/03/, /author/jane/, /wp-content/uploads/.... Crawl budget is being burned on these instead of your articles. Real articles take longer to get indexed because Googlebot is busy crawling parameter combinations.
Crawlable-but-valueless URLs do two bad things: waste crawl budget (slowing real article indexing) and signal poor site hygiene (Google sees a sprawling, unstructured site). Below: how to categorize the noise, block by source, and reclaim crawl budget for content that matters.
Common causes
Ordered by hit rate, highest first.
1. URL parameters expose every sort / filter combination
?sort=asc, ?sort=desc&filter=x, ?sort=desc&filter=x&page=2 — every combination is a unique URL. A list with 3 filters × 2 sort options × 10 pages = 60 URL combinations per list page.
How to spot it: GSC report has many URLs with ? query strings. Group by parameter pattern; the offenders are the high-multiplicity parameters.
2. Pagination beyond page 5-10 has no substantive content
/category/ai/page/12/ shows articles 121-130 — content nobody searches for. Google crawls but doesn’t index because it’s just thin.
How to spot it: GSC lists many /page/N/ URLs with N > 5. Deep pagination accounts for a chunk of the noise.
3. Faceted navigation creates URL combinatorial explosion
/products/?color=blue&size=large&brand=acme style URLs. Every facet combination is a URL. 10 colors × 5 sizes × 20 brands = 1,000 URL combinations from one page.
How to spot it: URLs with multiple params, especially e-commerce-style facets. Combinatorial = many distinct URLs from few real intents.
4. Auto-generated archive pages (year / month / author) are publicly crawlable
/2024/, /2024/03/, /2024/03/15/, /author/jane/ — your CMS auto-generates these. Each is a separate URL but most have nothing useful Google would rank.
How to spot it: URL patterns like /YYYY/, /YYYY/MM/, /author/. Date / author archives are usually thin.
5. Internal search results indexed accidentally
/search?q=ai+tools — your site search produced a results page that got crawled. Now Google has every search ever done.
How to spot it: URLs starting with /search? or /?s=. Internal search results should never be indexable.
6. WordPress / CMS attachment pages
/wp-content/uploads/image.png or ?attachment_id=42 — image attachment pages auto-generated by some CMSes. They have no real content; the image itself is at a different URL.
How to spot it: URLs containing /wp-content/, /attachment_id=, or similar CMS-specific attachment patterns.
Shortest path to fix
Ordered by ROI. Step 1 categorizes; Steps 2-5 fix by source.
Step 1: Export and categorize the noise
Search Console → Pages → "Crawled — currently not indexed" → Export
Group URLs by pattern:
| Category | Example | Count |
|---|---|---|
| Parameter sort/filter | ?sort=*, ?filter=* | 1200 |
| Pagination >page 5 | /page/N where N>5 | 600 |
| Facets | ?color=*&size=*&brand=* | 800 |
| Archive pages | /YYYY/, /author/* | 300 |
| Internal search | /search?q=* | 200 |
| Attachments | /wp-content/uploads/* | 300 |
The biggest categories are where to apply fixes first.
Step 2: For parameters — canonical to parameterless
In your page’s <head>:
<link rel="canonical" href="https://yoursite.com/category/ai/" />
Then ?sort=asc, ?sort=desc&filter=x all consolidate to the canonical. Google’s signal accumulates on one URL instead of fragmenting.
For parameters Google should never crawl, add to robots.txt:
User-agent: *
Disallow: /*?sort=*
Disallow: /*?filter=*
Canonical is for consolidation; robots.txt is for outright blocking.
Step 3: For deep pagination — noindex pages 3+
<!-- /category/ai/page/3/ and beyond -->
<meta name="robots" content="noindex, follow" />
follow lets Google still crawl to articles linked from the page; noindex keeps the page itself out of the index.
Or implement view-all + remove pagination entirely for short categories.
Step 4: For facets — noindex by default, index curated combos
Default: every facet combination = noindex
Curated: /products/blue-shoes/ (manually picked, body content added) = indexable
This is the e-commerce SEO pattern: only the high-intent combinations get URL-as-content treatment; the rest are filtering noise.
Step 5: For archive pages — noindex if thin
<!-- /2024/, /author/jane/ -->
<meta name="robots" content="noindex" />
If your archive pages have editorial content (intro, curation), keep them indexable. If they’re auto-generated lists, noindex.
Step 6: For internal search and attachments — block entirely
# robots.txt
User-agent: *
Disallow: /search?
Disallow: /?s=
Disallow: /wp-content/uploads/
Disallow: /*?attachment_id=*
Combine with WordPress (or your CMS) settings: “Discourage search engines from indexing search results” if your CMS has the toggle.
Prevention
- Plan URL structure to minimize parameter / facet combinations — clean URLs > parameter explosions
- Default new param patterns to noindex; curate the ones worth indexing
- Internal search results, attachments, deep pagination should be
noindexby template default - Quarterly: review GSC’s “Crawled — not indexed” for new noise categories
- For e-commerce / faceted sites, treat indexable URLs as an editorial decision, not an automatic one
- Reclaimed crawl budget shows up as faster indexing for new articles; track impressions on new content as a proxy
Related
Tags: #Content ops #Site quality #Site audit #Troubleshooting #Low-value URL