Too Many Tags Create Thin Archive Pages

Q: Why not just block tag pages in robots.txt?

Because `robots.txt` blocks *crawling*, not *indexing*. A blocked URL Google already knows about can stay indexed (often as a URL with no description), and Googlebot never sees your `noindex` to drop it. For removal, allow crawling and use `noindex`; reserve `robots.txt` for URLs you never want fetched at all.

Q: What threshold should a small site use?

If you have a few hundred articles, `>= 5` per tag is reasonable. Fewer than ~100 articles: consider `>= 3`, or skip public tag pages entirely and rely on category pages plus internal links until you have the depth to make a tag page genuinely useful.

800 tags, 600 with 1-2 articles each = 600 thin archive pages Google won't index. Set a per-tag article threshold, merge synonyms, noindex the rest, keep them out of the sitemap.

Published: May 19, 2026 Updated: Jun 21, 2026 Author: AI Productivity Guide Team 🌐 查看中文版本

Two years of editors freely adding tags (“seems useful!”) left you with 800 tags. 600 of them have one or two articles each, and every tag generates its own archive page. So you now have ~600 archive pages that list a single article — pages thinner than the article they point to. In Google Search Console (Indexing -> Pages) they pile up under Crawled - currently not indexed and Discovered - currently not indexed, and Googlebot keeps re-fetching them instead of your real articles.

Fastest fix: stop generating a public page for any tag below a minimum article count (a threshold of >= 5 is a sane default), then noindex and remove from the sitemap any thin tag pages already in the index. Tags as metadata are fine — tags as public pages need a threshold or they become a thin-content factory. The full pass below: set the threshold, merge near-synonyms with 301s, noindex + de-sitemap the stragglers, and add governance so tag proliferation can’t recur.

One thing that changed worth knowing (confirmed June 2026): noindex, follow is not a permanent state. Google keeps a noindex, follow page out of search while still following its links at first, but if the noindex stays for months it treats the page as noindex, nofollow and stops following those links. So follow buys you a crawl-through window, not forever. If a thin tag page is the only path to an article, fix the article’s internal links before you noindex the tag — don’t rely on follow to keep that article discoverable.

Common causes

Ordered by hit rate, highest first.

1. No tag governance — anyone can add a new tag

The CMS lets any author type a new tag into the frontmatter. Two authors writing the same week create ai-coding, ai-programming, and code-with-ai — three tags for one concept.

How to spot it: the tag list has obvious synonyms. Run a frequency count (see Step 1) and eyeball the long tail for variants of the same idea.

2. No threshold for “when does a tag become a page”

Every tag becomes a public page regardless of article count. A tag used once becomes a page with one article — thinner than the article itself.

How to spot it: Search Console -> Indexing -> Pages, filter the “Crawled - currently not indexed” examples; most of the URLs are tag/archive paths. Or crawl with Screaming Frog / Sitebulb and sort by word count — tag pages cluster at the bottom.

3. No merge / cleanup process

Tags accumulate and nothing prunes them. After two years half are dormant or near-synonyms, but each still generates a page.

How to spot it: check git log on your tag registry or normalization script. “No edits in 12+ months” means there is no cleanup process.

4. Tags double as topics, sub-topics, and keywords

ai, ai-coding, claude-code-tutorials, setting-up-claude, claude-setup — all defensible individually, but they overlap heavily, and each is its own tag page.

How to spot it: pick a tag and check how many other tags share most of its articles. High overlap means no clear differentiation and duplicate-ish archive pages.

5. Auto-generated tag pages have zero editorial content

The tag page is just “Articles tagged X” plus a list — no intro, no curation, no editor’s note. Empty template plus one article equals thin.

How to spot it: view source of a tag page. If everything is boilerplate and the article count is <= 2, it is thin by definition.

6. Tag URLs proliferate via case / format variations

/tag/AI, /tag/ai, /tag/AI-coding, /tag/ai-coding — case-sensitive routing or inconsistent normalization creates duplicate tag pages from one concept.

How to spot it: crawl and look for tag URLs that differ only in case or separator. If both 200, normalization is broken.

Which bucket are you in

Symptom in Search Console / crawl	Most likely cause	Go to
Many near-synonym tags in the frequency list	No governance (1)	Step 3 + Step 5
Hundreds of single-article tag pages	No threshold (2)	Step 2
Tag list grew but never shrank	No cleanup (3)	Step 3 + Step 5
Same articles under many overlapping tags	Tags = topics + keywords (4)	Step 3
Tag pages have no text but the article list	Empty template (5)	Step 2 (cut) or curate hubs
`/tag/AI` and `/tag/ai` both return 200	Case/format dupes (6)	Normalize in schema (Prevention)

Shortest path to fix

Ordered by ROI. Step 1 audits; Steps 2-4 reduce; Steps 5-6 lock it in and measure.

Step 1: Audit tag distribution

# Count articles per tag across the content tree
grep -rh "^tags:" src/content/articles/en \
  | tr ',' '\n' \
  | sed 's/^[" ]*//;s/["]*$//' \
  | sort | uniq -c | sort -nr

Output is count tag, highest first. Every tag with a count below your threshold (e.g. < 5) is a candidate to drop from public pages. Save this list — it’s the input to Steps 2 and 3.

Step 2: Set a public-page threshold

In your tag-page generator, only emit a route when the tag clears the threshold:

// src/pages/tag/[slug].astro
export async function getStaticPaths() {
  const articles = await getCollection("articles");
  const tagCounts = countTags(articles);

  return Object.entries(tagCounts)
    .filter(([, count]) => count >= 5)   // only tags with >= 5 articles
    .map(([slug]) => ({ params: { slug }, props: { /* ... */ } }));
}

Tags below the threshold still appear in article frontmatter (useful as internal metadata and for related-article logic) but no longer generate a crawlable page. Picking the number: >= 5 is a safe default; raise it to >= 8-10 if your articles are short, lower it to >= 3 only if each tag page also carries real editorial text.

Step 3: Merge close synonyms

ai-coding     <- ai-programming, code-with-ai, ai-code
claude-setup  <- setting-up-claude, claude-installation

For each merge:

Pick the canonical tag (most-used, clearest, matches how people search).
Update every article’s frontmatter to use the canonical tag.
301-redirect the old tag URLs to the canonical tag URL (so any inbound links and Google’s memory of the old URL transfer over).
Confirm the canonical tag page now lists the combined set of articles.

After re-running Step 1, the canonical tag should have absorbed the article counts of all the variants you merged.

Step 4: noindex the thin tag pages already indexed

For tag pages Google already crawled but that now sit below the threshold (or that you can’t merge), add a robots meta tag in the template:

<!-- tag page template, when articleCount < 5 -->
<meta name="robots" content="noindex, follow" />

Three things that make this actually work — all confirmed against Google’s current docs (June 2026):

The page must not be blocked in robots.txt. If Googlebot can’t fetch it, it never sees the noindex and the URL can linger in the index. Let it crawl, read the tag, and drop the page.
Remove these URLs from your XML sitemap. A noindexed URL listed in the sitemap is a conflicting signal that wastes crawl budget; sitemaps should contain only canonical, indexable URLs.
follow is temporary. Google follows the links at first, but a months-long noindex is eventually treated as noindex, nofollow. If a thin tag page is the only route to some article, add a real internal link to that article from a hub or related-articles block before you noindex.

If you’ve deleted a tag from the generator entirely, return 410 Gone for its URL instead — that signals permanent removal and Google drops it faster than a 404.

Step 5: Add tag governance

Create a curated allow-list in the repo:

// src/lib/allowed-tags.ts
export const ALLOWED_TAGS = [
  "ai", "ai-coding", "claude", "claude-code", "chatgpt", "cursor",
  "openai-api", "anthropic-api", "prompt-engineering",
  // ... ~50 tags total
] as const;

Then add a CI check (or a content:audit script run in prebuild) that fails the build when an article uses a tag outside ALLOWED_TAGS. That forces a deliberate code-review conversation before any new tag — and any new tag page — exists.

Step 6: Wait for re-crawl and measure

Recrawl is not instant. Google usually revisits and re-decides within two to four weeks for active sites; large tag-page batches can take longer. Track these in Search Console -> Indexing -> Pages:

Indexed tag-page count should fall as the thin pages drop out.
The Crawled - currently not indexed bucket should shrink (those were largely your thin tag pages).
Reclaimed crawl budget should show up as faster indexing of genuinely new articles — watch time-to-index on a few recent posts.

How to confirm it’s fixed

Generator: run a production build and grep the output for tag routes. Below-threshold tags should produce no /tag/<slug>/ file.
Live header: curl -I https://yoursite.com/tag/<thin-tag>/ plus a fetch of the HTML should show either a 410, or a 200 whose <head> contains <meta name="robots" content="noindex, follow">.
Sitemap: open sitemap.xml (or the tag sitemap) and confirm no noindexed tag URL is listed.
Search Console: use URL Inspection on one formerly-thin tag URL — it should report “noindex detected” / “Excluded by ‘noindex’ tag”, which means Google has seen and honored the directive.
Counts over time: the Pages report’s indexed count and “Crawled - currently not indexed” count trend down over the following weeks.

FAQ

Will noindexing 600 tag pages hurt my rankings? No, the opposite. Those pages weren’t ranking — they were diluting site quality and burning crawl budget. Removing thin, near-duplicate archive pages concentrates crawl and quality signals on the articles that actually rank.

noindex or 410 — which should I use? Keep the tag page but want it crawlable to its links for now -> noindex, follow. Killed the tag entirely and the URL should die -> 410 Gone. A 410 is cleaner and de-indexes faster, but only use it once the tag truly no longer exists in your generator.

Why not just block tag pages in robots.txt? Because robots.txt blocks crawling, not indexing. A blocked URL Google already knows about can stay indexed (often as a URL with no description), and Googlebot never sees your noindex to drop it. For removal, allow crawling and use noindex; reserve robots.txt for URLs you never want fetched at all.

Does noindex, follow keep passing link equity forever? No. Google follows the links initially, but a long-standing noindex is eventually treated like noindex, nofollow, so those links stop counting. Don’t depend on a noindexed tag page as a permanent internal-linking layer — link important articles directly from hubs.

What threshold should a small site use? If you have a few hundred articles, >= 5 per tag is reasonable. Fewer than ~100 articles: consider >= 3, or skip public tag pages entirely and rely on category pages plus internal links until you have the depth to make a tag page genuinely useful.

How long until Search Console reflects the change? Typically two to four weeks for Google to recrawl and re-decide on an active site; large batches of tag pages can take a month or more. Don’t mass-submit them in URL Inspection — let the natural recrawl pick them up.

Prevention

Maintain a curated ALLOWED_TAGS list in code; introducing a new tag requires a code review.
Keep a minimum article threshold (e.g. >= 5) for public tag pages — everything else stays metadata-only.
Normalize tag formatting at the schema level (lowercase, kebab-case) so case/format dupes can’t appear.
Quarterly, re-run the Step 1 audit and merge new near-duplicates before they accumulate.
For high-value tags (>= 10 articles), treat the tag page as a hub: add an editorial intro and curated ordering so it earns its index slot.
Keep noindexed and non-canonical URLs out of the XML sitemap.
A site with ~50 curated tags and strong tag pages outperforms one with 800 tags and thin pages.

Tags: #Content ops #Site quality #Site audit #Troubleshooting #Tag page