Two years of editors freely adding tags (“seems useful!”) gave you 800 tags. 600 of them have 1-2 articles. Each tag generates an archive page. Now you have 600 archive pages with a single article on each — Google sees a site full of duplicate, thin content. Crawl budget burns on these pages instead of articles.
Tags-as-metadata are fine. Tags-as-public-pages need a minimum article threshold or they become a thin-content factory. The fix: set a threshold (≥5 articles per tag page), merge near-synonyms, noindex below threshold, and add editorial governance so tag proliferation can’t repeat.
Common causes
Ordered by hit rate, highest first.
1. No tag governance — anyone can add new tags
The CMS lets any author type a new tag into the frontmatter. Two authors writing the same week create ai-coding, ai-programming, code-with-ai — three tags for one concept.
How to spot it: Tag list has obvious synonyms. grep -h "^tags:" *.mdx | tr ',' '\n' | sort | uniq -c | sort -nr shows redundant variants.
2. No threshold for “when does a tag become a page”
Every tag becomes a public page regardless of article count. A tag used once = a page with one article — page is thinner than the article it lists.
How to spot it: Crawl reveals many tag pages with <10 words of unique content. The tag-page template adds nothing the article doesn’t have.
3. No merge / cleanup process
Tags accumulate. Nothing prunes them. After 2 years, half are dormant or near-synonyms but each still generates a page.
How to spot it: git log for your tag registry or tag normalization script. If “no edits in 12+ months,” there’s no cleanup process.
4. Tags double as topics, sub-topics, and keywords
ai, ai-coding, claude-code-tutorials, setting-up-claude, claude-setup — all valid in some sense, but they create overlap. Each is a separate tag page.
How to spot it: Pick a tag; check if it shares articles with 3+ other tags. High overlap = no clear differentiation.
5. Auto-generated tag pages have zero editorial content
Tag page is just “Articles tagged X” + a list. No intro, no curation, no editor’s note. Empty template + 1 article = thin.
How to spot it: View source of a tag page. If everything is auto-generated and the article count is ≤2, it’s thin.
6. Tag URLs proliferate via case/format variations
/tag/AI, /tag/ai, /tag/AI-coding, /tag/ai-coding — case-sensitive routing or inconsistent normalization creates duplicate tag pages from the same concept.
How to spot it: Crawl finds tag URLs differing only in case. Normalization is broken.
Shortest path to fix
Ordered by ROI. Step 1 audits; Steps 2-4 reduce.
Step 1: Audit tag distribution
# Count articles per tag
grep -h "^tags:" src/content/articles/en/**/*.mdx \
| tr ',' '\n' \
| sed 's/^[" ]*//;s/["]*$//' \
| sort | uniq -c | sort -nr
Output: tag → article count. Tags with count <5 are below threshold candidates.
Step 2: Set a public-page threshold
In your tag-page generator:
// src/pages/tag/[slug].astro
export async function getStaticPaths() {
const articles = await getCollection("articles");
const tagCounts = countTags(articles);
return Object.entries(tagCounts)
.filter(([_, count]) => count >= 5) // only tags with ≥5 articles
.map(([slug, _]) => ({ params: { slug }, props: { /* ... */ } }));
}
Tags below threshold still appear in article metadata but don’t generate a public page.
Step 3: Merge close synonyms
ai-coding → consolidate with: ai-programming, code-with-ai, ai-code
claude-setup → consolidate with: setting-up-claude, claude-installation
For each merge:
1. Pick the canonical tag (most-used, clearest name)
2. Update all article frontmatter to use the canonical
3. 301 redirect old tag URLs to canonical tag URL
4. Test that the canonical tag page now has the combined articles
Step 4: noindex pre-existing thin tag pages
For tag pages already indexed but below threshold:
<!-- tag page template, when articleCount < 5 -->
<meta name="robots" content="noindex, follow" />
follow lets Google still crawl to the listed articles.
Or 410 the tag URLs entirely if you’ve removed them from the generator.
Step 5: Add tag governance
Establish a curated tag list in the repo:
// src/lib/allowed-tags.ts
export const ALLOWED_TAGS = [
"ai", "ai-coding", "claude", "claude-code", "chatgpt", "cursor",
"openai-api", "anthropic-api", "prompt-engineering",
// ... ~50 tags total
] as const;
CI check: reject PRs that introduce a tag not in ALLOWED_TAGS. Forces conversation about whether a new tag is needed.
Step 6: Wait for re-crawl + measure
# After 4-8 weeks:
# - Search Console: indexed tag-page count should drop
# - "Crawled - not indexed" count should drop (was thin tag pages)
# - Crawl budget reclaimed → check new-article indexing time
Prevention
- Maintain a curated
ALLOWED_TAGSlist in code; new tags require a code review - Minimum article threshold (e.g., ≥5) for public tag pages — others stay metadata-only
- Normalize tag formatting (lowercase, kebab-case) at the schema level
- Quarterly: audit synonyms and merge near-duplicates
- For high-value tags (≥10 articles), treat the tag page as a hub article with editorial intro
- A site with 50 curated tags + strong tag pages outperforms one with 800 tags + thin pages
Related
Tags: #Content ops #Site quality #Site audit #Troubleshooting #Tag page