Content duplication on a content site is rarely two identical articles. It is almost always two articles targeting the same intent — different titles, different wording, same underlying query. Google picks one and demotes the other. On a fast-scaling site, this happens constantly unless you have a script-driven process.
Background
There are three kinds of duplication: (1) identical-content (same article published twice — rare), (2) near-duplicate (paraphrased same article — common with AI assistance), and (3) duplicate-intent (different articles targeting the same query — the worst kind because it looks fine to a human). Each needs a different fix, and you cannot do any of it by hand once you cross ~200 articles.
How to tell
- Two articles consistently swap positions in Search Console for the same query.
- “Indexed” count in Search Console is much lower than “Submitted” — typical gap is more than 5%.
- “Duplicate without user-selected canonical” or “Duplicate, Google chose different canonical” appears in the Pages report.
- Two articles’ H1s, stripped of modifiers, would mean the same thing.
- Your sitemap has more entries than your unique primary keywords.
Before you start
- Confirm the job: this is a content-ops cleanup, not a launch. Block 1-2 hours of focus.
- Have a backup of the content collection —
git statusshould be clean before you run merge scripts. - Make sure your hosting layer supports 301 redirects (Astro +
_redirects, Firebaseredirects, Vercelvercel.json).
Step by step
- Add a
primaryKeywordfield to every article’s frontmatter. This is the single string that tells you what the article is for. Existing example shape:
---
title: "How to Submit a Sitemap to Search Console"
urlSlug: "submit-sitemap-search-console"
primaryKeyword: "submit sitemap search console"
category: "indie-dev"
---
- Run a duplicate-keyword report. A 30-line Node script over your content collection prints any keyword shared by two articles:
// scripts/find-duplicate-keywords.mjs
import { readdirSync, readFileSync } from 'node:fs';
import { join } from 'node:path';
import matter from 'gray-matter';
const ROOT = 'src/content/articles/en';
const byKw = new Map();
for (const cat of readdirSync(ROOT)) {
for (const file of readdirSync(join(ROOT, cat))) {
if (!file.endsWith('.mdx')) continue;
const { data } = matter(readFileSync(join(ROOT, cat, file), 'utf8'));
const kw = (data.primaryKeyword || '').toLowerCase().trim();
if (!kw) continue;
if (!byKw.has(kw)) byKw.set(kw, []);
byKw.get(kw).push(`${cat}/${file}`);
}
}
for (const [kw, files] of byKw) {
if (files.length > 1) console.log(`DUP "${kw}":\n ${files.join('\n ')}`);
}
Run it as part of npm run audit:content so duplicates fail the prebuild check.
- For near-duplicates, merge with a 301. Pick the stronger URL (higher impressions in Search Console), move any unique content into it, then add a redirect. Firebase example:
{
"hosting": {
"redirects": [
{ "source": "/articles/scale-ai-content-safely",
"destination": "/articles/scale-content-with-ai-safely",
"type": 301 }
]
}
}
Astro static + Netlify-style _redirects:
/articles/scale-ai-content-safely /articles/scale-content-with-ai-safely 301
- For duplicate-intent, narrow scope or noindex. Either split the angle (one beginner, one advanced) by rewriting H1 and
primaryKeyword, or mark the weaker onedraft: trueand addnoindexto its meta. In the article layout:
{frontmatter.noindex && <meta name="robots" content="noindex,follow" />}
- Set a self-canonical on every page by default, and only override when you are sure the target is stronger. In an Astro layout:
<link rel="canonical" href={`${Astro.site}${Astro.url.pathname}`} />
- Before publishing AI-assisted batches, run a similarity check. A simple cosine on title + first paragraph using OpenAI embeddings catches most issues:
// scripts/similarity-check.mjs (excerpt)
import OpenAI from 'openai';
const client = new OpenAI();
async function embed(text) {
const r = await client.embeddings.create({
model: 'text-embedding-3-small',
input: text,
});
return r.data[0].embedding;
}
function cosine(a, b) {
let dot = 0, na = 0, nb = 0;
for (let i = 0; i < a.length; i++) {
dot += a[i] * b[i];
na += a[i] * a[i];
nb += b[i] * b[i];
}
return dot / (Math.sqrt(na) * Math.sqrt(nb));
}
// Flag any pair with cosine > 0.85 for human review.
- After the cleanup deploy, ask Google to revalidate. Use URL Inspection on the merged URL, then “Request indexing” on the surviving one. For batch redirects, resubmit the sitemap.
Implementation checklist
- Every article has a
primaryKeywordfield; the audit script flags duplicates in prebuild. - 301 redirects are in
firebase.json/vercel.json/_redirects, not just in the article body. - Self-canonical is set globally; cross-page canonicals are an opt-in per article.
- The similarity check is wired into your AI-content pipeline, not run manually.
After-launch verification
- Recrawl the merged URL with Search Console URL Inspection and confirm the response code is 301 and the destination is the canonical URL.
- Re-check the Pages report 1-2 weeks later: “Duplicate” reasons should drop.
- Confirm the sitemap no longer lists the merged-away URLs (
grepthe build output).
Common pitfalls
- Trusting that “different titles = different articles”. Intent matters more than wording — failure mode is two articles trading impressions in Search Console.
- Using canonical to “hide” duplicates without fixing them. Google may ignore your canonical when it disagrees — you will see “Google chose different canonical” in the Pages report.
- Generating “10 best X for [profession]” pages where each profession version is 90% identical. This pattern triggers helpful content actions and tanks the whole cluster.
- Treating duplication as a one-time cleanup. New duplicates appear every month — wire the check into prebuild.
- 301-ing to a URL that itself 301s elsewhere. Chains lose signal; always redirect once.
- Forgetting to remove the merged URL from your sitemap — Google will keep crawling and re-flagging it.
FAQ
- Are translated versions of the same article duplicates?: No, if they’re properly tagged with hreflang. EN and ZH versions of the same article are two different URLs for two different audiences. Make sure both sides reference each other via
<link rel="alternate" hreflang="...">. - Does Google penalize duplication?: Not as a manual penalty in most cases, but it suppresses the weaker version and can affect site-wide quality signals once the ratio gets high.
- Can I just noindex duplicates?: Yes, but it wastes the writing effort. Merging via 301 keeps any link equity. Use noindex only when there is nothing to merge into.
- How do I detect near-duplicates at scale?: Embeddings + cosine similarity on title + first paragraph works well. Anything above ~0.85 deserves human review.
- What threshold should the similarity script flag?: Start at 0.85 for review and 0.92 for auto-block in CI. Tune down if you keep getting false positives on closely related but distinct topics.
Related
- Pillar and cluster pages
- Running a site-wide content audit
- Scaling content with AI safely
- Canonical URL explained
- Search Console canonical explained
- Content Site Quarterly Review Cadence That Actually Catches Drift
- Content Site Staffing and Roles for Solo to 5-Person Team
Tags: #Indie dev #Content ops #SEO #Website planning #Canonical #Technical SEO