How to Avoid Content Duplication When You're Scaling Fast

Duplication kills indexing once you cross a few hundred articles. Here's the script-driven workflow to catch it before Google does.

Content duplication on a content site is rarely two identical articles. It is almost always two articles targeting the same intent — different titles, different wording, same underlying query. Google picks one and demotes the other. On a fast-scaling site, this happens constantly unless you have a script-driven process.

Background

There are three kinds of duplication: (1) identical-content (same article published twice — rare), (2) near-duplicate (paraphrased same article — common with AI assistance), and (3) duplicate-intent (different articles targeting the same query — the worst kind because it looks fine to a human). Each needs a different fix, and you cannot do any of it by hand once you cross ~200 articles.

How to tell

  • Two articles consistently swap positions in Search Console for the same query.
  • “Indexed” count in Search Console is much lower than “Submitted” — typical gap is more than 5%.
  • “Duplicate without user-selected canonical” or “Duplicate, Google chose different canonical” appears in the Pages report.
  • Two articles’ H1s, stripped of modifiers, would mean the same thing.
  • Your sitemap has more entries than your unique primary keywords.

Before you start

  • Confirm the job: this is a content-ops cleanup, not a launch. Block 1-2 hours of focus.
  • Have a backup of the content collection — git status should be clean before you run merge scripts.
  • Make sure your hosting layer supports 301 redirects (Astro + _redirects, Firebase redirects, Vercel vercel.json).

Step by step

  1. Add a primaryKeyword field to every article’s frontmatter. This is the single string that tells you what the article is for. Existing example shape:
---
title: "How to Submit a Sitemap to Search Console"
urlSlug: "submit-sitemap-search-console"
primaryKeyword: "submit sitemap search console"
category: "indie-dev"
---
  1. Run a duplicate-keyword report. A 30-line Node script over your content collection prints any keyword shared by two articles:
// scripts/find-duplicate-keywords.mjs
import { readdirSync, readFileSync } from 'node:fs';
import { join } from 'node:path';
import matter from 'gray-matter';

const ROOT = 'src/content/articles/en';
const byKw = new Map();

for (const cat of readdirSync(ROOT)) {
  for (const file of readdirSync(join(ROOT, cat))) {
    if (!file.endsWith('.mdx')) continue;
    const { data } = matter(readFileSync(join(ROOT, cat, file), 'utf8'));
    const kw = (data.primaryKeyword || '').toLowerCase().trim();
    if (!kw) continue;
    if (!byKw.has(kw)) byKw.set(kw, []);
    byKw.get(kw).push(`${cat}/${file}`);
  }
}

for (const [kw, files] of byKw) {
  if (files.length > 1) console.log(`DUP "${kw}":\n  ${files.join('\n  ')}`);
}

Run it as part of npm run audit:content so duplicates fail the prebuild check.

  1. For near-duplicates, merge with a 301. Pick the stronger URL (higher impressions in Search Console), move any unique content into it, then add a redirect. Firebase example:
{
  "hosting": {
    "redirects": [
      { "source": "/articles/scale-ai-content-safely",
        "destination": "/articles/scale-content-with-ai-safely",
        "type": 301 }
    ]
  }
}

Astro static + Netlify-style _redirects:

/articles/scale-ai-content-safely  /articles/scale-content-with-ai-safely  301
  1. For duplicate-intent, narrow scope or noindex. Either split the angle (one beginner, one advanced) by rewriting H1 and primaryKeyword, or mark the weaker one draft: true and add noindex to its meta. In the article layout:
{frontmatter.noindex && <meta name="robots" content="noindex,follow" />}
  1. Set a self-canonical on every page by default, and only override when you are sure the target is stronger. In an Astro layout:
<link rel="canonical" href={`${Astro.site}${Astro.url.pathname}`} />
  1. Before publishing AI-assisted batches, run a similarity check. A simple cosine on title + first paragraph using OpenAI embeddings catches most issues:
// scripts/similarity-check.mjs (excerpt)
import OpenAI from 'openai';
const client = new OpenAI();

async function embed(text) {
  const r = await client.embeddings.create({
    model: 'text-embedding-3-small',
    input: text,
  });
  return r.data[0].embedding;
}

function cosine(a, b) {
  let dot = 0, na = 0, nb = 0;
  for (let i = 0; i < a.length; i++) {
    dot += a[i] * b[i];
    na += a[i] * a[i];
    nb += b[i] * b[i];
  }
  return dot / (Math.sqrt(na) * Math.sqrt(nb));
}

// Flag any pair with cosine > 0.85 for human review.
  1. After the cleanup deploy, ask Google to revalidate. Use URL Inspection on the merged URL, then “Request indexing” on the surviving one. For batch redirects, resubmit the sitemap.

Implementation checklist

  • Every article has a primaryKeyword field; the audit script flags duplicates in prebuild.
  • 301 redirects are in firebase.json / vercel.json / _redirects, not just in the article body.
  • Self-canonical is set globally; cross-page canonicals are an opt-in per article.
  • The similarity check is wired into your AI-content pipeline, not run manually.

After-launch verification

  • Recrawl the merged URL with Search Console URL Inspection and confirm the response code is 301 and the destination is the canonical URL.
  • Re-check the Pages report 1-2 weeks later: “Duplicate” reasons should drop.
  • Confirm the sitemap no longer lists the merged-away URLs (grep the build output).

Common pitfalls

  • Trusting that “different titles = different articles”. Intent matters more than wording — failure mode is two articles trading impressions in Search Console.
  • Using canonical to “hide” duplicates without fixing them. Google may ignore your canonical when it disagrees — you will see “Google chose different canonical” in the Pages report.
  • Generating “10 best X for [profession]” pages where each profession version is 90% identical. This pattern triggers helpful content actions and tanks the whole cluster.
  • Treating duplication as a one-time cleanup. New duplicates appear every month — wire the check into prebuild.
  • 301-ing to a URL that itself 301s elsewhere. Chains lose signal; always redirect once.
  • Forgetting to remove the merged URL from your sitemap — Google will keep crawling and re-flagging it.

FAQ

  • Are translated versions of the same article duplicates?: No, if they’re properly tagged with hreflang. EN and ZH versions of the same article are two different URLs for two different audiences. Make sure both sides reference each other via <link rel="alternate" hreflang="...">.
  • Does Google penalize duplication?: Not as a manual penalty in most cases, but it suppresses the weaker version and can affect site-wide quality signals once the ratio gets high.
  • Can I just noindex duplicates?: Yes, but it wastes the writing effort. Merging via 301 keeps any link equity. Use noindex only when there is nothing to merge into.
  • How do I detect near-duplicates at scale?: Embeddings + cosine similarity on title + first paragraph works well. Anything above ~0.85 deserves human review.
  • What threshold should the similarity script flag?: Start at 0.85 for review and 0.92 for auto-block in CI. Tune down if you keep getting false positives on closely related but distinct topics.

Tags: #Indie dev #Content ops #SEO #Website planning #Canonical #Technical SEO