Are translated versions of the same article duplicates?

No, as long as they are paired with `hreflang`. The EN and ZH versions are two URLs serving two audiences. Make sure both reference each other via a ` ` tag.

Does Google penalize duplication?

Not as a manual action in most cases, but it suppresses the weaker version and can dent site-wide quality signals once the duplicate ratio gets high.

Can I just noindex duplicates?

Yes, but it wastes the writing effort. A 301 merge keeps link equity; use `noindex,follow` only when there is nothing to merge into.

How do I detect near-duplicates at scale?

Embeddings plus cosine similarity on `title + first paragraph` works well. `text-embedding-3-small` at $0.02 per 1M tokens makes a full-library scan trivially cheap.

What threshold should the similarity script flag?

Start at 0.85 for human review and 0.92 for an auto-block in CI. Tune the review threshold down if you get false positives on closely related but distinct topics.

How long until Google honors the canonical change?

Typically 2 to 6 weeks, depending on crawl frequency and site size. Resubmitting the sitemap and requesting indexing speeds it up but does not guarantee a same-week recrawl.

Indie Dev & Website Building

Avoid Content Duplication When Scaling a Content Site Fast

Past a few hundred articles, duplication quietly kills indexing. Here is the script-driven workflow to catch it before Google does — with real embedding costs and Search Console signals.

Published: May 15, 2026 Updated: Jun 04, 2026 Author: AI Productivity Guide Team 🌐 查看中文版本

Content duplication on a content site is rarely two identical articles. It is almost always two articles targeting the same intent: different titles, different wording, same underlying query. Google picks one as canonical and demotes the other. On a fast-scaling site — especially one where AI helps draft batches — this happens every month unless you run a script-driven process that catches it before Googlebot does.

TL;DR

Three kinds of duplication exist: identical, near-duplicate, and duplicate-intent. The last one looks fine to a human and does the most damage.
Tag every article with a primaryKeyword and fail the prebuild when two articles share one.
Merge near-duplicates with a 301 redirect (a strong canonical signal per Google), not a canonical tag alone.
Catch near-duplicates before publishing with embeddings + cosine similarity. text-embedding-3-small costs $0.02 per 1M tokens (June 2026), so screening 2,000 articles is a few cents.
Canonical changes take roughly 2 to 6 weeks for Google to honor, so cleanup is not instant.

Three kinds of duplication, three different fixes

Type	What it is	How common	Fix
Identical	Same article published at two URLs	Rare	301 the weaker URL
Near-duplicate	Paraphrased version of the same article	Common with AI drafting	Merge unique content, then 301
Duplicate-intent	Different articles answering the same query	The worst — invisible to a human	Re-scope (split angles) or noindex the weaker one

You cannot do any of this by hand once you cross roughly 200 articles. Each type needs its own detection signal and its own fix, which is why this has to be automated.

How to tell you have a duplication problem

Two articles consistently swap positions in Search Console for the same query.
The Indexed count in the Pages report is well below Submitted — a gap above 5% on a clean site is worth investigating.
“Duplicate without user-selected canonical” or “Duplicate, Google chose different canonical” shows up under “Why pages aren’t indexed” in the Pages report. The first means Google found duplicates and you set no preference; the second means Google overrode the canonical you set.
Two articles’ H1s, stripped of modifiers, would mean the same thing.
Your sitemap has more entries than you have unique primary keywords.

Before you start

This is a content-ops cleanup, not a launch. Block 1 to 2 hours of focus.
Back up the content collection. git status should be clean before you run any merge or redirect script.
Confirm your hosting layer supports 301 redirects: Astro static with _redirects, Firebase Hosting redirects in firebase.json, or Vercel vercel.json.

Step by step

1. Add a `primaryKeyword` field to every article

This is the single string that states what the article is for. It is the anchor for every duplicate check that follows.

---
title: "How to Submit a Sitemap to Search Console"
urlSlug: "submit-sitemap-search-console"
primaryKeyword: "submit sitemap search console"
category: "indie-dev"
---

2. Run a duplicate-keyword report

A ~30-line Node script over your content collection prints any keyword shared by two articles:

// scripts/find-duplicate-keywords.mjs
import { readdirSync, readFileSync } from 'node:fs';
import { join } from 'node:path';
import matter from 'gray-matter';

const ROOT = 'src/content/articles/en';
const byKw = new Map();

for (const cat of readdirSync(ROOT)) {
  for (const file of readdirSync(join(ROOT, cat))) {
    if (!file.endsWith('.mdx')) continue;
    const { data } = matter(readFileSync(join(ROOT, cat, file), 'utf8'));
    const kw = (data.primaryKeyword || '').toLowerCase().trim();
    if (!kw) continue;
    if (!byKw.has(kw)) byKw.set(kw, []);
    byKw.get(kw).push(`${cat}/${file}`);
  }
}

for (const [kw, files] of byKw) {
  if (files.length > 1) console.log(`DUP "${kw}":\n  ${files.join('\n  ')}`);
}

Wire it into npm run audit:content so duplicate keywords fail the prebuild instead of slipping into production.

3. Merge near-duplicates with a 301

Pick the stronger URL (higher impressions in Search Console), move any unique content into it, then add a redirect. Per Google’s canonicalization docs, a redirect is a strong canonicalization signal — stronger than a rel="canonical" tag, and far stronger than sitemap inclusion. Firebase Hosting example:

{
  "hosting": {
    "redirects": [
      { "source": "/articles/scale-ai-content-safely",
        "destination": "/articles/scale-content-with-ai-safely",
        "type": 301 }
    ]
  }
}

Astro static with a Netlify-style _redirects file:

/articles/scale-ai-content-safely  /articles/scale-content-with-ai-safely  301

4. For duplicate-intent, re-scope or noindex

Either split the angle — one beginner article, one advanced — by rewriting the H1 and primaryKeyword so each truly targets a distinct query, or mark the weaker one draft: true and add a noindex meta. In the article layout:

{frontmatter.noindex && <meta name="robots" content="noindex,follow" />}

Use noindex,follow (not noindex,nofollow) so the page still passes link signals while staying out of the index.

5. Set a self-canonical on every page by default

Self-canonical everything, then override only when you are certain the target is stronger. In an Astro layout:

<link rel="canonical" href={`${Astro.site}${Astro.url.pathname}`} />

Never use robots.txt or the URL removal tool to “fix” duplicates — Google’s docs explicitly warn against both. A robots-disallowed URL can still be indexed without its content, and the removal tool hides every version of a URL from Search.

6. Run a similarity check before publishing AI-assisted batches

A cosine-similarity pass on title + first paragraph using OpenAI embeddings catches most near-duplicates before they ship. As of June 2026, text-embedding-3-small is $0.02 per 1M input tokens ($0.01 with the Batch API), with 1,536 dimensions and an 8,192-token context window. Screening a 2,000-article library costs a few cents, so there is no reason to run this by hand.

// scripts/similarity-check.mjs (excerpt)
import OpenAI from 'openai';
const client = new OpenAI();

async function embed(text) {
  const r = await client.embeddings.create({
    model: 'text-embedding-3-small',
    input: text,
  });
  return r.data[0].embedding;
}

function cosine(a, b) {
  let dot = 0, na = 0, nb = 0;
  for (let i = 0; i < a.length; i++) {
    dot += a[i] * b[i];
    na += a[i] * a[i];
    nb += b[i] * b[i];
  }
  return dot / (Math.sqrt(na) * Math.sqrt(nb));
}

// Flag any pair with cosine > 0.85 for human review.

If you want sharper separation between closely related topics, the larger models cost more but rarely change the verdict for this job:

Model (June 2026)	Price / 1M tokens	Dimensions	Notes
`text-embedding-3-small`	$0.02	1,536	Default; cheapest, fine for dedup
`text-embedding-3-large`	$0.13	3,072	More nuance, 6.5x the cost
`gemini-embedding-001`	$0.15	up to 3,072	Matryoshka truncation; 768 dims is the recommended sweet spot

For deduplication, 3-small is almost always the right call: the threshold matters far more than the model.

7. After the cleanup deploy, ask Google to revalidate

Use URL Inspection on the merged URL, confirm it returns the 301, then click “Request indexing” on the surviving URL. For batch redirects, resubmit the sitemap. Expect roughly 2 to 6 weeks for Google to recognize the canonical change, depending on how often your site is crawled.

Implementation checklist

Every article has a primaryKeyword; the audit script flags duplicates during prebuild.
301 redirects live in firebase.json / vercel.json / _redirects, not just in the article body.
Self-canonical is set globally; cross-page canonicals are opt-in per article.
The similarity check is wired into your AI-content pipeline, not run manually.

After-launch verification

Recrawl the merged URL with Search Console URL Inspection and confirm the response code is 301 and the destination is the canonical URL.
Re-check the Pages report 2 to 4 weeks later: “Duplicate” reasons should fall.
Confirm the sitemap no longer lists the merged-away URLs (grep the build output).

Common pitfalls

Trusting that “different titles = different articles.” Intent matters more than wording. The tell is two articles trading impressions in Search Console for one query.
Using a canonical tag to “hide” duplicates without fixing them. Google may override your canonical when it disagrees — that is exactly the “Google chose different canonical” status in the Pages report. A 301 is a stronger signal because it removes the duplicate entirely.
Generating “10 best X for [profession]” pages where each version is 90% identical. This pattern reads as scaled, low-value content and can drag the whole cluster down.
Treating duplication as a one-time cleanup. New duplicates appear every batch — wire the check into prebuild.
301-ing to a URL that itself 301s elsewhere. Redirect chains lose signal; always redirect once, directly to the final canonical.
Forgetting to remove the merged URL from your sitemap. Google keeps crawling it and re-flagging it.

FAQ

Are translated versions of the same article duplicates?: No, as long as they are paired with hreflang. The EN and ZH versions are two URLs serving two audiences. Make sure both reference each other via a <link rel="alternate" hreflang> tag.
Does Google penalize duplication?: Not as a manual action in most cases, but it suppresses the weaker version and can dent site-wide quality signals once the duplicate ratio gets high.
Can I just noindex duplicates?: Yes, but it wastes the writing effort. A 301 merge keeps link equity; use noindex,follow only when there is nothing to merge into.
How do I detect near-duplicates at scale?: Embeddings plus cosine similarity on title + first paragraph works well. text-embedding-3-small at $0.02 per 1M tokens makes a full-library scan trivially cheap.
What threshold should the similarity script flag?: Start at 0.85 for human review and 0.92 for an auto-block in CI. Tune the review threshold down if you get false positives on closely related but distinct topics.
How long until Google honors the canonical change?: Typically 2 to 6 weeks, depending on crawl frequency and site size. Resubmitting the sitemap and requesting indexing speeds it up but does not guarantee a same-week recrawl.

External references: Google Search Central — Consolidate duplicate URLs and OpenAI embeddings pricing.

Tags: #Indie dev #Content ops #SEO #Website planning #Canonical #Technical SEO

TL;DR

Three kinds of duplication, three different fixes

How to tell you have a duplication problem

Before you start

Step by step

1. Add a primaryKeyword field to every article

2. Run a duplicate-keyword report

3. Merge near-duplicates with a 301

4. For duplicate-intent, re-scope or noindex

5. Set a self-canonical on every page by default

6. Run a similarity check before publishing AI-assisted batches

7. After the cleanup deploy, ask Google to revalidate

Implementation checklist

After-launch verification

Common pitfalls

FAQ

Related

Related Articles

Content Site Quarterly Review Cadence That Catches Drift

Content Site Staffing: When to Add Writers, Editors, and Ops

Content Volume vs Quality: How to Balance Them on an Indie Site

Managing a Content Site Past 1,000 Articles: A Scripts-First Playbook

New Content Site: A 30-Day Plan That Gets You Indexed

Pillar and Cluster Pages: The Site Structure Google and AI Search Reward

1. Add a `primaryKeyword` field to every article