Does Google penalize my site for these duplicates?

No. Google clusters near-duplicate URLs and picks one canonical rather than applying a penalty. The cost is indirect: split ranking signals, wasted crawl budget, and pages stuck as `Duplicate without user-selected canonical` that never get indexed as themselves.

Why not just hide duplicates in the widget without merging?

Because they still appear in tag pages, sitemap, and search, and Google clusters them anyway. Hiding in one surface does not fix the underlying duplication.

Will fewer related items hurt engagement?

Usually no. Three non-duplicate links outperform three duplicates because readers actually click the non-duplicates.

What should I set lambda to in MMR?

Start at `0.7` (relevance-leaning) and lower it toward `0.5` if panels still feel repetitive. Re-check the top-3 spread after each change.

Jaccard or embeddings — which threshold should I trust?

Use both: Jaccard `> 0.5` for a cheap, deterministic build-time gate; embedding cosine `> 0.92` as a second gate when you have vectors. Calibrate both on a hand-labeled sample of your own pairs.

Troubleshooting

Related-Articles Widget Keeps Surfacing Near-Duplicate Pages

Your 'Related articles' panel shows three near-identical posts on every page. Detect the leak, score 'related but not duplicate' with MMR, and stop recommending duplicates to yourself.

Published: May 24, 2026 Updated: Jun 18, 2026 Author: AI Productivity Guide Team 🌐 查看中文版本

You scroll to the bottom of an article on “ChatGPT keeps logging me out” and the Related panel shows three other posts: “ChatGPT session expires too fast,” “ChatGPT auto-logout problem,” and “ChatGPT logged out after 5 minutes.” All three are the same question with slightly different titles, written across a year as you chased the long tail. The recommender promoted them because every signal (tag overlap, title-token Jaccard, embedding cosine) said they were maximally related. Readers see the same article three times, bounce, and Google sees a site recommending duplicates to itself. This is one of the harder problems on a mature content site because the recommender is doing exactly what you asked: it found similar, when you wanted related. Those are different tasks.

Fastest fix: add a hard near-duplicate cutoff (Jaccard > 0.5 or cosine > 0.92) that runs before ranking, then rerank survivors with MMR (Maximal Marginal Relevance) so the top-3 trades a little relevance for diversity. If the audit shows true duplicates, merge them with a 301 instead of hiding them in the widget. Everything below is the long version.

What’s actually at stake (as of June 2026)

Google does not “penalize” duplicate content the way the old myth claims — it clusters near-duplicate URLs and then picks one canonical to represent the group (clustering happens first, canonicalization second). The damage is indirect but real: crawl budget is spent re-crawling siblings instead of new pages, backlink and ranking signals split across the cluster instead of consolidating, and a page stuck as Duplicate without user-selected canonical in Search Console will not be indexed as itself. When your own Related widget links three near-duplicates to each other, you are actively feeding that clustering. See Google’s canonicalization docs: redirects are the strongest canonical signal, rel="canonical" is strong, sitemap inclusion is weak, and internal links should always point at the canonical, never a variant.

Common causes

Ordered by hit rate, highest first.

1. Tag-overlap recommender treats near-duplicate tags as a perfect match

Two articles share 5 of 6 tags. The recommender sorts by tag overlap and they always top each other’s Related list, even though they answer the same question.

How to spot it: For each article, look at its top-3 related. If titles are 70%+ word-overlap with the host article, the recommender is rewarding near-duplicates.

2. Embedding similarity ranks duplicates above complements

You upgraded the recommender to use sentence-embedding cosine. Now near-duplicates score 0.95+ and rank #1, while complementary pieces (the canonical fix, the prevention guide, the related tool) score 0.7-0.8 and fall off the panel.

How to spot it: Print the cosine score next to each related candidate. If your top-3 are all > 0.92, you are recommending duplicates, not related content.

3. Title-token overlap is the only feature

Old recommender just shingled title tokens. Articles with the same 3-4 head terms (“ChatGPT,” “login,” “fail”) form a tight cluster and recommend only each other.

How to spot it: Generate a co-recommendation graph. Tight 3-5 article cliques with no outbound edges are the leak.

The recommender is correctly surfacing duplicates because the duplicates exist. The fix is canonicalization in content, not a smarter widget.

How to spot it: Read the three “related” articles. If you cannot articulate one sentence of difference, they are duplicates and the widget is the symptom, not the bug.

Tag pages, category pages, and an old slug all resolve to the same canonical article. The recommender treats each URL as a candidate and surfaces all three.

How to spot it: Related items resolving to the same canonical URL after redirect.

6. Manual editorial picks override the algorithm, and editors picked duplicates

You shipped a “manual related” field in the CMS. Editors filled it with the highest-traffic siblings, which happen to be near-duplicates of the host page.

How to spot it: Compare algorithm output to the rendered widget. If the widget systematically prefers manual picks and those picks are near-duplicates, it is human, not model.

7. Recommender doesn’t down-weight the host article’s own topic cluster

Every recommendation comes from the same subcategory. There is no diversity term, so the panel is always 3 articles from the same 10-article cluster.

How to spot it: Count distinct subcategories across all rendered Related panels. If 80%+ of recommendations stay within the host subcategory, the model has no diversity term.

Which bucket are you in

Symptom you observe	Most likely cause	Go to
Top-3 titles repeat the host’s head terms	Title-token / tag overlap (causes 1, 3)	Steps 1-2
Cosine scores all `> 0.92`	Embedding ranks dupes over complements (cause 2)	Steps 1-3
Can’t state a one-sentence difference between the three	Real duplicates (cause 4)	Step 4
Same article appears twice via different URLs	Redirect/canonical leak (cause 5)	Step 5
Widget differs from algorithm output	Manual editorial picks (cause 6)	Step 6
Every panel is same-subcategory	No diversity term (cause 7)	Step 3

Shortest path to fix

Step 1: Build a duplicate-detection signal that runs at build time

Compute a similarity score for every article pair and persist it. The recommender uses this as a hard cutoff. Token-shingle Jaccard is cheap, deterministic, and runs in the build without an embedding API; for thousands of articles, swap the inner loop for MinHash/LSH so you avoid the O(n^2) blowup.

// scripts/article-similarity.mjs
import fs from 'node:fs';
import { encode } from 'gpt-tokenizer';

function shingles(text, n = 5) {
  const tokens = encode(text.toLowerCase());
  const set = new Set();
  for (let i = 0; i <= tokens.length - n; i++) {
    set.add(tokens.slice(i, i + n).join('-'));
  }
  return set;
}

function jaccard(a, b) {
  const inter = [...a].filter(x => b.has(x)).length;
  return inter / (a.size + b.size - inter);
}

const articles = loadAllArticles();
const shing = new Map(articles.map(a => [a.slug, shingles(a.title + ' ' + a.description + ' ' + a.body.slice(0, 2000))]));
const pairs = [];
for (let i = 0; i < articles.length; i++) {
  for (let j = i + 1; j < articles.length; j++) {
    const s = jaccard(shing.get(articles[i].slug), shing.get(articles[j].slug));
    if (s > 0.4) pairs.push({ a: articles[i].slug, b: articles[j].slug, score: s });
  }
}
fs.writeFileSync('data/similar.json', JSON.stringify(pairs, null, 2));

Calibrate the threshold on your own corpus: anything with Jaccard > 0.5 is almost always a near-duplicate the recommender must exclude; 0.4-0.5 is a gray zone worth eyeballing.

Step 2: Add a hard exclusion to the recommender

Whatever your similarity function is (tag overlap, embeddings, both), pass each candidate through the near-duplicate filter first, before ranking:

const NEAR_DUP_THRESHOLD = 0.4;
function isNearDup(host, candidate) {
  const key = [host, candidate].sort().join('|');
  return (similarityMap.get(key) ?? 0) > NEAR_DUP_THRESHOLD;
}
const survivors = candidates.filter(c => !isNearDup(host.slug, c.slug));

The widget will sometimes have fewer than 3 results. That is fine; better empty than misleading.

Step 3: Rerank survivors with MMR instead of a flat penalty

Filtering removes the obvious duplicates, but you can still end up with three near-neighbors that each squeak under the cutoff yet duplicate each other. MMR (Maximal Marginal Relevance) is the standard fix: it greedily picks the next item that is relevant to the host but dissimilar to what you already picked, so a panel never stacks three copies of the same idea. The formula is MMR = argmax over candidates of ( λ * Rel(host, c) - (1 - λ) * max Sim(c, alreadySelected) ), where λ (lambda) dials relevance vs diversity — start around 0.7.

function rerankMMR(host, survivors, lambda = 0.7, k = 3) {
  const selected = [];
  const pool = [...survivors];
  while (selected.length < k && pool.length) {
    let best = null, bestScore = -Infinity;
    for (const c of pool) {
      const rel = embeddingCosine(host, c);
      const maxSim = selected.length
        ? Math.max(...selected.map(s => embeddingCosine(s, c)))
        : 0;
      const mmr = lambda * rel - (1 - lambda) * maxSim;
      if (mmr > bestScore) { bestScore = mmr; best = c; }
    }
    selected.push(best);
    pool.splice(pool.indexOf(best), 1);
  }
  return selected;
}

If you have no embeddings, approximate Sim with a subcategory match: subtract a fixed penalty (e.g. 0.15) when a candidate shares the host’s subcategory, and tune so the top-3 has at most 1 same-subcategory neighbor. MMR with real pairwise similarity is strictly better because it diversifies on actual content, not just taxonomy.

Step 4: Decide: merge or differentiate

When near-duplicates show up in the audit, you have two choices:

Merge: one canonical article, 301-redirect the others. Best when one clearly leads in traffic and the rest are stub variants. The 301 is the strongest signal you can send Google to collapse the cluster — stronger than rel="canonical" alone.
Differentiate: rewrite each so it answers a different sub-question. Best when each has its own backlinks or unique traffic.

Do not leave them as “duplicates the widget hides.” Hidden duplicates still appear in tag pages, sitemap, and search, and Google still clusters them — you have only hidden the symptom from one surface.

Step 5: Resolve candidates through redirects before recommending

function canonicalSlug(slug) {
  const target = redirectMap.get(`/en/articles/${slug}/`);
  if (target) return target.replace(/^\/en\/articles\//, '').replace(/\/$/, '');
  return slug;
}
const candidates = rawCandidates.map(c => ({ ...c, slug: canonicalSlug(c.slug) }));

Dedupe by resolved slug. Otherwise the same canonical article appears two or three times because tag/category/old-slug URLs each came in as a separate candidate. This is also the cure for the Duplicate, Google chose a different canonical than user status in Search Console: your internal links (the Related widget is one) must point at the canonical, not a variant.

Step 6: Audit manual editorial picks

If your CMS supports an editor-curated Related field, run a one-time audit:

for (const article of articles) {
  for (const manual of article.manualRelated || []) {
    const sim = similarityMap.get([article.slug, manual].sort().join('|')) || 0;
    if (sim > 0.4) console.log(`Manual pick near-dup: ${article.slug} -> ${manual} (${sim})`);
  }
}

Send the list back to editorial with a required merge-or-differentiate decision per pair.

Step 7: CI guardrail

Fail the build if the audit produces more than N near-duplicate pairs site-wide. Trends matter more than the absolute number; if the count grows three weeks in a row, freeze new posts in the affected cluster until it is resolved.

How to confirm it’s fixed

Spot-check the worst offender. Open the article that previously showed three near-duplicates and confirm the rendered Related panel now shows distinct answers (or fewer than 3 items, which is acceptable).
Re-run the audit. node scripts/article-similarity.mjs should report zero rendered Related pairs above your cutoff. Diff against the previous run.
Check subcategory spread. Across a sample of pages, the top-3 should span at least 2 subcategories where the catalog allows.
Watch Search Console. Over the following weeks, the Duplicate without user-selected canonical count in Pages should trend down, not up, for the affected cluster. This lags by days to weeks; do not expect same-day movement.

When this is not on you

A content site that genuinely covers a long tail will always have some near-neighbors. The bar is not “no two articles are similar” — it is “no Related panel shows the same answer three times.” A Jaccard of 0.3-0.4 between siblings is fine and often helpful.

Easy to misdiagnose as

“Recommender model is bad.” The model may be excellent at “find similar articles.” The problem is you asked it for related and it gave you similar. Different task.
“We need more articles to break up the cluster.” Adding more usually makes it worse. The cluster is dense because the topic is narrow.
“Tag taxonomy is too coarse.” Sometimes true, but unrelated to whether the widget surfaces duplicates. Fix the widget first, then audit tags.

Prevention

Run pairwise Jaccard / embedding-cosine at build time and persist the matrix (MinHash/LSH once the corpus is large).
Recommender filters near-duplicates before ranking, then reranks survivors with MMR.
Diversity term so top-3 spans at least 2 subcategories where the catalog allows.
Resolve candidate slugs through the redirect map before deduping.
Quarterly audit: any pair with Jaccard > 0.5 gets a merge-or-differentiate decision logged in the CMS.
CI cap on the total number of near-duplicate pairs site-wide; if it grows, pause new posts in the affected cluster.

FAQ

Does Google penalize my site for these duplicates? No. Google clusters near-duplicate URLs and picks one canonical rather than applying a penalty. The cost is indirect: split ranking signals, wasted crawl budget, and pages stuck as Duplicate without user-selected canonical that never get indexed as themselves.
Why not just hide duplicates in the widget without merging? Because they still appear in tag pages, sitemap, and search, and Google clusters them anyway. Hiding in one surface does not fix the underlying duplication.
Will fewer related items hurt engagement? Usually no. Three non-duplicate links outperform three duplicates because readers actually click the non-duplicates.
What should I set lambda to in MMR? Start at 0.7 (relevance-leaning) and lower it toward 0.5 if panels still feel repetitive. Re-check the top-3 spread after each change.
Jaccard or embeddings — which threshold should I trust? Use both: Jaccard > 0.5 for a cheap, deterministic build-time gate; embedding cosine > 0.92 as a second gate when you have vectors. Calibrate both on a hand-labeled sample of your own pairs.

Tags: #Content ops #Troubleshooting #SEO #internal-linking #duplicate-content #recommendations

What’s actually at stake (as of June 2026)

Common causes

1. Tag-overlap recommender treats near-duplicate tags as a perfect match

2. Embedding similarity ranks duplicates above complements

3. Title-token overlap is the only feature

4. You have actual duplicates that should be merged, not de-duplicated in the widget

5. Same canonical with multiple URLs leaks into Related

6. Manual editorial picks override the algorithm, and editors picked duplicates

7. Recommender doesn’t down-weight the host article’s own topic cluster

Which bucket are you in

Shortest path to fix

Step 1: Build a duplicate-detection signal that runs at build time

Step 2: Add a hard exclusion to the recommender

Step 3: Rerank survivors with MMR instead of a flat penalty

Step 4: Decide: merge or differentiate

Step 5: Resolve candidates through redirects before recommending

Step 6: Audit manual editorial picks

Step 7: CI guardrail

How to confirm it’s fixed

When this is not on you

Easy to misdiagnose as

Prevention

FAQ

Related

Related Articles

Internal Link Rot: Articles Point to Renamed or Deleted Slugs

Canonical Points to the Wrong Page: Translations Canonicalize Back to English

FAQ Rich Result Gone in Google? It's Deprecated, Not Your Schema

Hreflang Misconfigured Between EN and ZH: No Return Tags, Wrong Codes, Missing x-default

Image Alt Text Missing in Bulk: Audit, Backfill, and Lock It In

Publish Date Stuck in the Past: Articles Look Stale After Real Refreshes