You scroll to the bottom of an article on “ChatGPT keeps logging me out” and the Related panel shows three other posts: “ChatGPT session expires too fast,” “ChatGPT auto-logout problem,” and “ChatGPT logged out after 5 minutes.” All three are the same question with slightly different titles, written across a year as you chased the long tail. The recommender promoted them because every signal (tag overlap, title token Jaccard, embedding cosine) said they were maximally related. Readers see the same article repeated three times, bounce, and Google sees a site that recommends duplicates to itself, which is a strong signal about content quality. This is one of the harder problems on a mature content site because the recommender is doing exactly what it was asked to do.
This article covers how to detect the leak, how to score “related but not duplicate,” and where in the pipeline to fix it.
Common causes
Ordered by hit rate, highest first.
1. Tag-overlap recommender treats near-duplicate tags as a perfect match
Two articles share 5 of 6 tags. The recommender sorts by tag overlap and they always top each other’s Related list, even though they answer the same question.
How to spot it: For each article, look at its top-3 related. If titles are 70%+ word-overlap with the host article, the recommender is rewarding near-duplicates.
2. Embedding similarity ranks duplicates above complements
You upgraded the recommender to use sentence-embedding cosine. Now near-duplicates score 0.95+ and rank #1, while complementary pieces (the canonical fix, the prevention guide, the related tool) score 0.7-0.8 and fall off the panel.
How to spot it: Print the cosine score next to each related candidate. If your top-3 are all > 0.92, you are recommending duplicates, not related content.
3. Title-token overlap is the only feature
Old recommender just shingled title tokens. Articles with the same 3-4 head terms (“ChatGPT,” “login,” “fail”) form a tight cluster and recommend only each other.
How to spot it: Generate a co-recommendation graph. Tight 3-5 article cliques with no outbound edges are the leak.
4. You have actual duplicates that should be merged, not de-duplicated in the widget
The recommender is correctly surfacing duplicates because the duplicates exist. The fix is canonicalization in content, not a smarter widget.
How to spot it: Read the three “related” articles. If you cannot articulate one sentence of difference, they are duplicates and the widget is the symptom, not the bug.
5. Same canonical with multiple URLs leaks into Related
Tag pages, category pages, and an old slug all link to the same canonical article. The recommender treats each URL as a candidate and surfaces all three.
How to spot it: Related items resolving to the same canonical URL after redirect.
6. Manual editorial picks override the algorithm, and editors picked duplicates
You shipped a “manual related” field in the CMS. Editors filled it with the highest-traffic siblings, which happen to be near-duplicates of the host page.
How to spot it: Compare algorithm output to the rendered widget. If the widget systematically prefers manual picks and those picks are near-duplicates, it is human, not model.
7. Recommender doesn’t down-weight the host article’s own topic cluster
Every recommendation comes from the same subcategory. There is no breadth penalty, so the panel is always 3 articles from the same 10-article cluster.
How to spot it: Count distinct subcategories across all rendered Related panels. If 80%+ of recommendations stay within the host subcategory, the model has no cluster-diversity term.
Shortest path to fix
Step 1: Build a duplicate-detection signal that runs at build time
Compute a similarity score for every article pair and persist it. The recommender will use this as a hard cutoff.
// scripts/article-similarity.mjs
import fs from 'node:fs';
import { encode } from 'gpt-tokenizer';
function shingles(text, n = 5) {
const tokens = encode(text.toLowerCase());
const set = new Set();
for (let i = 0; i <= tokens.length - n; i++) {
set.add(tokens.slice(i, i + n).join('-'));
}
return set;
}
function jaccard(a, b) {
const inter = [...a].filter(x => b.has(x)).length;
return inter / (a.size + b.size - inter);
}
const articles = loadAllArticles();
const shing = new Map(articles.map(a => [a.slug, shingles(a.title + ' ' + a.description + ' ' + a.body.slice(0, 2000))]));
const pairs = [];
for (let i = 0; i < articles.length; i++) {
for (let j = i + 1; j < articles.length; j++) {
const s = jaccard(shing.get(articles[i].slug), shing.get(articles[j].slug));
if (s > 0.4) pairs.push({ a: articles[i].slug, b: articles[j].slug, score: s });
}
}
fs.writeFileSync('data/similar.json', JSON.stringify(pairs, null, 2));
Anything with Jaccard > 0.5 is a near-duplicate the recommender must exclude.
Step 2: Add a hard exclusion to the recommender
Whatever your similarity function is (tag overlap, embeddings, both), pass the candidate through the near-duplicate filter first:
const NEAR_DUP_THRESHOLD = 0.4;
function isNearDup(host, candidate) {
const key = [host, candidate].sort().join('|');
return similarityMap.get(key) > NEAR_DUP_THRESHOLD;
}
const related = candidates
.filter(c => !isNearDup(host.slug, c.slug))
.slice(0, 3);
The widget will sometimes have fewer than 3 results. That is fine; better empty than misleading.
Step 3: Add a cluster-diversity term to the ranker
After the duplicate filter, penalize candidates that share the host’s subcategory. Roughly:
function score(host, candidate) {
const base = embeddingCosine(host, candidate);
const sameSub = host.subcategory === candidate.subcategory ? 0.15 : 0;
return base - sameSub;
}
Tune the penalty so the top-3 has at most 1 same-subcategory neighbor.
Step 4: Decide: merge or differentiate
When near-duplicates show up in the audit, you have two choices:
- Merge: one canonical article, redirect the others. Best when one of them clearly leads in traffic and the others are stub variants.
- Differentiate: rewrite each so it answers a different sub-question. Best when each has its own backlinks or unique traffic.
Do not leave them as “duplicates the widget hides.” Hidden duplicates still appear in tag pages, sitemap, and search.
Step 5: Resolve candidates through redirects before recommending
function canonicalSlug(slug) {
const target = redirectMap.get(`/en/articles/${slug}/`);
if (target) return target.replace(/^\/en\/articles\//, '').replace(/\/$/, '');
return slug;
}
const candidates = rawCandidates.map(c => ({ ...c, slug: canonicalSlug(c.slug) }));
Dedupe by resolved slug. Otherwise the same canonical article appears three times.
Step 6: Audit manual editorial picks
If your CMS supports an editor-curated Related field, run a one-time audit:
for (const article of articles) {
for (const manual of article.manualRelated || []) {
const sim = similarityMap.get([article.slug, manual].sort().join('|')) || 0;
if (sim > 0.4) console.log(`Manual pick near-dup: ${article.slug} -> ${manual} (${sim})`);
}
}
Send the list back to editorial with the merge-or-differentiate decision required.
Step 7: CI guardrail
Fail the build if the audit produces more than N near-duplicate pairs site-wide. Trends matter more than the absolute number; if the count grows three weeks in a row, freeze new posts in the affected cluster until it is resolved.
When this is not on you
A content site that genuinely covers a long tail will always have some near-neighbors. The bar is not “no two articles are similar” — it is “no Related panel shows the same answer three times.” A Jaccard of 0.3-0.4 between siblings is fine and often helpful.
Easy to misdiagnose as
- “Recommender model is bad.” The model may be excellent at “find similar articles.” The problem is you asked it to find related, and it found similar. They are not the same task.
- “We need more articles to break up the cluster.” Adding more usually makes it worse. The cluster is dense because the topic is narrow.
- “Tag taxonomy is too coarse.” Sometimes true, but unrelated to whether the widget surfaces duplicates. Fix the widget first, then audit tags.
Prevention
- Run pairwise Jaccard / embedding-cosine at build time and persist the matrix.
- Recommender filters near-duplicates before ranking, not after.
- Cluster-diversity penalty so top-3 spans at least 2 subcategories where the catalog allows.
- Resolve candidate slugs through the redirect map before deduping.
- Quarterly audit: any pair with Jaccard > 0.5 gets a merge-or-differentiate decision logged in the CMS.
- CI cap on the total number of near-duplicate pairs site-wide; if it grows, pause new posts in the affected cluster.
FAQ
- Why not just hide duplicates in the widget without merging? Because they still appear in tag pages, sitemap, and search. Hiding in one surface does not fix the underlying duplication.
- Will fewer related items hurt engagement? Usually no. Three non-duplicate links outperform three duplicates because readers actually click the non-duplicates.
Related
- Cluster overlap and keyword cannibalization
- Duplicate titles across many pages
- Too many thin pages
- Orphan content pages
- Topic cluster too shallow
- Internal links uneven across the site
- Homepage cannot distribute authority
- AI content lacks unique value
- Search Console reports many low-value URLs
- Many pages, few impressions
- Site grew fast and feels repetitive
- Stale articles not updated
Tags: #Content ops #Troubleshooting #SEO #internal-linking #duplicate-content #recommendations