At 100 articles you can hold the whole site in your head. At 1,000 you cannot. The site starts having internal contradictions, duplicate intent pages, broken internal links, and dead clusters you forgot existed. Managing past 1,000 is mostly about systems — a generated content index, a dupe scanner, a retire shelf — not writing.
Background
Sites that cross 1,000 articles without management discipline often see traffic plateau or drop, even while still publishing. The cause is usually internal: duplicate competition, link rot, and crawl budget waste, not external algorithm changes. Past 1,000, every workflow needs a script behind it.
How to tell
- You can’t answer “do I already have an article on X?” without searching.
- Search Console shows many “Duplicate without user-selected canonical” entries.
- Indexed page count is dropping even though you keep publishing.
- Multiple articles rank for the same query and cannibalize each other.
- Your sitemap has 1,200 entries but Search Console reports only 700 indexed.
Before you start
- Block 2-3 weeks for the first audit; budget a few hours / week for ongoing maintenance.
git statusclean before running any bulk-edit scripts.- Search Console API access (OAuth credentials) for programmatic data pulls.
Step by step
- Generate a content index from the content folder, not by hand. A 40-line Node script keeps it always fresh:
// scripts/build-content-index.mjs
import { readdirSync, readFileSync, writeFileSync } from 'node:fs';
import { join } from 'node:path';
import matter from 'gray-matter';
const ROOT = 'src/content/articles';
const rows = [];
for (const lang of readdirSync(ROOT)) {
for (const cat of readdirSync(join(ROOT, lang))) {
for (const file of readdirSync(join(ROOT, lang, cat))) {
if (!file.endsWith('.mdx')) continue;
const { data, content } = matter(readFileSync(join(ROOT, lang, cat, file), 'utf8'));
rows.push({
slug: data.urlSlug,
lang,
category: data.category,
title: data.title,
primaryKeyword: data.primaryKeyword || '',
publishedAt: data.publishedAt,
words: content.split(/\s+/).length,
});
}
}
}
writeFileSync('content-index.csv',
'slug,lang,category,title,primaryKeyword,publishedAt,words\n' +
rows.map(r => Object.values(r).map(v => `"${v}"`).join(',')).join('\n')
);
console.log(`Wrote ${rows.length} rows`);
Run it whenever you need the table; commit the CSV if you want a snapshot.
- Run a duplicate-intent scan. Group articles by
primaryKeyword:
awk -F, 'NR>1 {print $5}' content-index.csv | sort | uniq -c | sort -rn | head
# 3 "submit sitemap search console"
# 2 "firebase hosting cache"
# ...
# any count > 1 needs a merge or canonical decision
- Merge duplicates with 301 — never delete. Add the redirect once in your hosting config, then
draft: truethe merged file so it stops building:
# _redirects (Astro/Netlify style)
/articles/dup-slug-old /articles/canonical-slug 301
- Audit internal links monthly. A site this size always has dead links from renamed slugs:
# scripts/check-internal-links.mjs
import { readFileSync, readdirSync } from 'node:fs';
import { join } from 'node:path';
const known = new Set(/* all live slugs from content-index.csv */);
const offenders = [];
walk('src/content/articles', (file) => {
const md = readFileSync(file, 'utf8');
const matches = md.matchAll(/\]\(\/[a-z]+\/articles\/([a-z0-9-]+)\/\)/g);
for (const m of matches) {
if (!known.has(m[1])) offenders.push({ file, broken: m[1] });
}
});
console.table(offenders.slice(0, 50));
- Keep the sitemap to indexable URLs only. In
astro.config.mjs:
sitemap({
filter: (page) => {
// exclude tag pages, draft pages, noindex pages
return !/\/tag\//.test(page) && !page.endsWith('/draft/');
},
})
- Set up a “retire shelf” process. Pull 180-day Search Console data, list zero-click articles, decide per row:
# Pull last 180 days of clicks by page from Search Console API
curl -X POST "https://www.googleapis.com/webmasters/v3/sites/$SITE/searchAnalytics/query" \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
--data '{
"startDate": "2025-11-23",
"endDate": "2026-05-22",
"dimensions": ["page"],
"rowLimit": 25000
}' \
| jq -r '.rows[] | select(.clicks==0) | .keys[0]' > retire-shelf.txt
wc -l retire-shelf.txt
# expect 200-400 lines on a 1k-article site
For each: merge into a stronger neighbor, refresh in place, or noindex,follow. Record decisions in the content index.
- Shift the publish/maintenance ratio. A weekly content-ops checklist:
weekly:
- audit_duplicates: 1 h # run dupe scan, decide merges
- audit_internal: 1 h # broken-link check, fix
- retire_review: 2 h # pull GSC data, retire 5-10 articles
- publish_new: 6 h # ~3 articles
total: 10 h
maintenance_ratio: 40%
- Add a prebuild gate that fails the build if any of the audit thresholds are violated:
node scripts/find-duplicate-keywords.mjs || exit 1
node scripts/check-internal-links.mjs || exit 1
node scripts/audit-sitemap-vs-index.mjs || exit 1
Implementation checklist
- Content index is generated, not hand-maintained.
- Duplicate-keyword scan runs in prebuild.
- Internal link check runs at least weekly.
- Retire shelf is reviewed monthly with Search Console data.
- 30-40% of weekly time goes to maintenance, tracked.
After-launch verification
- Coverage report “Duplicate” counts trend down over 4-8 weeks.
- Indexed-vs-submitted ratio rises above 90%.
- Internal-link checker reports zero broken on the latest run.
Common pitfalls
- Trying to scale with the same workflow you used at 100 articles. Manual everything breaks past 500.
- Refusing to retire old articles for sentimental reasons. Dead pages drag the whole site.
- Adding more authors without an editorial process. Inconsistency multiplies fast.
- Doing one massive audit yearly. Continuous small audits beat occasional huge ones.
- Deleting instead of 301-merging — you lose any link equity that did exist.
- Keeping noindex pages in the sitemap; Search Console flags them as “Submitted but blocked”.
FAQ
- Should I split into multiple sites at 1,000?: Only if topics genuinely don’t overlap and each side stands on its own. Splitting also splits authority, which hurts both.
- How many “dead” articles is normal?: 20-40% of articles getting under 1 click/month is common even on healthy sites. The question is whether they actively hurt (duplicate intent) or just sit (low priority refresh).
- Do I need a CMS at this scale?: Not necessarily. Astro Content Collections + a generated index works fine into the multi-thousand range. The bottleneck is process, not tooling.
- How long does a full audit take?: For 1,000 articles, plan 2-3 weeks of part-time work the first time. Continuous monthly maintenance after that takes a few hours.
- Can AI help with the audit?: Yes. Use embeddings for duplicate detection and a model to suggest merge candidates. Always human-review the merge decisions.
Related
- Running a site-wide content audit
- How to avoid content duplication
- Site QA with AI
- Content volume vs quality
- When to refresh old articles
Tags: #Indie dev #Content ops #SEO #Website planning #Workflow