Should I split into multiple sites at 1,000 articles?

Only if the topics genuinely do not overlap and each side can stand on its own. Splitting also splits domain authority, which usually hurts both halves more than the cleaner focus helps.

How many "dead" articles is normal?

Even on healthy sites, 20 to 40 percent of articles getting under one click per month is common. The real question is whether they actively hurt (duplicate intent, broken links) or just sit quietly as low-priority refresh candidates.

Do I need a CMS at this scale?

Not necessarily. Astro Content Collections plus a generated index works comfortably into the multi-thousand range. The bottleneck is process, not tooling.

Can AI help with the audit?

Yes, mainly for near-duplicate detection. Embed each article's title and first paragraph with an embedding model, compute cosine similarity, and flag any pair above roughly 0.9 as a merge candidate. Then have a model draft the merge rationale, and human-review every actual merge.

How long does a full audit take?

For 1,000 articles, plan 2 to 3 weeks of part-time work the first time. After that, monthly maintenance runs a few hours.

Indie Dev & Website Building

Managing a Content Site Past 1,000 Articles: A Scripts-First Playbook

Past 1,000 articles you manage with scripts, not willpower. A generated content index, duplicate scanner, link checker, and Search Console retire-shelf workflow, with current 2026 API limits.

Published: May 15, 2026 Updated: Jun 04, 2026 Author: AI Productivity Guide Team 🌐 查看中文版本

At 100 articles you can hold the whole site in your head. At 1,000 you cannot. The site starts to contradict itself, grow duplicate-intent pages, accumulate broken internal links, and hide dead clusters you forgot you published. Managing a site past 1,000 articles is mostly about systems: a generated content index, a duplicate scanner, a link checker, and a retire shelf driven by Search Console data. Writing is the easy part now; keeping the existing 1,000 healthy is the work.

This playbook is the exact workflow this site runs at roughly 1,200 articles per language, with the current (June 2026) Google API limits you have to design around.

TL;DR

Generate the content index from the content folder with a script, never by hand. Stale spreadsheets are how duplicates slip through.
Scan for duplicate intent by grouping on the primary keyword; merge with a 301 and draft: true, never delete.
Run a broken-internal-link check at least weekly; a 1,000-page site renames slugs constantly.
Pull 180-day Search Console data and run a monthly retire shelf. The Search Analytics API returns 25,000 rows per request and caps at 50,000 rows per day per site per search type, so paginate with startRow.
Spend 30 to 40 percent of weekly content time on maintenance, and gate the build so audits actually run.

Why traffic stalls after 1,000 articles

Sites that cross 1,000 articles without management discipline often see traffic plateau or fall, even while still publishing. The cause is almost always internal, not an external algorithm change:

Duplicate competition. Two or three articles target the same query and split clicks; Google picks one canonical and ignores the rest.
Link rot. Renamed slugs leave dead internal links that waste crawl budget and bleed PageRank into 404s.
Crawl-budget waste. Thin, zero-click pages and tag archives consume the crawl Google allots you, so fresh pages get discovered slower.

Past 1,000, every recurring task needs a script behind it. Manual everything quietly breaks somewhere around 500.

Symptoms you have outgrown manual management

You cannot answer “do I already have an article on X?” without searching the repo.
Search Console shows a rising count of “Duplicate without user-selected canonical” under Indexing.
Indexed page count is dropping even though you keep publishing.
Multiple articles rank for the same query and cannibalize each other in the Performance report.
Your sitemap has 1,200 entries but Search Console reports only 700 indexed.

Before you start

Block 2 to 3 weeks for the first full audit; budget a few hours per week for ongoing maintenance.
Confirm git status is clean before running any bulk-edit script, so a bad run is one git checkout away from undone.
Set up Search Console API access (a service-account OAuth credential, verified as an owner on the property) for programmatic data pulls. The same credential drives both the Search Analytics and URL Inspection endpoints.

Step by step

1. Generate a content index from the folder, not by hand

A 40-line Node script keeps the index always fresh and removes the “I forgot to update the sheet” failure mode entirely:

// scripts/build-content-index.mjs
import { readdirSync, readFileSync, writeFileSync } from 'node:fs';
import { join } from 'node:path';
import matter from 'gray-matter';

const ROOT = 'src/content/articles';
const rows = [];

for (const lang of readdirSync(ROOT)) {
  for (const cat of readdirSync(join(ROOT, lang))) {
    for (const file of readdirSync(join(ROOT, lang, cat))) {
      if (!file.endsWith('.mdx')) continue;
      const { data, content } = matter(readFileSync(join(ROOT, lang, cat, file), 'utf8'));
      rows.push({
        slug: data.urlSlug,
        lang,
        category: data.category,
        title: data.title,
        primaryKeyword: data.primaryKeyword || '',
        publishedAt: data.publishedAt,
        words: content.split(/\s+/).length,
      });
    }
  }
}

writeFileSync('content-index.csv',
  'slug,lang,category,title,primaryKeyword,publishedAt,words\n' +
  rows.map(r => Object.values(r).map(v => `"${v}"`).join(',')).join('\n')
);
console.log(`Wrote ${rows.length} rows`);

Run it on demand; commit the CSV when you want a dated snapshot to diff against later.

2. Scan for duplicate intent

Group articles by primaryKeyword. Any keyword with more than one article is a merge-or-canonical decision waiting to happen:

awk -F, 'NR>1 {print $5}' content-index.csv | sort | uniq -c | sort -rn | head
#  3 "submit sitemap search console"
#  2 "firebase hosting cache"
# ...
# any count > 1 needs a merge or canonical decision

Exact-string matching catches the obvious cases. For near-duplicates with different wording, see the embeddings approach in the FAQ.

3. Merge duplicates with a 301, never delete

Add the redirect once in your hosting config, then set draft: true on the merged file so it stops building. Deleting throws away whatever link equity the old URL had; a 301 passes it to the survivor.

# _redirects (Astro/Netlify style)
/articles/dup-slug-old  /articles/canonical-slug  301

On Firebase Hosting, the equivalent goes in firebase.json under hosting.redirects with "type": 301.

4. Audit internal links weekly

A site this size renames slugs constantly, and every rename can orphan inbound links. Catch them before Google does:

// scripts/check-internal-links.mjs
import { readFileSync, readdirSync } from 'node:fs';
import { join } from 'node:path';

const known = new Set(/* all live slugs from content-index.csv */);
const offenders = [];

walk('src/content/articles', (file) => {
  const md = readFileSync(file, 'utf8');
  const matches = md.matchAll(/\]\(\/[a-z]+\/articles\/([a-z0-9-]+)\/\)/g);
  for (const m of matches) {
    if (!known.has(m[1])) offenders.push({ file, broken: m[1] });
  }
});

console.table(offenders.slice(0, 50));

5. Keep the sitemap to indexable URLs only

Every URL in the sitemap that you also noindex earns a “Submitted but blocked” flag in Search Console and dilutes the crawl. With @astrojs/sitemap, the filter callback receives the full page URL and returns true to keep it; serialize can return undefined to drop an entry entirely:

// astro.config.mjs
sitemap({
  filter: (page) =>
    !/\/tag\//.test(page) &&     // drop tag archives
    !/\/draft\//.test(page),     // drop draft routes
})

6. Run a monthly retire shelf from Search Console data

Pull 180 days of clicks by page and list the zero-click articles. The Search Analytics API returns up to 25,000 rows per request and caps at 50,000 rows per day per site per search type (as of June 2026), so on a 1,000-plus-article site you usually fit in one request, but plan to paginate with startRow if you ever need more:

# Pull last 180 days of clicks by page from the Search Analytics API
curl -X POST "https://www.googleapis.com/webmasters/v3/sites/$SITE/searchAnalytics/query" \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  --data '{
    "startDate": "2025-12-06",
    "endDate":   "2026-06-04",
    "dimensions": ["page"],
    "rowLimit": 25000,
    "startRow": 0
  }' \
  | jq -r '.rows[] | select(.clicks==0) | .keys[0]' > retire-shelf.txt

wc -l retire-shelf.txt
# expect 200-400 lines on a 1k-article site

For each line, choose one of three actions: merge it into a stronger neighbor, refresh it in place, or set it to noindex,follow. Record the decision back in the content index so you do not re-litigate it next month.

7. Shift the publish-to-maintenance ratio

Below 1,000 articles you can publish nearly full time. Past it, maintenance has to earn a fixed slice of every week or the audits never run. A concrete weekly content-ops budget:

weekly:
  - audit_duplicates:  1 h    # run dupe scan, decide merges
  - audit_internal:    1 h    # broken-link check, fix
  - retire_review:     2 h    # pull GSC data, retire 5-10 articles
  - publish_new:       6 h    # ~3 articles
total: 10 h
maintenance_ratio: 40%

8. Gate the build on the audits

A script that only runs when you remember it does not run. Wire the audits into prebuild so a regression fails CI instead of shipping:

node scripts/find-duplicate-keywords.mjs || exit 1
node scripts/check-internal-links.mjs    || exit 1
node scripts/audit-sitemap-vs-index.mjs  || exit 1

Faster re-indexing: what actually works in 2026

When you merge or retire pages, you want search engines to recrawl quickly. Two facts to set expectations:

Google does not support IndexNow (as of June 2026), and its Indexing API is restricted to job postings and livestream pages, not general content. For Google, your levers are a clean sitemap, internal links to the changed pages, and the occasional manual “Request indexing” in the URL Inspection tool.
IndexNow works for Bing, Yandex, Naver, and Seznam. You can ping up to 10,000 URLs per submission, which is plenty for a post-merge batch. It is a single HTTP request with your key.

For programmatic status checks, the URL Inspection API is capped at 2,000 queries per day and 600 per minute per property (as of June 2026), so audit indexing status in batches rather than hammering every URL nightly.

Implementation checklist

Content index is generated by script, not hand-maintained.
Duplicate-keyword scan runs in prebuild.
Internal-link check runs at least weekly.
Retire shelf is reviewed monthly against 180-day Search Console data.
30 to 40 percent of weekly content time goes to maintenance, and it is tracked.

How to verify it is working

The “Duplicate without user-selected canonical” count in the Indexing report trends down over 4 to 8 weeks.
The indexed-to-submitted ratio rises above 90 percent.
The internal-link checker reports zero broken links on the latest run.
Average clicks per article stops falling as you publish (the cannibalization tax is gone).

Common pitfalls

Scaling with your 100-article workflow. Manual everything breaks past 500 articles.
Refusing to retire old articles for sentimental reasons. Dead pages drag the whole site’s crawl efficiency.
Adding authors without an editorial process. Inconsistency multiplies faster than output.
One giant yearly audit. Continuous small audits beat occasional huge ones every time.
Deleting instead of 301-merging. You throw away any link equity the old URL had.
Leaving noindex pages in the sitemap. Search Console flags them “Submitted but blocked” and you waste crawl.

FAQ

Should I split into multiple sites at 1,000 articles?: Only if the topics genuinely do not overlap and each side can stand on its own. Splitting also splits domain authority, which usually hurts both halves more than the cleaner focus helps.
How many “dead” articles is normal?: Even on healthy sites, 20 to 40 percent of articles getting under one click per month is common. The real question is whether they actively hurt (duplicate intent, broken links) or just sit quietly as low-priority refresh candidates.
Do I need a CMS at this scale?: Not necessarily. Astro Content Collections plus a generated index works comfortably into the multi-thousand range. The bottleneck is process, not tooling.
Can AI help with the audit?: Yes, mainly for near-duplicate detection. Embed each article’s title and first paragraph with an embedding model, compute cosine similarity, and flag any pair above roughly 0.9 as a merge candidate. Then have a model draft the merge rationale, and human-review every actual merge.
How long does a full audit take?: For 1,000 articles, plan 2 to 3 weeks of part-time work the first time. After that, monthly maintenance runs a few hours.

Tags: #Indie dev #Content ops #SEO #Website planning #Workflow