How long does it take?

4-8 hours for 200-500 articles the first time, then 1-3 hours per round once the scripts exist.

Can I automate the decisions with AI?

Use AI for triage — flagging dead, thin, and duplicate candidates from the CSV is a clean job for any model. Keep retire/keep decisions human: judging an article's role across an interconnected site is exactly where models guess wrong.

What if I have no Search Console data?

Connect the property and wait ~28 days. Auditing without impression data is mostly guessing, since steps 3 and 4 both depend on it.

How far back does Search Console data go?

16 months as of June 2026 — Google drops the oldest day daily. If you want a longer baseline, export the API pull to your repo each quarter; that history never expires once it is in git.

How do I audit translated content?

Audit each language separately. The same article in English and Chinese serves different markets and is not a duplicate; only compare within a language.

Indie Dev & Website Building

Run a Site-Wide Content Audit: A Repeatable Quarterly Process

A scripted quarterly content audit: URL inventory, Search Console join, dead-page and near-ranker flags, duplicate scanner, linkinator broken-link check, and a diff-able decision log.

Published: May 15, 2026 Updated: Jun 05, 2026 Author: AI Productivity Guide Team 🌐 查看中文版本

Most content-audit guides describe a process so heavy you do it once and never again. For an indie site, you want the opposite: a light audit you can actually run every quarter, backed by scripts so 90% of the work is automated. The goal is not a perfect site. It is catching regressions while they are still cheap to fix, and keeping a written record of what you decided.

TL;DR

Treat a content audit as one big join — your file-system URL inventory plus a 28-day Google Search Console pull plus five heuristics — and write each step as a script in scripts/. The first pass on 200-500 articles takes 4-8 hours; every audit after that takes 1-3 hours because the tooling already exists. Record every keep/refresh/merge/delete decision in a committed CSV so next quarter starts from a diff, not from zero.

When you actually need this

Run the audit if two or more of these are true:

Your last audit was more than 6 months ago, or you have never done one.
In Search Console’s Pages report (the old “Index Coverage”), your indexed-to-submitted ratio is below 90%.
You have 100+ articles and can no longer list what is on the site without looking.
Your internal-link checker has not run in months.
You suspect duplicate-intent pages but cannot name them.

If only one is true, skip the full audit and just run the broken-link checker (step 6) — that is the highest-value 10 minutes.

What you need before you start

Search Console API access (OAuth). Without impression data the audit is mostly guessing. New properties need ~28 days of history before the numbers mean anything.
A file-based content layer (Astro/Hugo content collections, a Markdown folder, etc.) so the inventory is generated, not hand-maintained.
One CSV format you commit and reuse. Each completed audit becomes the baseline the next one diffs against.

A note on Search Console limits as of June 2026: performance data only goes back 16 months (Google deletes the oldest day each day), and the API returns at most 25,000 rows per request and 50,000 rows per day per search type. For a site under ~20,000 URLs a single 28-day page-dimension pull is well inside those caps.

Step by step

1. Generate the URL inventory

A ~30-line Node script reads every MDX file and emits a CSV. This is the spine of the audit — everything else joins onto it.

// scripts/audit-step1-inventory.mjs
import { readdirSync, readFileSync, writeFileSync } from 'node:fs';
import matter from 'gray-matter';

const rows = [];
for (const lang of ['en', 'zh']) {
  for (const cat of readdirSync(`src/content/articles/${lang}`)) {
    for (const f of readdirSync(`src/content/articles/${lang}/${cat}`)) {
      if (!f.endsWith('.mdx')) continue;
      const { data, content } = matter(
        readFileSync(`src/content/articles/${lang}/${cat}/${f}`, 'utf8'));
      rows.push({
        url: `https://yourdomain.com/${lang}/articles/${data.urlSlug}/`,
        lang, category: cat,
        slug: data.urlSlug,
        title: data.title,
        primaryKeyword: data.primaryKeyword || '',
        publishedAt: data.publishedAt,
        words: content.split(/\s+/).length,
      });
    }
  }
}
writeFileSync('audit-inventory.csv',
  'url,lang,category,slug,title,primaryKeyword,publishedAt,words\n' +
  rows.map(r => Object.values(r).map(v => `"${v}"`).join(',')).join('\n'));

2. Pull 28-day Search Console data and join

Use a fixed 28-day window every quarter so quarter-over-quarter numbers are comparable. The page dimension gives you clicks, impressions, and average position per URL.

curl -X POST "https://www.googleapis.com/webmasters/v3/sites/$SITE/searchAnalytics/query" \
  -H "Authorization: Bearer $TOKEN" -H "Content-Type: application/json" \
  --data '{
    "startDate":"2026-04-22","endDate":"2026-05-22",
    "dimensions":["page"],"rowLimit":25000
  }' \
  | jq -r '.rows[] | [.keys[0],.clicks,.impressions,.position] | @csv' \
  > gsc-28d.csv

If a single property exceeds 25,000 ranking URLs, page through with "startRow":25000 on a second request and concatenate. Then join GSC onto the inventory with five lines of Python:

python3 -c "
import csv
gsc = {r[0]:r for r in csv.reader(open('gsc-28d.csv'))}
out = csv.writer(open('audit-joined.csv','w'))
for row in csv.reader(open('audit-inventory.csv')):
    url = row[0]; m = gsc.get(url, ['','0','0','0'])
    out.writerow(row + m[1:4])
"

audit-joined.csv now has columns 1-8 from the inventory plus clicks (9), impressions (10), and position (11). Every flag below is one awk line over this file.

3. Flag dead pages

Live more than 90 days with zero clicks and zero impressions — Google has effectively ignored them:

awk -F, 'NR>1 && $9==0 && $10==0 {print $4, $5}' audit-joined.csv \
  | awk -F'"' '{print $2}'
# slugs needing a decision: merge / refresh / noindex / delete

4. Flag near-rankers (the highest ROI list)

Average position 8-20 with more than 100 impressions. These pages already rank — a refresh often moves them onto page one, which is far cheaper than writing something new:

awk -F, 'NR>1 && $11>=8 && $11<=20 && $10>100' audit-joined.csv \
  | sort -t, -k10 -rn | head -30
# refresh list, sorted by impressions (biggest wins first)

5. Flag duplicates by intent

Group by primary keyword; any group of size 2+ is competing with itself for the same query:

awk -F, 'NR>1 {print $7}' audit-joined.csv | sort | uniq -c \
  | awk '$1 > 1' | sort -rn
# each count > 1 is a duplicate-intent cluster to merge or differentiate

6. Run a broken-link check on the built site

linkinator (v7.6.1 as of June 2026) crawls the rendered site and reports non-200 links. Default concurrency is 100; throttle it so you do not hammer your own host, and use --retry so a transient 429 does not show up as a false positive:

npx linkinator https://yourdomain.com \
  --recurse --retry --concurrency 10 --skip 'http(s)?://[^/]+/$' \
  --format CSV > linkinator-report.csv
awk -F, '$2 != "200"' linkinator-report.csv | head

Treat external 403s from sites that block crawlers as warnings, not failures (linkinator accepts status patterns like 403 or 4xx to ignore). Internal non-200s are real bugs — fix them.

7. Flag thin pages

Word count under 400 with no special reason (a glossary stub or redirect landing may be fine):

awk -F, 'NR>1 && $8<400 {print $4, $8}' audit-joined.csv

8. Write decisions back to the CSV

Add a decision column with one of: keep, refresh, merge:<target-slug>, noindex, delete. Commit the file. That committed CSV is the entire point of the audit — it is the baseline next quarter diffs against, and the record of why a URL is gone.

The five flags at a glance

Flag	Rule (column)	Typical action
Dead page	clicks=0 and impressions=0, age > 90d	merge, refresh, noindex, or delete
Near-ranker	position 8-20, impressions > 100	refresh — highest ROI
Duplicate intent	same primaryKeyword count ≥ 2	merge or differentiate
Broken link	linkinator status ≠ 200 (internal)	fix the target or link
Thin page	words < 400, no reason	expand or merge

What to expect from each audit

First run: 4-8 hours for 200-500 articles — most of it writing the scripts once.
Later runs: 1-3 hours; you only re-run scripts and review the new flags.
A healthy retire rate is 5-10% of articles per audit. Needing to retire 30%+ in one pass means the original content strategy, not this audit, is the problem.

Verifying it worked

After 4-8 weeks, the Pages report’s indexed count rises as dead pages get fixed or removed.
Re-running the audit produces a shorter dead-page list quarter over quarter.
linkinator reports zero internal non-200 links on the latest build.

Common pitfalls

Auditing without writing decisions down. Next quarter you re-discover the same problems from scratch.
Refusing to retire anything. The audit degrades into a “fix someday” list instead of a set of decisions.
Trying to fix everything in one sitting. The audit is the diagnosis; spread the surgery over the following weeks.
Skipping it because “things look fine.” Search Console always surfaces something the homepage view hides.
Keeping decisions in Google Sheets when a CSV in the repo is diff-able, scriptable, and versioned alongside the content.

FAQ

How long does it take? 4-8 hours for 200-500 articles the first time, then 1-3 hours per round once the scripts exist.
Can I automate the decisions with AI? Use AI for triage — flagging dead, thin, and duplicate candidates from the CSV is a clean job for any model. Keep retire/keep decisions human: judging an article’s role across an interconnected site is exactly where models guess wrong.
What if I have no Search Console data? Connect the property and wait ~28 days. Auditing without impression data is mostly guessing, since steps 3 and 4 both depend on it.
How far back does Search Console data go? 16 months as of June 2026 — Google drops the oldest day daily. If you want a longer baseline, export the API pull to your repo each quarter; that history never expires once it is in git.
How do I audit translated content? Audit each language separately. The same article in English and Chinese serves different markets and is not a duplicate; only compare within a language.

Tags: #Indie dev #Content ops #SEO #Website planning #Technical SEO #Workflow

TL;DR

When you actually need this

What you need before you start

Step by step

1. Generate the URL inventory

2. Pull 28-day Search Console data and join

3. Flag dead pages

4. Flag near-rankers (the highest ROI list)

5. Flag duplicates by intent

6. Run a broken-link check on the built site

7. Flag thin pages

8. Write decisions back to the CSV

The five flags at a glance

What to expect from each audit

Verifying it worked

Common pitfalls

FAQ

Related

Related Articles

Content Site Quarterly Review Cadence That Catches Drift

Content Site Staffing: When to Add Writers, Editors, and Ops

Avoid Content Duplication When Scaling a Content Site Fast

Content Volume vs Quality: How to Balance Them on an Indie Site

Managing a Content Site Past 1,000 Articles: A Scripts-First Playbook

New Content Site: A 30-Day Plan That Gets You Indexed