Running a Site-Wide Content Audit — A Repeatable Process

A quarterly content audit with concrete scripts: URL inventory, Search Console join, dead-page flags, dupe scanner, broken-link checker, and decision log.

Most “content audit” articles online describe an audit so heavy you do it once and never again. For an indie site, you need a lighter audit you can actually run every quarter — backed by scripts so the work is mostly automated. The goal isn’t perfection; it’s catching the regressions before they compound.

Background

A content audit is a join: your URL inventory plus Search Console data plus a few heuristics. If you write the join and heuristics as scripts, every subsequent audit takes hours instead of days. This article gives you the scripts.

How to tell

  • Your last audit was more than 6 months ago (or never).
  • Search Console “Submitted vs Indexed” ratio is below 90%.
  • You have 100+ articles and you can no longer say what is on the site without looking.
  • Internal link checker has not been run in months.
  • You suspect you have duplicates but cannot point to them.

Quick verdict

Run a light audit every quarter, not a heavy one yearly. Frequent small audits surface issues while they’re still cheap to fix.

Before you start

  • Search Console API access (OAuth) — without it the audit is mostly guessing.
  • Content collection or other file-based content layer.
  • A spreadsheet or CSV format you will reuse — the audit becomes a baseline each time.

Step by step

  1. Generate the URL inventory. A 30-line Node script:
// scripts/audit-step1-inventory.mjs
import { readdirSync, readFileSync, writeFileSync } from 'node:fs';
import { join } from 'node:path';
import matter from 'gray-matter';

const rows = [];
for (const lang of ['en', 'zh']) {
  for (const cat of readdirSync(`src/content/articles/${lang}`)) {
    for (const f of readdirSync(`src/content/articles/${lang}/${cat}`)) {
      if (!f.endsWith('.mdx')) continue;
      const { data, content } = matter(readFileSync(`src/content/articles/${lang}/${cat}/${f}`, 'utf8'));
      rows.push({
        url: `https://yourdomain.com/${lang}/articles/${data.urlSlug}/`,
        lang, category: cat,
        slug: data.urlSlug,
        title: data.title,
        primaryKeyword: data.primaryKeyword || '',
        publishedAt: data.publishedAt,
        words: content.split(/\s+/).length,
      });
    }
  }
}
writeFileSync('audit-inventory.csv',
  'url,lang,category,slug,title,primaryKeyword,publishedAt,words\n' +
  rows.map(r => Object.values(r).map(v => `"${v}"`).join(',')).join('\n'));
  1. Pull 28-day Search Console data and join. Per URL: impressions, clicks, average position:
curl -X POST "https://www.googleapis.com/webmasters/v3/sites/$SITE/searchAnalytics/query" \
  -H "Authorization: Bearer $TOKEN" -H "Content-Type: application/json" \
  --data '{
    "startDate":"2026-04-22","endDate":"2026-05-22",
    "dimensions":["page"],"rowLimit":25000
  }' \
  | jq -r '.rows[] | [.keys[0],.clicks,.impressions,.position] | @csv' \
  > gsc-28d.csv

Join with a 5-line awk or Python:

python3 -c "
import csv
gsc = {r[0]:r for r in csv.reader(open('gsc-28d.csv'))}
out = csv.writer(open('audit-joined.csv','w'))
for row in csv.reader(open('audit-inventory.csv')):
    url = row[0]; m = gsc.get(url, ['','0','0','0'])
    out.writerow(row + m[1:4])
"
  1. Flag dead pages. Live > 90 days with zero impressions:
awk -F, 'NR>1 && $9==0 && $10==0 {print $4, $5}' audit-joined.csv \
  | awk -F'"' '{print $2}'
# slugs needing decision (merge / refresh / noindex / delete)
  1. Flag near-rankers. Position 8-20 with impressions > 100:
awk -F, 'NR>1 && $11>=8 && $11<=20 && $10>100' audit-joined.csv \
  | sort -t, -k10 -rn | head -30
# refresh list, sorted by impressions
  1. Flag duplicates. Group by primary keyword:
awk -F, 'NR>1 {print $7}' audit-joined.csv | sort | uniq -c \
  | awk '$1 > 1' | sort -rn
# any count > 1 = duplicate-intent group
  1. Run a broken-link checker over the built site. Use linkinator or your own walker:
npx linkinator https://yourdomain.com \
  --recurse --concurrency 5 --skip 'http(s)?://[^/]+/$' \
  --format CSV > linkinator-report.csv
awk -F, '$2 != "200"' linkinator-report.csv | head
  1. Flag thin pages. Word count < 400 with no special reason:
awk -F, 'NR>1 && $8<400 {print $4, $8}' audit-joined.csv
  1. Write decisions back to the CSV. Add a decision column with values like keep, refresh, merge:<target-slug>, noindex, delete. Commit the CSV; it is the baseline for next quarter.

Implementation checklist

  • All scripts live in scripts/ and are runnable with npm run audit.
  • Inventory CSV is regenerated from the file system, not maintained by hand.
  • Search Console pull uses 28-day window consistently.
  • Decisions are recorded in the CSV before any actual content changes.
  • A diff between this quarter’s CSV and last quarter’s is reviewable.

After-launch verification

  • After 4-8 weeks, Search Console Pages indexed count rises (dead pages either fixed or removed).
  • Re-running the audit produces a shorter list of dead pages.
  • Linkinator reports zero non-200 internal links on the latest build.

Common pitfalls

  • Auditing without writing decisions down. By next quarter you’ll re-discover the same problems.
  • Refusing to retire anything. The audit becomes a list of “things to fix someday” instead of decisions.
  • Trying to fix everything in one sitting. Spread fixes over the next few weeks; the audit is the diagnosis, not the surgery.
  • Skipping the audit because “things look fine”. Search Console always shows surprises.
  • Using Google Sheets when a CSV in the repo would be diff-able and scriptable.

FAQ

  • How long should this take?: 4-8 hours for 200-500 articles the first time. Half that on the next round once the tooling is in place.
  • Can I do this with AI?: Yes for triage (flagging candidates) but human review for retire/keep decisions. AI is bad at judging article context across an interconnected site.
  • What if I have no Search Console data?: Connect Search Console first and wait 28 days. Auditing without impression data is mostly guessing.
  • How aggressive should I be with retiring?: On a healthy site, retiring 5-10% of articles per audit is normal. Retiring 30%+ in one pass suggests deeper problems with the original content strategy.
  • What about translated content?: Audit each language separately. Different markets ≠ duplicates.

Tags: #Indie dev #Content ops #SEO #Website planning #Technical SEO #Workflow