How long does a 1,000-page wiki take?

Roughly 2-3 focused days for export and triage, plus about a week for merges, if one person can dedicate to it. The second cleanup is far faster because the triage template is reusable.

Can AI delete pages directly?

Keep deletion manual. The model proposes `delete` in the triage sheet; a human confirms and archives. There's no safe way to let it hard-delete.

What does the AI cost?

Small. At Claude Sonnet 4.6's $3 per million input tokens, reading a full 1,000-page export (well under 1M tokens) costs a few dollars; even with re-runs you're in single-digit dollars for the whole project.

Should I just use Notion AI Q&A instead of cleaning up?

No. Q&A retrieves answers but inherits whatever mess is in the pages — three contradicting onboarding docs produce three contradicting answers. Clean first, then Q&A is genuinely useful.

Why batches of 20-50 pages instead of the whole export?

Long-context recall is sharper on focused batches, and concatenating everything makes citation back to file paths unreliable. Batches also let you parallelize across a team.

AI Tool Tutorials

AI Knowledge Base Cleanup: From Notion Graveyard to Searchable Wiki

An 8-step AI workflow to clean a sprawling Notion / Confluence / Google Sites wiki: export, batch-dedupe, triage, merge to canonical, archive, tag, reindex. Verified June 2026.

Published: May 17, 2026 Updated: Jun 04, 2026 Author: AI Productivity Guide Team 🌐 查看中文版本

TL;DR

A team wiki rots in a predictable way: three versions of the onboarding doc, a 2022 setup guide nobody dares delete, and a search box that returns five pages that contradict each other. This is an 8-step workflow that uses a long-context model (Claude Sonnet 4.6 or GPT-5.5) to do the grunt work of grouping, deduping, and merging — while you keep every irreversible decision (delete, publish) in human hands. Budget roughly 2-3 days of triage plus about a week of merges for a 1,000-page wiki with one person on it. The AI cost is small: at Sonnet 4.6’s $3 per million input tokens, even re-reading a 1,000-page export end to end is a few dollars.

Who this is for

Ops leads, knowledge managers, and team leads who just inherited a 500-2,000 page Notion, Confluence, or Google Sites wiki that nobody trusts anymore. If “is this page still current?” is a daily Slack question, this is for you.

Why AI, and where it stops

Two jobs in a cleanup are mind-numbing for a human and easy for a model: (1) reading every page to spot semantic duplicates, and (2) drafting a merged “canonical” doc with source citations. A 1M-token context window — standard on Claude Opus 4.7 and Sonnet 4.6, and on Gemini 3.1 Pro — fits roughly 1,500 pages of text in a single session, so the model can actually compare pages instead of guessing. (ChatGPT Plus caps the in-app window at about 320 pages; the full 1M window is reserved for the $200 Pro plan, so for big batches Claude or Gemini is the cheaper path.)

What AI must not do: delete pages, resolve factual conflicts between sources, or publish. Those stay manual. The model proposes; you decide.

Step by step

1. Structured export

AI cannot help until you have a complete, plain-text dump.

Notion (path changed in early 2026): Settings → Workspace → General → Export all workspace content → format Markdown & CSV → toggle Include subpages and Include content. You get a ZIP with one .md file per page, plus a .csv per database. A full-workspace export can take up to 30 hours to process; Notion emails you a download link that expires after 7 days, so grab it promptly. (Source: Notion Help — Export your content.)
Confluence: space sidebar → Space settings → Export → choose HTML (easiest for AI) or XML (full storage format, includes comments). Note the gotchas: HTML export omits page comments and Team Calendars, and you need the Export Space permission. (Source: Atlassian — Export a space’s content.)
Google Sites: no native bulk export. Pull the URL list from your sitemap or Search Console, then wget --mirror the public pages, or use a community gsites-exporter script.

Set up a working directory so every step has a home:

kb_cleanup_2026_06/
├── raw/                # original exported files
├── batches/            # files merged into batches for step 2
├── triage.csv          # the triage sheet from step 3
├── canonical_drafts/   # merged outputs from step 4
└── taxonomy.md         # tag list from step 6

2. Batch grouping

Concatenate 20-50 source files into one batch file, each page prefixed --- FILE: <path> --- on its own line so the model can cite paths back to you. Open Claude (Sonnet 4.6) or GPT-5.5 and send:

Below is a batch of [N] knowledge-base pages, separated by lines reading
--- FILE: [path] ---

[paste or upload the batch]

Output three Markdown tables:

Table 1 — group by topic:
| group_name | file paths included | core question (one sentence) |

Table 2 — near-duplicate pairs (semantically overlapping, different files):
| file A | file B | overlap dimension | overlap severity (high/med/low) |

Table 3 — orphan pages (in no group, no inbound links, not updated in 12 months):
| file | best guess what this page was for | suggested action (merge / archive / delete / keep) |

Use ONLY the [path] values I provided. Do not invent file paths.

Why batches of 20-50 and not the whole wiki at once: long-context recall degrades on a giant undifferentiated paste, and you want the grouping instruction at the very end, after the documents — that ordering measurably improves accuracy on long inputs. Run once per batch, append the rows into triage.csv.

3. Triage sheet

Tag every row with four fields. This sheet, not the AI chat, is your source of truth:

page_path,group,status,confidence,notes
docs/onboarding_v1.md,onboarding,merge,high,merge with v2/v3
docs/onboarding_v2.md,onboarding,merge,high,canonical base
docs/legacy_aws_setup.md,infra-legacy,archive,high,pre-2023 setup
docs/ceo_2022_strategy.md,founder-notes,archive,medium,keep as historical
docs/test_page.md,orphan,delete,high,test leftover

status is one of keep / merge / archive / delete. Set confidence to high / medium / low, and re-run step 4 on the low-confidence rows before acting on them.

4. Merge into canonical

For each status=merge group, paste the full text of all candidates and ask for one merged draft with citations:

Below are 3-5 pages under the same topic: "[group core question]".
I am merging them into one canonical doc.

[paste full text of all candidates, each prefixed --- FILE: [path] ---]

Produce a canonical draft:
1. Structure: what / why / how / examples / FAQ / related
2. Every factual claim carries a [from: path] tag showing its source page
3. When candidates conflict on the same fact, LIST the conflict — do not pick.
   Put a "Human decision needed" section at the bottom.
4. Flag stale candidates (tools or versions no longer in use) and suggest fixes.

End with one sentence on why this canonical beats any single source.

Save to canonical_drafts/[topic].md. A human resolves the “Human decision needed” section before anything is published — this is where the model is most likely to be confidently wrong.

5. Archive, do not delete

Never hard-delete. Cached URLs and old links create 404s that hurt SEO and break bookmarks. Instead, banner the page and move it:

> **This page is archived (YYYY-MM).**
> Current version: [canonical page title](canonical-page-link)
> Kept so old links and search results don't 404. Do not edit further.

In Notion, add the callout block above, then move the page to an Archive space. In Confluence, insert an Info macro at the top, then Move the page to an Archive space. Strip the page from the navigation so it stops surfacing in browse, but leave the URL live.

6. AI taxonomy generation

Feed the surviving keep + merge rows back in and ask for a flat, non-overlapping tag set:

Below are the keep + merge pages and their core questions:

[paste keep/merge rows from triage.csv]

Give me a flat tag taxonomy:
- 15-30 tags total (do not exceed 30)
- Each tag with a one-sentence definition of what content belongs here
- No overlap — each page should match 3 tags at most, not all of them
- Format: tag-slug | one-line definition | 1-2 example pages

Then suggest 3 tags max per page. Output CSV: page_path,suggested_tags

Write the tags back into the KB (Notion Tags property, Confluence labels). Keep the cap at 30 tags — a taxonomy that grows past that becomes its own search problem.

7. Rebuild the home / index page

Combine the taxonomy with your top-20 traffic pages (Google Analytics, Notion’s page analytics, or Confluence’s “Popular” report):

Design a new KB home page structure:
- Top: "5 most-used entries" — the 5 highest-traffic canonical pages
- Middle: 6-8 sections by taxonomy tag, each listing 3-5 representative pages
- Bottom: an auto-generated "Updated in the last 7 days" list

Each entry gets a one-line description (10 words max) so a newcomer knows
what's behind the link without clicking.

8. Schedule the quarterly recheck

A cleaned wiki re-rots in about six months if no one tends it. Add a recurring calendar event every 90 days (2026-09-04 KB recheck), and paste the kb_cleanup_2026_06/ path into the event description. Next pass, reuse the same batch prompt and triage template — the second cleanup takes a fraction of the first.

The 8 steps at a glance

#	Step	Who does it	Output
1	Structured export	Human	`raw/` dump
2	Batch grouping	AI	3 tables per batch
3	Triage sheet	Human	`triage.csv`
4	Merge to canonical	AI drafts, human resolves	`canonical_drafts/`
5	Archive	Human	bannered, moved pages
6	Tag taxonomy	AI proposes, human approves	`taxonomy.md`
7	Rebuild index	AI drafts, human edits	new home page
8	Quarterly recheck	Human	calendar recurrence

Which model for which step

Task	Best fit (June 2026)	Why
Read a 1,000-page batch in one pass	Claude Sonnet 4.6 / Opus 4.7, Gemini 3.1 Pro	1M-token window at standard pricing
Cheapest big-batch reading	Gemini 3.1 Pro ($2 / 1M input)	lowest input cost of the three
Drafting merged canonical docs	Claude Opus 4.7	strongest at structured synthesis with citations
In-app from a Notion page	Notion AI Q&A	retrieves but won’t restructure — use it after, not for cleanup

Notion’s own AI Q&A is a retrieval layer, not a cleanup tool: it answers questions over your pages but won’t dedupe or merge them. Run the cleanup first, then let Q&A shine on a wiki that’s actually clean.

Common mistakes

Deleting before archiving. Old links die silently and search rankings drop. Archive with a banner instead.
Skipping the merge-to-canonical step. If you only tag and reindex, the duplicates are still there and reappear in search within weeks.
No quarterly recurring pass. The graveyard reforms in roughly six months. The recheck is the only step that makes the cleanup stick.
Letting AI resolve conflicts. When two sources disagree on a fact, the model will often pick confidently and wrong. Force a “Human decision needed” section.

FAQ

How long does a 1,000-page wiki take? Roughly 2-3 focused days for export and triage, plus about a week for merges, if one person can dedicate to it. The second cleanup is far faster because the triage template is reusable.
Can AI delete pages directly? Keep deletion manual. The model proposes delete in the triage sheet; a human confirms and archives. There’s no safe way to let it hard-delete.
What does the AI cost? Small. At Claude Sonnet 4.6’s $3 per million input tokens, reading a full 1,000-page export (well under 1M tokens) costs a few dollars; even with re-runs you’re in single-digit dollars for the whole project.
Should I just use Notion AI Q&A instead of cleaning up? No. Q&A retrieves answers but inherits whatever mess is in the pages — three contradicting onboarding docs produce three contradicting answers. Clean first, then Q&A is genuinely useful.
Why batches of 20-50 pages instead of the whole export? Long-context recall is sharper on focused batches, and concatenating everything makes citation back to file paths unreliable. Batches also let you parallelize across a team.

Tags: #Tutorial #Productivity #Knowledge base #Cleanup

TL;DR

Who this is for

Why AI, and where it stops

Step by step

1. Structured export

2. Batch grouping

3. Triage sheet

4. Merge into canonical

5. Archive, do not delete

6. AI taxonomy generation

7. Rebuild the home / index page

8. Schedule the quarterly recheck

The 8 steps at a glance

Which model for which step

Common mistakes

FAQ

Related

Related Articles

AI 1-on-1 Meeting Prep Tutorial for Manager and Report

AI OKR Quarterly Planning Tutorial That Doesn't Drift

AI Personal OKR Tutorial: Quarterly Goals That Stick

AI Slack Message Tone Tutorial: Direct Without Being Curt

AI Calendar Planning Workflow: From Inbox Chaos to a Defended Week

AI Email Triage Tutorial: Inbox Zero in 15 Min