What this covers
A 5-stage AI workflow to clean up a sprawling Notion / Confluence / Google Sites knowledge base — dedupe, tag, archive, link, and reindex.
Key tools and concepts:
- Notion: A collaborative workspace combining docs, notes, databases, and project tracking.
Who this is for
Ops leads, knowledge managers, team leads inheriting a 500-2000 page wiki that no one trusts anymore.
When to reach for it
When search return shows 5 contradicting pages, new hires can’t find anything, and “is this still current?” is a daily question.
Step by step
-
Structured export. AI cannot help without a complete dump.
- Notion: bottom-left
Settings & members→Settings→Export all workspace content→ formatMarkdown & CSV→ checkInclude subpagesandInclude content. You get a ZIP, one.mdfile per page. - Confluence: space top-right
••• → Export space→ chooseHTMLorXML(HTML is easier for AI), checkInclude attachments. - Google Sites: no native bulk export; use
gsites-exporter, or get the URL list from Site Search andwget --mirror.
Set up a working directory:
kb_cleanup_2026_05_21/ ├── raw/ # original exported files ├── batches/ # files merged into batches for step 2 ├── triage.csv # the triage sheet from step 3 ├── canonical_drafts/ # merged outputs from step 4 └── taxonomy.md # tag list from step 6 - Notion: bottom-left
-
Batch grouping. Combine 20-50 files into one batch per file, each page prefixed
--- FILE: <path> ---. Open Claude (Sonnet 4.6+) or GPT-5.5-128k+, send:Below is a batch of <N> knowledge-base pages, separated by `--- FILE: <path> ---`. <paste or upload the batch> Output three Markdown tables: Table 1 — group by topic: | group_name | file paths included | core question (≤1 sentence) | Table 2 — near-duplicate pairs (semantically overlapping but different files): | file A | file B | overlap dimension | overlap severity (high/med/low) | Table 3 — orphan pages (not in any group, no inbound links, not updated in 12 months): | file | best guess what this page was for | suggested action (merge / archive / delete / keep) | Do NOT hallucinate file paths — use only the `<path>` values I provided.Run once per batch, append into
triage.csv. -
Triage sheet. Tag each row with 4 fields:
page_path,group,status,confidence,notes docs/onboarding_v1.md,onboarding,merge,high,merge with v2/v3 docs/onboarding_v2.md,onboarding,merge,high,canonical base docs/legacy_aws_setup.md,infra-legacy,archive,high,pre-2023 setup docs/ceo_2022_strategy.md,founder-notes,archive,medium,keep as historical docs/test_page.md,orphan,delete,high,test leftoverstatusis one ofkeep/merge/archive/delete. Useconfidencehigh/medium/low — re-run step 4 for low-confidence rows. -
Merge into canonical. For each
status=mergegroup:Below are 3-5 pages all under the same topic "<group core question>". I'm merging them into one canonical doc. <paste full text of all candidates, each prefixed --- FILE: <path> ---> Produce a canonical draft: 1. Structure: what / why / how / examples / FAQ / related 2. Every factual claim must carry [from: <path>] marking which source it came from 3. When candidates conflict on the same fact, list the conflict — DO NOT pick for me. Put a "Human decision needed" section at the bottom 4. Flag candidates with stale info (mentions tools / versions no longer in use), suggest updates After writing, give me 1 sentence on why this canonical is stronger than any single candidate.Save to
canonical_drafts/<topic>.md. Human-review the “Human decision needed” section before publishing. -
Archive, do not delete. Keep in place, add a banner, then move to
/archive:> **This page is archived (YYYY-MM).** > Current version: [<canonical page title>](<canonical page link>) > Kept so old links / search results don't 404. Do not edit further.Notion: move the page to an
Archiveworkspace and add the callout block above. Confluence: insert anInfomacro, then Move to anArchivespace. Do not delete — search engines have these URLs cached; deletion creates 404s and hurts SEO. -
AI taxonomy generation:
Below is the list of `keep + merge` pages and their core questions: <paste keep/merge rows from triage.csv> Give me a flat tag taxonomy: - Total 15-30 tags (do not exceed 30) - Each tag with a 1-sentence definition of "what content belongs here" - No overlap — each page should match ≤3 tags, not blanket all - Format: `tag-slug | 1-line definition | 1-2 example pages` Then suggest ≤3 tags per page. Output as CSV: `page_path,suggested_tags`.Write tags back into the KB (Notion
Tagsproperty, Confluence labels). -
Rebuild home / index. Use the taxonomy + top 20 traffic pages (GA, Notion analytics, or Confluence “Popular”):
Design a new KB home page structure: - Top: "5 most-used entries" — the 5 highest-traffic canonical pages - Middle: 6-8 sections by taxonomy tag, each listing 3-5 representative pages - Bottom: an auto-generated "Updated in the last 7 days" list Each entry gets a 1-line description (≤10 words) so a newcomer knows what's behind the link without clicking. -
Schedule the quarterly recheck. Add a recurring calendar event
2026-08-21 KB recheck(every 90 days). Drop the pathkb_cleanup_2026_05_21/in the event description — next time, reuse the same batch + prompt + triage template.
Recommended workflow
Export dump → batch grouping → triage sheet → merge into canonical pages → archive obsolete → tag taxonomy → rebuild index → recurring pass.
Common mistakes
- Deleting before archiving — old links die silently
- Skipping the merge-into-canonical step — duplicates come back
- No quarterly recurring pass — the graveyard reforms in 6 months
FAQ
- How long for a 1000-page wiki?: Roughly 2-3 focused days for triage + 1 week for merges if you can dedicate one person.
- Can AI delete pages directly?: No — keep the deletion step manual. AI proposes, you decide.