AI Knowledge Base Cleanup Workflow: From Notion Graveyard to Searchable Wiki

A 5-stage AI workflow to clean up a sprawling Notion / Confluence / Google Sites knowledge base — dedupe, tag, archive, link, and reindex.

What this covers

A 5-stage AI workflow to clean up a sprawling Notion / Confluence / Google Sites knowledge base — dedupe, tag, archive, link, and reindex.

Key tools and concepts:

  • Notion: A collaborative workspace combining docs, notes, databases, and project tracking.

Who this is for

Ops leads, knowledge managers, team leads inheriting a 500-2000 page wiki that no one trusts anymore.

When to reach for it

When search return shows 5 contradicting pages, new hires can’t find anything, and “is this still current?” is a daily question.

Step by step

  1. Structured export. AI cannot help without a complete dump.

    • Notion: bottom-left Settings & membersSettingsExport all workspace content → format Markdown & CSV → check Include subpages and Include content. You get a ZIP, one .md file per page.
    • Confluence: space top-right ••• → Export space → choose HTML or XML (HTML is easier for AI), check Include attachments.
    • Google Sites: no native bulk export; use gsites-exporter, or get the URL list from Site Search and wget --mirror.

    Set up a working directory:

    kb_cleanup_2026_05_21/
    ├── raw/                # original exported files
    ├── batches/            # files merged into batches for step 2
    ├── triage.csv          # the triage sheet from step 3
    ├── canonical_drafts/   # merged outputs from step 4
    └── taxonomy.md         # tag list from step 6
  2. Batch grouping. Combine 20-50 files into one batch per file, each page prefixed --- FILE: <path> ---. Open Claude (Sonnet 4.6+) or GPT-5.5-128k+, send:

    Below is a batch of <N> knowledge-base pages, separated by `--- FILE: <path> ---`.
    
    <paste or upload the batch>
    
    Output three Markdown tables:
    
    Table 1 — group by topic:
    | group_name | file paths included | core question (≤1 sentence) |
    
    Table 2 — near-duplicate pairs (semantically overlapping but different files):
    | file A | file B | overlap dimension | overlap severity (high/med/low) |
    
    Table 3 — orphan pages (not in any group, no inbound links, not updated in 12 months):
    | file | best guess what this page was for | suggested action (merge / archive / delete / keep) |
    
    Do NOT hallucinate file paths — use only the `<path>` values I provided.

    Run once per batch, append into triage.csv.

  3. Triage sheet. Tag each row with 4 fields:

    page_path,group,status,confidence,notes
    docs/onboarding_v1.md,onboarding,merge,high,merge with v2/v3
    docs/onboarding_v2.md,onboarding,merge,high,canonical base
    docs/legacy_aws_setup.md,infra-legacy,archive,high,pre-2023 setup
    docs/ceo_2022_strategy.md,founder-notes,archive,medium,keep as historical
    docs/test_page.md,orphan,delete,high,test leftover

    status is one of keep / merge / archive / delete. Use confidence high/medium/low — re-run step 4 for low-confidence rows.

  4. Merge into canonical. For each status=merge group:

    Below are 3-5 pages all under the same topic "<group core question>". I'm merging them into one canonical doc.
    
    <paste full text of all candidates, each prefixed --- FILE: <path> --->
    
    Produce a canonical draft:
    1. Structure: what / why / how / examples / FAQ / related
    2. Every factual claim must carry [from: <path>] marking which source it came from
    3. When candidates conflict on the same fact, list the conflict — DO NOT pick for me. Put a "Human decision needed" section at the bottom
    4. Flag candidates with stale info (mentions tools / versions no longer in use), suggest updates
    
    After writing, give me 1 sentence on why this canonical is stronger than any single candidate.

    Save to canonical_drafts/<topic>.md. Human-review the “Human decision needed” section before publishing.

  5. Archive, do not delete. Keep in place, add a banner, then move to /archive:

    > **This page is archived (YYYY-MM).**
    > Current version: [<canonical page title>](<canonical page link>)
    > Kept so old links / search results don't 404. Do not edit further.

    Notion: move the page to an Archive workspace and add the callout block above. Confluence: insert an Info macro, then Move to an Archive space. Do not delete — search engines have these URLs cached; deletion creates 404s and hurts SEO.

  6. AI taxonomy generation:

    Below is the list of `keep + merge` pages and their core questions:
    
    <paste keep/merge rows from triage.csv>
    
    Give me a flat tag taxonomy:
    - Total 15-30 tags (do not exceed 30)
    - Each tag with a 1-sentence definition of "what content belongs here"
    - No overlap — each page should match ≤3 tags, not blanket all
    - Format: `tag-slug | 1-line definition | 1-2 example pages`
    
    Then suggest ≤3 tags per page. Output as CSV: `page_path,suggested_tags`.

    Write tags back into the KB (Notion Tags property, Confluence labels).

  7. Rebuild home / index. Use the taxonomy + top 20 traffic pages (GA, Notion analytics, or Confluence “Popular”):

    Design a new KB home page structure:
    - Top: "5 most-used entries" — the 5 highest-traffic canonical pages
    - Middle: 6-8 sections by taxonomy tag, each listing 3-5 representative pages
    - Bottom: an auto-generated "Updated in the last 7 days" list
    
    Each entry gets a 1-line description (≤10 words) so a newcomer knows what's behind the link without clicking.
  8. Schedule the quarterly recheck. Add a recurring calendar event 2026-08-21 KB recheck (every 90 days). Drop the path kb_cleanup_2026_05_21/ in the event description — next time, reuse the same batch + prompt + triage template.

Export dump → batch grouping → triage sheet → merge into canonical pages → archive obsolete → tag taxonomy → rebuild index → recurring pass.

Common mistakes

  • Deleting before archiving — old links die silently
  • Skipping the merge-into-canonical step — duplicates come back
  • No quarterly recurring pass — the graveyard reforms in 6 months

FAQ

  • How long for a 1000-page wiki?: Roughly 2-3 focused days for triage + 1 week for merges if you can dedicate one person.
  • Can AI delete pages directly?: No — keep the deletion step manual. AI proposes, you decide.

Tags: #Tutorial #Productivity #Knowledge base #Cleanup