Translation Pages Mismatched: EN Has 5 Sections, ZH Has 3

Solo edits drift EN and ZH apart — sections, code blocks, links diverge. Audit pairs by structure, diff bilingual content, and enforce translate-as-you-edit policy.

You open the EN article, you open its ZH counterpart, and they barely look like translations of each other. EN has five ## sections and three fenced code blocks; ZH has three sections and one code block. EN added an FAQ block six weeks ago; ZH never got it. EN renamed a step from “Step 1: Audit” to “Step 1: Inventory” and the ZH version still says the old phrasing. The pair shares a translationKey but the content has structurally diverged.

This is different from word-count drift (where ZH is just terser by language). It is structural drift: section count differs, code block count differs, link targets differ, headings translate concepts that no longer exist in the other locale. Readers landing via hreflang feel cheated; Google sees mismatched alternates and downgrades trust on both. The fix has three legs: audit by structure (not just timestamps), enforce translate-as-you-edit at the PR layer, and accept “single-language” as a valid declaration for low-value pages.

Common causes

1. Solo edits never trigger a translation ticket

You edit en/foo.mdx to add a new section. You commit. Nothing reminds you that zh/foo.mdx now lacks that section. Repeat for six months across 200 articles and the structural gap is huge.

How to spot it: count ## headings per file and diff across pairs.

for f in src/content/articles/en/troubleshooting/*.mdx; do
  key=$(basename "$f")
  zh="src/content/articles/zh/troubleshooting/$key"
  [ -f "$zh" ] || continue
  en_sec=$(grep -c '^## ' "$f")
  zh_sec=$(grep -c '^## ' "$zh")
  if [ "$en_sec" != "$zh_sec" ]; then
    echo "$key: en=$en_sec zh=$zh_sec"
  fi
done

Anything where en and zh disagree by 2+ is structural drift, not language verbosity.

2. Edits propagate one direction only — usually EN -> ZH stalls

Most content sites have a primary author who writes EN first. ZH gets translated weeks later, if at all. Subsequent EN edits never round-trip back to ZH. The pair starts mirrored and drifts every PR.

How to spot it: list pairs where EN mtime is more than 30 days newer than ZH.

3. Renamed translationKey or moved file breaks the pair silently

You renamed a slug in EN. The translationKey now points to a missing ZH file (or a stale one with the old key). hreflang emits a dangling pair. Nothing fails the build.

How to spot it: dump translationKeys from both locales and diff.

diff \
  <(grep -h "^translationKey:" src/content/articles/en/**/*.mdx | sort -u) \
  <(grep -h "^translationKey:" src/content/articles/zh/**/*.mdx | sort -u)

4. New code examples added in EN never copied to ZH

You added a fenced code block in EN with a fresh shell script. ZH still shows the old version (or has no code block at all). Code blocks are language-agnostic but the surrounding prose isn’t — so the ZH page now references a snippet that does not appear on the page.

How to spot it: count triple-backtick fences per pair and diff.

5. FAQ block added on one side only

You added a ## FAQ section with three ### Question? entries on EN. ZH never got it. The FAQ JSON-LD only emits on EN. The ZH page loses a rich result opportunity and looks thinner.

Shortest path to fix

Step 1: Run a structural diff across all pairs

Build a script that compares structure, not just mtime:

# scripts/audit-pair-structure.mjs
import fs from "node:fs";
import path from "node:path";

const EN_DIR = "src/content/articles/en/troubleshooting";
const ZH_DIR = "src/content/articles/zh/troubleshooting";

function metrics(file) {
  const txt = fs.readFileSync(file, "utf8");
  return {
    h2: (txt.match(/^## /gm) || []).length,
    h3: (txt.match(/^### /gm) || []).length,
    code: (txt.match(/^```/gm) || []).length / 2,
    lines: txt.split("\n").length,
  };
}

for (const f of fs.readdirSync(EN_DIR)) {
  const en = path.join(EN_DIR, f);
  const zh = path.join(ZH_DIR, f);
  if (!fs.existsSync(zh)) continue;
  const a = metrics(en), b = metrics(zh);
  if (Math.abs(a.h2 - b.h2) >= 2 || Math.abs(a.code - b.code) >= 2) {
    console.log(`DRIFT ${f}: h2 en=${a.h2} zh=${b.h2}, code en=${a.code} zh=${b.code}`);
  }
}

Output ranks the most divergent pairs. Sync those first.

Step 2: For each drifted pair, decide sync or split

Three legitimate outcomes:

- Sync: bring the laggard up to match the leader's structure
- Split: content has legitimately diverged; remove the translationKey pair and treat as two distinct articles
- Mark single-language: low-traffic ZH; remove translationKey on ZH, drop hreflang alternate from EN

Do not “auto-translate the missing sections.” Bad MT is worse than a missing section. Either commit to a real translation or split the pair.

Step 3: Enforce translate-as-you-edit at the PR layer

Add a CI step that flags any PR touching en/*.mdx without touching the matching zh/*.mdx:

# .github/workflows/translation-sync.yml fragment
- name: Check translation parity
  run: |
    CHANGED_EN=$(git diff --name-only origin/main -- 'src/content/articles/en/' | grep '\.mdx$' || true)
    for f in $CHANGED_EN; do
      zh=$(echo "$f" | sed 's|/en/|/zh/|')
      if [ -f "$zh" ] && ! git diff --name-only origin/main | grep -q "$zh"; then
        echo "::warning::EN changed: $f -- but ZH not updated: $zh"
      fi
    done

This is a warning, not a failure. The author either updates ZH in the same PR or opens an i18n ticket and acknowledges the drift explicitly.

Step 4: Backfill the worst offenders deliberately

Pick the top 20 drifted pairs by traffic (filter Search Console by /zh/articles/*). Sync those manually. Ignore the long tail until it earns the work.

Step 5: Verify hreflang still pairs cleanly

After sync, recheck hreflang in the sitemap. Each pair should emit:

<xhtml:link rel="alternate" hreflang="en" href="https://site.com/en/articles/slug/" />
<xhtml:link rel="alternate" hreflang="zh" href="https://site.com/zh/articles/slug/" />

If a page got marked single-language, drop the alternate entirely. Half-emitted hreflang is worse than none.

Prevention

  • CI warning whenever EN/ZH change in isolation; author must respond
  • Structural audit (h2/h3/code-block count) runs weekly in prebuild
  • New ## FAQ block on one side requires an i18n ticket before merge
  • Renaming a slug requires updating both locales in the same PR; lint rule enforces it
  • Low-traffic pages explicitly marked single-language rather than left drifted
  • Quarterly review: top 20 drifted pairs get scheduled sync work

Tags: #Content ops #Site quality #Site audit #Troubleshooting #Bilingual