You generated a HeyGen / D-ID / Synthesia talking head, or fed an existing clip to SyncLabs / Wav2Lip, and the mouth movement is visibly out of sync with the audio. Sometimes the lips lead by a few frames, sometimes they trail, sometimes they form the wrong shape entirely. Viewers notice this in under 2 seconds and trust drops immediately. Fix it by cleaning the source audio, picking a lip-sync model that matches the language, and aligning offset in post.
Common causes
Ordered by hit rate.
1. Source audio has long silences or filler noise
Lip-sync models lock onto phonemes. Heavy background noise, music beds, or 2-second silences confuse the model and it freezes the mouth or syncs to noise transients.
How to spot it: Open the audio in Audacity. Visible noise floor above -40 dB or silences over 1 second mean the model is struggling.
2. Model trained primarily on English, fed non-English audio
Wav2Lip, SadTalker, and early HeyGen avatars were trained mostly on English phonemes. Mandarin, Japanese, or Hindi phoneme shapes do not match — mouth lands on the closest English shape.
How to spot it: English audio syncs cleanly; same speaker in their native language drifts.
3. Audio bitrate or sample rate mismatch
Source audio at 22 kHz fed to a 16 kHz model, or vice versa, causes temporal stretching. Lip timing drifts by a constant offset that grows over the clip.
How to spot it: Lips align at clip start, drift further by clip end. Check audio sample rate.
4. Video framerate not declared correctly
A 24 fps source video uploaded as 30 fps metadata gets re-timed during lip-sync, producing a constant offset.
How to spot it: Run ffprobe on the source clip; check the declared fps versus actual.
5. Avatar / face crop too tight or too loose
If the face bounding box clips off the chin or includes too much of the neck, the lip detector mis-locks. The model still tries to animate something, producing wrong-shaped mouth movement.
How to spot it: Inspect the face detection overlay if your tool provides one. Bounding box should include forehead to mid-chin.
6. Two-pass workflows with audio re-encoded
Render a clip with built-in lip-sync, then re-encode for delivery in a different codec — the audio stream may shift by 1 to 3 frames during encoding.
How to spot it: Lip-sync was fine in tool preview but off in final delivery.
Before you start
- Save the original audio file and the source video separately, untouched.
- Identify whether drift is constant (offset bug) or growing (sample rate bug).
- Note the lip-sync tool, its model version, and the language of the audio.
- Decide acceptable tolerance: under 2 frames is usually invisible, 3-5 frames is noticeable, over 5 is unacceptable.
- Back up the project before re-rendering.
Information to collect
- Audio file: format, sample rate, bitrate, duration, noise floor.
- Video file: codec, fps (declared and actual), resolution.
- Tool, model version, language setting.
- Specific timestamps where drift is worst.
- Whether drift varies by phoneme (model issue) or by time (sample rate issue).
Step-by-step fix
Step 1: Clean the source audio
Before feeding any lip-sync tool:
# Audacity (free)
- Noise Reduction: capture noise profile from 1s silence, apply at 12 dB
- Normalize to -3 dBFS
- Trim leading and trailing silence to under 200 ms
- Export as 16-bit PCM WAV at 22050 Hz (matches most lip-sync models)
# Adobe Podcast Enhance (faster)
- Upload, auto-enhance, download
- Re-encode to WAV at 22050 Hz
Clean audio is the single biggest lever.
Step 2: Match the model to the language
# English
- Wav2Lip, SyncLabs, HeyGen, D-ID all work
# Mandarin / Cantonese
- HeyGen Chinese-trained avatars (label: "CN")
- DUIX (domestic option, lower latency)
- Avoid Wav2Lip without a fine-tune
# Japanese / Korean
- HeyGen has Japanese tier; performance varies
- Synthesia's Asian-language avatars are stronger
# Indic languages
- SadTalker fine-tunes available on HuggingFace
- Test before committing
Step 3: Lock audio and video to matching specs
# Re-encode audio to canonical 22050 Hz WAV
ffmpeg -i input.mp3 -ar 22050 -ac 1 -c:a pcm_s16le clean.wav
# Verify video framerate
ffprobe -v error -select_streams v:0 -show_entries stream=r_frame_rate input.mp4
# Force video to declared fps if mismatched
ffmpeg -i input.mp4 -r 24 -c:v libx264 -crf 18 -c:a copy fixed.mp4
Step 4: Tighten the face crop
For tools that expose face crop settings (HeyGen advanced, SadTalker):
- Bounding box: forehead to mid-chin, with 10 percent padding
- Center horizontally on the nose tip
- Avoid full-body or wide shots; tighter is better for lip detection
- For HeyGen, use Studio mode and select Talking Photo with custom crop
Step 5: Align offset in post
If a constant offset remains:
# DaVinci Resolve / Premiere Pro
- Bring video to V1, audio to A1
- Slide audio 1-5 frames earlier or later until lips match
- For drift growing over time, use a speed ramp:
- At clip start: 100 percent speed
- At clip end: 100.5 percent or 99.5 percent speed
- Time-stretch the shorter stream to match
Step 6: Re-render with consistent codec
# Final delivery without re-encode shift
ffmpeg -i synced.mp4 -c:v libx264 -crf 18 -c:a aac -b:a 192k \
-movflags +faststart -avoid_negative_ts make_zero final.mp4
-avoid_negative_ts make_zero prevents the 1-3 frame audio shift on container re-mux.
Step 7: Use a second-pass lip-sync if quality is critical
For ad-tier deliverables:
- First pass: generate talking head in HeyGen / D-ID
- Export with audio
- Second pass: feed the exported clip to SyncLabs lip-sync mode
- SyncLabs re-syncs on top of the existing video
- This compounds quality, especially for non-English audio
Verify
- Watch the first 5 seconds at 100 percent speed; lips should land on phonemes.
- Slow to 25 percent and watch hard consonants (
b,p,m); mouth should close on these. - Try with three different testers; if all say sync is fine, you are good.
- Export and watch on phone, laptop, and TV; small screens hide drift but TV reveals it.
Long-term prevention
- Standardize source audio at 22050 Hz WAV, mono, -3 dBFS, under 200 ms leading silence.
- Pick a lip-sync model trained on the target language from the start.
- Lock video to a single fps standard across the project (24 or 30).
- Tighten face crop in pre-flight checks, not after the first render.
- Use a single final-render codec config across the project.
Common pitfalls
- Treating lip-sync drift as a tool bug when it is almost always source-audio quality.
- Feeding noisy lavalier or zoom audio directly into a lip-sync tool.
- Mixing 24 fps and 30 fps content within a single project.
- Re-encoding for delivery in a codec different from the working codec without checking offset.
FAQ
Why does my lip-sync work in English but drift in Chinese? Most lip-sync models are English-phoneme dominant. Use HeyGen CN avatars or DUIX for Chinese.
Is 2-frame drift noticeable? Under 2 frames at 24 fps is invisible to most viewers. Over 5 frames is unacceptable.
Can I fix drift in post for live-action talking head? Yes — frame-level audio slide in Resolve handles constant offset. Growing drift needs time-stretch.