Fix AI Lip Sync Mismatch in Generated Video

Mouth movements don't match your separately generated audio. Diagnose the cause and pick the right fix: re-render, post-sync, or compose around it.

Published: May 17, 2026 Updated: Jun 17, 2026 Author: AI Productivity Guide Team 🌐 查看中文版本

You generated a clip in Runway, Kling, or Pika, generated voice-over separately in ElevenLabs or OpenAI TTS, dropped both onto one timeline, and the mouth misses the audio by anywhere from a fraction of a second to an obvious vowel mismatch. Human perception is unforgiving here: broadcast standards (EBU R37) target audio between roughly 5 ms early and 15 ms late, and viewers reliably notice errors past about 45 ms when audio leads or about 125 ms when it lags (ITU-R BT.1359). For film the working tolerance is tighter, around 22 ms.

Fastest fix: if you generated video and audio independently, the mouth was never aligned to your phonemes and no nudging the timeline will fix it. Run the clip plus your audio file through a post-sync tool that re-renders the mouth (Sync.so, or open-source MuseTalk/Wav2Lip), or re-generate the whole shot end-to-end in an avatar tool (HeyGen, Synthesia). Trimming a few frames only helps when the problem is leading silence or a frame-rate mismatch, not unaligned phonemes.

Which bucket are you in

Symptom	Most likely cause	Go to
Mouth moves but never matches the words; lip closures for M/P/B are absent	Audio + video generated independently; model never saw your audio	Step 2 or Step 3
First second looks fine, drift grows toward the end	Frame-rate mismatch or variable-frame-rate (VFR) source	Cause 3 + Step 4
Constant offset of ~50-200 ms from the very start	Leading silence padding on the TTS file	Cause 4 + Step 4
Sync slips proportionally after you fit audio to length	Time-stretched/remapped audio	Cause 5
Tool simply has no audio-driven mouth feature	Wrong tool for the job	Cause 2 + Step 2

Common causes

Ordered by what we see most often.

1. Audio and video generated independently

The most common case. You wrote a prompt like “woman speaking to the camera,” generated 5s of video, then generated 5s of audio. Neither model knew about the other. The mouth movements are generic speech-like motion, not aligned to specific phonemes.

How to spot it: Did you generate the video clip without uploading or referencing the exact audio file? If yes, this is your case. A reliable tell: the lip closures for M, P, B (lips fully meet) and the lower-lip-to-teeth shapes for F, V are missing or land on the wrong syllable.

2. Tool does not support audio-driven lip sync

Not every video model aligns the mouth to an audio track. As of June 2026:

Runway Gen-3/Gen-4 does not align mouth to an uploaded audio track. Runway’s lip-sync path is Act-Two (released July 2025), which transfers a reference performance video (your own face/gestures) onto a generated character, not arbitrary audio.
Pika added an audio-driven path (Pikaframes / “Pikaformance,” powered by Pika 2.5): upload an image or character plus an audio file and it animates the mouth. Free plan caps audio at ~10s, paid plans ~30s.
Kling has native lip sync (current line is the Kling 3.x era): upload an audio file and it animates the mouth. Keep each spoken line short — roughly 3-5s — because long monologues desync, and multi-person dialogue is still weak.
Sora: the standalone Sora web/app was discontinued on April 26, 2026; it never offered a post-hoc lip-sync tool for arbitrary footage.

If you used a tool with no audio-driven mouth feature, no amount of prompt tweaking will fix it.

How to spot it: Check the tool’s docs for “lip sync” or “audio-driven motion.” If the feature is absent (or, as with Runway, only accepts a reference video not an audio file), the tool is the bottleneck.

3. Frame-rate mismatch or variable-frame-rate source

You generated video at 24fps and the audio is being mapped against a 30fps timeline (or vice versa). The mismatch compounds over time: the first second looks fine, by the fourth it is half a phoneme off. A nastier version is variable frame rate (VFR) — common in screen recordings, phone footage, and some AI exports. Editors like DaVinci Resolve assume a constant frame rate, so VFR media drifts out of sync no matter how carefully you cut it.

How to spot it: Open the video and check its fps in the editor’s clip attributes; compare with the project sequence/timeline fps. To check for VFR, run ffprobe -v error -select_streams v:0 -show_entries stream=r_frame_rate,avg_frame_rate -of default=noprint_wrappers=1 input.mp4 — if r_frame_rate and avg_frame_rate differ, the file is VFR. Conform it to constant frame rate first (see Step 4).

4. Audio has silence padding at the start

ElevenLabs and OpenAI TTS sometimes prepend a short stretch of silence (commonly tens to a couple hundred milliseconds; some TTS engines add far more) to the output. If you snapped the audio to clip start, the spoken portion is now offset by that padding and lip sync is misaligned by exactly that amount — a constant offset that does not grow.

How to spot it: Zoom into the waveform at the start. If there is flat audio before the first phoneme, that is your offset. Trim it (Step 4).

5. Stretched / time-remapped audio

You sped up or slowed down the audio to fit the video duration. The lip movements are at the video’s original speed; the remapped audio is not. Sync drifts proportionally — small at the head, large at the tail.

6. Talking-head shot generated with vague “speaking” cues

Some video models produce mouth-open/mouth-closed motion that maps onto a vague sense of “speech” but never matches specific words. The decisive lip shapes for M, P, B (closures) and F, V are missing entirely. This is really a sub-case of cause 1: there is no per-phoneme target, so post-sync (Step 3) is the only real fix.

Before you change anything

Save both source assets (video and audio) at their original quality.
Note the exact tool, model, and version used to generate each.
Decide how important sync is: a brand explainer needs tight sync; B-roll voiceover does not.
Confirm the frame rates and sample rates of both assets match your edit timeline.
Back up the current edit before re-rendering — re-generation burns credits.

Information to collect

Both source files, original quality.
Frame rate of the video (constant vs VFR), sample rate and codec of the audio, project sequence settings.
Whether the audio includes leading silence padding.
Transcript of the audio with timestamps (most TTS tools export this; it makes drift easy to measure).
A specific timestamp where the mismatch is most obvious.

Shortest path to fix

Step 1: Decide your sync strategy

Three legitimate paths:

End-to-end lip sync in one tool — best sync quality, least control over voice/look.
Generate video and audio separately, then post-sync with a dedicated tool — best control, most steps.
Compose around the mismatch — cut away from the face during words that obviously break sync.

Choose based on the use case.

Step 2: For end-to-end lip sync, use a tool that supports it

Current options (June 2026):

HeyGen — avatar-based; type a script or upload an audio file and the avatar speaks it. Its video-translate feature also re-syncs an existing presenter’s mouth to translated audio across 175+ languages. Best out-of-the-box for talking-head/brand work.
Synthesia — avatar library + voice cloning + script-to-video.
D-ID — upload a portrait, give it a script, get a talking-head video.
Pika (Pikaframes/Pikaformance) — upload an image/character + audio; Pika 2.5 drives the mouth.
Kling lip sync (3.x era) — upload audio; keep lines short (~3-5s).
Runway Act-Two — drive a generated character with a reference performance video (not an audio file).

For brand work where lip sync must be tight, generate the entire shot in one of these rather than stitching separately.

Step 3: For separate generation, post-sync with a dedicated tool

After you have video and audio separately, send both to a tool that re-renders the mouth region to your audio:

Sync.so (the company behind Wav2Lip; formerly Sync Labs) — edits the lips of any speaker in any clip to match a target audio file. API and pay-per-use billing; Hobbyist tier starts around $5/month. Strongest on natural-looking faces.
MuseTalk — open-source, near-photorealistic, supports near real-time; a good free option if you can run it.
Wav2Lip / SadTalker — open-source pipelines you can run locally. Note: the open-source Wav2Lip license is for personal/research/non-commercial use, and its maintainers now direct commercial users to Sync’s API. Check the license before using it in paid work.

Workflow is the same in all three: upload the video clip + the audio file, and the tool re-renders only the mouth area aligned to your audio.

Step 4: Match frame rate and trim silence

Even with the right tool:

Pick one project frame rate — 24fps (cinema) or 30fps (web) — and conform everything to it. Convert VFR sources to constant frame rate first: ffmpeg -i input.mp4 -r 30 -c:v libx264 -c:a copy output.mp4.
Trim leading silence before placing audio on the timeline. Use silence detection — Premiere, DaVinci Resolve, and Audition all have it.
Lock the audio sample rate to 48kHz (video standard); re-export TTS at 48kHz when the tool allows.
Let the editor align for you when you have a constant reference: in Premiere, select both clips and use Synchronize / Merge Clips (Audio waveform); in DaVinci Resolve, Auto Sync Audio in the Media Pool or Auto Align Clips → Based on Waveform in the timeline.

Step 5: For the compose-around path

If you cannot regenerate or post-sync:

Cut away from the face during the worst-matched moments — B-roll, product shots, environment cuts that hide the mouth for 1-2 seconds.
Use over-the-shoulder shots where the mouth is angled away.
Use lower-third graphics during quoted text.
For social formats (TikTok, Reels), burned-in captions draw attention away from the lips.

Step 6: Re-record the voice to the video’s cadence

If you have voice control (ElevenLabs Studio, OpenAI TTS), re-render the voice to match the video’s mouth cadence: insert short pauses or speed up sections so speech lands on mouth opens. In ElevenLabs, force pauses with the <break time="0.5s" /> tag rather than relying on punctuation.

How to confirm it’s fixed

Play at full speed with sound. The eye should not catch the mismatch.
Play at 25% speed. Individual phoneme alignment should land within roughly 40-80 ms — viewers are more forgiving of audio lagging than leading, so if you must err, err late.
Check the lip closures for M/P/B words specifically; those are where bad sync shows first.
Show the clip to someone unfamiliar with the project and ask whether anything feels off.
Sample a moment from the middle and one from the end — drift compounds, so the end is the hardest test.

FAQ

Can I just nudge the audio track left or right to fix it?

Only if the cause is a constant offset (leading silence, Cause 4) or a clean frame-rate conversion. If the video and audio were generated independently (Cause 1), the mouth shapes don’t correspond to your words at any offset, so nudging just trades one bad frame for another. Post-sync (Step 3) is the fix.

Why does the start look synced but the end drift apart?

Classic frame-rate or VFR mismatch (Cause 3) or time-stretched audio (Cause 5). A constant per-frame error accumulates, so the head looks fine and the tail is half a phoneme off. Conform to a constant frame rate and remove any speed-ramping on the audio.

Which tool gives the best lip sync on a real person’s existing footage?

For re-syncing an existing recording of a real person (e.g. dubbing into another language), Sync.so or HeyGen’s video-translate path are the strongest. HeyGen is tuned for avatars and translation; Sync.so is the mouth-retargeting layer you bolt onto your own transcribe/translate/voice pipeline.

Is Wav2Lip free to use for client work?

The open-source Wav2Lip model is licensed for personal, research, and non-commercial use only; its maintainers point commercial users to Sync’s API. For paid/client work, use Sync.so or another commercially licensed option rather than the bare open-source model.

How tight does lip sync actually need to be?

Broadcast (EBU R37) aims for audio 5 ms early to 15 ms late; the detectability edge (ITU-R BT.1359) is around +45 ms (audio leading) to -125 ms (audio lagging); film practice is roughly within 22 ms. For web/social, anything under ~80 ms reads as in-sync to most viewers.

If it still fails

Reduce to the smallest reproduction: just the 2-second segment where the mismatch is worst. Most “the whole clip is off” complaints collapse to one segment.
Try the alternate path — if end-to-end was too rigid, try separate-then-post-sync; if post-sync looked off, try a different post-sync tool.
For business-critical content, accept an avatar tool (HeyGen, Synthesia) even if you wanted a more cinematic look — sync wins.
Package the source video, source audio, edit timeline, frame-rate details, and the bad moment before asking community help.

Prevention

Decide before generating whether lip sync matters; pick tools accordingly.
Standardize project frame rate (24fps cinema or 30fps web) across all generations, and conform any VFR source to constant frame rate on import.
Use 48kHz audio throughout to match video standards.
Build a “sync workflow” note per use case (brand video → HeyGen; B-roll → separate-then-post-sync).
Add ~200ms of buffer at each cut point so minor sync drift hides behind cuts.

Tags: #Prompt #Debug #Troubleshooting #Video generation