AI Character Motion Workflow: Stop the Uncanny Glitching

A repeatable image-to-video workflow for AI character clips that move naturally. Constrain the motion, lock the camera, batch-generate. Verified June 2026.

Published: May 17, 2026 Updated: Jun 04, 2026 Author: AI Productivity Guide Team 🌐 查看中文版本

Character motion is the part of AI video that breaks most often. Faces flicker between frames, fingers multiply, walk cycles slide or spasm, and the character you carefully designed turns into a near-relative two seconds in. The fix is not a better prompt adjective. It is a discipline: one motion per clip, a locked camera, a single reference image driving every shot, and a brutal accept/reject pass. This guide gives you the exact tools, prompts, costs, and checks, current as of June 2026.

TL;DR

Always start from a reference image and use image-to-video, never pure text-to-video. Runway Gen-4 holds character identity from a single reference image across shots; Veo 3.1 accepts up to three reference images.
One motion per clip, 3-5 seconds. Walking, a head turn, sitting, reaching. Never combine motions or stack camera movement on top of character movement.
Lock the camera in every prompt (fixed camera, no pan, no zoom, no dolly), or the tool adds a default push that compounds drift.
Batch and discard. Expect roughly 1 in 5-8 generations to pass. Budget for 6-10 attempts per shot.
Stitch the survivors in DaVinci Resolve / CapCut / Premiere, hiding cuts at head turns or occlusions, with one LUT over everything.

Who this is for

Indie animators, comic and story creators, short-film makers, and anyone producing character-driven AI video where a person has to move on screen and stay recognizably the same person.

This workflow is the wrong fit for two jobs: complex multi-character action (two or more people interacting physically still falls apart), and tight lip-sync to existing audio. For dialogue, generate the visuals here, then add the mouth movement with a dedicated lip-sync model (see the FAQ).

Which tool to use (June 2026)

The Sora consumer app shut down on April 26, 2026, so it is no longer a practical option for most creators. The current image-to-video field for character work:

Tool	Model (Jun 2026)	Character ref	Image-to-video cost	Entry price
Runway	Gen-4 / Gen-4.5 / Gen-4 Turbo	1 reference image, ~95% identity hold	Turbo 5 credits/sec, Gen-4 & 4.5 12 credits/sec	Standard $12/mo annual ($15 monthly), 625 credits/mo
Kling	Kling 2.5 / 2.6	Reference + start frame	Kling 2.5 Pro 1080p 5s ≈ 210 credits	Standard ≈ $6.99/mo, Pro $25.99/mo
Luma Dream Machine	Ray 3 (DM 2.0)	Character reference	Native 1080p, ~3x cheaper than Ray 2	Free tier; paid from low monthly
Google Veo 3.1	Veo 3.1	Up to 3 reference images, synced audio	Via Gemini app / Flow / Vids	Google AI Pro $19.99/mo

Practical default: iterate on Runway Gen-4 Turbo (5 credits/sec is cheap enough to brute-force), then re-roll your two or three best shots on Gen-4.5 for the hero takes. Reach for Veo 3.1 when you need synced audio or want to fix identity from more than one angle.

Step 1 - Build the reference image

Use Midjourney, FLUX, or Nano Banana (Gemini 2.5 Flash Image) to produce one front-facing, natural-light, medium shot (waist up) at 1024×1024 or larger. Replace the bracketed placeholder with your real description.

[character description: age / gender / clothing / hair color / distinguishing features], neutral expression, looking at camera, medium shot from waist up, soft window light from camera-left, plain light gray studio background, sharp focus on face, 35mm lens, photo-realistic, 9:16

Save it as character_ref_v1.jpg. This same file is the image-to-video input for every clip in the project. Reusing one reference is what keeps the face consistent; a new reference per shot is the single biggest cause of identity drift.

Step 2 - One motion per clip

For each shot, the image-to-video input is the reference image and the text is one of these prompts. These four motions are the most stable across every current model. Keep one motion only.

Walk side-on across frame (3s):

character walks from left edge to right edge of frame, natural side-profile gait, one full stride per second (3 strides total), fixed camera, no zoom, no pan, character maintains identical face and clothing throughout, soft window light, plain background, 24fps, 3 seconds

Turn head toward camera (2s):

character starts facing 3/4 right, slowly turns head toward camera, eyes meet lens at 1.5s, subtle smile, eyebrow micro-lift, no body movement, fixed camera, 2 seconds

Sit down on chair (4s):

character is standing, looks down at chair, lowers body smoothly into seated posture, hands settle on knees, single fluid motion, no glitching limbs, fixed camera at chest height, side angle, 4 seconds

Reach for object (3s):

character extends right arm forward and slightly down to pick up a small object from desk, fingers close around object, brings hand back to neutral, no other body movement, fixed close-up on torso and arm, 3 seconds

Step 3 - Use image-to-video, not text-to-video

Pure text-to-video has nothing to anchor the face to, so identity wanders. Always feed the reference image. Entry points per tool:

Runway: sidebar Generate → choose Gen-4 (or Gen-4 Turbo) → upload your reference under the image input
Kling: home → Image to Video tab → upload reference, select Kling 2.5 or 2.6
Luma: generate page → set the reference as the start frame in Dream Machine
Veo 3.1 (Gemini / Flow): add your reference image(s) as ingredients, then prompt the motion

Step 4 - Lock the camera

Force one of these into every prompt:

fixed camera, no pan, no zoom, no dolly
locked-off tripod shot, no camera movement
static wide shot, camera stationary

Skip this and the model adds a default Ken Burns push. That camera drift stacks on top of your character motion, and you get double drift: the background slides while the figure also slips. One source of motion at a time.

Step 5 - Keep clips at 3-5 seconds

Identity holds well for a few seconds and degrades after. Default durations:

Runway Gen-4 / Gen-4.5: 5s base, extendable to longer (don’t extend a character shot; identity drifts past ~10s)
Kling 2.5 / 2.6: 5s or 10s (pick 5s for character work)
Luma Ray 3: 5s
Veo 3.1: ~8s clips

Need a longer beat? Generate several 5s clips and stitch them, hiding the cuts at head turns or occlusion moments where a small jump is invisible.

Step 6 - Batch generate, expect a low hit rate

Run each motion 6-10 times with the same reference image and the same prompt, re-rolling each time. Plan around roughly a 1-in-5-to-1-in-8 keeper rate. Per-clip cost as of June 2026:

Runway Gen-4 Turbo: 5 credits/sec → ~25 credits for a 5s clip (≈ $0.25 at the $0.01/credit top-up rate)
Runway Gen-4.5: 12 credits/sec → ~60 credits for a 5s clip (≈ $0.60)
Kling 2.5 Pro: ~210 credits for a 1080p 5s clip

So a single hero shot at 8 Gen-4.5 attempts runs roughly 480 credits, and you expect 1-2 usable takes. Iterate on Turbo to find the framing and prompt that works, then spend the expensive credits only on the final pass.

Step 7 - Accept/reject every clip

Run each new clip through these five checks the moment it finishes. If any fails, discard it; do not try to salvage a broken take.

[ ] Face is the reference character's face, not a distant cousin
[ ] Clothing / hair color / distinguishing features unchanged across the clip
[ ] Correct number of fingers, no limbs clipping through the body
[ ] Natural gait (not glide or spasm)
[ ] Single motion type (no "walking then suddenly running")

Step 8 - Stitch into the final cut

Take the best survivor per motion into DaVinci Resolve, CapCut, or Premiere:

Place cuts at the final frame before a head turn or an occlusion so the seam is hidden
Add a 2-3 frame cross-dissolve between clips to absorb tiny color jumps
Apply one unifying LUT across the whole edit so independently generated clips read as a single shoot

For more on per-clip drift, see the AI video motion drift fix and the broader image-to-video workflow.

Common mistakes

Complex motion in long clips. The longer and busier the shot, the more drift wins. Shorten and simplify.
Camera movement plus character movement. Pick exactly one source of motion.
Asking for facial expression in a wide shot. Models render faces poorly at distance; save expression for medium and close shots.
Skipping the reference image. Text-only character motion is dramatically harder and almost never holds identity.
Extending a clip past 10 seconds. The “extend” button is convenient and it will drift your character. Stitch instead.

Advanced tips

For every clip of the same character, feed the same reference image. That single file is your identity anchor across the whole project.
Side-angle walking drifts less than head-on or back-shot walking, because the model has fewer ambiguous limb crossings to resolve.
For multi-angle scenes, Veo 3.1 accepts up to three reference images of the same character, which helps it reconstruct the face from angles a single front-facing reference can’t cover.
For dialogue, generate clean visuals first, then add mouth movement with a lip-sync model rather than asking the video model to lip-sync from scratch.

FAQ

Can I generate a 30-second character monologue in one go? Not reliably. Identity and lip movement both degrade well before 30 seconds. Generate 5-second chunks and edit them together with cuts that hide the seams at head turns or pauses.

Which tool keeps the character most consistent? For a single reference image, Runway Gen-4 holds identity best in independent tests (around 95% consistency from one reference). If you have multiple reference angles, Veo 3.1’s up-to-three-image input gives the model more to work from.

What about lip-sync? Raw video generation rarely matches existing audio accurately. Generate the visuals here, then run them through a dedicated lip-sync model: Hedra’s Character-3 is strong for talking-photo realism, and Sync.so is the most accurate for syncing dialogue onto existing footage. Veo 3.1 can also generate synced audio natively if you are scripting the dialogue at generation time.

Why does my character keep changing clothes mid-clip? Almost always a too-long clip, a missing or low-resolution reference image, or a prompt that doesn’t pin the wardrobe. Keep clips at or under 5 seconds, use a 1024px+ reference, and add maintains identical clothing throughout to the prompt.

Is Sora still an option? The Sora consumer app shut down on April 26, 2026. The API runs until September 24, 2026, but for hands-on creators the practical choices are now Runway, Kling, Luma, and Veo 3.1.

Tags: #Tutorial #Video generation #Workflow

TL;DR

Who this is for

Which tool to use (June 2026)

Step 1 - Build the reference image

Step 2 - One motion per clip

Step 3 - Use image-to-video, not text-to-video

Step 4 - Lock the camera

Step 5 - Keep clips at 3-5 seconds

Step 6 - Batch generate, expect a low hit rate

Step 7 - Accept/reject every clip

Step 8 - Stitch into the final cut

Common mistakes

Advanced tips

FAQ

Related

Related Articles

AI Explainer Video Tutorial: 60-Second Concept Reveals

AI Music Video Tutorial: Beat-Synced 30-Second Edits

AI Trailer Tutorial: A Tension Arc in 45 Seconds

Cinematic Camera Movement Workflow for AI Video

AI Product Commercial Video: A 30-Second Ad That Doesn't Look AI

Short-Form Video Prompts: TikTok, Reels, Shorts, Douyin (2026)