AI Video Motion Doesn't Make Sense: 6 Causes + Fix Path

Legs pump forward but the body slides backward, water pours up, hands phase through objects. Verb is ambiguous and physics aren't simulated. Specify direction, target, and start/end pose.

Published: May 17, 2026 Updated: Jun 18, 2026 Author: AI Productivity Guide Team 🌐 查看中文版本

You prompted a man running across a field and got back a man whose legs pump forward while his body slides backward. Or pouring water into a glass and the water flows up. Or picking up a phone and the hand passes through it.

These are physics and action-coherence failures. The model recognizes the action label and the appearance of the motion, but it does not simulate mass, gravity, or momentum, so each region of the frame can move semi-independently. That is why the legs and torso disagree.

Fastest fix: rewrite the verb with explicit direction and a destination (running forward toward the camera), keep it to one action per clip, and for anything involving fluids or precise hand-object contact, generate two static beats and cut between them instead of rendering the transition. If your tool supports it, paint the motion directly (Runway Multi Motion Brush) or set start/end keyframes (Kling) rather than relying on a text verb. If a physics shot keeps failing, re-run it on the strongest current physics model you can access (Kling 3.0 or Veo 3.1 as of June 2026) before you keep fighting the prompt.

Pick your bucket first

Match your symptom to the most likely cause before you start editing the prompt.

Symptom	Most likely cause	Go to
Legs/arms move one way, body another	Verb has no shared motion vector	Step 1
Action morphs or “teleports” between poses	Start/end pose unspecified	Step 2
Water flows up, cloth clips, fingers melt on contact	Physics the model can’t simulate	Step 4
Beats get skipped or compressed	Too many actions in one clip	Step 3
Action runs off the edge and breaks	Subject framed at the edge	Step 5
Motion looks confused even with a clean prompt	Conflicting motion cues	Cause 6

Common causes

Ordered by hit rate, highest first.

1. Action verb has multiple valid interpretations

"jumping" — vertical jump? long jump? jumping jacks? trampoline?
"throwing" — throwing forward? throwing up? throwing away?
"reaching" — reaching forward? up? sideways?

Without a direction and a target, the model picks one interpretation and may render it mechanically wrong.

How to spot it: the verb in your prompt has no direction modifier.

2. Start and end pose not specified

The model has to guess where the action begins and ends. If you don’t say “starts standing, ends seated,” it interpolates between two poses it invented, which is where the “teleport” look comes from.

How to spot it: the prompt has the verb but no description of the starting pose or ending state.

3. Action involves physics the model can’t simulate

Even the strongest 2026 models still struggle with:

Liquids flowing realistically (water, pouring, splashing)
Cloth wrapping and draping naturally
Precise hand-object contact (grabbing, holding, releasing)
Multi-step manual actions (cooking, typing, driving)

As of June 2026 the two best physics models you can actually access are Kling 3.0 (the current benchmark for fluid, fabric, and particle behavior; shipped Feb 4, 2026, native 4K/60fps) and Google Veo 3.1 (realistic physics plus strong temporal coherence and native synced audio). If a fluid or hand-contact shot is failing on a weaker model, re-running it on Kling 3.0 or Veo 3.1 is often cheaper than fighting the prompt.

One thing that changed: OpenAI Sora 2 used to be the physics leader, but it is no longer a practical option. OpenAI discontinued the Sora web and app experiences on April 26, 2026, and the Sora API is deprecated and shuts down on September 24, 2026 (OpenAI’s discontinuation notice). Unless you are calling the deprecated API directly before that date, treat Kling 3.0 and Veo 3.1 as your best-physics fallbacks. None of these models are fully solved, so the workarounds in Step 4 still apply.

How to spot it: the action involves fine fluid, cloth, or hand interaction.

4. Multiple actions in one prompt

"a man walks to a chair, sits down, opens a book, starts reading"

That is four actions in a 4-to-8-second clip. The model gets confused or silently skips beats.

How to spot it: the prompt has 3 or more sequential verbs.

5. Subject occupies the edge of the frame

Actions that start or end at the frame edge are interpreted ambiguously because the model can’t see where the limb or object is going.

How to spot it: the subject is near a frame edge in the start frame.

6. Conflicting motion cues

"jogging slowly while sprinting forward at high speed"

Or implicit conflicts (graceful run vs frantic dash). The model averages the cues and gets something that reads as neither.

How to spot it: your motion descriptors disagree on speed or intensity.

Shortest path to fix

Step 1: Use unambiguous, direction-specified verbs

# Bad — ambiguous            # Good — direction + target
"jumping"          →  "jumping vertically straight up"
"running"          →  "running forward toward the camera"
"throwing a ball"  →  "throwing a ball to the right, ball moves off-screen right"
"reaching"         →  "reaching the right arm forward toward the table"

Always include: direction + target/destination + speed. Giving every body part a shared motion vector is what stops the legs and torso from disagreeing.

Step 2: Specify start pose and end state

# Template
"starts [pose], performs [action], ends [pose]"

# Example
"starts standing with arms at sides, performs a single forward step,
ends with right foot in front, left foot back, arms still at sides"

# For object interaction
"starts with empty hands, picks up the red cup with right hand,
ends holding the cup at chest height"

If your tool supports start/end keyframes (Kling’s first-frame/last-frame mode, Runway’s keyframe inputs), upload two still images instead of describing the poses in text. The model then only has to interpolate the path between two poses you control, which removes most pose-guessing. First-frame/last-frame control is available across Kling 1.6, 2.x, and 3.0; Kling 3.0 (shipped Feb 4, 2026) generates noticeably more natural motion paths between the two frames, which is exactly what helps here.

Step 3: One action per clip

# Bad — multi-action
"a man walks to a chair, sits down, opens a book, starts reading"

# Good — split into clips
Clip 1: "a man walks toward a chair, camera follows"
Clip 2: "a man sits down on the chair, settling into the seat"
Clip 3: "a man opens a book on his lap, looks down to read"

# Stitch the clips together in your editor

One continuous motion inside a clip (walk forward, then sit) is fine. Three discrete actions almost always loses a beat.

Step 4: Work around physics the model can’t do

If the model can’t simulate it, don’t render the transition — cut around it:

# Water pouring — hard
- Generate a "before" shot (kettle tilted) and an "after" shot (cup full)
- Cut between them
- Add a subtle water sound for continuity

# Hand picking up an object — hard
- Generate "hand near object" and "hand already holding object"
- Cut between them quickly
- Don't show the actual grab transition

# Multi-step cooking
- One shot per step, edited together
- Don't try to render the full sequence in one clip

For morphing or melting objects mid-motion, add rigidity language to the positive prompt: the metal door stays solid and rigid throughout the swing, hinges stable. For melting fingers on contact, a negative prompt helps — but only on tools that actually have a negative field:

Kling 3.0 has a negative_prompt field (the API default is blur, distort, low quality; it was restored on May 23, 2026 after a brief removal). Add stability words: morphing, extra fingers, mutated hands, deformed limbs, sliding feet.
Veo 3.1 supports negative_prompt in its generation config, but Google’s guidance is to use plain descriptive words, not instructive phrasing — write mutated hands, extra fingers, morphing, not no mutated hands (Veo 3.1 prompting guide).
Runway (Gen-4.5) does not expose a negative-prompt field for video. Runway’s official guidance is to phrase everything positively, so describe the clean result you want (solid intact hands, five fingers, stable anatomy) in the normal prompt instead.

Step 5: Frame the subject in the middle, not the edge

# Bad start frame
- Subject at the frame edge with the action going off-screen

# Good start frame
- Subject centered or on a rule-of-thirds line
- The action has room to happen inside the frame
- Use wider framing if the action needs space

Step 6: Paint or keyframe the motion directly

Text verbs are the weakest way to specify motion. If your tool offers direct motion control, use it:

# Runway Multi Motion Brush
- Paint motion vectors directly onto the image
- Assign independent motion to up to 5 separate regions
- Per region, set motion across 3 axes — horizontal (x),
  vertical (y), and proximity/zoom (z) — plus an ambient-noise slider
- Keep your TEXT prompt consistent with the painted directions,
  or you get tearing artifacts

# Kling start/end keyframes
- Upload a start frame and an end frame
- The model generates the path between them

One caveat from Runway’s own guidance: your text prompt must agree with your brush directions. If you brush a river flowing down but the prompt says “still water,” you get artifacts.

For tools without a motion brush or keyframes, simplify the prompt and accept fewer simultaneous actions. Runway’s current Gen-4.5 prompting guidance says the same thing: start with a simple prompt that captures only the essential motion, then add detail only if you need it. Runway is explicit that Gen-4.5 “thrives on prompt simplicity” — a focused prompt with clear motion direction beats an overloaded paragraph, so over-stuffed prompts make motion worse, not better. Let a good input image carry the visuals and use the text to describe what moves.

How to confirm it’s fixed

Track one limb. Watch a single foot or hand frame by frame (scrub slowly). It should move in one consistent direction with the body, not reverse or stutter.
Check the contact moment. For grabs/pours, the failure is almost always at the contact frame. If you cut around it (Step 4), confirm there’s no phasing in the kept frames.
Count the beats. If you split a multi-action shot, each clip should contain exactly one motion, start to finish, no skipped beat.

Prevention

Test new verbs on a simple subject first to see what the model actually does with them.
Default to one action per clip; multi-action shots compose poorly.
Keep a running “verbs the model handles well” list and a “verbs to avoid” list.
For complex physics, plan cuts up front instead of betting on a full motion render.

FAQ

Q: Why do the legs move forward while the body moves backward? A: The model recognizes the verb “running” and the visual pose of running, but it doesn’t enforce the rigid-body constraint that legs and torso share one motion. Each region is generated semi-independently. Specifying direction (running forward toward the camera) gives every body part the same motion vector and resolves it.

Q: Will newer models fix physics coherence automatically? A: Each generation reduces gross errors. As of June 2026, Kling 3.0 and Veo 3.1 handle water, gravity, fabric, and collisions far better than 2025 models, and hand generation has improved a lot. But precise multi-step physics — pouring, threading, controlled grabs — still lags. Don’t design a workflow around the next model fixing it; design around the limit with cuts and split clips.

Q: Does adding “realistic physics” or “correct anatomy” to the prompt help? A: Marginally. Those tokens nudge style but add no actual simulation. They can even hurt: a long descriptive prompt competes with the action verb for the model’s attention. If your tool has a negative-prompt field (Kling 3.0 and Veo 3.1 do; Runway Gen-4.5 does not), put stability words like morphing, mutated hands, sliding feet in the negative slot — that does more than the same words in the positive prompt. On Veo, use plain descriptive words, not “no”/“don’t”. On Runway, phrase positively and describe the clean result instead.

Q: When should I split a multi-action shot vs. push for one coherent clip? A: If the actions span more than about 2 seconds each, or the subject changes location, split. One continuous motion inside a 4-to-8-second clip (walk forward, sit, look up) reads as natural; three discrete actions (walk, sit, open book, read) almost always loses a beat or compresses awkwardly.

Q: My fluid/hand shot fails on every tool. What now? A: Re-run it on the strongest physics model you can access — Kling 3.0 or Veo 3.1 as of June 2026 (Sora 2’s app shut down April 26, 2026, so it’s no longer the easy answer it used to be). If it still fails, stop fighting it: generate two static beats (before/after) and cut between them. A clean cut reads as intentional; a broken transition reads as an AI artifact.

Tags: #Video generation #Debug #Troubleshooting