You prompted a man running across a field and got back a man whose legs are pumping forward but whose body is moving backward. Or pouring water into a glass and the water flows up. Or picking up a phone and the hand passes through the phone.
These are physics / action coherence failures — the model knows the action label but doesn’t know how the action actually decomposes mechanically. This is one of the hardest video issues to fix because the model genuinely lacks understanding; you have to constrain the action heavily.
Common causes
Ordered by hit rate, highest first.
1. Action verb has multiple valid interpretations
"jumping" — vertical jump? long jump? jumping jacks? trampoline?
"throwing" — throwing forward? throwing up? throwing away?
"reaching" — reaching forward? up? sideways?
Without direction + target, the model picks an interpretation and may render mechanically wrong.
How to spot it: verb in prompt has no direction modifier.
2. Start / end pose not specified
The model has to guess where the action begins and ends. If you don’t say “starts standing, ends seated,” it may interpolate wrongly.
How to spot it: prompt has the verb but no description of starting pose or ending state.
3. Action involves complex physics the model can’t do
Models still struggle with:
- Water flowing realistically
- Cloth wrapping naturally
- Hand-object precise interactions (grabbing, holding, releasing)
- Multi-step actions (cook, type, drive)
How to spot it: action involves fine fluid / cloth / hand interaction.
4. Multiple actions in one prompt
"a man walks to a chair, sits down, opens a book, starts reading"
Too many actions in 4 seconds. Model gets confused or skips beats.
How to spot it: prompt has 3+ sequential verbs.
5. Subject occupies edge of frame
Actions starting / ending at frame edges are interpreted ambiguously. The model can’t see where things go.
How to spot it: subject is near frame edge in start frame.
6. Conflicting motion cues
"jogging slowly while sprinting forward at high speed"
Or implicit conflicts (graceful run vs frantic dash).
How to spot it: motion descriptors disagree on speed / intensity.
Shortest path to fix
Step 1: Use unambiguous, direction-specified verbs
# Bad — ambiguous
"jumping" → "jumping vertically straight up"
"running" → "running forward toward the camera"
"throwing a ball" → "throwing a ball to the right, ball moves off-screen right"
"reaching" → "reaching the right arm forward toward the table"
Always include: direction + target / destination + speed.
Step 2: Specify start pose and end state
# Template
"starts [pose], performs [action], ends [pose]"
# Example
"starts standing with arms at sides, performs a single forward step,
ends with right foot in front, left foot back, arms still at sides"
# For object interaction
"starts with empty hands, picks up the red cup with right hand,
ends holding the cup at chest height"
Step 3: One action per clip
# Bad — multi-action
"a man walks to a chair, sits down, opens a book, starts reading"
# Good — split
Clip 1: "a man walks toward a chair, camera follows"
Clip 2: "a man sits down on the chair, settling pose"
Clip 3: "a man opens a book on his lap, looks down to read"
# Stitch in editor
Step 4: Avoid physics-complex actions
If the model can’t do it, work around:
# Water pouring — hard
- Use a "before" shot (kettle tilted) + "after" shot (cup full)
- Cut between them
- Add subtle water sound for continuity
# Hand picking up object — hard
- Generate "hand near object" + "hand holding object"
- Cut between them quickly
- Avoid showing the actual grab transition
# Multi-step cooking
- One shot per step, edited together
- Don't try to render full sequence in one clip
Step 5: Frame subject in the middle, not the edge
# Bad start frame
- Subject at frame edge with action going off-screen
# Good start frame
- Subject centered or in rule-of-thirds position
- Action has space to happen within the frame
- Use wider framing if action needs space
Step 6: Use ControlNet-style motion reference
Some tools support motion reference videos:
# Runway Motion Brush
- Paint motion vectors directly on the image
- Specify exactly which parts move in which direction
# Kling Motion Brush (similar)
- Similar capability
# This eliminates ambiguity entirely
For tools without motion brush, simplify the prompt and accept fewer simultaneous actions.
Prevention
- Test verbs on simple subjects first to see what the model actually does
- Default to one action per clip; multi-action shots compose poorly
- Keep a “verbs the model handles well” list and a “verbs to avoid” list
- For complex physics, use cuts instead of full motion render
FAQ
Q: Why does the model render legs moving forward while the body moves backward? A: The model recognizes the verb “running” and the visual pose of running, but doesn’t simulate the rigid-body constraint that legs and torso move together. Each region is generated semi-independently. Specifying direction (“running forward toward the camera”) gives the model a shared motion vector for all body parts.
Q: Will newer models fix physics-coherence issues automatically? A: Each generation reduces gross errors (hands phasing through objects, water flowing up), but precise multi-step physics — pouring, grabbing, threading — still lags. Don’t bet a workflow on the next model fixing it; design around the limit with cuts and split clips.
Q: Does adding “realistic physics” or “correct anatomy” to the prompt help? A: Marginally. These tokens nudge style but don’t add physical simulation. They help more when removed than added — long descriptive prompts compete with the action verb for the model’s attention. Short, direction-specified prompts win.
Q: When should I split a multi-action shot vs. push for a single coherent clip? A: If the actions span more than ~2 seconds each or change subject location, split. Within a single 4s clip, one continuous motion (walk forward, sit, look up) reads as natural; three discrete actions (walk → sit → open book → read) almost always loses a beat or compresses awkwardly.