Image-to-Video Prompts (All Subjects): 10 Templates That Don't Break

Q: Which tool has the best image-to-video right now?

As of June 2026, **Kling 2.6** and **Runway Gen-4.5** lead on real human motion and faces; **Veo 3.1** wins on physical realism and built-in audio; **Hailuo 2.3** is the best value and handles Chinese prompts natively. Sora 2 is strong on physics but is API-only now. Test your specific image on two of them — results vary a lot by source frame.

Animate a still frame on Sora 2, Veo 3.1, Kling 2.6, Runway Gen-4.5, Hailuo, and Pika. Ten copy-ready image-to-video prompt templates plus a current model + pricing table.

Published: May 12, 2026 Updated: Jun 04, 2026 Author: AI Productivity Guide Team 🌐 查看中文版本

Image-to-video is the highest-ROI mode in any AI video tool today. Instead of asking the model to invent everything from a text prompt, you hand it a perfect first frame and tell it only what should move. That single change kills most “AI weirdness”: six-fingered hands, faces that morph between frames, props that vanish. The ten templates below are written to survive on the June 2026 generation of models, and there is a current pricing and version table at the bottom so you pick the right tool before you burn credits.

TL;DR

Give the model a clean still frame, then describe one motion, the camera behavior, and what must not change. Keep it to 1-3 sentences.
On image-to-video, short prompts beat long ones. Long prose fights the starting frame and triggers drift.
For a single hero shot, Kling 2.6 and Runway Gen-4.5 are the strongest on real human motion and faces; Veo 3.1 wins on physics and built-in audio; Hailuo 2.3 is the cheapest reliable option and handles Chinese prompts well.
Single generations run 5-10 seconds on most tools. To go longer, chain clips with start/end-frame features rather than asking for one long render.

Why image-to-video beats text-to-video for almost everything

The model does not invent the subject, only the motion, so identity stays stable.
Frame-to-frame consistency is far higher because the look is locked by your input image.
You control the art direction in advance with Midjourney or Flux.
Reusing the same starting-frame style keeps a series visually unified.
When something breaks, you regenerate one source still instead of the whole render.

If you are new to AI video, start here and only touch text-to-video once you understand how motion prompts behave.

What an image-to-video prompt should specify

Three things, in this order:

What moves — name the subject and the single action.
How the camera behaves — static, slow dolly, gentle pan, or orbit.
Atmosphere / detail motion — wind, fog drift, light flicker, fabric movement.

Keep the prompt short. The sweet spot is roughly 25-60 words. Past 100 words, most models get confused and start over-interpreting, which fights your fixed first frame.

10 copy-ready prompt templates

1. Portrait: subtle look up

The woman in the frame slowly tilts her head up and gives a small soft smile. A gentle breeze moves a few strands of her hair. Eyes blink once. Camera stays static.
Duration: 5 seconds, no other movement.

2. Cityscape: gentle traffic flow

Cars in the background move slowly through the intersection. The traffic light cycles from red to green once. Neon signs flicker subtly. Camera is locked.
Duration: 6 seconds, no camera movement.

3. Landscape: wind on grass

Wind moves the grass and trees gently. Clouds drift slowly from left to right. Distant water ripples softly.
Camera: very slow dolly forward.
Duration: 7 seconds.

4. Product shot: slow rotation reveal

The product rotates slowly clockwise about 90 degrees, revealing the label. Light reflections shift across the surface. Background and lighting remain identical.
Camera: static.
Duration: 5 seconds.

5. Drink: pour motion

Liquid pours smoothly from the bottle into the glass, filling halfway. Small bubbles rise. Nothing else in the frame moves.
Camera: static medium close-up.
Duration: 5 seconds.

6. Anime character: idle breathing animation

The character breathes naturally, shoulders rising and falling slightly. Hair sways gently as if a small breeze passes. Eyes blink once. Pose stays exactly the same.
Camera: still.
Duration: 5 seconds.

7. Coffee cup: steam rising

Steam slowly rises from the cup in soft wisps. The liquid surface ripples once. Nothing else moves.
Camera: static close-up.
Duration: 5 seconds.

8. Cinematic portrait: single head turn

The person slowly turns their head to look directly at the camera, holding eye contact for the last 2 seconds. Background blur stays consistent.
Camera: very subtle slow zoom in.
Duration: 6 seconds.

9. Game splash art: particle ambience

Magical sparkles drift slowly around the character. The cape moves gently as if in light wind. Eyes glow a little brighter. Character pose stays exactly the same.
Camera: static.
Duration: 6 seconds.

10. Landscape: drone-style slow rise

Camera slowly rises straight up, revealing more of the landscape below. The main subject stays in frame. Clouds drift slowly.
Duration: 7 seconds, no rotation.

Tuning tips

Always write “camera is static” unless you specifically want motion. Sora 2, Veo 3.1, and Runway all default to a small drift you usually don’t want.
Always set a duration. 5-7 seconds is the safe range; 5 seconds is the most stable on every tool tested.
State what must not change: background remains identical, hairstyle stays the same, lighting unchanged. Models honor explicit negatives better than implied ones.
Specify one action clearly. No compound actions in a single clip.
If your tool exposes a seed (Kling, Hailuo, and Runway do), fix it so you can iterate motion on the same image without re-rolling the whole look.

Common mistakes

Compound actions in one clip: walks to the table, picks up cup, drinks — all three will look broken. Split into three clips.
No “static camera” instruction: the camera drifts and adds artifacts on every current model.
Trying to “fix” a flawed starting frame with a long prompt: fix the frame first; the prompt cannot repair bad source geometry.
Asking for too much length: single generations top out near 5-10 seconds. Asking for 15+ seconds in one render still degrades on most consumer tiers.
Camera motion that contradicts the still: don’t request a head turn to the front when the source only shows the back of the head.

Workflow for stitching multiple clips into a longer scene

Single image-to-video generations are short by design (see the table below), so longer scenes are chained, not rendered in one pass:

Generate a first frame in Midjourney or Flux.
Run image-to-video and save the clip.
Extract the last frame, or use a start/end-frame feature directly — Kling’s Start/End Frame, Pika’s Pikaframes (which can stretch to ~25 seconds), and Runway both support this.
Feed that frame as the start of the next clip and continue chaining.
Cut and color-match in any editor (CapCut, Premiere, Resolve).

See How to Improve Motion Consistency in AI Videos for the full chaining workflow and seed-locking details.

Current models, durations, and pricing (June 2026)

Tool	Latest model	Single-clip length	Best at	Entry price (consumer)
Sora 2 / Sora 2 Pro	Sora 2	4/8/12s (Pro: 10/15/25s)	Physics, prompt adherence	API only — consumer app retired Apr 26, 2026
Veo 3.1	Veo 3.1 (Fast/Lite)	8s per generation	Physics, native audio	Google AI Pro $19.99/mo (~1,000 Flow credits)
Kling	Kling 2.6 Pro	5-10s (Extend to 2-3 min)	Real human motion, ads, audio EN/ZH	Standard from $6.99/mo
Runway	Gen-4.5	up to 60s	Stylized motion, expressions, audio	Basic $12/mo
Hailuo (MiniMax)	Hailuo 2.3	up to 10s, 1080p	Cheapest reliable, Chinese prompts	Standard $9.99/mo
Pika	Pika 2.2	5/10s (Pikaframes ~25s)	Fast previews, transitions	Free tier 150 credits/mo

Notes: the Sora consumer app (web + iOS) was discontinued on April 26, 2026; Sora 2 lives on as an API only and is scheduled to sunset on September 24, 2026, so it’s no longer the casual pick it once was. Veo 3.1 generations are capped at 8 seconds each, so anything longer means chaining. Always confirm current numbers on the vendor’s own pricing page before committing a budget — these tiers move quarterly.

External references: Runway pricing · Kling pricing · Pika pricing.

FAQ

Q: Best resolution for the starting frame? A: Match the model’s native output. Most image-to-video models render 720p or 1080p, so a 1024x1024 or 1920x1080 source is ideal. Going much higher just gets downscaled, which can introduce artifacts. Hailuo 2.3, for example, outputs 1080p at 6 seconds or 768p at 10 seconds — feed it a clean 1080p still.

Q: Which tool has the best image-to-video right now? A: As of June 2026, Kling 2.6 and Runway Gen-4.5 lead on real human motion and faces; Veo 3.1 wins on physical realism and built-in audio; Hailuo 2.3 is the best value and handles Chinese prompts natively. Sora 2 is strong on physics but is API-only now. Test your specific image on two of them — results vary a lot by source frame.

Q: My subject still morphs partway through. Why? A: Usually one of two causes: the clip is too long (drop to 5 seconds), or you asked for an action the starting frame can’t support, like turning a head from the back to the front. Shorten, simplify the action, and start from a higher-resolution still.

Q: How do I make a video longer than ~10 seconds? A: Chain clips. Use a start/end-frame feature (Kling Start/End Frame, Pika Pikaframes up to ~25 seconds) or extract the last frame and reuse it as the next clip’s input. Kling’s Extend can snap clips together up to 2-3 minutes. Don’t ask one render for 15+ seconds.

Q: Can I specify camera motion AND character motion? A: Yes, but keep both small. One subtle action plus one slow camera move is the limit before consistency breaks. Both at high intensity degrades fast on every current model.

Q: How do I add sound? A: Veo 3.1, Kling 2.6 Pro, and Runway Gen-4.5 now generate native audio. On tools that don’t, add it in post (CapCut, Premiere, Resolve).

Tags: #Image-to-video #Video generation #Prompt