Image-to-Video Doesn't Follow the Source Image

Output looks nothing like the input image. Raise image strength, shrink the text prompt, and ensure source is ≥1024px with subject filling >40% of frame.

You upload a specific image to Runway / Pika / Kling, write a prompt about gentle motion — and the output video doesn’t really look like your input image. The subject is slightly different. The lighting changed. The composition shifted. It’s “inspired by” your image rather than animating it.

Image-to-video tools have multiple knobs that together control how strictly the video follows the source: image strength / fidelity, motion strength (which dilutes the image), text prompt weight (which can override), and source image quality. Each tool weights them differently.

Common causes

Ordered by hit rate, highest first.

1. Image strength / fidelity slider too low

Most tools default image strength to 0.5-0.6. At that level, the model takes “creative interpretation.” For strict adherence, push to 0.8+.

How to spot it: check the tool’s image strength setting. <0.7 = expect drift.

2. Motion strength too high

High motion overrides image fidelity. Even at full image strength, motion 8 will deform the subject.

How to spot it: motion slider is at default or higher. Drop it.

3. Text prompt overrides the image

In some tools (especially Runway, Kling), the text prompt has equal or higher weight than the image. A descriptive text prompt can rewrite what’s in the image.

How to spot it: your text prompt describes things not in the image (clothing, hair, expression). It’s fighting the image.

4. Source image too small or low quality

A 512×512 reference has less identity info than a 2048×2048. The model has more to work with at higher res.

How to spot it: source image is <1024px on the short side.

5. Subject too small in the source image

If the subject is 20% of frame, the model has fewer pixels to anchor on. Subject should be >40% of frame.

How to spot it: subject is small in source. Crop tighter before uploading.

6. Image is heavily stylized / illustration

Anime, painting, sketch sources translate to video worse than photo sources. The model has to “interpret” them.

How to spot it: source is an illustration; output is photorealistic version that doesn’t match.

Shortest path to fix

Step 1: Verify source image quality

# Source image checklist
- Resolution: ≥1024px on short side (1536+ better)
- Format: PNG (not heavily-compressed JPEG)
- Subject takes up >40% of frame vertically
- Subject in focus, sharp
- Lighting is clear, no extreme shadows hiding features

If your source fails any of these, fix the source first.

Step 2: Push image strength to maximum

# Runway Gen-3 Alpha
- Image strength → 0.8 to 1.0
- Or use "Image to video" strict mode

# Pika 2.0
- Strength → 0.85+
- Or use "Image conditioning" mode

# Kling 1.6
- Image strength ("túxiàng qiángdù" slider) → max
- Reference faithfulness → high

# Hailuo / Luma
- Image reference weight → high / max

Step 3: Drop motion to minimum

# Runway: motion 1-2
# Pika: 0.2-0.3
# Kling: "subtle"
# Luma: low

Less motion = less deformation of the source subject.

Step 4: Strip or minimize the text prompt

Counter-intuitive but important: less text = more image fidelity.

# Bad — overrides image
"a beautiful woman in a red dress walking confidently through a vibrant city street, cinematic, warm sunset"

# Good — only motion hint
"slight head turn, gentle smile"

# Best — no text prompt
""   # empty; let image speak for itself

Keep text to 5-10 words describing ONLY the motion, not the subject.

Step 5: Crop source so subject fills frame

# Before upload
- Open source in Photoshop / Preview / Pixelmator
- Crop so subject is at least 50% of frame
- Pad with similar-color background if aspect ratio matters
- Re-save as high-quality PNG

Step 6: For illustration sources, use a stylized video model

# Anime / illustration → video
- Sora (handles illustration well)
- Kling 1.6 with "Stylized" mode
- Try "ChampVision" or other illustration-specific models

# Force photoreal conversion (if that's what you want)
- Accept that source style won't be preserved
- Use ControlNet-like reference to preserve composition only

Prevention

  • Always prep source images to ≥1024px with subject filling >40% before uploading
  • Default to minimum motion + maximum image strength for first generation; raise motion only if needed
  • Keep text prompts to motion-only descriptions; don’t describe the subject again
  • For multi-clip projects with same source, save the prepped source and reuse exactly

Tags: #Video generation #Debug #Troubleshooting