What this tutorial solves
The pain: you have a finished script you are proud of, you paste it into an AI video tool, and what comes back is a series of literal stock-footage-looking clips that miss the subtext. The fix is not a better model — it is breaking the script into shot-level prompts and deliberately deciding, per shot, whether you want a literal visual or an evocative one. This workflow turns a 60-second script into 8 to 12 well-paced shots in roughly 90 minutes, including generation time.
Who this is for
Writers and content creators who script first and visualize second — essayists, video podcasters, indie filmmakers, brand marketers narrating product stories, and educators producing explainer content. Especially useful when the script’s voice or thesis matters more than glossy production — the literal-vs-evocative discipline is what carries voice into picture.
When to reach for it
You have a finished script (narration, monologue, voiceover, or dialogue) and need accompanying visuals. Also when revising an existing video: re-shot mapping a script you have already produced lets you swap weak visuals without recutting audio. Useful for repurposing a podcast clip as a short video — the audio is already cut, you just need 30 to 60 seconds of picture.
When this is NOT the right tool
Pure improvisational visual work where you compose in the moment. Scripts where the visual is the script (storyboarded before written — animation, music videos). Documentary work where the picture has to be real footage of real events. Any project where rights to AI-generated imagery are a problem (some publications still reject AI visuals for editorial work).
Step by step
- Read the script aloud and time it. Mark moments where a visual shift would help — generally one every 4 to 8 seconds. Mark with timestamps so durations align later.
- For each marked moment, decide: literal visual (showing what the script says) or evocative visual (showing what the script means). Literal works for concrete nouns; evocative works for abstract claims and emotional beats.
- Aim for roughly 60% literal, 40% evocative. Too literal feels like a slideshow; too evocative loses the viewer. Open the script after marking and check the ratio.
- Write an AI video prompt per shot. Include shot length matching the script section plus 0.5 to 1 second buffer for editing. Example: a 6-second script line gets a 7-second generation.
- Generate all shots. Keep them slightly longer than needed and accept a 30 to 50% re-generation rate on first pass — abstract beats often need 2 to 3 tries.
- Edit script audio first, then drop visuals on top. Audio drives the cut; visuals serve audio. This is the opposite of how you would cut shot footage.
- Test the cut without sound first. If the visuals tell the story alone — or at least a recognizable version of it — you have enough. If not, two or three shots need to be stronger or different.
First-run exercise
Pick the shortest script you have — 30 to 45 seconds. Run the full workflow including the silent-test pass. Most writers find their first pass is too literal (80/20 instead of 60/40) and the silent test catches it immediately: the video plays like an inventory of nouns. For the second run, swap two literal shots for evocative ones in the same script. The before/after comparison is more useful than reading another tutorial.
Quality check
- Shot count matches your mark density. A 60-second script has 8 to 12 shots, not 30, not 4.
- Literal-to-evocative ratio is roughly 60/40. Recount after generation — some literal prompts come back evocative anyway.
- Every shot has 0.5 to 1 second of buffer on each end. No back-to-back hard cuts on the frame boundary.
- Audio cuts and visual cuts align within 100 ms. If they drift, the brain reads “out of sync” even before you can name why.
- The silent-test cut conveys the script’s gist. If you have to read a transcript to understand, your visuals are too evocative.
- No shot does double duty for two unrelated lines. Each prompt is for exactly one beat.
How to reuse this workflow
- Save the read-aloud-and-mark step as a template doc. Two columns: timestamp + line, literal/evocative + prompt seed.
- Build a small library of prompts that have produced strong evocative shots for recurring themes you cover (remote work, AI fatigue, creativity). Reuse the seeds; tweak the specifics.
- Track your re-generation rate per shot type. If evocative shots take 3+ tries, your prompts are too vague — add concrete visual anchors.
- For series or multi-episode work, develop a visual vocabulary doc: characters, palette, recurring motifs. AI shots stay more consistent when you reference the doc each time.
- Every few weeks, regenerate a successful shot with a new model snapshot. When the new model produces equal or better quality, migrate.
Recommended workflow
Read script aloud and time it → mark visual cut points every 4 to 8 seconds → tag each as literal or evocative aiming for 60/40 → write one shot prompt per mark with duration plus buffer → generate (expect 30 to 50% re-gen rate) → cut audio first, drop visuals on top, align cuts to audio → silent-test the cut and revise weak shots.
Common mistakes
- One visual per sentence. Sentences are too small a unit; you end up with 30 shots in a minute and the cut feels frantic.
- Only literal visuals. The video feels like a slideshow inventorying nouns. Voice carries the literal info; let picture do something else.
- Only evocative visuals. The viewer loses the thread. Anchor every 8 to 12 seconds with a literal beat.
- Generating before reading aloud. You miss the natural pacing of your own writing and end up with cuts in the middle of phrases.
- Letting visuals drive cuts instead of audio. AI shots have arbitrary length; if you cut to them, the script gets chopped.
- Skipping the silent test. Sound covers a multitude of visual problems; mute reveals them.
Advanced tips
- For dialogue-heavy scripts, alternate close-ups (character emotion) with wides (environment). Two close-ups in a row reads as repetitive.
- For narration, lean evocative — the voiceover already carries literal info. 50/50 or even 40/60 literal/evocative often works for narration.
- Save the script-to-shot-mapping in a doc with a column for which generations succeeded and which were re-rolled. Future similar projects benefit from the prior pattern.
- For interview or podcast clips, the visual job is to maintain attention, not explain — lean toward atmospheric or abstract evocative shots.
- For ads with a call to action, end on a literal product shot. Evocative endings perform measurably worse on CTR.
Output checklist
- Script read aloud and visual moments marked with timestamps.
- Literal-evocative tagged with target ratio around 60/40.
- Shot durations match script pacing with 0.5 to 1 second buffer per side.
- Audio cut completed before visuals are placed.
- Tested as silent video to confirm visual story works.
- Final cut has no two adjacent shots from the same category in a row for more than 12 seconds.
FAQ
- Should I shoot the script first then visualize, or visualize first?: For AI workflow, script first. Generating without a script direction wastes credits and produces incoherent material.
- What about voiceover quality?: AI voiceover (ElevenLabs, OpenAI voices) is acceptable for many cases. For high-stakes brand work or anywhere voice is the product, hire a human.
- My script has 4-minute monologues. Same workflow?: Yes, but split into 30 to 60 second segments and apply the workflow per segment. A single 4-minute master is too unwieldy.
- How do I keep characters or settings consistent across shots?: Use the same seed when possible, describe characters in the same words across prompts, and reference a “style sheet” line in every prompt. See AI video style consistency.
- What about subtitles or captions?: Add in post. Generating subtitles into the video locks them; overlaying in the editor lets you re-time without re-rendering.
- Can I use this for animated shorts?: Yes, but the literal-vs-evocative split shifts toward evocative — animation is already a stylized layer over reality.