You generated a two-shot — woman in red on the left, man in blue on the right. Around second 3 they pass behind a pillar, and when they re-emerge the man is now wearing red and the woman is in blue. Or their faces have swapped entirely. The model lost track of which character was which. This is identity-swap, and it is one of the hardest multi-character problems in current AI video. Fix by generating each character separately and compositing, locking both characters into the first frame as image-to-video, or using a tool with per-character reference image support like Runway Gen-3 References or Kling references.
Common causes
Ordered by hit rate.
1. Model has weak per-character anchoring
Text-to-video models reason about “a woman in red and a man in blue” as a single bag of attributes. Once their pixels mix (occlusion, close framing, hugging), the model is free to reassign which person owns which attribute.
How to spot it: Identity swap happens precisely at moments of occlusion or close contact. The model “redraws” both characters when they re-emerge and gets the assignment wrong.
2. Prompt does not strongly differentiate the characters
“Two friends walking” gives the model nothing to anchor on. “A tall woman in a red coat with short black hair, walking alongside a shorter man with curly blond hair in a blue jacket” gives much stronger anchors.
How to spot it: Re-read your prompt. If you cannot tell from the prompt alone which character is on which side, the model has no reason to keep them straight.
3. Reference image has both characters fused
For image-to-video, your reference image might have both characters in the same crop region or with overlapping silhouettes. The model treats them as one entity.
How to spot it: Look at the reference. If you cannot draw a clean bounding box around each character, the model can’t either.
4. Action requires them to swap positions
If your prompt says “they switch sides as they walk,” the model genuinely did what you asked, but the identities followed the swap. Common in choreographed shots.
How to spot it: Does your prompt include “switch,” “pass,” “cross,” or “exchange”? Position swap can drag identity swap with it.
5. Long clip duration on a multi-character scene
5-second two-shots are usually OK. 10-second two-shots have triple the chance of identity drift. Same model behavior as single-character drift but worse.
How to spot it: Generate 4s and 10s versions. If 4s is stable and 10s swaps, duration is amplifying weak anchoring.
Shortest path to fix
Step 1: Strengthen per-character description in the prompt
# Weak
"Two friends walking through a park."
# Strong
"On the left, a tall woman in a red coat with shoulder-length black hair,
walking next to a shorter man on the right in a blue jacket with curly blond
hair. Maintain positions: woman always left, man always right.
Maintain clothing: red on woman, blue on man, throughout entire clip."
# Use distinct hair color, height, clothing color
# Repeat the assignment at start and end of prompt
Step 2: Generate each character separately and composite
The single most reliable fix:
# Generate character A alone in scene
"A tall woman in red coat walks through park from left to right,
empty path, no other people, locked tripod."
# Generate character B alone in same scene
"A shorter man in blue jacket walks through park from left to right,
empty path, no other people, locked tripod, matching lighting and color grade."
# Composite in After Effects or Resolve Fusion
- Mask each character on green-screen-style separation
- Layer A on bottom, B on top
- Adjust timing so they appear in same shot
- Add ground shadow to anchor them
Step 3: Use image-to-video with both characters in the first frame
If you must keep them in the same generation:
# Reference image checklist
- Both characters clearly visible
- Distinct silhouettes (height, hair, clothing)
- Clean spatial separation, no occlusion
- Strong color contrast in their outfits
# Image-to-video prompt
"Continuation of the depicted scene. Woman in red stays on left throughout.
Man in blue stays on right throughout. No swapping of positions or clothing.
Maintain identities from reference frame."
# Generate 4 seconds max; longer = higher swap risk
Step 4: Use per-character reference images where supported
# Runway Gen-3 References
- Upload reference image for Character A
- Upload reference image for Character B
- Tool conditions each separately
- Strength: 0.7-0.8 per character
# Kling 2.0 Multi-Subject References
- Add up to 2 subject references
- Bind each to a description in prompt
- Best for short clips, still drifts on long
# Pika Pikascene
- Multiple character anchors
- Better than text-only multi-character
Step 5: Avoid occlusion or rewrite to single-character shots
If swaps persist, restructure the edit:
# Replace one two-shot with two single-shots
- Shot 1: Woman in red walking (4 seconds)
- Shot 2: Man in blue walking (4 seconds)
- Cut between them in editor
# Or generate the two-shot but cut before/after occlusion
- Generate 6 seconds
- Use only the first 3 seconds where no occlusion happens
- Discard the post-occlusion portion
Prevention
- Default to single-character generation plus composite for any shot where character identity matters.
- Always prompt distinct visual differentiators (color, height, hair) for multi-character scenes.
- Avoid prompts that require occlusion, swaps, or close physical contact between characters.
- Cap multi-character clips at 4-5 seconds; chain in post for longer.
- Build a reference-image library per character so future shots stay consistent.