The default behavior of every text-to-image model is to give you a slightly different person each generation. For indie authors, comic artists, game devs, and brand teams running a mascot across 20 banner images, that drift is the single biggest reason their assets feel “AI-generated” instead of “designed”. The fix is a canonical reference image plus a frozen structural description — applied with discipline, not improvisation.
What this tutorial solves
Generating one cool image is easy; generating ten that all look like the same person is hard. This guide turns “same character, different scenes” into a repeatable process built around a character bible (one reference image + one trait list) and tool-specific reference features (Midjourney --cref, ChatGPT image input, Stable Diffusion LoRA, Sora character system). The goal is consistency you can ship to a client, not a one-off lucky shot.
Who this is for
Indie authors illustrating chapters or covers, game devs needing the same NPC across portraits and combat poses, comic and webtoon artists who can’t redraw, marketing teams running a consistent mascot across asset packs, and educators producing cohort visuals for a course. If your character only appears once, you can skip this; if it appears five or more times, the discipline pays off immediately.
When to reach for it
A character will appear in five or more images and must look like the same person. Also reach for this when launching a new brand mascot, illustrating a book or comic series, building a game’s character roster, or producing a short story with recurring cast. Less useful for editorial illustration where each scene’s protagonist is incidental.
When this is NOT the right tool
One-off illustrations where consistency does not matter, real-person portraits (use a real photoshoot — AI cannot legally or ethically replicate a specific real person without permission), or characters with no defined visual where the drift is actually a feature. Also skip for photorealistic humans — small differences read as a different person, and current models cannot hold true photoreal consistency yet.
Before you start
- Decide your style early: stylized illustration, anime, painterly, photoreal-stylized, full photoreal. Stylized tolerates small variations; photoreal does not.
- Choose your toolchain: Midjourney with
--crefand--sref, ChatGPT image with image input, Stable Diffusion + LoRA / DreamBooth, Sora for video frames, Flux with Redux for reference. Test on your character before committing. - Reserve a folder structure:
/character-bible/{character-name}/canonical.png,traits.md,prompt-template.md,outputs/. - Block 1-2 hours just for canonical generation. The reference image is the most important asset; do not rush it.
Step by step
- Generate the canonical portrait. Front-facing, neutral background, clear lighting, mid-shot. Generate 12-20 variations and pick the strongest. This is the only time you are searching for the character; afterwards you are matching to it.
- Write the trait list. 5-7 specific, visible traits: hair color + length + texture, eye color, skin tone, distinguishing marks (scar, freckles, tattoo placement), signature outfit or accessory, body type. Avoid abstract traits (“kind eyes” — model interpretation drifts).
- Use the reference image as input. In tools that support it: Midjourney
--cref URL --cw 100, ChatGPT image with the canonical attached, Stable Diffusion with IP-Adapter or ControlNet reference preprocessor, Flux Redux node in ComfyUI. The reference image carries more signal than any prose. - For tools without image input, paste the trait list verbatim. Do not rephrase. “Auburn shoulder-length wavy hair” stays exactly “auburn shoulder-length wavy hair” in every prompt. Tiny rephrases compound into a different person by image 5.
- For each new scene, change only background, lighting, and pose. Character description stays byte-identical. Keep a
prompt-template.mdwith placeholders forsceneandposeonly. - When the AI drifts, increase reference weight. Midjourney: bump
--cwto 100. Stable Diffusion: raise IP-Adapter weight to 0.8-1.0. ChatGPT: re-attach the reference and remind the model explicitly. - Maintain a character bible. Reference image + canonical traits + 3-5 already-approved scene outputs serve as future references. The bible grows with the project.
Trait list example
Name: Mira
Hair: auburn, shoulder-length, wavy, side-parted left
Eyes: green, almond-shaped
Skin: warm olive
Marks: small scar above right eyebrow
Outfit: charcoal canvas jacket with brass buttons,
knee-high boots, leather satchel slung right shoulder
Build: medium height, athletic
Paste this block at the top of every scene prompt with a one-line action and setting appended.
First-run exercise
- Generate your canonical portrait. Spend the full 1-2 hours on it.
- Generate three scene images using the reference + frozen trait list.
- Print or screen-tile all four (canonical + three scenes). Squint. If any one image reads as a different person, the trait list is too vague or the reference weight too low.
- Adjust the variable that fixes it — usually reference weight — and re-run the three scenes.
Quality check
- Place canonical and new scene side by side at thumbnail size. Same person?
- Are the distinguishing marks present in the new scene? Missing scars and freckles are the easiest tell.
- Did the outfit drift in unspecified directions? “Charcoal jacket” stayed charcoal, or became “dark blue” by image 8?
- Did the character age or shift body type across the series? Subtle aging is a common drift.
How to reuse this workflow
- Once a character bible is stable, reuse it across projects. The same trait block produces consistent outputs even months later.
- Build a
scenes/library of approved outputs to use as additional references for new generations. - For a series, generate all key panels in one session per character so the conversational model warms up.
- Update the bible when the character “ages” or changes outfit by intent (chapter break, season change) — make it a deliberate version, not a drift.
Recommended workflow
Build twelve scene images of a mascot: canonical portrait first, 5-trait description, then for each scene, [trait block] doing [action] in [setting], [pose], [lighting]. Consistent across all twelve outputs. Total time about 3-4 hours including iteration; a tenth as long without this discipline but with unusable results.
Common mistakes
- Rephrasing the character description each scene. “Red hair” -> “ginger hair” -> “auburn hair” -> a different person by image 5.
- Adding new traits mid-series (“she has a pendant now”). Stick to what was in the canonical, or update the bible explicitly.
- Not saving the canonical image. You lose the only objective anchor and every drift compounds.
- Trying for hyperreal real-person consistency. Stick to stylized characters where the human eye forgives small differences.
- Using different models within a single character set. Midjourney Mira and Stable Diffusion Mira will not match; pick one.
- Letting prompt order drift. Always put the trait block first; moving it later in the prompt reduces its influence.
Advanced tips
- For maximum consistency, use tools with explicit character reference (Midjourney
--crefat weight 100, Stable Diffusion LoRA trained on 15-30 images, Flux Redux, Sora character system). - Stylized beats photoreal for consistency — cartoon and painterly characters tolerate small variations better than photoreal humans.
- For comics or story sequences, generate all panels in one session so the model “warms up” to the character.
- Train a LoRA once you have 15-30 approved outputs. Future generations will hold consistency without needing the reference image attached every time.
- For video (Sora, Veo), generate a strong canonical key frame first, then use image-to-video to drive motion. Pure text-to-video character consistency is still weak.
Output checklist
- Canonical character image saved at high resolution.
- Trait list (5-7 specific traits) reused identically across scenes.
- Reference image attached as input in every tool that supports it.
- Character bible doc maintained and version-controlled.
- Scene outputs reviewed side by side at thumbnail size before approval.
FAQ
- Best tool for character consistency?: Midjourney
--creffor fast iteration, Stable Diffusion + LoRA for maximum control, Flux Redux for high-fidelity reference matching, Sora character system for video. Test all on your specific character; results vary by style. - Can I train a model on my character?: For Stable Diffusion, yes — LoRA needs about 15-30 reference images and produces strong consistency. DreamBooth is more powerful but heavier. For closed models, use their built-in reference features instead.
- Why does the face still drift even with a reference image?: Reference weight is probably too low, or the reference image itself was inconsistent (multiple angles in one image confuses the system). Use a single clean front-facing canonical.
- How many traits is too many?: Above 8-10 specific traits, models start to drop some randomly. Stick to the most visible and distinguishing.
- Can I do photoreal human consistency?: Not reliably with current public models. Stylized photoreal (a slightly painterly look) works; pure photoreal does not yet hold across enough variations.
- What if I need the character in 100 images?: Train a LoRA after the first 20 approved outputs. Future generations cost much less effort and stay more consistent.