AI Image Reference Image Ignored: Fix img2img & Style Transfer

Q: What is Flux Redux and when do I use it instead of IP-Adapter?

Redux is Flux's built-in image-prompt adapter (the `CLIPVisionEncode` -> `StyleModelApply` nodes in ComfyUI). Use it when your base model is Flux, since SDXL IP-Adapters do not load on Flux. At full strength Redux mostly reproduces the reference and ignores text; increase downsampling to give your prompt room.

Uploaded a reference for img2img, IP-Adapter, Flux Redux, or ControlNet and the output barely resembles it? Diagnose by strength sweep, confirm the reference reached the model, then crank the right knob.

Published: May 24, 2026 Updated: Jun 17, 2026 Author: AI Productivity Guide Team 🌐 查看中文版本

You uploaded a clean photo of a teal ceramic mug as the img2img / reference input and wrote “same mug, on a wooden table”. The output is a brown mug on a granite counter. The reference looks like it was glanced at and discarded.

Fastest fix: the single most common cause is denoising strength set too high. In a standard diffusion pipeline, strength=1.0 adds full noise and ignores the reference entirely (that is just text-to-image), while strength near 0 leaves the reference almost untouched. Drop img2img strength to 0.5-0.65 and re-run. If you wanted “same style, different subject” instead of “same composition”, you are in the wrong mode entirely — use IP-Adapter or Flux Redux, not img2img.

Reference conditioning in diffusion pipelines is not a copy operation. It is a weighted bias on the denoising trajectory, and that bias loses to a stronger text prompt, a wrong strength setting, or a model with no image-encoder head wired to the noise channel. The fix is knowing which “reference” mode you are in and how to crank it.

Which bucket are you in?

Read this first. Each row points at the exact section below.

Symptom	Likely cause	Jump to
Output unrelated to reference	Strength / denoise too high (near 1.0)	Cause 1
Output is a near-copy of reference	Strength / scale too low	Cause 1
Wanted “same style, new subject” but got “same layout”	Wrong mode (used img2img for a style job)	Cause 2
Reference has faint influence even at high steps	IP-Adapter scale / Redux strength at default	Cause 3
Subject missing from a wide/tall reference	Destructive center-crop to square	Cause 4
Long detailed prompt, reference ignored	Text conditioning drowns the image	Cause 5
Reference accepted but zero effect	Model has no image-conditioning head loaded	Cause 6
Reference weak only when prompt is strict	CFG too high	Cause 7

Common causes

In rough order of frequency.

1. Strength / denoise is too high

In img2img, strength (a.k.a. “denoising strength”) sets how much of the original image survives. The number maps directly to noise steps: with 50 sampling steps, strength=0.8 adds noise for 40 of them and denoises from there, so most of the reference is overwritten. At strength=1.0 the reference is fully replaced — the run is pure text-to-image. At 0.3 the output is barely different from the reference. As of June 2026, the diffusers and ComfyUI defaults still treat strength this way.

How to spot it: output unrelated to reference = strength too high; output identical to reference = strength too low. Sweet spot for “recognizably the same scene, restyled” is usually 0.5-0.65.

2. Wrong conditioning mode (img2img vs IP-Adapter vs Flux Redux vs ControlNet)

Four distinct mechanisms, all called “reference” by different UIs:

img2img: reference becomes the starting (partially noised) latent. Controls composition and rough color.
IP-Adapter: reference is encoded by a CLIP image encoder and injected into cross-attention. Controls style and content semantically; composition can change freely.
Flux Redux: Flux’s native image-prompt adapter (the CLIPVisionEncode -> StyleModelApply path in ComfyUI). At full strength it largely ignores your text and produces variations of the reference; you dial it down to let the prompt back in.
ControlNet: reference is preprocessed (canny edge, depth, pose) into a control map. Controls structure only.

Picking the wrong mode for your goal gives weak results.

How to spot it: if you wanted “same style, different subject” and used img2img, you used the wrong mode. IP-Adapter (any model) or Flux Redux (Flux models) is for style; img2img is for composition.

3. IP-Adapter scale / Redux strength is at a soft default

IP-Adapter respects a scale parameter. Per the diffusers docs, scale=1.0 conditions on the image prompt only and scale=0.5 gives a balanced text/image blend — so the common default of 0.5 deliberately holds the reference back. At 0.3 it is barely visible; at 0.8-0.9 it dominates. Many UIs do not expose this slider clearly.

For Flux Redux, strength is reduced by downsampling the conditioning tensor (e.g. the Apply Style Model / Advanced Redux Control downsampling_factor, where 3 is roughly “medium”). If Redux output looks too loose, lower the downsampling.

How to spot it: hunt for the image weight / ip_adapter_scale (IP-Adapter) or downsampling_factor (Redux). If IP-Adapter scale is 0.5 or unset, raise to 0.8.

4. Reference is being resized / cropped destructively

Many pipelines crop the reference to a square at 512x512 or 1024x1024. If your reference is wide or tall, the cropped version may no longer contain the subject.

How to spot it: open the reference the pipeline actually saw (some UIs save the preprocessed image). If the subject is cropped out, that is the reason.

5. Text prompt overpowers the image prompt

The model balances text and image conditioning. A long, specific text prompt (photorealistic, 4k, studio lighting, magazine cover, sharp focus, ...) outweighs a vague reference; the model trusts the text more.

How to spot it: shorten the text prompt to a few words and re-run. If the reference now shows up, the text was drowning it.

6. Model has no image-conditioning head

Plain SDXL-base does not take reference images natively. You need img2img through the pipeline, or an IP-Adapter, or a ControlNet loaded as a separate model. Some UIs silently accept the reference but discard it if no head can consume it. (IP-Adapter and Redux are also model-family specific: an SDXL IP-Adapter will not load against a Flux checkpoint, and vice versa.)

How to spot it: check whether the workflow / API call loads an ip-adapter-*.safetensors or a controlnet-*.safetensors, and that it matches your base model’s family. If not, the reference is going into the void.

7. CFG scale is starving the image conditioning

In multi-conditioning setups, very high CFG (12+) amplifies the text prompt and proportionally reduces image-prompt influence.

How to spot it: drop CFG to 5-7 and re-run. If reference influence increases, CFG was the issue.

Before you start

Identify which mode you are in: img2img, IP-Adapter, Flux Redux, or ControlNet. Read the workflow / UI labels carefully.
Save the reference at the resolution and aspect ratio you want, and let the pipeline resize down rather than crop.
Decide which aspect of the reference you actually want preserved: composition, style, structure, or all three. Different modes serve different goals.

Information to collect

The reference image file you uploaded.
The strength / denoise value (img2img), scale value (IP-Adapter), downsampling factor (Redux), or weight value (ControlNet).
Full text prompt.
Model name and any adapter / ControlNet files loaded, plus their model family.
CFG scale.
Output resolution and reference resolution — is one being resized?

Step-by-step fix

Step 1: Pick the correct mode for your goal

Goal	Use
Same composition, different details	img2img, strength `0.4-0.6`
Same style, different subject (SDXL/SD)	IP-Adapter, scale `0.7-0.9`
Same style, different subject (Flux)	Flux Redux, dial downsampling for prompt room
Same pose / structure, anything else	ControlNet (pose, depth, canny)
Same character, different scene	IP-Adapter Face + ControlNet pose

Pick one and commit. Mixing modes without intention usually fails.

Step 2: Sweep strength systematically

Generate 5 images at strengths 0.3, 0.5, 0.7, 0.85, 0.95 (img2img) or the same values for IP-Adapter scale. Keep the seed fixed so only one variable moves, then look at the row.

for strength in [0.3, 0.5, 0.7, 0.85, 0.95]:
    out = pipe(
        prompt=prompt,
        image=reference,
        strength=strength,
        generator=torch.Generator().manual_seed(42),
    )
    out.images[0].save(f"sweep_{strength}.png")

Pick the value where the output respects both the reference and your prompt.

Step 3: Verify the reference actually reached the model

In API mode, print the multipart upload:

curl -v -X POST $API_URL \
  -F "prompt=teal ceramic mug on wooden table" \
  -F "image=@reference.png" \
  -F "strength=0.6"

Confirm image=@reference.png is in the request. If not, the field name is wrong — some APIs want init_image, reference_image, or image_prompt. In a node UI, confirm the reference node is actually wired into the sampler, not left dangling.

Step 4: Shorten the text prompt

Strip the text to the minimum needed to specify the differences from the reference:

BEFORE: photorealistic 4k studio lighting teal ceramic mug on
        oak wooden table soft window light shallow depth of
        field magazine quality

AFTER:  same mug, on a wooden table

Let the reference carry the style and lighting; let the text carry only the changes.

Step 5: Crank IP-Adapter scale (or lower Redux downsampling)

If using IP-Adapter:

ip_adapter_scale = 0.85  # was 0.5 (the balanced default)

At 0.85 the reference’s style and content dominate; at 1.0 the output is conditioned on the image only. For a style-only result that keeps your prompt’s layout, the diffusers docs support per-block scaling — push style (up block_0) while zeroing layout (down block_2):

pipeline.set_ip_adapter_scale({
    "up":   {"block_0": [0.0, 1.0, 0.0]},
    "down": {"block_2": [0.0, 0.0]},
})

If using Flux Redux and it ignores your prompt, raise the downsampling_factor (more downsampling = weaker reference, more prompt) instead.

Step 6: Pre-crop the reference to a matching aspect

If your output is 1024x1024 and the reference is 1920x1080, the pipeline will letterbox or center-crop. Center-crop kills wide references. Pre-crop the reference yourself to match the output aspect:

ffmpeg -i ref.jpg -vf "crop=min(iw\,ih):min(iw\,ih)" ref_square.jpg

Or pad to square non-destructively before upload.

Step 7: Stack ControlNet for structural control

If your goal is “exact same composition”, img2img alone is fragile. Add a ControlNet canny or depth on top. Start controlnet_conditioning_scale at 0.5 and raise toward 1.0 for stricter adherence (values above 1.0 enforce the edges hard but cost image quality):

out = pipe(
    prompt=prompt,
    image=reference,
    strength=0.7,
    control_image=canny_of_ref,
    controlnet_conditioning_scale=0.8,
)

ControlNet pins structure; img2img pins color/lighting. Together they preserve the reference reliably. See AI image character consistency for the related multi-image case.

How to confirm it’s fixed

Generate 3 outputs at the chosen strength. The reference’s most important attribute (style / composition / subject, whichever you wanted) should be visible in all 3.
Drop strength to 0.1 and confirm the output is nearly identical to the reference. This proves the reference is loaded.
Raise strength to 0.99 and confirm the output diverges almost entirely. This proves the strength control is responsive.

If the 0.1 test does not look like your reference, the reference is not reaching the model — go back to Step 3 before touching any other knob.

Long-term prevention

Label your workflow with the conditioning mode in big text. You will forget which mode it is a week later.
Save the strength / scale values that work for your common use cases as named presets.
Keep reference images at the same aspect ratio as the intended output to avoid silent crops.
Use short text prompts when a strong reference is in play — let the reference carry style.
Stack ControlNet + img2img when you need both structure and color preservation.
Match adapter family to base model (SDXL adapter with SDXL, Flux Redux with Flux) and test new IP-Adapter / ControlNet / Redux versions with a controlled prompt before relying on them in production.

Common pitfalls

Using img2img at strength 0.9 and expecting the reference to survive. It will not.
Loading IP-Adapter but forgetting to set its scale. Some Comfy nodes default the scale to 0.0, which silently disables it.
Writing a 200-token text prompt with a reference and wondering why the reference is ignored.
Uploading a portrait-oriented reference for a landscape output — the pipeline center-crops and discards the subject.
Treating “reference image” as one universal feature. Four different mechanisms exist; pick the right one.
Loading an SDXL IP-Adapter against a Flux (or vice versa) base model — it either errors or no-ops.
Running ControlNet pose detection on a reference where the pose is unclear (heavy clothing, occlusion). The detector outputs garbage and the model follows garbage.

FAQ

Q: How is img2img different from IP-Adapter for style transfer?

img2img literally starts from the reference as the initial latent, so the output is shaped like the reference. IP-Adapter encodes the reference semantically through CLIP and biases generation, so the output can have completely different composition but matching style. For style transfer, IP-Adapter (or Flux Redux on Flux) is better.

Q: What is Flux Redux and when do I use it instead of IP-Adapter?

Redux is Flux’s built-in image-prompt adapter (the CLIPVisionEncode -> StyleModelApply nodes in ComfyUI). Use it when your base model is Flux, since SDXL IP-Adapters do not load on Flux. At full strength Redux mostly reproduces the reference and ignores text; increase downsampling to give your prompt room.

Q: My reference is photorealistic but the output is illustration-style. Why?

The text prompt or checkpoint is overriding the reference style. Switch to a photorealistic checkpoint, shorten the prompt, and consider IP-Adapter at scale 0.85.

Q: Can I use multiple reference images?

Yes. IP-Adapter Plus, multi-image Redux, and some Comfy workflows accept several. They blend, which may not be what you want. A more reliable pattern is one reference for style plus a ControlNet pose from another reference.

Q: Does seed matter for reference-based generation?

Yes. Same seed + same reference + same prompt + same strength = same output. Lock the seed so you can isolate the effect of a single strength or scale change.

Tags: #Troubleshooting #ai-image #img2img #reference-image #controlnet