AI Image Reference Image Mostly Ignored

You uploaded a reference for img2img or style transfer and the output barely resembles it — strength, mode, and model architecture all matter. Diagnose by sweeping strength.

You uploaded a clean reference photo of a teal ceramic mug as the img2img / reference input. You wrote “same mug, on a wooden table”. The output is a brown mug on a granite counter. The reference image looks like it was glanced at and discarded. Reference image conditioning in diffusion pipelines is not a copy operation — it is a weighted bias on the denoising trajectory, and that bias loses to a stronger text prompt, a low-strength setting, or a model that does not have an image-encoder head wired to the noise channel. Fix this by understanding which “reference” mode you are using and how to crank it.

Common causes

In rough order of frequency.

1. Strength / denoise is too high

In img2img, “strength” or “denoising strength” controls how much of the original image survives. At 0.85+ the model destroys the reference and effectively does text-to-image. At 0.3 the output is barely different from the reference.

How to spot it: If output looks unrelated to reference, strength is too high. If output looks identical to reference, strength is too low.

2. Wrong conditioning mode (img2img vs IP-Adapter vs ControlNet)

Three distinct ways to use a reference, all called “reference” by different UIs:

  • img2img: reference is the starting noise; controls composition and rough color.
  • IP-Adapter: reference is encoded by a CLIP image encoder; controls style and content semantically.
  • ControlNet: reference is preprocessed (canny edge, depth, pose); controls structure.

Picking the wrong mode for your goal gives weak results.

How to spot it: If you wanted “same style, different subject” and used img2img, you used the wrong mode. IP-Adapter is for style; img2img is for composition.

3. IP-Adapter weight is at default 0.5

IP-Adapter respects a weight parameter. At 0.5 (a common default) the reference has moderate influence; at 0.3 it is barely visible; at 0.9 it dominates. Many UIs do not expose this slider clearly.

How to spot it: Hunt for the “image weight” or “ip_adapter_scale” parameter. If it is 0.5 or unset, raise to 0.8.

4. Reference is being resized / cropped destructively

Many pipelines crop the reference to a square at 512x512 or 1024x1024. If your reference is wide or tall, the cropped version may not contain the subject anymore.

How to spot it: Open the reference file the pipeline actually saw (some UIs save it). If the subject is cropped out, that is the reason.

5. Text prompt overpowers the image prompt

The model balances text and image conditioning. A long, specific text prompt (“photorealistic, 4k, studio lighting, magazine cover, sharp focus, …”) outweighs a vague reference. The model trusts the text more.

How to spot it: Shorten the text prompt to a few words and re-run. If the reference now shows up, the text was drowning it.

6. Model has no image-conditioning head

Plain SDXL-base does not take reference images natively — you need img2img through the pipeline, or an IP-Adapter, or ControlNet loaded as a separate model. Some UIs silently accept the reference but discard it if the model has no head for it.

How to spot it: Check if the workflow / API call loads an ip-adapter-*.safetensors or a controlnet-*.safetensors. If not, your reference is going into the void.

7. CFG scale is starving the image conditioning

In multi-conditioning setups, very high CFG (12+) amplifies text prompt and proportionally reduces image-prompt influence.

How to spot it: Drop CFG to 5-7 and re-run. If reference influence increases, CFG was the issue.

Before you start

  • Identify which mode you are in: img2img, IP-Adapter, or ControlNet. Read the workflow / UI labels carefully.
  • Save the reference image at the resolution and aspect ratio you want — let the pipeline resize down, not crop.
  • Decide what aspect of the reference you actually want preserved: composition, style, structure, or all three. Different modes serve different goals.

Information to collect

  • Reference image file (the one you uploaded).
  • Strength / denoise value (for img2img) or scale value (for IP-Adapter) or weight value (for ControlNet).
  • Full text prompt.
  • Model name and any adapter / ControlNet files loaded.
  • CFG scale.
  • Output resolution and reference resolution — is one being resized?

Step-by-step fix

Step 1: Pick the correct mode for your goal

GoalUse
Same composition, different detailsimg2img, strength 0.4-0.6
Same style, different subjectIP-Adapter, scale 0.7-0.9
Same pose / structure, anything elseControlNet (pose, depth, canny)
Same character, different sceneIP-Adapter Face + ControlNet pose

Pick one and commit. Mixing modes without intention usually fails.

Step 2: Sweep strength systematically

Generate 5 images at strengths 0.3, 0.5, 0.7, 0.85, 0.95 (img2img) or scales 0.3, 0.5, 0.7, 0.85, 0.95 (IP-Adapter). Look at the row.

for strength in [0.3, 0.5, 0.7, 0.85, 0.95]:
    out = pipe(
        prompt=prompt,
        image=reference,
        strength=strength,
        seed=42,
    )
    out.save(f"sweep_{strength}.png")

Pick the value where the output respects the reference and your prompt.

Step 3: Verify the reference actually reached the model

In API mode, print the multipart upload:

curl -v -X POST $API_URL \
  -F "prompt=teal ceramic mug on wooden table" \
  -F "image=@reference.png" \
  -F "strength=0.6"

Confirm image=@reference.png is in the request. If not, the field name is wrong. Some APIs want init_image, reference_image, or image_prompt.

Step 4: Shorten the text prompt

Strip the text prompt to the minimum needed to specify the differences from the reference:

BEFORE: photorealistic 4k studio lighting teal ceramic mug on
        oak wooden table soft window light shallow depth of
        field magazine quality

AFTER:  same mug, on a wooden table

The reference carries the style and lighting; let the text only carry the changes.

Step 5: Crank IP-Adapter scale

If using IP-Adapter:

ip_adapter_scale = 0.85  # was 0.5

At 0.85 the reference’s style and content dominate. At 0.95 the output is nearly a recolor of the reference.

Step 6: Pre-crop the reference to a matching aspect

If your output is 1024x1024 and reference is 1920x1080, the pipeline will either letterbox or center-crop. Center-crop kills wide references. Pre-crop the reference yourself to match the output aspect:

ffmpeg -i ref.jpg -vf "crop=min(iw\,ih):min(iw\,ih)" ref_square.jpg

Or pad to square in a non-destructive way before upload.

Step 7: Stack ControlNet for structural control

If your goal is “exact same composition”, img2img alone is fragile. Add ControlNet canny or depth on top:

out = pipe(
    prompt=prompt,
    image=reference,
    strength=0.7,
    controlnet_conditioning_image=canny_of_ref,
    controlnet_conditioning_scale=0.8,
)

ControlNet pins structure; img2img pins color/lighting. Together they preserve the reference reliably. See AI image character consistency for the related multi-image case.

Verify

  • Generate 3 outputs at the chosen strength. The reference’s most important attribute (style / composition / subject — whichever you wanted) should be visible in all 3.
  • Drop strength to 0.1 and confirm the output is nearly identical to the reference. Proves the reference is loaded.
  • Raise strength to 0.99 and confirm the output diverges almost entirely. Proves the strength control is responsive.

Long-term prevention

  • Label your workflow with the conditioning mode in big text. Users (and you, a week later) forget.
  • Document the strength / scale values that work for your common use cases as named presets.
  • Keep reference images at the same aspect ratio as the intended output to avoid silent crops.
  • Use short text prompts when a strong reference is in play — let the reference carry style.
  • Stack ControlNet + img2img when you need both structure and color preservation.
  • Test new IP-Adapter or ControlNet versions with a controlled prompt before relying on them in production.

Common pitfalls

  • Using img2img at strength 0.9 and expecting the reference to survive. It will not.
  • Loading IP-Adapter but forgetting to set its scale parameter. Default may be 0.0 in some Comfy nodes.
  • Writing a 200-token text prompt with a reference and wondering why the reference is ignored.
  • Uploading a portrait-oriented reference for a landscape output — the pipeline center-crops and discards the subject.
  • Assuming “reference image” in the UI is one universal feature — three different mechanisms exist; pick the right one.
  • Using ControlNet pose detection on a reference where the pose is unclear (heavy clothing, occlusion). The detector outputs garbage, and the model follows garbage.

FAQ

Q: How is img2img different from IP-Adapter for style transfer?

img2img literally starts from the reference image as noise — output is shaped like the reference. IP-Adapter encodes the reference semantically through CLIP and biases generation — output can have completely different composition but matching style. For style transfer, IP-Adapter is better.

Q: My reference is photorealistic but output is illustration-style. Why?

The text prompt or checkpoint is overriding the reference style. Switch to a photorealistic checkpoint, shorten the prompt, and consider IP-Adapter at scale 0.85.

Q: Can I use multiple reference images?

Yes — IP-Adapter Plus and some Comfy workflows accept multiple. They blend, which may not be what you want. Stronger pattern: one reference for style + ControlNet pose from another reference.

Q: Does seed matter for reference-based generation?

Yes. Same seed + same reference + same prompt + same strength = same output. Use the seed to lock variants and isolate the effect of strength changes.

Tags: #Troubleshooting #ai-image #img2img #reference-image #controlnet