Fix AI Image Wrong Perspective or Scale

Q: Why does my image keep coming out flat no matter what I write?

Two likely causes: you put the camera words at the *end* of a long prompt (move them to the front), or you used `straight-on` / `no perspective` / `flat-lay` without meaning to. Swap in an angled viewpoint like `three-quarter view` plus `two-point perspective`.

Flat tables, sideways stairs, oversized heads. Open the prompt with a 5-7 word camera block (focal length + viewpoint + perspective style). Works in Midjourney, ChatGPT Images, Nano Banana, and ControlNet.

Published: May 21, 2026 Updated: Jun 17, 2026 Author: AI Productivity Guide Team 🌐 查看中文版本

You generated a kitchen scene and the table is flat against the camera — no horizon, no depth. Or the staircase goes “up” but the perspective lines point sideways. Or a person’s head is half the height of their body. The content is right, but the space makes no sense.

Fastest fix: add a 5-7 word camera block to the front of your prompt — [focal length] + [viewpoint] + [perspective style] — for example wide-angle 24mm, eye level, two-point perspective, .... That single move corrects 60-70% of perspective and scale failures across Midjourney, ChatGPT Images, and Gemini’s Nano Banana. The rest of this page covers the other buckets and a hard-lock option (ControlNet Depth) for architecture.

Perspective and scale break because the prompt gives the model no spatial anchor — no lens, no viewpoint, no perspective style. Without those, the model averages across thousands of possible camera setups and renders a confused middle.

Which bucket are you in?

Symptom in the image	Most likely cause	Jump to
Environment looks flat, no depth or horizon	No focal length / lens word	Cause 1
Composition is dead-on flat, no angle	No viewpoint word	Cause 2
Objects sit at impossible relative sizes	Too many objects competing for scale	Cause 3
Subject is cropped or stretched awkwardly	Aspect ratio fights the subject	Cause 4
Geometry is self-contradictory (looks “wrong” in a way you can’t name)	Conflicting perspective cues	Cause 5
Lines bend, melt, or refuse to converge	Style word that breaks perspective	Cause 6

Common causes

Ordered by hit rate, highest first.

1. No focal length specified

Without a lens word — wide-angle, telephoto, 24mm, 85mm — the model picks a generic mid-focal length. Fine for close-ups, but it flattens or distorts environments. Specifying a focal length forces the model to imitate real optical physics: wide lenses (16-24mm) exaggerate depth and stretch edges; long lenses (85mm+) compress depth and isolate the subject. Google’s own Nano Banana prompting guide (June 2026) recommends exactly this — wide-angle lens “to show a vast scale,” macro lens “for intricate details” — so the technique is current across every major model, not just Stable Diffusion.

How to spot it: prompt has no lens / focal length word. Add one.

2. No viewpoint specified

eye level, low angle, high angle, bird's eye view, dutch angle — pick one. Without a viewpoint, the model defaults to a flat eye-level that often kills depth.

How to spot it: prompt has no viewpoint word.

3. Too many objects fighting for depth

Each object carries its own implied scale. A vase, chair, sofa, window, plant, and cat each imply a different distance. The model gets confused trying to compose them consistently and breaks perspective. This is the one cause that camera words alone will not fix.

How to spot it: prompt names 5 or more distinct objects.

4. Aspect ratio mismatch to subject

A wide environment shot forced into 9:16 vertical makes the model crop perspective awkwardly; a tall figure forced into 16:9 does the opposite. As of June 2026, Midjourney v7/v8 supports any ratio between 1:2 and 2:1, ChatGPT Images 2.0 supports 3:1 down to 1:3, and Nano Banana Pro supports 21:9, 16:9, 4:5, 9:16 and more — so there is no excuse to fight the frame.

How to spot it: the aspect ratio is the opposite of what the subject naturally suggests.

5. Conflicting perspective cues

"bird's eye view of a person standing tall, looking up at the camera"

bird's eye view means looking down. looking up at the camera means the camera is below. They contradict, so the model picks one or averages both into mush.

How to spot it: the viewpoint words and the subject-angle words disagree.

6. Style anchor that breaks normal perspective

cubist, surreal, mc escher, dali, isometric, axonometric — these styles intentionally break perspective. If one slipped into your prompt, that is why the geometry looks impossible.

How to spot it: a style word in your prompt evokes broken perspective.

Shortest path to fix

Step 1: Add a camera block at the prompt opening

Template:

[focal length] + [viewpoint] + [perspective style] + [your subject]

Examples:

# Architecture / environment
"wide-angle 24mm, eye level, two-point perspective, ..."

# Portrait
"85mm portrait lens, eye level with subject, shallow depth f/1.8, ..."

# Cinematic landscape
"35mm anamorphic wide, low angle from ground, three-point perspective, ..."

# Top-down product / flat-lay
"top-down shot 90 degrees overhead, no perspective, flat, ..."

# Interior architecture
"24mm wide, eye level standing 5ft above floor, two-point perspective, ..."

This single addition fixes 60-70% of perspective issues. Put it at the front — leading tokens carry more weight in every current model.

Step 2: Pick ONE viewpoint and commit

eye level             — natural, default
low angle             — looking up, heroic
high angle            — looking down, vulnerable
bird's eye view       — straight down or near-vertical
worm's eye view       — straight up
dutch angle           — tilted camera, dynamic
three-quarter view    — 30-45 degree offset
straight-on           — flat, head-on, no perspective
isometric             — engineering-style flat angled

Don’t combine incompatible ones (see Cause 5).

Step 3: Drop the object count if the scene is busy

Cap the prompt at 3-5 named objects. Fewer objects, more consistent perspective.

# Before — too many
"a kitchen with a table, four chairs, fridge, stove, microwave, coffee maker, blender, sink, window, plant"

# After — focused
"a kitchen scene featuring a wooden dining table with four chairs in soft morning light"

Step 4: Match aspect ratio to the scene type

- Wide landscape / environment   -> 16:9 or 21:9 (landscape)
- Standing person / portrait      -> 4:5 or 9:16 (portrait)
- Flat-lay / top-down             -> 1:1 (square)
- Architecture, wide              -> 21:9 (cinematic)
- Architecture, tall              -> 9:16 (vertical)

In Midjourney add --ar 16:9; in ChatGPT Images and Nano Banana, state the ratio in plain words (for example “16:9 landscape”). If a Midjourney output snaps back to square, you are on a pre-v5 setting or used a decimal ratio like 9:19.5 — use whole numbers.

Step 5: Add explicit perspective-style words

two-point perspective     — standard architectural
one-point perspective     — single vanishing point, hallway / road
three-point perspective   — looking up/down at tall buildings
forced perspective        — exaggerated depth
isometric perspective     — engineering, no vanishing point
no perspective            — flat, top-down or straight-on
deep depth of field       — everything in focus, perspective visible

Step 6: Use ControlNet Depth for strict perspective

When you have a reference image with the exact perspective you want, stop fighting the prose and lock it geometrically:

# ComfyUI (June 2026)
1. Install comfyui_controlnet_aux for the depth preprocessor
2. Load a Depth (or LineArt) ControlNet:
   - SDXL: xinsir ControlNet Union (one model, 10+ control types)
   - Flux: Shakker Labs ControlNet Union Pro 2.0
3. Feed the reference image with the perspective you want
4. Strength 0.6-0.8 — locks perspective, model fills content

SDXL ControlNets do not load on Flux and vice versa — match the ControlNet to your base model. For precise architectural perspective, a quick 3D render from SketchUp or Blender makes an ideal Depth input.

How to confirm it’s fixed

Horizon test: can you point to where the floor meets the wall, or where ground meets sky? If yes, depth is back.
Vanishing-point test: trace two parallel lines (table edges, floorboards, ceiling). They should converge toward a consistent point, not run parallel or diverge.
Head-to-body test: for figures, the head should be roughly 1/7 to 1/8 of standing height. If it’s near half, scale is still broken — cut object count (Step 3) or switch to a portrait aspect ratio.
Re-roll once: generate two or three variations with the same fixed prompt. If perspective is now consistent across all of them, the camera block took; if only one is right, you got lucky and a cue still conflicts.

Prevention

Open every prompt with a 5-7 word camera block: focal length + viewpoint + perspective style.
Default focal lengths: 24mm wide for environments, 50mm normal for half-body, 85mm for portrait.
Match aspect ratio to the scene type (landscape for environments, portrait for people).
For architectural or product work, make ControlNet Depth your default rather than a rescue.

FAQ

Do camera and lens words still work in 2026 models, or only old Stable Diffusion? They work everywhere. Google’s Nano Banana guide and Midjourney’s v7/v8 docs both recommend explicit lens, f-stop, and angle terms. The newer “thinking” models (ChatGPT Images 2.0, Nano Banana Pro) actually honor them more reliably than 2024-era models did.

My head-to-body ratio is still off after adding a camera block. Why? That is usually object overload (Cause 3) or a portrait shot squeezed into a landscape frame (Cause 4), not a lens problem. Cut to 3-5 objects and switch to a 4:5 or 9:16 ratio before touching the lens word.

Which tool handles perspective and geometry best right now? As of June 2026, ChatGPT Images 2.0 (gpt-image-2) leads on technical geometry — layouts, architecture, infographics. Nano Banana Pro wins on photo edits and detail retention but still stumbles on hard geometry. For pixel-exact perspective, neither beats ControlNet Depth on a local SDXL or Flux pipeline.

Why does my image keep coming out flat no matter what I write? Two likely causes: you put the camera words at the end of a long prompt (move them to the front), or you used straight-on / no perspective / flat-lay without meaning to. Swap in an angled viewpoint like three-quarter view plus two-point perspective.

Can I fix perspective without regenerating the whole image? Partly. Nano Banana Pro and ChatGPT Images 2.0 can do targeted edits, but they rebuild geometry from scratch, so perspective is not reliably preserved. For a true edit-in-place that keeps geometry, run the original through ControlNet Depth at strength 0.7.

Tags: #Image generation #Debug #Troubleshooting