Model Fills Missing Details With Unwanted Assumptions

You gave a partial spec; the model invented the rest to look complete.

You asked for a quarterly summary of a customer’s account activity. You provided login dates and ticket counts. The model returned a tidy summary that included the customer’s “renewal date in Q3”, their “primary contact at the company”, and an “estimated revenue impact of approximately $42k”. You never gave it any of those. They sound right. They are all invented. The model interpreted “write a quarterly summary” as “produce a complete-looking quarterly summary” and completeness, by training-distribution norms, requires renewal dates and contact names. So it filled them.

This page walks through why models default to gap-filling and how to make “I do not have this detail” a first-class output state.

Common causes

1. Prompt asks for “complete”, “comprehensive”, or “full”

These words mean “leave no gaps”. The model takes them literally and invents gaps closed.

How to spot it: your prompt verb or adjective implies completeness.

2. Output schema has required fields with no missing-data path

If your schema says { "renewal_date": string } with no null option, the model has to put a string there. It will, true or not.

How to spot it: schema has no explicit null / unknown handling.

3. No rule for “when unsure”

You never told the model what to do when it does not know. The default behavior is “produce something plausible”.

How to spot it: prompt has no if unknown: rule.

4. Common-sense gap-filling

For very common patterns (employee names, dates, addresses), models fill from prior-training expectations of “what this kind of record looks like”. It is not random — it is statistical priors.

How to spot it: invented details are plausible and follow common patterns (e.g., “Q3” for renewal, “John Smith” for a contact).

5. Output format encourages completeness

Tables especially: an empty cell looks wrong, so the model fills it. Bullet lists are similar — half a bullet list reads as incomplete.

How to spot it: format is table, structured list, or schema-driven.

Before you change anything

  • Identify which details in the output you never supplied.
  • Save the invented output for comparison.
  • Decide the policy: should missing data be UNKNOWN, null, blank, or an error?
  • Check your schema for required fields that have no missing-data path.
  • Decide whether the workflow needs the missing data, or whether marking it missing is sufficient.

Information to collect

  • The original input data (what you actually gave).
  • The output the model produced.
  • Specific invented details, marked.
  • Your schema (if any) and required fields.
  • Model and temperature.

Shortest path to fix

Step 1: Explicit “if unknown” rule

Rules for missing data:
- If a required detail was not provided in the input, output "UNKNOWN".
- Do not infer. Do not estimate.
- Do not write phrases like "approximately", "around", "roughly" unless
  the input contains a number.
- If more than 3 fields would be UNKNOWN, stop and ask for the missing data.

This single block solves 70% of gap-filling.

Step 2: Schema with explicit null handling

Schema:
{
  "renewal_date": "<ISO date OR null>",
  "primary_contact": "<name OR null>",
  "revenue_impact_usd": "<number OR null>",
  "data_gaps": ["<field name that was null and why>"]
}

The null option + data_gaps array makes “missing” a first-class output.

Step 3: Make the model list assumptions before producing

Step 1: List every assumption you would need to make to produce
        a complete answer. Number them.
Step 2: For each, mark whether the input data supports it (YES) or
        whether you would be inventing it (NO).
Step 3: Produce the answer using only YES assumptions. For NO
        assumptions, write UNKNOWN in the output.

This converts implicit fabrication into an explicit, audit-able step.

Step 4: Use few-shot examples that include UNKNOWN

Show the model what acceptable “I do not know” output looks like:

Example 1:
Input: Login dates only
Output:
{
  "renewal_date": null,
  "primary_contact": null,
  "revenue_impact_usd": null,
  "data_gaps": ["renewal_date (not in input)", "primary_contact (not in input)", "revenue_impact_usd (not in input)"]
}

Now produce for: <real input>

Step 5: Forbid filler vocabulary

Forbidden phrases (do not use unless input contains evidence):
- "approximately", "around", "roughly", "likely", "estimated"
- "based on industry norms"
- "typical", "average", "standard"

These phrases are the linguistic signature of gap-filling. Banning them forces the model to either provide evidence or say UNKNOWN.

Step 6: Have the model verify its output against the input

Append:

After producing, list each non-UNKNOWN claim with the exact input line
that supports it. If you cannot point to a supporting input line,
the claim must become UNKNOWN.

Verification step catches the rest.

How to confirm the fix

  • Spot-check 5 outputs: every specific claim has a supporting line in the input.
  • The data_gaps array is non-empty when input is partial.
  • “UNKNOWN” appears in outputs where it should.
  • Filler vocabulary count is 0.
  • A teammate reviewing the output cannot point to any “wait, where did that come from” detail.

If it still fails

  1. The model may need temperature 0 for strict factuality — try lowering it.
  2. Switch to structured output mode (JSON schema, tool use) — schema enforcement catches invented fields.
  3. Add a separate verification pass: prompt 1 produces, prompt 2 verifies each claim against input.
  4. For high-stakes work, use retrieval (RAG) so the model has explicit sources; require source citation per claim.

Prevention

  • Default rule in every prompt: “If unknown, output UNKNOWN. Do not infer.”
  • Mark required vs optional fields in schemas; require explicit UNKNOWN for missing required.
  • Forbid filler vocabulary in prompts where factuality matters.
  • Audit production outputs for invented specifics monthly.
  • For repeated workflows, build a verification step into the pipeline.
  • Treat “completeness” as a separate concern from “correctness” — sometimes incomplete output is the correct output.

Tags: #Troubleshooting #Prompt #Prompt quality #Hallucination