Prompt Asks for "Best" Without Defining a Decision Rule

Q: Why do I get a different "best" every time I run the same prompt?

Two reasons stack up. First, "best" with no axis forces the model to guess which criterion you care about, and it guesses differently each time. Second, hosted LLM inference is non-deterministic even at `temperature 0` because of probabilistic sampling and floating-point batching, so the answer can drift run to run. A weighted rubric fixes the first cause and makes the second cause irrelevant to the final pick.

Q: Does setting temperature to 0 make the answer reproducible?

No. `temperature 0` reduces but does not remove variation. The same prompt can still produce different tokens across runs and across GPU types. Pin the *decision* with a rubric instead of trying to pin the *text*.

Q: How many criteria should a rubric have?

Usually 3 to 5. Fewer than 3 and "best" is barely more defined than before; more than 6 and the weights get noisy and criteria start double-counting. Keep each criterion independent and measurable, with weights summing to 100%.

Q: What if I don't know the right weights yet?

Ask the model to propose a rubric first ("propose 4 weighted criteria for choosing a database for this workload"), then edit the weights to match what you actually value before asking for the pick. You review and own the rubric; the model just drafts it.

Q: Should I just ask the model to "be objective"?

No. "Be objective" is another undefined instruction. Objectivity in a decision comes from a written rubric with weights and a tie-breaker, not from a politeness phrase the model can ignore.

Run the same "what's best?" prompt three times, get three answers. Replace "best" with an axis, weights, and a tie-breaker to get one defensible pick.

Published: May 20, 2026 Updated: Jun 21, 2026 Author: AI Productivity Guide Team 🌐 查看中文版本

You asked “what’s the best database for this project?” and got back a confident-sounding pick. Run the same prompt three times and you get three different “best” databases. Each pick has plausible reasoning. None of the reasoning anchors to your actual project.

Fastest fix: stop asking for “best”. Ask for “best on a named axis, given a threshold”. Replace "What's the best database?" with "What's the cheapest database that sustains 10k writes/min at p99 read latency under 100ms, for a 90/10 read/write, 100GB workload?". The axis (cheapest), the bound (p99 < 100ms), and the workload kill the ambiguity that produces a different answer every run.

The model is not malfunctioning. “Best” with no axis, no weights, and no tie-breaker collapses onto whatever option had the strongest positive associations in training data for the phrase “best X” — which skews toward whatever was trendy when the model was trained. On top of that, hosted LLM inference is not byte-for-byte reproducible even at temperature 0: probabilistic sampling and floating-point batching make the same prompt drift between runs (Thinking Machines: Defeating Nondeterminism in LLM Inference). So a vague “best” prompt is doubly unstable. The fix is to remove the ambiguity the model is forced to guess at, not to chase determinism you cannot get from the API.

Which bucket are you in?

Symptom	Likely cause	Go to
Prompt has no “cheapest / fastest / most reliable” word	No axis specified	Step 1
You listed 2+ goals (“fast and cheap and easy”) with no ranking	Multiple criteria, no weights	Step 2
Model picked AWS / OpenAI / Stripe with no support from your data	No constraints on what to ignore	Step 3
Re-runs swap the winner inside the top 2-3	No tie-breaker	Step 4
Same prompt in a fresh chat picks differently	Implicit criteria carried from earlier turns	Step 5

Common causes

1. No axis specified

“Best” along what axis? Cost, latency, scalability, team familiarity? You did not say. The model picks an axis it likes, usually the one with the strongest training-data sentiment.

How to spot it: your prompt has no axis word (cheapest, fastest, most reliable).

2. Multiple criteria, no weights

You named criteria — “should be fast and cheap and easy to maintain” — but did not say which matters most. The model picks the criterion with the strongest sentiment in its prior.

How to spot it: 2+ criteria listed with no order or weight.

3. No constraints on what to ignore

If you do not say “ignore vendor brand reputation” or “ignore popularity”, those factors quietly dominate. Brand recognition is a large implicit weight.

How to spot it: the model picked a default-popular option (AWS, OpenAI, Stripe, Postgres) without your input data supporting it.

4. No tie-breaker

When two options are close, the model picks the one mentioned first or the one with more presence in training data. That is effectively random, and it changes between runs because hosted inference is non-deterministic even at low temperature.

How to spot it: re-running the prompt picks different “winners” within the top 2-3.

5. Implicit criteria from prior turns

In a long conversation, an earlier turn established a framing (“we care about cost”) that you forgot. The model still anchors on it for “best”.

How to spot it: the same prompt in a new chat picks differently.

Before you change anything

Write down the single axis you actually care about.
If multiple, rank them and assign rough weights that sum to 100%.
Identify what factors you want the model to ignore.
Decide your tie-breaker rule.
For decisions you make often, store these as a reusable rubric.

Information to collect

The current prompt with “best” in it.
Outputs from 3 separate runs (use fresh chats so prior turns do not leak in).
Your actual project context (workload, constraints, team).
Past decisions you would consider “right” and why.
The model and any system prompt in play.

Shortest path to fix

Step 1: Replace “best” with “best on [axis]”

Bad:  "What is the best database for this?"
Good: "What is the cheapest database that handles 10k writes/min
       with p99 read latency under 100ms? Workload: 90% reads,
       10% writes, 3kb average row, 100GB total."

The axis (cheapest), the threshold (10k writes/min, p99 < 100ms), and the workload spec eliminate the ambiguity that was causing the run-to-run drift.

Step 2: Use a weighted rubric

For multi-criteria decisions. Good rubric criteria are specific, measurable, and independent (no two criteria measuring the same thing), and the weights sum to 100%:

Criteria with weights:
- Cost: 40% (lower is better; hard cap $300/mo)
- p99 read latency: 30% (lower is better; hard cap 100ms)
- Team familiarity: 20% (1=never used, 5=daily)
- Maintenance burden: 10% (1=managed service, 5=self-hosted)

For each option:
1. Score 1-5 on each criterion.
2. Multiply each score by its weight.
3. Sum for a total score.
4. Pick the highest.
5. Show the full scoring table so I can recompute it.

This forces explicit reasoning the model cannot hide behind a confident sentence. It is the same structure teams use for rubric-based LLM evaluation: name orthogonal dimensions, weight them, and score each one.

Step 3: Declare what to ignore

Ignore in your evaluation:
- Vendor brand reputation and general popularity.
- Whether the option is mentioned in HN/Reddit threads.
- Training-data recency bias (do not favor the newest option by default).
- Marketing claims; use only documented limits and published pricing.

Naming ignored factors prevents them from sneaking in as implicit weights.

Step 4: Declare a tie-breaker

Tie-breaker (use when the top 2 total scores are within 0.3):
1. Prefer the option whose team has the most senior contributor.
2. If still tied, prefer the option with the longer track record.
3. If still tied, return both with a one-line note on the trade-off.

A written tie-breaker is what makes a close call reproducible instead of coin-flip random.

Step 5: Ask for trade-offs as part of the answer

For the chosen option, list:
- 2 things this choice is worse at than the runner-up
- 1 thing you would lose if you were forced to switch later
- 1 unknown that could change the answer

Trade-offs separate “best” from “best given our constraints”.

Step 6: Build a reusable rubric file

For decisions you make often (vendor choice, library selection, architecture pattern), store the rubric in a file. Each new decision reuses the same rubric with new options, so the criteria and weights stay stable across decisions and across teammates.

How to confirm the fix

The scoring table is reproducible: you can recompute the totals by hand and get the same winner.
Running the same prompt 3 times produces the same pick (with the rubric forcing the math, minor wording drift between runs no longer changes the winner).
The trade-off list names concrete weaknesses, not vibes.
A teammate following the same rubric reaches the same conclusion.
The decision is defensible in writing without re-asking the model.

Note: if you need byte-identical model output across runs, the rubric still wins because it pins the decision, not the prose. Hosted APIs cannot guarantee identical text even with a fixed seed and temperature 0 — OpenAI’s seed is best-effort and tied to a system_fingerprint that changes when they update infrastructure (OpenAI Cookbook: reproducible outputs), and Anthropic’s API exposes no seed parameter at all, so Claude output stays non-deterministic even at temperature 0.

If it still fails

Criteria may still be too vague — make each one quantitative with a unit and a cap.
Add 1-2 worked examples of past decisions, with their scores, to anchor the scale.
For decisions with sensitive or private context, the model may simply lack the data — provide it inline.
Some decisions genuinely lack a single “best”. The right answer is “any of these top 3 is fine; here are the trade-offs”, and a good rubric will surface that as a near-tie rather than a fake winner.

FAQ

Why do I get a different “best” every time I run the same prompt? Two reasons stack up. First, “best” with no axis forces the model to guess which criterion you care about, and it guesses differently each time. Second, hosted LLM inference is non-deterministic even at temperature 0 because of probabilistic sampling and floating-point batching, so the answer can drift run to run. A weighted rubric fixes the first cause and makes the second cause irrelevant to the final pick.

Does setting temperature to 0 make the answer reproducible? No. temperature 0 reduces but does not remove variation. The same prompt can still produce different tokens across runs and across GPU types. Pin the decision with a rubric instead of trying to pin the text.

How many criteria should a rubric have? Usually 3 to 5. Fewer than 3 and “best” is barely more defined than before; more than 6 and the weights get noisy and criteria start double-counting. Keep each criterion independent and measurable, with weights summing to 100%.

What if I don’t know the right weights yet? Ask the model to propose a rubric first (“propose 4 weighted criteria for choosing a database for this workload”), then edit the weights to match what you actually value before asking for the pick. You review and own the rubric; the model just drafts it.

Should I just ask the model to “be objective”? No. “Be objective” is another undefined instruction. Objectivity in a decision comes from a written rubric with weights and a tie-breaker, not from a politeness phrase the model can ignore.

Prevention

Default: convert every “best” prompt into a weighted rubric before you send it.
Keep one rubric file per decision type (vendor, architecture, library, tool).
For team decisions, agree on the rubric before running the prompt.
Audit “best” decisions monthly: were they reproducible? If not, the rubric is missing a criterion or a tie-breaker.
Replace “best” / “most” / “top” with axis-specific language as a writing reflex.
When in doubt, ask the model to propose a rubric first, then review and adopt it before asking for the pick.

Tags: #Troubleshooting #Prompt #Prompt quality #Prompt engineering