Local Model Ignores the Tool-Calling Format

Q: Does Ollama support the OpenAI tool-calling API natively?

Yes. Use `/api/chat` (native) or `/v1/chat/completions` (OpenAI-compatible) with a `tools` array. The model still has to be tool-trained — look for the `tools` tag on its [Ollama model page](https://ollama.com/search?c=tools). Streaming tool calls and `think` are both supported in current builds.

Q: My `llama-server` ignores the tools array completely. Why?

You almost certainly started it without `--jinja`. That flag is what enables OpenAI-style tool calling and the tool-call autoparser. Restart with `llama-server --jinja ...` and re-test.

Q: Can grammar / the `format` constraint force any model to produce tool-call JSON?

It forces *syntactic* validity — valid JSON matching the schema. It does not make a non-tool-trained model pick the right tool or sensible arguments. Constraint plus a tool-trained model is the reliable combination; constraint alone on a base model gives valid-looking but nonsensical calls.

Local LLM writes tool names in prose instead of structured JSON, or ignores the tools list. Fix it with the right tool-capable model, --jinja in llama-server, and Ollama's format JSON-schema constraint.

Published: May 25, 2026 Updated: Jun 21, 2026 Author: AI Productivity Guide Team 🌐 查看中文版本

You send a prompt to a local Ollama or llama-server endpoint with a tools list in the OpenAI /v1/chat/completions format, and a prompt that clearly needs one of those tools. Instead of returning a structured tool_calls array like {"tool_calls": [{"function": {"name": "search_web", "arguments": {"query": "..."}}}]}, the model writes “I’ll use the search_web function with query…” in plain prose and then hallucinates an answer. Or it emits broken JSON with a missing bracket or a wrong key.

Fastest fix (covers most cases): pick a model actually trained for tool calling (as of June 2026, Qwen3 is the most reliable local pick, followed by GPT-OSS and Llama 3.3), run it on a current runtime (Ollama v0.30.x or a recent llama-server build), and start llama-server with the --jinja flag. The single most common cause is llama-server started without --jinja, which silently disables tool-call parsing. The second is using a model that was never fine-tuned for tool calling.

This is almost never “the model isn’t smart enough.” It’s that the tool-call format in the chat template doesn’t match what the model was fine-tuned to emit, or the runtime isn’t constraining the output.

Which bucket are you in?

Symptom	Most likely cause	Jump to
Prose every time, never any JSON	`llama-server` missing `--jinja`, or model not tool-trained	Causes 1, 3
JSON sometimes, prose other times	Temperature too high / no constraint	Causes 5, 3
JSON appears but keys/brackets are wrong	Quant too low, or schema not enforced	Causes 3, 7
Tool always ignored, even with obvious prompt	Tools not injected into the prompt	Cause 4
Right shape, wrong tool or wrong args	Schema too complex for a small model	Cause 6

Common causes

Ordered by hit rate, highest first.

1. `llama-server` started without `--jinja`

As of mid-2026, llama.cpp’s llama-server only does OpenAI-style tool calling when you pass --jinja. Without it, the server ignores the tools array, applies a plain chat template, and the model answers in prose. This is the single most common cause for llama-server users.

How to spot it: Look at the exact command that launched llama-server. If --jinja is absent, that is your problem. (Ollama applies tool templates automatically, so this cause is llama-server-specific.)

2. Model not fine-tuned for tool calling

Not every instruction-tuned model supports tool calling. Base models and many general instruction fine-tunes have no concept of emitting structured tool-call JSON. Only models explicitly trained with function-calling data will reliably produce it.

As of June 2026, strong local tool-callers include Qwen3 (lowest dropped-call rate in independent benchmarks), GPT-OSS (20B is stable, 120B is top-tier), Llama 3.3, Gemma 4 (native function calling in the weights), and the older but still solid Llama 3.1, Mistral Nemo, Qwen2.5, Hermes 3, and Functionary v3.x.

How to spot it: On Ollama, browse the tool-capable model list — if your model isn’t tagged tools, it won’t follow the format. On HuggingFace, check the model card for “function calling,” “tool use,” or “tools.”

3. Wrong tool-calling template applied

Each family encodes tool calls differently: Mistral uses [TOOL_CALLS] tokens, Llama 3.x uses <|python_tag|> plus a JSON block, Qwen uses <tool_call> tags, Hermes uses its own wrapper. If the runtime applies the wrong template (for example a Llama 2 template on a Mistral model), the model never sees its trained tool-call tokens and falls back to prose.

llama-server ships a PEG-based autoparser that recognizes the native formats for Llama 3.1/3.2/3.3, Functionary v3.1/v3.2, Hermes 2/3, Qwen 2.5, Mistral Nemo, FireFunction v2, Command R7B, and DeepSeek R1 (WIP). For anything it doesn’t recognize, it falls back to a generic JSON format that works but costs more tokens. Some templates (DeepSeek R1) need an explicit override file.

How to spot it: For Ollama, run ollama show modelname --modelfile | grep -A20 TEMPLATE and compare the tool-call token format against the model’s HuggingFace tokenizer_config.json chat_template. For llama-server, if the model isn’t on the recognized list, pass --chat-template-file with a tool-aware Jinja template.

4. Tools list not injected into the prompt

Some serving stacks (older builds, custom wrappers) accept the tools parameter in the API call but never render the tool definitions into the actual prompt. The model never sees the available tools and has no basis for calling one.

How to spot it: Turn on debug logging and read the full rendered prompt. For llama-server, start with --verbose; for Ollama, run OLLAMA_DEBUG=1 ollama serve. If the tool names, descriptions, and parameter schemas are not in the logged prompt, they aren’t being injected.

5. Temperature too high, sampling away from structured output

At temperature 0.8+, even a well-configured tool-calling model can drift from valid JSON because high-entropy sampling occasionally picks a token that breaks syntax. Tool calling wants low-temperature or constrained sampling.

How to spot it: Re-run the same prompt at temperature=0. If structured output appears reliably at 0, temperature was the cause.

6. Tool schema too complex for a small model

Small models (Llama 3.1 8B, Qwen3 4B) handle simple schemas (2-3 parameters) well but struggle with deeply nested schemas or more than 8-10 tools in one call. The output looks syntactically valid but selects the wrong tool or wrong parameters.

How to spot it: Cut the schema down to a single tool with one required parameter and test. If it now calls correctly, schema complexity is the issue — move up to a 14B+ model or split the toolset.

7. Quantization too low for stable JSON

Tool calling demands precise token sequences ("function", {, }, quotes). At aggressive quants (IQ3 and below) the probability distribution at those positions flattens, so brackets and key names go wrong more often and the JSON fails to parse.

How to spot it: Re-pull a Q5_K_M or Q8_0 build of the same model and re-test. A clear jump in success rate points to quantization.

Shortest path to fix

Step 1: Verify the model supports tool calling

# Ollama: tool-capable models are tagged "tools"
ollama show llama3.3 --modelfile | grep -i tool

# Models that call tools reliably as of June 2026:
# - qwen3              (most stable; lowest dropped-call rate)
# - gpt-oss:20b        (clean tool calling, agent-tuned)
# - llama3.3           (needs 48GB+ VRAM for 70B)
# - gemma4 / gemma     (native function calling in the weights)
# - llama3.1:8b, qwen2.5, mistral-nemo, hermes3

Step 2 (llama-server): start with `--jinja`

This is the fix for most llama-server users. Without --jinja, tool calling is off.

# --jinja turns on OpenAI-style tool calling and the PEG tool-call autoparser
llama-server --jinja -fa \
  -hf bartowski/Qwen2.5-7B-Instruct-GGUF:Q5_K_M \
  --port 8080

# For a model whose template the autoparser doesn't recognize,
# point it at a tool-aware Jinja template:
llama-server --jinja -fa -m model.gguf \
  --chat-template-file models/templates/llama-cpp-deepseek-r1.jinja \
  --port 8080

Then the /v1/chat/completions endpoint will parse and return a real tool_calls array:

response = requests.post("http://localhost:8080/v1/chat/completions", json={
    "model": "local",
    "messages": [{"role": "user", "content": "What's the weather in Paris?"}],
    "tools": [{
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get current weather for a city",
            "parameters": {
                "type": "object",
                "properties": {
                    "city": {"type": "string", "description": "City name"}
                },
                "required": ["city"]
            }
        }
    }],
    "tool_choice": "auto",
    "temperature": 0.1,
})

Step 2 (Ollama): use a tool-tagged model and the right endpoint

Ollama applies the tool template for you — there is no --jinja flag. Use a current Ollama build (v0.30.x as of June 2026); tool calling has been supported since the Llama 3.1 release, streaming tool calls landed in v0.8.0, and later releases kept refining tool-call parsing during thinking. Hit either /api/chat (native) or /v1/chat/completions (OpenAI-compatible).

import openai

client = openai.OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")
tools = [{
    "type": "function",
    "function": {
        "name": "get_weather",
        "description": "Get current weather for a city",
        "parameters": {
            "type": "object",
            "properties": {"city": {"type": "string"}},
            "required": ["city"],
        },
    },
}]

resp = client.chat.completions.create(
    model="qwen3",
    messages=[{"role": "user", "content": "What's the weather in Paris?"}],
    tools=tools,
    tool_choice="auto",
    temperature=0.1,
)
msg = resp.choices[0].message
print(msg.tool_calls[0].function if msg.tool_calls else msg.content)

Step 3: Constrain the output structure

For reliability above prose-vs-JSON luck, constrain the decoder. Ollama exposes this through the format parameter, which accepts a full JSON Schema and forces the decode to match it (since Ollama 0.3.0). llama-server does the same internally via GBNF grammars derived from the tool schema when --jinja is on.

# Ollama native /api/chat: force a JSON shape with `format`
import ollama

resp = ollama.chat(
    model="qwen3",
    messages=[{"role": "user", "content": "Weather in Paris, JSON only."}],
    format={
        "type": "object",
        "properties": {
            "city": {"type": "string"},
            "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]},
        },
        "required": ["city"],
    },
    options={"temperature": 0},
)
print(resp["message"]["content"])  # guaranteed to match the schema

A constraint guarantees the JSON is syntactically valid. It does not guarantee the model picked the right tool or the right argument values — that still depends on the model being tool-trained (Step 1).

Step 4: Drop the temperature for tool requests

response = client.chat.completions.create(
    model="qwen3",
    messages=messages,
    tools=tools,
    tool_choice="auto",
    temperature=0,    # deterministic; 0.1 is also fine
    max_tokens=256,   # tool calls are short
)

Step 5: Log the raw prompt to confirm tools are injected

# llama-server: dump the rendered prompt
llama-server --jinja --verbose -m model.gguf 2>&1 | grep -A 50 "prompt"

# Ollama: enable debug logging on the server process
OLLAMA_DEBUG=1 ollama serve

Confirm the tool names and JSON schema appear in the logged prompt. If they don’t, your runtime isn’t injecting the tools and no model setting will help.

Step 6: Test with a minimal single-tool schema first

minimal_tool = {
    "type": "function",
    "function": {
        "name": "echo",
        "description": "Echo back the input string",
        "parameters": {
            "type": "object",
            "properties": {"text": {"type": "string"}},
            "required": ["text"],
        },
    },
}

# A prompt that unambiguously requires the tool
messages = [{"role": "user", "content": "Echo the phrase 'hello world' using the echo tool."}]

If the minimal case works and your real schema doesn’t, the cause is schema complexity (Cause 6) — simplify, split tools, or move to a larger model.

How to confirm it’s fixed

Run the same request 5-10 times and check that every response carries a populated tool_calls array (or, with format, schema-valid JSON) with no surrounding prose. A model that is genuinely fixed should hit the structure every time at temperature=0. If you still see 1-2 prose responses out of 10, the model is under-trained for tools — switch models rather than fighting the prompt.

Prevention

Pick a model from the tool-capable list for any agent use; don’t assume a generic instruct model supports tools. Qwen3 and GPT-OSS are the safe defaults as of June 2026.
For llama-server, bake --jinja into your launch script and treat its absence as a bug.
Keep temperature at 0 or 0.1 for any request that includes a tools parameter.
Run a current runtime: Ollama v0.30.x+ or a recent llama-server build. Older versions lack the autoparser and streaming tool-call support.
Prefer Q5_K_M or higher quant for tool-calling workloads; sub-Q4 quants break JSON more often.
Validate tool schemas against JSON Schema before sending — all type values lowercase ("string", not "String"), complete properties.
Test with one simple tool before adding complex multi-tool schemas.
In production, add a validator that checks for tool_calls and falls back gracefully (one retry with a stronger constraint, then a text path) when the model ignores the format.

FAQ

Q: Does Ollama support the OpenAI tool-calling API natively? A: Yes. Use /api/chat (native) or /v1/chat/completions (OpenAI-compatible) with a tools array. The model still has to be tool-trained — look for the tools tag on its Ollama model page. Streaming tool calls and think are both supported in current builds.

Q: My llama-server ignores the tools array completely. Why? A: You almost certainly started it without --jinja. That flag is what enables OpenAI-style tool calling and the tool-call autoparser. Restart with llama-server --jinja ... and re-test.

Q: Can grammar / the format constraint force any model to produce tool-call JSON? A: It forces syntactic validity — valid JSON matching the schema. It does not make a non-tool-trained model pick the right tool or sensible arguments. Constraint plus a tool-trained model is the reliable combination; constraint alone on a base model gives valid-looking but nonsensical calls.

Q: Why does the model produce a correct tool call sometimes and prose other times? A: Sampling stochasticity. At higher temperatures the model reaches a fork where “I’ll call” (prose) and {"tool_calls" (JSON) are both plausible next tokens. Set temperature=0, or constrain the output via --jinja / Ollama’s format, to rule out the prose path.

Q: Why is tool calling 100% on GPT-5.5 but only 60-70% on a local 7B? A: Frontier models had far more function-calling fine-tuning and constrained decoding behind the API. Small open models have much less. Move to Qwen3 or GPT-OSS, raise the quant, drop the temperature, and add a retry-with-constraint fallback in your app layer.

Q: Does parallel tool calling (multiple tools in one turn) work locally? A: Mainstream local models (Qwen3, Llama 3.3) support it, but less reliably than single-tool calls. Confirm single-tool stability first, enable parallel calls second, and parse each tool call defensively.

Tags: #local-llm #ollama #Troubleshooting

Which bucket are you in?

Common causes

1. llama-server started without --jinja

2. Model not fine-tuned for tool calling

3. Wrong tool-calling template applied

4. Tools list not injected into the prompt

5. Temperature too high, sampling away from structured output

6. Tool schema too complex for a small model

7. Quantization too low for stable JSON

Shortest path to fix

Step 1: Verify the model supports tool calling

Step 2 (llama-server): start with --jinja

Step 2 (Ollama): use a tool-tagged model and the right endpoint

Step 3: Constrain the output structure

Step 4: Drop the temperature for tool requests

Step 5: Log the raw prompt to confirm tools are injected

Step 6: Test with a minimal single-tool schema first

How to confirm it’s fixed

Prevention

FAQ

Related

Related Articles

llama.cpp mmap Fails on a Network Drive

llama.cpp Quality Drops After Switching to a More Aggressive Quant

LM Studio Out of Memory When Loading a Model

Local Embedding Server Crashes Under Batched Requests

Chat-Template Mismatch Produces Garbage Local LLM Output

Multi-GPU Not Used — Local LLM Runs Only on GPU 0

1. `llama-server` started without `--jinja`

Step 2 (llama-server): start with `--jinja`