You configure an Ollama 0.4 or llama-server endpoint with a tools list in the OpenAI /v1/chat/completions format and send a prompt that clearly requires using one of the defined tools. Instead of returning {"tool_calls": [{"function": {"name": "search_web", "arguments": {"query": "..."}}}]}, the model writes “I’ll use the search_web function with query…” in plain prose and continues generating a fabricated answer. Or it outputs malformed JSON with missing brackets. The issue is almost never that the model is “not smart enough” — it’s that the tool-calling format in the chat template doesn’t match what the model was fine-tuned to produce, or grammar-based constraint enforcement isn’t enabled.
Common causes
Ordered by hit rate, highest first.
1. Model not fine-tuned for tool calling
Not every instruction-tuned model supports tool calling. Base models and many general instruction fine-tunes have no concept of producing structured tool-call JSON. Only models explicitly trained with function-calling data (Mistral v0.3+, Llama 3.1+, Qwen2.5+, Functionary, Hermes 3, etc.) will reliably output tool-call JSON.
How to spot it: Check the model card on HuggingFace for “function calling,” “tool use,” or “tools” in the features list. If not mentioned, the model will not follow the tool-calling format.
2. Wrong tool-calling template in the chat template
Mistral v0.3 uses [TOOL_CALLS] tokens. Llama 3.1 uses <|python_tag|> for code interpreter and a JSON block for regular tool calls. Qwen2.5 uses <tool_call> XML-like tags. If Ollama or llama-server applies the wrong template (e.g., Llama 2 template on a Mistral v0.3 model), the model never sees its trained tool-call tokens and produces prose instead.
How to spot it: Run ollama show modelname --modelfile | grep TEMPLATE and compare the tool-call token format against the model’s HuggingFace tokenizer_config.json.
3. Grammar enforcement not enabled in llama-server
llama-server supports GBNF (GGML BNF) grammars that constrain the output to valid JSON. Without grammar enforcement, the model will sometimes produce correct JSON and sometimes produce malformed JSON or prose, depending on the prompt and temperature. For reliable tool-call extraction, grammar-based constraints are essential.
How to spot it: Check your llama-server API call for a grammar field. If it’s absent, output format is unconstrained.
4. Tools list not injected into the system prompt
Some local serving stacks (older llama-server versions, custom wrappers) accept the tools parameter in the API call but don’t actually inject the tool definitions into the prompt. The model never sees the available tools and has no basis for producing a tool call.
How to spot it: Enable verbose mode in llama-server (--verbose) and log the full prompt string sent to the model. If the tool definitions (names, descriptions, parameter schemas) are not present in the rendered prompt, the tools aren’t being injected.
5. Temperature too high causing sampling away from structured output
At temperature 0.9+, even a well-configured tool-calling model may deviate from the expected JSON structure because high entropy sampling occasionally picks tokens that break the JSON syntax. Tool calling requires low-temperature or grammar-constrained sampling.
How to spot it: Reduce temperature to 0 or 0.1 for the same prompt and compare. If structured output appears at temperature 0, temperature is the cause.
6. Tool schema too complex for the model’s context or capability
Models like Llama 3.1 8B handle simple tool schemas (2-3 parameters) reliably but struggle with deeply nested schemas or more than 8-10 tools in a single call. The model produces a syntactically valid-looking output but uses the wrong tool or wrong parameters.
How to spot it: Simplify the tool schema to a single tool with one required parameter and test. If the model now calls it correctly, the complexity of the full schema is the issue.
Shortest path to fix
Step 1: Verify the model supports tool calling
# Check for explicit tool-call support markers in model metadata
ollama show mistral:7b-instruct-v0.3 --modelfile | grep -i tool
# Or check HuggingFace model card
# Models known to support tool calling well:
# - mistralai/Mistral-7B-Instruct-v0.3
# - meta-llama/Llama-3.1-8B-Instruct
# - Qwen/Qwen2.5-7B-Instruct
# - NousResearch/Hermes-3-Llama-3.1-8B
# - meetkai/functionary-small-v3.2
Step 2: Use grammar-constrained tool calling in llama-server
# Start llama-server with the built-in tool-call grammar
./llama-server \
-m models/Mistral-7B-Instruct-v0.3-Q4_K_M.gguf \
--chat-template mistral-v3-tekken \
--port 8080
# The /v1/chat/completions endpoint will enforce JSON structure
# when tools are provided — no extra flags needed in 0.3.x+
For older llama-server versions, explicitly pass the grammar:
# Use the tool_call grammar built into llama.cpp
response = requests.post("http://localhost:8080/v1/chat/completions", json={
"model": "mistral",
"messages": [{"role": "user", "content": "What's the weather in Paris?"}],
"tools": [{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get current weather for a city",
"parameters": {
"type": "object",
"properties": {
"city": {"type": "string", "description": "City name"}
},
"required": ["city"]
}
}
}],
"tool_choice": "auto",
"temperature": 0.1,
})
Step 3: Set temperature to 0 or 0.1 for tool-calling requests
response = client.chat.completions.create(
model="mistral:7b-instruct-v0.3",
messages=messages,
tools=tools,
tool_choice="auto",
temperature=0.1, # Low temperature for structured output
max_tokens=256, # Tool calls are short
)
Step 4: For Ollama, update the Modelfile to include tool-call template tokens
For Mistral v0.3:
FROM mistral:7b-instruct-v0.3
TEMPLATE """[INST] {{ if .System }}{{ .System }}
{{ end }}{{ .Prompt }} [/INST]{{ .Response }}"""
PARAMETER stop "[INST]"
PARAMETER stop "[/INST]"
PARAMETER stop "</s>"
For Llama 3.1 with tool calling, use Ollama 0.4 which has native tool-call support for Llama 3.1 — no custom template needed.
Step 5: Log the raw prompt to verify tool definitions are injected
# Enable verbose output from llama-server
LLAMA_LOG_LEVEL=debug ./llama-server \
-m model.gguf \
--chat-template mistral-v3-tekken \
--verbose 2>&1 | grep -A 50 "input:"
Confirm the tool JSON schema appears in the logged prompt. If it doesn’t, your serving framework isn’t injecting the tools correctly.
Step 6: Test with a minimal single-tool schema first
# Minimal tool to isolate the issue
minimal_tool = {
"type": "function",
"function": {
"name": "echo",
"description": "Echo back the input string",
"parameters": {
"type": "object",
"properties": {
"text": {"type": "string"}
},
"required": ["text"]
}
}
}
# Prompt that unambiguously requires the tool
messages = [{"role": "user", "content": "Echo the phrase 'hello world' using the echo tool."}]
Prevention
- Choose models from the “function calling” category when your use case requires tool calling — don’t assume any instruct model supports it.
- Always set
temperature=0.1or lower for requests that include a tools parameter. - Use Ollama 0.4+ for Llama 3.1 and Mistral v0.3 tool calling — older versions don’t have native tool-call template support.
- Validate tool schemas against JSON Schema Draft-7 before sending — malformed schemas cause silent tool-call failures.
- Test tool calling with a single simple tool before adding complex multi-tool schemas.
- Log the raw prompt (enable
--verbosein llama-server) for any new model+framework combination to confirm tools are being injected. - For production use, add a response validator that checks for
tool_callsin the response and falls back gracefully when the model ignores the format.
FAQ
Q: Does Ollama support the OpenAI tool-calling API natively?
A: Yes, starting in Ollama 0.3 for select models (Llama 3.1, Mistral v0.3, Qwen2.5). Use the /v1/chat/completions endpoint with a tools parameter. The model must support tool calling in its training. Use ollama show modelname to check for tool-call capability in the model description.
Q: What’s the difference between tool calling and function calling?
A: They are the same concept under different names. OpenAI originally used “function calling” (GPT-3.5, GPT-4), then renamed it “tool calling” in the API v1 update. The JSON structure changed slightly: function calling uses function_call in the response, tool calling uses tool_calls (a list). Most local serving stacks use the newer tool_calls format.
Q: Can I force any model to produce JSON tool calls using grammar constraints? A: You can constrain the output format to valid JSON using GBNF grammar, but the model still needs to “know” which tool to call and with what arguments. Grammar constraints ensure syntactic correctness but not semantic correctness. A model not trained for tool calling will produce syntactically valid but semantically nonsensical tool call arguments.
Q: Why does the model sometimes produce the right tool call and sometimes prose?
A: This is sampling stochasticity at work. At temperature 0.7, the model may reach a probability fork where “I’ll call” (prose) and {"tool_calls" (JSON) are both plausible next tokens. Set temperature=0 for deterministic, structured tool-call output, or use grammar constraints to rule out the prose path entirely.