Properly configuring inference servers for tool calling

The soaring popularity of AI Agents, Model Context Protocol (MCP) clients and servers, and increasing mandates to integrate existing business applications with AI models is leading to an explosion in use-cases that require accurate tool calling in inference APIs and large language models.

The OpenAI Chat Completions API introduced a tools parameter in inference requests and a tool_calls parameter in inference responses to give us a structured way of dealing with tools. In the legacy OpenAI Completions API, tool calls were done manually via client-side prompting and client-side parsing of model responses. In the Chat Completions and even newer Responses API, that responsibility has moved to the inference server.

This is great because it’s one less thing clients have to get right. However, it also means the inference server has to be properly setup to get that model’s best results. The rest of this article will focus on arming the local developer, platform engineer, or infrastructure admin with enough knowledge to get maximum tool calling results for the model in use.

I’ll be using vLLM as my inference server in all these examples - partly because it’s what I know best, but mostly because it has easily configurable chat templates and tool call parsers that we need to do this properly. This is less configurable (or not configurable at all) in most other inference servers, making it harder to demonstrate in this post. And, the lack of configurability in some servers limits corresponding model choices in more advanced real-world scenarios.

The three parts to successful tool calls
#

There are three main components that have to be matched together to get the best tool call results - the language model, the chat template in use by the inference server, and the tool call parsing code (or plugin) of the inference server.

The language model
#

Some language models are better than others at recognizing when a tool should be called and at constructing those tool call responses. There are various leaderboards in the wild that track the performance of these models, with the most popular one being the Berkeley Function-Calling Leaderboard V3, often abbreviated to just BFCL or BFCL-v3. The higher a model is on that leaderboard, the better it is at overall tool calling. There are many different types of tool calling, and some models excel at some categories more than others. But, for the purpose of this article I’ll ignore those nuances and just focus on the overall score.

At the time of this writing, the Salesforce/Llama-xLAM-2-70b-fc-r model sits in the top spot on this leaderboard. Meanwhile, meta-llama/Llama-3.3-70B-Instruct is in position #72, in the lower half of all models tested. I’ll use these two models later on to demonstrate the importance of properly configuring the chat template and tool call parsing for each model.

The chat template
#

The job of the chat template is to take the list of messages and tools from the inference request and turn those into the prompt that actually gets sent to the model. Here’s an example Llama-3.1 chat template for reference. These templates are often complex, as they need to handle multiple sets of possible scenarios based on what kind of messages are sent in the inference requests.

Every model trained for tool calling has an ideal way it should be prompted with tools to achieve the best results. Models typically ship a default chat template in their tokenizer configuration. This is usually the best general-purpose chat template for a model and is sometimes, but not always, the correct tool calling template to use for the model.

For example, the Salesforce/Llama-xLAM-2-70b-fc-r model ships a default chat template optimized for tool calling out of the box, as that is the main purpose of this model. However, the meta-llama/Llama-3.3-70B-Instruct model can be prompted to construct either JSON or Python tool calls in its responses. Its default prompt is setup for JSON tool calling, but that may not be optimal for all tool use-cases. Meta provides very detailed documentation about how to prompt their Llama models and why there are both JSON and Python tool call syntaxes for the curious. I won’t dive that deeply into the Llama models in this article, but know that some models have more than one chat template to choose from based on how the model is being used.

The inference server and its tool call parsing
#

The job of the tool call parser is to take the raw output from the model and attempt to parse it into structured function objects in the tool_calls response. If models were perfect, this would be a trivial task. Unfortunately, most of the current models are nowhere close to perfect, which means there’s often some fuzzy parsing logic involved in this step.

Different inference servers take different approaches to choosing the tool call parser to use. Ollama, for example, attempts to extract the expected tool call format out of the Golang chat templates uploaded with each model in their model registry.

vLLM, on the other hand, ships its many tool call parser implementions with the inference server itself instead of with the models. Some of these implementations are more mature than others, and it’s also possible to write and use a custom tool call parser with vLLM at runtime without having to build it into the server. This gives a lot of flexibility, but also requires diligence to ensure the tool call parser in use matches the format expected by the chat template and model selected.

Putting the three parts together with some examples
#

In all of these examples, I’m running vLLM on a machine with NVIDIA 4xL40s GPUs. The --tensor-parallel-size and --max-model-len parameters below are just to get these models loading within my available GPU memory. Those may need to be tweaked for other setups running on different hardware.

A properly configured Llama-3.3-70B-Instruct for JSON tool calls
#

To configure Llama-3.3-70B-Instruct for JSON format tool calls, we need to specify its JSON chat template and JSON tool call parser. While the default chat template shipped in this model’s tokenizer is setup for JSON tool call parsing, we’ll explicitly download and use the latest one available at the time of writing from the vLLM version 0.9.1 repository here to get the latest improvements.

Note that this chat template says llama3.1_json, but the Llama 3.3 models use the same chat template and tool call format as the Llama 3.1 models.

# Download the JSON chat template
curl https://raw.githubusercontent.com/vllm-project/vllm/refs/tags/v0.9.1/examples/tool_chat_template_llama3.1_json.jinja -o tool_chat_template_llama3.1_json.jinja

# Start vLLM with the JSON template
vllm serve meta-llama/Llama-3.3-70B-Instruct \
  --port 8000 \
  --tensor-parallel-size 4 \
  --max-model-len 48000 \
  --chat-template $HOME/tool_chat_template_llama3.1_json.jinja \
  --enable-auto-tool-choice \
  --tool-call-parser llama3_json

Now let’s execute a chat completion request with a tool call:

curl -s http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
  "model": "meta-llama/Llama-3.3-70B-Instruct",
  "messages": [
    {
      "role": "user",
      "content": "What is the current time in Boston, MA?"
    }
  ],
  "tools": [
    {
      "type": "function",
      "function": {
        "name": "get_current_time",
        "description": "Get the current time in a specific location.",
        "parameters": {
          "type": "object",
          "properties": {
            "location": {
              "type": "string",
              "description": "The city and state, e.g. Atlanta, GA"
            }
          },
          "required": ["location"]
        }
      }
    }
  ],
  "tool_choice": "auto"
}' | jq .

If everything went well, we’ll get a tool_calls object in the response, like in the highlighted lines below:

 1{
 2  "id": "chatcmpl-8eaf3f043f0746f392c8dc7249a6aab1",
 3  "object": "chat.completion",
 4  "created": 1750540007,
 5  "model": "meta-llama/Llama-3.3-70B-Instruct",
 6  "choices": [
 7    {
 8      "index": 0,
 9      "message": {
10        "role": "assistant",
11        "reasoning_content": null,
12        "content": null,
13        "tool_calls": [
14          {
15            "id": "chatcmpl-tool-71bf0c7c2ae24e4c966fc39b394e5657",
16            "type": "function",
17            "function": {
18              "name": "get_current_time",
19              "arguments": "{\"location\": \"Boston, MA\"}"
20            }
21          }
22        ]
23      },
24      "logprobs": null,
25      "finish_reason": "tool_calls",
26      "stop_reason": 128008
27    }
28  ],
29  "usage": {
30    "prompt_tokens": 265,
31    "total_tokens": 291,
32    "completion_tokens": 26,
33    "prompt_tokens_details": null
34  },
35  "prompt_logprobs": null
36}

Because our chat template and our tool call parser are matched, the model is generating JSON style tool calls that it’s able to parse with the tool call parser into the expected format, giving us these tool_calls results. If they weren’t setup properly, we’d see the tool call response from the model as a text response instead of the tool_calls being parsed. We’ll look at an example of that a bit later.

A misconfigured Llama-3.3-70B-Instruct for Python tool calls
#

As mentioned earlier, there are some cases (such as when using built-in tools, or in some more advanced tool calling scenarios) where Llama models respond more accurately with Python formatted tool calls. However, vLLM does not (nor any other inference server that I’ve seen) ship a matched set of chat templates and pythonic tool call parsers designed for Llama 3.3 models. There are some that are close, but not quite right, so let’s try those and see what happens.

Since we have no pythonic chat template for Llama 3.3 models, let’s use the one designed for Llama 3.2 models. And, we’ll use the pythonic tool call parser, which was also designed for Llama 3.2 models. This is a mistake someone could easily make in the wild, assuming chat templates and tool call parsers designed for Llama 3.2 models would also work for Llama 3.3.

First, start up our vLLM with the pythonic chat template and tool call parser not designed for our exact model:

# Download the Llama 3.2 pythonic chat template
curl https://raw.githubusercontent.com/vllm-project/vllm/refs/tags/v0.9.1/examples/tool_chat_template_llama3.2_pythonic.jinja -o $HOME/tool_chat_template_llama3.2_pythonic.jinja

# Start vLLM with the Llama 3.2 pythonic template
vllm serve meta-llama/Llama-3.3-70B-Instruct \
  --port 8000 \
  --tensor-parallel-size 4 \
  --max-model-len 48000 \
  --chat-template $HOME/tool_chat_template_llama3.2_pythonic.jinja \
  --enable-auto-tool-choice \
  --tool-call-parser pythonic

Let’s send the same chat completion request as before:

curl -s http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
  "model": "meta-llama/Llama-3.3-70B-Instruct",
  "messages": [
    {
      "role": "user",
      "content": "What is the current time in Boston, MA?"
    }
  ],
  "tools": [
    {
      "type": "function",
      "function": {
        "name": "get_current_time",
        "description": "Get the current time in a specific location.",
        "parameters": {
          "type": "object",
          "properties": {
            "location": {
              "type": "string",
              "description": "The city and state, e.g. Atlanta, GA"
            }
          },
          "required": ["location"]
        }
      }
    }
  ],
  "tool_choice": "auto"
}' | jq .

This time our response looks a bit different. Instead of getting the expected tool_calls result, we got the model’s tool call as text content that was not successfully parsed:

 1{
 2  "id": "chatcmpl-c8f39cb4ba2f4a31a07e5cacb9cff5d0",
 3  "object": "chat.completion",
 4  "created": 1750544289,
 5  "model": "meta-llama/Llama-3.3-70B-Instruct",
 6  "choices": [
 7    {
 8      "index": 0,
 9      "message": {
10        "role": "assistant",
11        "reasoning_content": null,
12        "content": "[\"get_current_time(location=\\\"Boston, MA\\\")\"]",
13        "tool_calls": []
14      },
15      "logprobs": null,
16      "finish_reason": "stop",
17      "stop_reason": 128008
18    }
19  ],
20  "usage": {
21    "prompt_tokens": 269,
22    "total_tokens": 281,
23    "completion_tokens": 12,
24    "prompt_tokens_details": null
25  },
26  "prompt_logprobs": null
27}

This is a typical example of tool calling failure, where the model actually generated a reasonable tool call output but in a format that didn’t match what our tool call parser expected. So, instead of the structured tool_calls output, we get this text with what looks like a Python list of function calls with some escaped quotes.

If outputs like these show up when testing tool calling, it’s a good sign that the chat templates and/or tool call parsing code are not optimally designed for the model in use.

Example of tool calling with the Salesforce xLAM model
#

Remember that Salesforce xLAM model mentioned earlier, that’s sitting on top of the Berkeley Function-Calling Leaderboard? The team behind that model understands the importance of properly matching the chat template to the tool call parser, and they actually ship their own tool call parser for vLLM to get optimal results. And, its default chat template is already setup for proper tool calling.

Let’s do our same tool call test using that model:

# Download the xLAM tool call parser
curl https://huggingface.co/Salesforce/Llama-xLAM-2-70b-fc-r/raw/main/xlam_tool_call_parser.py -o $HOME/xlam_tool_call_parser.py

# Start vLLM with the xLAM tool call parser
vllm serve Salesforce/Llama-xLAM-2-70b-fc-r \
  --port 8000 \
  --tensor-parallel-size 4 \
  --max-model-len 48000 \
  --enable-auto-tool-choice \
  --tool-parser-plugin $HOME/xlam_tool_call_parser.py \
  --tool-call-parser xlam

Send the sample tool call request again:

curl -s http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
  "model": "Salesforce/Llama-xLAM-2-70b-fc-r",
  "messages": [
    {
      "role": "user",
      "content": "What is the current time in Boston, MA?"
    }
  ],
  "tools": [
    {
      "type": "function",
      "function": {
        "name": "get_current_time",
        "description": "Get the current time in a specific location.",
        "parameters": {
          "type": "object",
          "properties": {
            "location": {
              "type": "string",
              "description": "The city and state, e.g. Atlanta, GA"
            }
          },
          "required": ["location"]
        }
      }
    }
  ],
  "tool_choice": "auto"
}' | jq .

And the results, with the tool_calls as expected:

 1{
 2  "id": "chatcmpl-df454b5a3fa446ddbf98779e093334ea",
 3  "object": "chat.completion",
 4  "created": 1750546019,
 5  "model": "Salesforce/Llama-xLAM-2-70b-fc-r",
 6  "choices": [
 7    {
 8      "index": 0,
 9      "message": {
10        "role": "assistant",
11        "reasoning_content": null,
12        "content": null,
13        "tool_calls": [
14          {
15            "id": "call_0_ee61d2bf22d34c35a960ce21b9b18983",
16            "type": "function",
17            "function": {
18              "name": "get_current_time",
19              "arguments": "{\"location\": \"Boston, MA\"}"
20            }
21          }
22        ]
23      },
24      "logprobs": null,
25      "finish_reason": "tool_calls",
26      "stop_reason": null
27    }
28  ],
29  "usage": {
30    "prompt_tokens": 282,
31    "total_tokens": 304,
32    "completion_tokens": 22,
33    "prompt_tokens_details": null
34  },
35  "prompt_logprobs": null

A misconfigured Salesforce xLAM model
#

What happens if someone doesn’t read the model card of the xLAM model? Perhaps they were using the Llama 3.3-70B model, saw the better results from the xLAM model on the BFCL-v3 leaderboard, and just changed the model out in their vllm serve command but not the tool call parser. What kind of results would we get with that?

Start vLLM with the xLAM model but llama3_json tool call parser:


# Start vLLM with the xLAM model but llama3_json
vllm serve Salesforce/Llama-xLAM-2-70b-fc-r \
  --port 8000 \
  --tensor-parallel-size 4 \
  --max-model-len 48000 \
  --enable-auto-tool-choice \
  --tool-call-parser llama3_json

Send the sample tool call request:

curl -s http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
  "model": "Salesforce/Llama-xLAM-2-70b-fc-r",
  "messages": [
    {
      "role": "user",
      "content": "What is the current time in Boston, MA?"
    }
  ],
  "tools": [
    {
      "type": "function",
      "function": {
        "name": "get_current_time",
        "description": "Get the current time in a specific location.",
        "parameters": {
          "type": "object",
          "properties": {
            "location": {
              "type": "string",
              "description": "The city and state, e.g. Atlanta, GA"
            }
          },
          "required": ["location"]
        }
      }
    }
  ],
  "tool_choice": "auto"
}' | jq .

And, because we have the wrong tool call parser for this model, our tool call comes out as text content in the model’s output instead of as tool_calls:

 1{
 2  "id": "chatcmpl-16f119c6cf6548018ecd7c0774a11069",
 3  "object": "chat.completion",
 4  "created": 1750546771,
 5  "model": "Salesforce/Llama-xLAM-2-70b-fc-r",
 6  "choices": [
 7    {
 8      "index": 0,
 9      "message": {
10        "role": "assistant",
11        "reasoning_content": null,
12        "content": "[{\"name\": \"get_current_time\", \"arguments\": {\"location\": \"Boston, MA\"}}]",
13        "tool_calls": []
14      },
15      "logprobs": null,
16      "finish_reason": "stop",
17      "stop_reason": null
18    }
19  ],
20  "usage": {
21    "prompt_tokens": 282,
22    "total_tokens": 304,
23    "completion_tokens": 22,
24    "prompt_tokens_details": null
25  },
26  "prompt_logprobs": null
27}

Takeaways
#

The chat template and tool call parser must match the model
#

It’s not enough to just choose a model that’s good at tool calling. The chat template and tool call parser must also be matched to that model. If a model doesn’t publish details on how to properly prompt it and how to parse its tool calls (ideally with predefined chat templates and tool call parsers for vLLM, like the xLAM models), then it may require deeper investigation or even writing custom chat templates and tool call parsers to get the expected results.

Don’t assume chat templates and tool call parsers can be reused from one model to another in the same model family, unless the creators explicitly state that it’s ok to do so. We saw an example of how this often doesn’t work when using the Llama 3.2 chat template and parser with the Llama 3.3 model.

If clients are ever seeing what looks like tool calls in the content of their model responses (as opposed to as tool_calls), it’s a sure sign that something is not properly matched between the model’s generated responses and the tool call parser in use.

Model authors need to test, document, and ship tool calling chat templates and parsers
#

At their launch, Meta’s Llama 4 models were hard to prompt properly for tool calling and parsers weren’t available for popular inference servers. This led to lower results on benchmarks like the BFCL-v3 than the models are actually capable of, and worse tool calling performance in the real world with popular inference servers and SaaS services.

Even now, a couple months after launch, SaaS inference providers of Llama 4 models are quite poor at getting this right. There are dedicated chat templates and tool call parsers for both JSON and Pythonic tool calling in vLLM now, although the Pythonic tool call parser still struggles with some of the common Llama 4 model tool calling outputs.

Some of the underwhelming reception of these models at launch could have been avoided, at least in the tool calling space, by working closer with the SaaS services and inference servers to get this right out of the gate.

There’s room for innovation, improvements, and standardization in tool call parsing
#

There is not a single consistent format that all models are trained to output when generating tool calls. Many tool call parsers are not much more than regular expressions, and they can fail to handle all types of model output under all circumstances (especially streaming vs non-streaming output). For vLLM, users have to know to select the right tool call parser for their model and chat template. In Ollama, users don’t get a choice and are stuck with the tool call parser Ollama automatically generates from parsing the model’s chat template, which may or may not give the best results possible.

There’s an opportunity here to come up with something better. Perhaps model tokenizers should specify a tool call grammar in addition to their chat template so that each chat template is at least matched to some reasonable tool call parser. Perhaps the community should coordinate on a shared library for tool call parsers that all inference servers use, reducing the fragmentation in this space where each server has its own method for parsing tool calls.

This space is still evolving, and there’s plenty of opportunity for motivated individuals to jump in and lend a hand in these upstream communities.

Want to talk more about tool calling or vLLM?
#

Even if we had perfectly accurate tool calling models (which we don’t), there’s a lot of nuance involved in getting a production tool calling setup working as accurately as possible. If you want to chat more about these things, feel free to reach out to me directly via any of my contact details listed on this site.

Author

Benjamin Browning

I’m an open source software engineer at Red Hat.

The three parts to successful tool calls #

The language model #

The chat template #

The inference server and its tool call parsing #

Putting the three parts together with some examples #

A properly configured Llama-3.3-70B-Instruct for JSON tool calls #

A misconfigured Llama-3.3-70B-Instruct for Python tool calls #

Example of tool calling with the Salesforce xLAM model #

A misconfigured Salesforce xLAM model #

Takeaways #

The chat template and tool call parser must match the model #

Model authors need to test, document, and ship tool calling chat templates and parsers #

There’s room for innovation, improvements, and standardization in tool call parsing #

Want to talk more about tool calling or vLLM? #