Function Calling Training Data Construction Principles
Chapter 22: Function Calling Training Data Construction Principles
Hermes's outstanding performance on Function Calling tasks stems from systematic training data engineering. This chapter analyzes the design logic behind Hermes's 11 tool call parsers, methods for constructing high-quality training samples, annotation standards for multi-step tool call sequences, and the technical sources of Hermes 4's advantages on Function Calling benchmarks.
22.1 The Challenges of Function Calling
Function Calling requires a language model to do more than "output text" โ it must generate structured tool invocation instructions at the right moment and continue reasoning after receiving tool responses. This involves multiple sub-capabilities:
- Intent recognition: Determining whether the current task requires a tool call
- Tool selection: Choosing the most appropriate tool from available options
- Parameter extraction: Extracting correct parameter values from user intent
- Type constraints: Generating schema-compliant parameters (types, enums, required fields)
- Multi-step reasoning: Deciding the next action based on tool return values
- Error recovery: Adaptive strategies when tool calls fail
22.2 11 Tool Call Parser Designs
Hermes provides 11 tool call parsers to accommodate different model architectures and output format preferences:
Parser Classification
| Parser Type | Format | Target Models | Characteristics |
|---|---|---|---|
hermes |
XML-like tags | Hermes series | Native format, most stable parsing |
mistral |
Embedded JSON | Mistral series | Requires special [TOOL_CALLS] marker |
llama3 |
JSON blocks | Llama 3 series | Uses <|python_tag|> delimiter |
qwen |
JSON with markers | Qwen series | <tool_call> tag variant |
functionary |
Strict JSON mode | Functionary series | Strictest schema validation |
openai |
OpenAI compatible | GPT-compatible models | JSON tool_calls array |
claude |
XML tags | Claude series | <invoke> + <parameter> |
gemini |
protobuf-like | Gemini series | Strong-typed parameter definitions |
pythonic |
Python function call syntax | Code-optimized models | func_name(arg1=val1) |
react |
ReAct format | General purpose | Action: / Action Input: text |
json_schema |
Pure JSON Schema | General purpose | Max compatibility, lower efficiency |
Format Examples by Parser
hermes parser (native):
[tool_call]
{"name": "web_search", "arguments": {"query": "Python asyncio", "max_results": 5}}
[/tool_call]
mistral parser:
[TOOL_CALLS] [{"name": "web_search", "arguments": {"query": "Python asyncio"}}]
llama3 parser:
<|python_tag|>{"name": "web_search", "arguments": {"query": "Python asyncio"}}<|eom_id|>
react parser:
Thought: I need to search for Python asyncio information.
Action: web_search
Action Input: {"query": "Python asyncio", "max_results": 5}
pythonic parser:
web_search(query="Python asyncio", max_results=5)
Parser Configuration
from hermes.function_calling import ToolCallParser
# Auto-detect based on model name
parser = ToolCallParser.auto_detect("NousResearch/Hermes-2-Pro-Mistral-7B")
# Manual specification
parser = ToolCallParser.create("mistral")
# Register a custom parser
@ToolCallParser.register("my_custom")
class MyCustomParser(BaseParser):
def parse(self, text: str) -> list[ToolCall]:
...
def format(self, tool_call: ToolCall) -> str:
...
22.3 Training Data Formats
22.3.1 Basic Single Tool Call Sample
{
"id": "fc_train_001",
"type": "single_tool_call",
"messages": [
{
"role": "system",
"content": "You are a helpful AI assistant with access to the following tools:\n\n[{\"name\": \"get_weather\", \"description\": \"Get weather for a city\", \"parameters\": {\"type\": \"object\", \"properties\": {\"city\": {\"type\": \"string\"}, \"unit\": {\"type\": \"string\", \"enum\": [\"celsius\", \"fahrenheit\"], \"default\": \"celsius\"}}, \"required\": [\"city\"]}}]"
},
{
"role": "user",
"content": "What's the weather like in Beijing today?"
},
{
"role": "assistant",
"content": "[tool_call]\n{\"name\": \"get_weather\", \"arguments\": {\"city\": \"Beijing\", \"unit\": \"celsius\"}}\n[/tool_call]"
},
{
"role": "tool",
"content": "{\"city\": \"Beijing\", \"temperature\": 22, \"condition\": \"Sunny\", \"humidity\": 45}"
},
{
"role": "assistant",
"content": "Beijing is sunny today with a temperature of 22ยฐC and humidity at 45%. A great day to be outside!"
}
],
"quality_score": 0.95,
"verified": true
}
22.3.2 Multi-Step Tool Call Sequence Sample
{
"id": "fc_train_002",
"type": "multi_step_tool_call",
"task_description": "Research a stock and generate an analysis report",
"messages": [
{
"role": "system",
"content": "You are a financial analysis AI assistant..."
},
{
"role": "user",
"content": "Please analyze Apple (AAPL) stock price trends over the past 30 days"
},
{
"role": "assistant",
"content": "[inner_monologue]\nI need to first get AAPL's historical price data, then perform technical analysis.\n[/inner_monologue]\n[tool_call]\n{\"name\": \"stock_price_history\", \"arguments\": {\"symbol\": \"AAPL\", \"days\": 30}}\n[/tool_call]"
},
{
"role": "tool",
"content": "{\"symbol\": \"AAPL\", \"data\": [{\"date\": \"2024-01-01\", \"close\": 185.2}, ...]}"
},
{
"role": "assistant",
"content": "[inner_monologue]\nGot 30 days of data. Now I need to calculate technical indicators: MA5, MA20, RSI.\n[/inner_monologue]\n[tool_call]\n{\"name\": \"calculate_technical_indicators\", \"arguments\": {\"data\": \"<previous_result>\", \"indicators\": [\"MA5\", \"MA20\", \"RSI\"]}}\n[/tool_call]"
},
{
"role": "tool",
"content": "{\"MA5\": 187.3, \"MA20\": 183.5, \"RSI\": 58.4, \"trend\": \"bullish\"}"
},
{
"role": "assistant",
"content": "## Apple (AAPL) 30-Day Stock Analysis\n\n**Trend Assessment**: Bullish\n- MA5 (187.3) > MA20 (183.5): short-term MA above long-term, bullish signal\n- RSI = 58.4: neutral-to-strong range, not overbought\n"
}
],
"tool_call_count": 2,
"quality_score": 0.92
}
22.3.3 Negative (No Tool Call) Sample
High-quality training data must include scenarios where tools should not be called:
{
"id": "fc_train_003",
"type": "no_tool_call",
"reason": "Question can be answered directly without tools",
"messages": [
{
"role": "system",
"content": "You are a helpful AI assistant with web_search available"
},
{
"role": "user",
"content": "What is 1+1?"
},
{
"role": "assistant",
"content": "1+1 equals 2."
}
],
"annotation": {
"should_call_tool": false,
"reason": "Basic math doesn't need web search; a direct answer is more efficient"
}
}
22.4 Principles for Constructing High-Quality Training Samples
Parameter Quality Scoring
class ParameterQualityScorer:
def score(self, tool_call: dict, ground_truth: dict) -> dict:
scores = {}
# 1. Completeness: are all required fields provided?
required_fields = ground_truth.get("required", [])
provided = tool_call.get("arguments", {}).keys()
scores["completeness"] = len(set(required_fields) & set(provided)) / max(len(required_fields), 1)
# 2. Accuracy: are values correct?
correct_count = sum(
1 for k, v in ground_truth["arguments"].items()
if tool_call["arguments"].get(k) == v
)
scores["accuracy"] = correct_count / len(ground_truth["arguments"])
# 3. Type precision: do types match?
type_correct = sum(
1 for k, v in tool_call.get("arguments", {}).items()
if isinstance(v, type(ground_truth["arguments"].get(k)))
)
scores["type_precision"] = type_correct / max(len(tool_call.get("arguments", {})), 1)
# 4. No hallucination: are any non-existent parameters introduced?
extra_params = set(tool_call.get("arguments", {}).keys()) - set(ground_truth.get("properties", {}).keys())
scores["no_hallucination"] = 1.0 if not extra_params else 0.5
scores["overall"] = sum(scores.values()) / len(scores)
return scores
Training Data Diversity Requirements
| Dimension | Requirement | Recommended Distribution |
|---|---|---|
| Tool count | Single / Dual / Three+ | 50% / 30% / 20% |
| Call steps | 1-step / 2-step / 3-step+ | 40% / 35% / 25% |
| Should call? | Yes / No | 70% / 30% |
| Parameter types | string / number / array / object | Balanced |
| Error recovery | Includes tool failure scenarios | At least 15% |
| Language | English / Chinese / Mixed | As needed |
22.5 Multi-Step Tool Call Annotation Standards
Step-Level Data Flow Annotation
{
"steps": [
{
"step_id": 1,
"tool": "web_search",
"arguments": {"query": "AAPL stock price"},
"output_var": "$search_results",
"produces": ["$search_results.results[0].url"]
},
{
"step_id": 2,
"tool": "browser_navigate",
"arguments": {"url": "$search_results.results[0].url"},
"depends_on": [1],
"output_var": "$page_content",
"data_flow": "step1_output -> url_parameter"
}
],
"critical_path": [1, 2],
"annotations": {
"data_dependencies": "Step 1's result URL is passed into Step 2"
}
}
Decision Point Annotation
Decision points are the most critical annotations โ they teach the model to make judgments based on intermediate results:
{
"decision_points": [
{
"after_step": 1,
"condition": "web_search.result_count == 0",
"action": "retry_with_broader_query",
"reasoning": "If search returns nothing, broaden the query and retry"
},
{
"after_step": 2,
"condition": "browser_navigate.status_code == 403",
"action": "try_alternative_url",
"reasoning": "403 means access denied; switch to an alternative source"
}
]
}
22.6 Why Hermes 4 Excels at Function Calling
Performance Improvements
| Area | Hermes 3 | Hermes 4 | Improvement |
|---|---|---|---|
| Parameter accuracy | 87.3% | 94.1% | +6.8pp |
| Tool selection accuracy | 91.2% | 97.8% | +6.6pp |
| Multi-step success rate | 72.4% | 88.9% | +16.5pp |
| Error recovery rate | 58.3% | 79.2% | +20.9pp |
| Schema compliance rate | 95.1% | 99.3% | +4.2pp |
Technical Improvements
1. Enhanced tool description understanding: Training data intentionally features incomplete or ambiguous tool descriptions, forcing the model to infer parameter semantics from context.
2. Joint training across 11 parsers: The same tool call scenario is annotated in all 11 formats simultaneously, giving the model format generalization capabilities.
3. Negative sample reinforcement: 30% of training samples are "should not call tool" scenarios, significantly reducing hallucinated tool calls.
4. Error injection training: Tool call errors are deliberately introduced (wrong parameter values, type mismatches), then correctly annotated recovery paths โ training the model's self-correction ability:
# Error injection sample
{
"type": "error_recovery",
"messages": [
{
"role": "assistant",
"content": "[tool_call]\n{\"name\": \"get_weather\", \"arguments\": {\"city\": 123}}\n[/tool_call]"
},
{
"role": "tool",
"content": "{\"error\": \"TypeError: city must be string, got int\"}"
},
{
"role": "assistant",
"content": "[inner_monologue]\nI made a type error โ city should be a string, not an integer. Let me correct this.\n[/inner_monologue]\n[tool_call]\n{\"name\": \"get_weather\", \"arguments\": {\"city\": \"Beijing\"}}\n[/tool_call]"
}
]
}
22.7 Summary
This chapter systematically covered the training data engineering behind Hermes Function Calling:
- 11 parsers: Cover format preferences across different model architectures, from native hermes format to react text format
- Training data formats: Four major types โ single call, multi-step, parallel calls, and rejection (no-call)
- Quality dimensions: Four-dimensional scoring โ completeness, accuracy, type precision, and no-hallucination
- Annotation standards: Step-level data flow and decision point annotations are key to multi-step reasoning capability
- Hermes 4 advantages: Three technical pillars โ joint training, negative sample reinforcement, error injection
Review Questions
-
When generating training data, using a strong model (e.g., GPT-4) to generate samples and then filtering them vs. using human annotation โ what are the trade-offs of each approach? How would you design a hybrid strategy?
-
The "should not call tool" negative sample ratio is set at 30%. If this ratio were too high (e.g., 50%), what behavioral bias would the model develop?
-
Joint training across 11 parsers might cause format selection "instability" (the model sometimes generates hermes format, sometimes react format for the same task). How would you use system prompts to enforce a specific format?