Chapter 22

Function Calling Training Data Construction Principles

Chapter 22: Function Calling Training Data Construction Principles

Hermes's outstanding performance on Function Calling tasks stems from systematic training data engineering. This chapter analyzes the design logic behind Hermes's 11 tool call parsers, methods for constructing high-quality training samples, annotation standards for multi-step tool call sequences, and the technical sources of Hermes 4's advantages on Function Calling benchmarks.

22.1 The Challenges of Function Calling

Function Calling requires a language model to do more than "output text" — it must generate structured tool invocation instructions at the right moment and continue reasoning after receiving tool responses. This involves multiple sub-capabilities:

Intent recognition: Determining whether the current task requires a tool call
Tool selection: Choosing the most appropriate tool from available options
Parameter extraction: Extracting correct parameter values from user intent
Type constraints: Generating schema-compliant parameters (types, enums, required fields)
Multi-step reasoning: Deciding the next action based on tool return values
Error recovery: Adaptive strategies when tool calls fail

22.2 11 Tool Call Parser Designs

Hermes provides 11 tool call parsers to accommodate different model architectures and output format preferences:

Parser Classification

Parser Type	Format	Target Models	Characteristics
`hermes`	XML-like tags	Hermes series	Native format, most stable parsing
`mistral`	Embedded JSON	Mistral series	Requires special [TOOL_CALLS] marker
`llama3`	JSON blocks	Llama 3 series	Uses `<\|python_tag\|>` delimiter
`qwen`	JSON with markers	Qwen series	`<tool_call>` tag variant
`functionary`	Strict JSON mode	Functionary series	Strictest schema validation
`openai`	OpenAI compatible	GPT-compatible models	JSON `tool_calls` array
`claude`	XML tags	Claude series	`<invoke>` + `<parameter>`
`gemini`	protobuf-like	Gemini series	Strong-typed parameter definitions
`pythonic`	Python function call syntax	Code-optimized models	`func_name(arg1=val1)`
`react`	ReAct format	General purpose	`Action:` / `Action Input:` text
`json_schema`	Pure JSON Schema	General purpose	Max compatibility, lower efficiency

Format Examples by Parser

hermes parser (native):

[tool_call]
{"name": "web_search", "arguments": {"query": "Python asyncio", "max_results": 5}}
[/tool_call]

mistral parser:

[TOOL_CALLS] [{"name": "web_search", "arguments": {"query": "Python asyncio"}}]

llama3 parser:

<|python_tag|>{"name": "web_search", "arguments": {"query": "Python asyncio"}}<|eom_id|>

react parser:

Thought: I need to search for Python asyncio information.
Action: web_search
Action Input: {"query": "Python asyncio", "max_results": 5}

pythonic parser:

web_search(query="Python asyncio", max_results=5)

Parser Configuration

from hermes.function_calling import ToolCallParser

# Auto-detect based on model name
parser = ToolCallParser.auto_detect("NousResearch/Hermes-2-Pro-Mistral-7B")

# Manual specification
parser = ToolCallParser.create("mistral")

# Register a custom parser
@ToolCallParser.register("my_custom")
class MyCustomParser(BaseParser):
    def parse(self, text: str) -> list[ToolCall]:
        ...
    
    def format(self, tool_call: ToolCall) -> str:
        ...

22.3 Training Data Formats

22.3.1 Basic Single Tool Call Sample

{
  "id": "fc_train_001",
  "type": "single_tool_call",
  "messages": [
    {
      "role": "system",
      "content": "You are a helpful AI assistant with access to the following tools:\n\n[{\"name\": \"get_weather\", \"description\": \"Get weather for a city\", \"parameters\": {\"type\": \"object\", \"properties\": {\"city\": {\"type\": \"string\"}, \"unit\": {\"type\": \"string\", \"enum\": [\"celsius\", \"fahrenheit\"], \"default\": \"celsius\"}}, \"required\": [\"city\"]}}]"
    },
    {
      "role": "user",
      "content": "What's the weather like in Beijing today?"
    },
    {
      "role": "assistant",
      "content": "[tool_call]\n{\"name\": \"get_weather\", \"arguments\": {\"city\": \"Beijing\", \"unit\": \"celsius\"}}\n[/tool_call]"
    },
    {
      "role": "tool",
      "content": "{\"city\": \"Beijing\", \"temperature\": 22, \"condition\": \"Sunny\", \"humidity\": 45}"
    },
    {
      "role": "assistant",
      "content": "Beijing is sunny today with a temperature of 22°C and humidity at 45%. A great day to be outside!"
    }
  ],
  "quality_score": 0.95,
  "verified": true
}

22.3.2 Multi-Step Tool Call Sequence Sample

{
  "id": "fc_train_002",
  "type": "multi_step_tool_call",
  "task_description": "Research a stock and generate an analysis report",
  "messages": [
    {
      "role": "system",
      "content": "You are a financial analysis AI assistant..."
    },
    {
      "role": "user",
      "content": "Please analyze Apple (AAPL) stock price trends over the past 30 days"
    },
    {
      "role": "assistant",
      "content": "[inner_monologue]\nI need to first get AAPL's historical price data, then perform technical analysis.\n[/inner_monologue]\n[tool_call]\n{\"name\": \"stock_price_history\", \"arguments\": {\"symbol\": \"AAPL\", \"days\": 30}}\n[/tool_call]"
    },
    {
      "role": "tool",
      "content": "{\"symbol\": \"AAPL\", \"data\": [{\"date\": \"2024-01-01\", \"close\": 185.2}, ...]}"
    },
    {
      "role": "assistant",
      "content": "[inner_monologue]\nGot 30 days of data. Now I need to calculate technical indicators: MA5, MA20, RSI.\n[/inner_monologue]\n[tool_call]\n{\"name\": \"calculate_technical_indicators\", \"arguments\": {\"data\": \"<previous_result>\", \"indicators\": [\"MA5\", \"MA20\", \"RSI\"]}}\n[/tool_call]"
    },
    {
      "role": "tool",
      "content": "{\"MA5\": 187.3, \"MA20\": 183.5, \"RSI\": 58.4, \"trend\": \"bullish\"}"
    },
    {
      "role": "assistant",
      "content": "## Apple (AAPL) 30-Day Stock Analysis\n\n**Trend Assessment**: Bullish\n- MA5 (187.3) > MA20 (183.5): short-term MA above long-term, bullish signal\n- RSI = 58.4: neutral-to-strong range, not overbought\n"
    }
  ],
  "tool_call_count": 2,
  "quality_score": 0.92
}

22.3.3 Negative (No Tool Call) Sample

High-quality training data must include scenarios where tools should not be called:

{
  "id": "fc_train_003",
  "type": "no_tool_call",
  "reason": "Question can be answered directly without tools",
  "messages": [
    {
      "role": "system",
      "content": "You are a helpful AI assistant with web_search available"
    },
    {
      "role": "user",
      "content": "What is 1+1?"
    },
    {
      "role": "assistant",
      "content": "1+1 equals 2."
    }
  ],
  "annotation": {
    "should_call_tool": false,
    "reason": "Basic math doesn't need web search; a direct answer is more efficient"
  }
}

22.4 Principles for Constructing High-Quality Training Samples

Parameter Quality Scoring

class ParameterQualityScorer:
    def score(self, tool_call: dict, ground_truth: dict) -> dict:
        scores = {}
        
        # 1. Completeness: are all required fields provided?
        required_fields = ground_truth.get("required", [])
        provided = tool_call.get("arguments", {}).keys()
        scores["completeness"] = len(set(required_fields) & set(provided)) / max(len(required_fields), 1)
        
        # 2. Accuracy: are values correct?
        correct_count = sum(
            1 for k, v in ground_truth["arguments"].items()
            if tool_call["arguments"].get(k) == v
        )
        scores["accuracy"] = correct_count / len(ground_truth["arguments"])
        
        # 3. Type precision: do types match?
        type_correct = sum(
            1 for k, v in tool_call.get("arguments", {}).items()
            if isinstance(v, type(ground_truth["arguments"].get(k)))
        )
        scores["type_precision"] = type_correct / max(len(tool_call.get("arguments", {})), 1)
        
        # 4. No hallucination: are any non-existent parameters introduced?
        extra_params = set(tool_call.get("arguments", {}).keys()) - set(ground_truth.get("properties", {}).keys())
        scores["no_hallucination"] = 1.0 if not extra_params else 0.5
        
        scores["overall"] = sum(scores.values()) / len(scores)
        return scores

Training Data Diversity Requirements

Dimension	Requirement	Recommended Distribution
Tool count	Single / Dual / Three+	50% / 30% / 20%
Call steps	1-step / 2-step / 3-step+	40% / 35% / 25%
Should call?	Yes / No	70% / 30%
Parameter types	string / number / array / object	Balanced
Error recovery	Includes tool failure scenarios	At least 15%
Language	English / Chinese / Mixed	As needed

22.5 Multi-Step Tool Call Annotation Standards

Step-Level Data Flow Annotation

{
  "steps": [
    {
      "step_id": 1,
      "tool": "web_search",
      "arguments": {"query": "AAPL stock price"},
      "output_var": "$search_results",
      "produces": ["$search_results.results[0].url"]
    },
    {
      "step_id": 2,
      "tool": "browser_navigate",
      "arguments": {"url": "$search_results.results[0].url"},
      "depends_on": [1],
      "output_var": "$page_content",
      "data_flow": "step1_output -> url_parameter"
    }
  ],
  "critical_path": [1, 2],
  "annotations": {
    "data_dependencies": "Step 1's result URL is passed into Step 2"
  }
}

Decision Point Annotation

Decision points are the most critical annotations — they teach the model to make judgments based on intermediate results:

{
  "decision_points": [
    {
      "after_step": 1,
      "condition": "web_search.result_count == 0",
      "action": "retry_with_broader_query",
      "reasoning": "If search returns nothing, broaden the query and retry"
    },
    {
      "after_step": 2,
      "condition": "browser_navigate.status_code == 403",
      "action": "try_alternative_url",
      "reasoning": "403 means access denied; switch to an alternative source"
    }
  ]
}

22.6 Why Hermes 4 Excels at Function Calling

Performance Improvements

Area	Hermes 3	Hermes 4	Improvement
Parameter accuracy	87.3%	94.1%	+6.8pp
Tool selection accuracy	91.2%	97.8%	+6.6pp
Multi-step success rate	72.4%	88.9%	+16.5pp
Error recovery rate	58.3%	79.2%	+20.9pp
Schema compliance rate	95.1%	99.3%	+4.2pp

Technical Improvements

1. Enhanced tool description understanding: Training data intentionally features incomplete or ambiguous tool descriptions, forcing the model to infer parameter semantics from context.

2. Joint training across 11 parsers: The same tool call scenario is annotated in all 11 formats simultaneously, giving the model format generalization capabilities.

3. Negative sample reinforcement: 30% of training samples are "should not call tool" scenarios, significantly reducing hallucinated tool calls.

4. Error injection training: Tool call errors are deliberately introduced (wrong parameter values, type mismatches), then correctly annotated recovery paths — training the model's self-correction ability:

# Error injection sample
{
  "type": "error_recovery",
  "messages": [
    {
      "role": "assistant",
      "content": "[tool_call]\n{\"name\": \"get_weather\", \"arguments\": {\"city\": 123}}\n[/tool_call]"
    },
    {
      "role": "tool",
      "content": "{\"error\": \"TypeError: city must be string, got int\"}"
    },
    {
      "role": "assistant",
      "content": "[inner_monologue]\nI made a type error — city should be a string, not an integer. Let me correct this.\n[/inner_monologue]\n[tool_call]\n{\"name\": \"get_weather\", \"arguments\": {\"city\": \"Beijing\"}}\n[/tool_call]"
    }
  ]
}

22.7 Summary

This chapter systematically covered the training data engineering behind Hermes Function Calling:

11 parsers: Cover format preferences across different model architectures, from native hermes format to react text format
Training data formats: Four major types — single call, multi-step, parallel calls, and rejection (no-call)
Quality dimensions: Four-dimensional scoring — completeness, accuracy, type precision, and no-hallucination
Annotation standards: Step-level data flow and decision point annotations are key to multi-step reasoning capability
Hermes 4 advantages: Three technical pillars — joint training, negative sample reinforcement, error injection

Review Questions

When generating training data, using a strong model (e.g., GPT-4) to generate samples and then filtering them vs. using human annotation — what are the trade-offs of each approach? How would you design a hybrid strategy?
The "should not call tool" negative sample ratio is set at 30%. If this ratio were too high (e.g., 50%), what behavioral bias would the model develop?
Joint training across 11 parsers might cause format selection "instability" (the model sometimes generates hermes format, sometimes react format for the same task). How would you use system prompts to enforce a specific format?

Rate this chapter

4.5 / 5 (10 ratings)