Chapter 22

Function Calling Training Data Construction Principles

Chapter 22: Function Calling Training Data Construction Principles

Hermes's outstanding performance on Function Calling tasks stems from systematic training data engineering. This chapter analyzes the design logic behind Hermes's 11 tool call parsers, methods for constructing high-quality training samples, annotation standards for multi-step tool call sequences, and the technical sources of Hermes 4's advantages on Function Calling benchmarks.


22.1 The Challenges of Function Calling

Function Calling requires a language model to do more than "output text" — it must generate structured tool invocation instructions at the right moment and continue reasoning after receiving tool responses. This involves multiple sub-capabilities:

  1. Intent recognition: Determining whether the current task requires a tool call
  2. Tool selection: Choosing the most appropriate tool from available options
  3. Parameter extraction: Extracting correct parameter values from user intent
  4. Type constraints: Generating schema-compliant parameters (types, enums, required fields)
  5. Multi-step reasoning: Deciding the next action based on tool return values
  6. Error recovery: Adaptive strategies when tool calls fail

22.2 11 Tool Call Parser Designs

Hermes provides 11 tool call parsers to accommodate different model architectures and output format preferences:

Parser Classification

Parser Type Format Target Models Characteristics
hermes XML-like tags Hermes series Native format, most stable parsing
mistral Embedded JSON Mistral series Requires special [TOOL_CALLS] marker
llama3 JSON blocks Llama 3 series Uses <|python_tag|> delimiter
qwen JSON with markers Qwen series <tool_call> tag variant
functionary Strict JSON mode Functionary series Strictest schema validation
openai OpenAI compatible GPT-compatible models JSON tool_calls array
claude XML tags Claude series <invoke> + <parameter>
gemini protobuf-like Gemini series Strong-typed parameter definitions
pythonic Python function call syntax Code-optimized models func_name(arg1=val1)
react ReAct format General purpose Action: / Action Input: text
json_schema Pure JSON Schema General purpose Max compatibility, lower efficiency

Format Examples by Parser

hermes parser (native):

[tool_call]
{"name": "web_search", "arguments": {"query": "Python asyncio", "max_results": 5}}
[/tool_call]

mistral parser:

[TOOL_CALLS] [{"name": "web_search", "arguments": {"query": "Python asyncio"}}]

llama3 parser:

<|python_tag|>{"name": "web_search", "arguments": {"query": "Python asyncio"}}<|eom_id|>

react parser:

Thought: I need to search for Python asyncio information.
Action: web_search
Action Input: {"query": "Python asyncio", "max_results": 5}

pythonic parser:

web_search(query="Python asyncio", max_results=5)

Parser Configuration

from hermes.function_calling import ToolCallParser

# Auto-detect based on model name
parser = ToolCallParser.auto_detect("NousResearch/Hermes-2-Pro-Mistral-7B")

# Manual specification
parser = ToolCallParser.create("mistral")

# Register a custom parser
@ToolCallParser.register("my_custom")
class MyCustomParser(BaseParser):
    def parse(self, text: str) -> list[ToolCall]:
        ...
    
    def format(self, tool_call: ToolCall) -> str:
        ...

22.3 Training Data Formats

22.3.1 Basic Single Tool Call Sample

{
  "id": "fc_train_001",
  "type": "single_tool_call",
  "messages": [
    {
      "role": "system",
      "content": "You are a helpful AI assistant with access to the following tools:\n\n[{\"name\": \"get_weather\", \"description\": \"Get weather for a city\", \"parameters\": {\"type\": \"object\", \"properties\": {\"city\": {\"type\": \"string\"}, \"unit\": {\"type\": \"string\", \"enum\": [\"celsius\", \"fahrenheit\"], \"default\": \"celsius\"}}, \"required\": [\"city\"]}}]"
    },
    {
      "role": "user",
      "content": "What's the weather like in Beijing today?"
    },
    {
      "role": "assistant",
      "content": "[tool_call]\n{\"name\": \"get_weather\", \"arguments\": {\"city\": \"Beijing\", \"unit\": \"celsius\"}}\n[/tool_call]"
    },
    {
      "role": "tool",
      "content": "{\"city\": \"Beijing\", \"temperature\": 22, \"condition\": \"Sunny\", \"humidity\": 45}"
    },
    {
      "role": "assistant",
      "content": "Beijing is sunny today with a temperature of 22°C and humidity at 45%. A great day to be outside!"
    }
  ],
  "quality_score": 0.95,
  "verified": true
}

22.3.2 Multi-Step Tool Call Sequence Sample

{
  "id": "fc_train_002",
  "type": "multi_step_tool_call",
  "task_description": "Research a stock and generate an analysis report",
  "messages": [
    {
      "role": "system",
      "content": "You are a financial analysis AI assistant..."
    },
    {
      "role": "user",
      "content": "Please analyze Apple (AAPL) stock price trends over the past 30 days"
    },
    {
      "role": "assistant",
      "content": "[inner_monologue]\nI need to first get AAPL's historical price data, then perform technical analysis.\n[/inner_monologue]\n[tool_call]\n{\"name\": \"stock_price_history\", \"arguments\": {\"symbol\": \"AAPL\", \"days\": 30}}\n[/tool_call]"
    },
    {
      "role": "tool",
      "content": "{\"symbol\": \"AAPL\", \"data\": [{\"date\": \"2024-01-01\", \"close\": 185.2}, ...]}"
    },
    {
      "role": "assistant",
      "content": "[inner_monologue]\nGot 30 days of data. Now I need to calculate technical indicators: MA5, MA20, RSI.\n[/inner_monologue]\n[tool_call]\n{\"name\": \"calculate_technical_indicators\", \"arguments\": {\"data\": \"<previous_result>\", \"indicators\": [\"MA5\", \"MA20\", \"RSI\"]}}\n[/tool_call]"
    },
    {
      "role": "tool",
      "content": "{\"MA5\": 187.3, \"MA20\": 183.5, \"RSI\": 58.4, \"trend\": \"bullish\"}"
    },
    {
      "role": "assistant",
      "content": "## Apple (AAPL) 30-Day Stock Analysis\n\n**Trend Assessment**: Bullish\n- MA5 (187.3) > MA20 (183.5): short-term MA above long-term, bullish signal\n- RSI = 58.4: neutral-to-strong range, not overbought\n"
    }
  ],
  "tool_call_count": 2,
  "quality_score": 0.92
}

22.3.3 Negative (No Tool Call) Sample

High-quality training data must include scenarios where tools should not be called:

{
  "id": "fc_train_003",
  "type": "no_tool_call",
  "reason": "Question can be answered directly without tools",
  "messages": [
    {
      "role": "system",
      "content": "You are a helpful AI assistant with web_search available"
    },
    {
      "role": "user",
      "content": "What is 1+1?"
    },
    {
      "role": "assistant",
      "content": "1+1 equals 2."
    }
  ],
  "annotation": {
    "should_call_tool": false,
    "reason": "Basic math doesn't need web search; a direct answer is more efficient"
  }
}

22.4 Principles for Constructing High-Quality Training Samples

Parameter Quality Scoring

class ParameterQualityScorer:
    def score(self, tool_call: dict, ground_truth: dict) -> dict:
        scores = {}
        
        # 1. Completeness: are all required fields provided?
        required_fields = ground_truth.get("required", [])
        provided = tool_call.get("arguments", {}).keys()
        scores["completeness"] = len(set(required_fields) & set(provided)) / max(len(required_fields), 1)
        
        # 2. Accuracy: are values correct?
        correct_count = sum(
            1 for k, v in ground_truth["arguments"].items()
            if tool_call["arguments"].get(k) == v
        )
        scores["accuracy"] = correct_count / len(ground_truth["arguments"])
        
        # 3. Type precision: do types match?
        type_correct = sum(
            1 for k, v in tool_call.get("arguments", {}).items()
            if isinstance(v, type(ground_truth["arguments"].get(k)))
        )
        scores["type_precision"] = type_correct / max(len(tool_call.get("arguments", {})), 1)
        
        # 4. No hallucination: are any non-existent parameters introduced?
        extra_params = set(tool_call.get("arguments", {}).keys()) - set(ground_truth.get("properties", {}).keys())
        scores["no_hallucination"] = 1.0 if not extra_params else 0.5
        
        scores["overall"] = sum(scores.values()) / len(scores)
        return scores

Training Data Diversity Requirements

Dimension Requirement Recommended Distribution
Tool count Single / Dual / Three+ 50% / 30% / 20%
Call steps 1-step / 2-step / 3-step+ 40% / 35% / 25%
Should call? Yes / No 70% / 30%
Parameter types string / number / array / object Balanced
Error recovery Includes tool failure scenarios At least 15%
Language English / Chinese / Mixed As needed

22.5 Multi-Step Tool Call Annotation Standards

Step-Level Data Flow Annotation

{
  "steps": [
    {
      "step_id": 1,
      "tool": "web_search",
      "arguments": {"query": "AAPL stock price"},
      "output_var": "$search_results",
      "produces": ["$search_results.results[0].url"]
    },
    {
      "step_id": 2,
      "tool": "browser_navigate",
      "arguments": {"url": "$search_results.results[0].url"},
      "depends_on": [1],
      "output_var": "$page_content",
      "data_flow": "step1_output -> url_parameter"
    }
  ],
  "critical_path": [1, 2],
  "annotations": {
    "data_dependencies": "Step 1's result URL is passed into Step 2"
  }
}

Decision Point Annotation

Decision points are the most critical annotations — they teach the model to make judgments based on intermediate results:

{
  "decision_points": [
    {
      "after_step": 1,
      "condition": "web_search.result_count == 0",
      "action": "retry_with_broader_query",
      "reasoning": "If search returns nothing, broaden the query and retry"
    },
    {
      "after_step": 2,
      "condition": "browser_navigate.status_code == 403",
      "action": "try_alternative_url",
      "reasoning": "403 means access denied; switch to an alternative source"
    }
  ]
}

22.6 Why Hermes 4 Excels at Function Calling

Performance Improvements

Area Hermes 3 Hermes 4 Improvement
Parameter accuracy 87.3% 94.1% +6.8pp
Tool selection accuracy 91.2% 97.8% +6.6pp
Multi-step success rate 72.4% 88.9% +16.5pp
Error recovery rate 58.3% 79.2% +20.9pp
Schema compliance rate 95.1% 99.3% +4.2pp

Technical Improvements

1. Enhanced tool description understanding: Training data intentionally features incomplete or ambiguous tool descriptions, forcing the model to infer parameter semantics from context.

2. Joint training across 11 parsers: The same tool call scenario is annotated in all 11 formats simultaneously, giving the model format generalization capabilities.

3. Negative sample reinforcement: 30% of training samples are "should not call tool" scenarios, significantly reducing hallucinated tool calls.

4. Error injection training: Tool call errors are deliberately introduced (wrong parameter values, type mismatches), then correctly annotated recovery paths — training the model's self-correction ability:

# Error injection sample
{
  "type": "error_recovery",
  "messages": [
    {
      "role": "assistant",
      "content": "[tool_call]\n{\"name\": \"get_weather\", \"arguments\": {\"city\": 123}}\n[/tool_call]"
    },
    {
      "role": "tool",
      "content": "{\"error\": \"TypeError: city must be string, got int\"}"
    },
    {
      "role": "assistant",
      "content": "[inner_monologue]\nI made a type error — city should be a string, not an integer. Let me correct this.\n[/inner_monologue]\n[tool_call]\n{\"name\": \"get_weather\", \"arguments\": {\"city\": \"Beijing\"}}\n[/tool_call]"
    }
  ]
}

22.7 Summary

This chapter systematically covered the training data engineering behind Hermes Function Calling:

Review Questions

  1. When generating training data, using a strong model (e.g., GPT-4) to generate samples and then filtering them vs. using human annotation — what are the trade-offs of each approach? How would you design a hybrid strategy?

  2. The "should not call tool" negative sample ratio is set at 30%. If this ratio were too high (e.g., 50%), what behavioral bias would the model develop?

  3. Joint training across 11 parsers might cause format selection "instability" (the model sometimes generates hermes format, sometimes react format for the same task). How would you use system prompts to enforce a specific format?

Rate this chapter
4.5  / 5  (10 ratings)

💬 Comments