第 22 章

Function Calling 训练数据构造原理

第22章：Function Calling 训练数据构造原理

Hermes 模型在 Function Calling 任务上的卓越表现，源于其系统化的训练数据工程。本章深入解析 Hermes 支持 11 种工具调用解析器的设计逻辑、高质量训练样本的构造方法、多步工具调用序列的标注规范，以及 Hermes 4 在 Function Calling 任务上相比其他模型的优势来源。

22.1 Function Calling 的挑战

Function Calling（函数调用）要求语言模型不只是"输出文本"，而是在正确的时机生成结构化的工具调用指令，并在获得工具响应后继续推理。这涉及多个子能力：

意图识别：判断当前任务是否需要工具调用
工具选择：从可用工具中选择最合适的
参数提取：从用户意图中提取正确的参数值
类型约束：生成符合 Schema 的参数（类型、枚举、必填性）
多步推理：根据工具返回值决定下一步行动
错误恢复：工具调用失败后的自适应策略

22.2 11 种工具调用解析器设计

Hermes 提供 11 种工具调用解析器（Tool Call Parsers），以适配不同的模型架构和输出格式偏好：

22.2.1 解析器分类

解析器类型	格式	适用模型	特点
`hermes`	XML-like 标签	Hermes 系列	原生格式，解析最稳定
`mistral`	JSON 内嵌	Mistral 系列	需要特殊 [TOOL_CALLS] 标记
`llama3`	JSON 块	Llama 3 系列	使用 `<
`qwen`	JSON with markers	Qwen 系列	`<tool_call>` 标签变体
`functionary`	JSON 严格模式	Functionary 系列	最严格的 Schema 验证
`openai`	OpenAI 兼容	GPT 兼容模型	JSON `tool_calls` 数组
`claude`	XML 标签	Claude 系列	`<invoke>` + `<parameter>`
`gemini`	protobuf-like	Gemini 系列	强类型参数定义
`pythonic`	Python 函数调用语法	代码优化模型	`func_name(arg1=val1)`
`react`	ReAct 格式	通用	`Action:` / `Action Input:` 文本
`json_schema`	纯 JSON Schema	通用	最大兼容性，最低效率

22.2.2 各解析器格式示例

hermes 解析器（原生格式）：

[tool_call]
{"name": "web_search", "arguments": {"query": "Python asyncio", "max_results": 5}}
[/tool_call]

mistral 解析器：

[TOOL_CALLS] [{"name": "web_search", "arguments": {"query": "Python asyncio"}}]

llama3 解析器：

<|python_tag|>{"name": "web_search", "arguments": {"query": "Python asyncio"}}<|eom_id|>

react 解析器：

Thought: I need to search for Python asyncio information.
Action: web_search
Action Input: {"query": "Python asyncio", "max_results": 5}

pythonic 解析器：

web_search(query="Python asyncio", max_results=5)

22.2.3 解析器配置与选择

from hermes.function_calling import ToolCallParser

# 自动检测（根据模型名称推断）
parser = ToolCallParser.auto_detect("NousResearch/Hermes-2-Pro-Mistral-7B")

# 手动指定
parser = ToolCallParser.create("mistral")

# 注册自定义解析器
@ToolCallParser.register("my_custom")
class MyCustomParser(BaseParser):
    def parse(self, text: str) -> list[ToolCall]:
        # 自定义解析逻辑
        ...
    
    def format(self, tool_call: ToolCall) -> str:
        # 自定义格式化逻辑
        ...

22.3 训练数据格式

22.3.1 基础单次工具调用样本

{
  "id": "fc_train_001",
  "type": "single_tool_call",
  "messages": [
    {
      "role": "system",
      "content": "你是一个有帮助的 AI 助手，可以使用以下工具：\n\n[{\"name\": \"get_weather\", \"description\": \"获取指定城市的天气\", \"parameters\": {\"type\": \"object\", \"properties\": {\"city\": {\"type\": \"string\", \"description\": \"城市名称\"}, \"unit\": {\"type\": \"string\", \"enum\": [\"celsius\", \"fahrenheit\"], \"default\": \"celsius\"}}, \"required\": [\"city\"]}}]"
    },
    {
      "role": "user",
      "content": "北京今天天气怎么样？"
    },
    {
      "role": "assistant",
      "content": "[tool_call]\n{\"name\": \"get_weather\", \"arguments\": {\"city\": \"北京\", \"unit\": \"celsius\"}}\n[/tool_call]"
    },
    {
      "role": "tool",
      "content": "{\"city\": \"北京\", \"temperature\": 22, \"condition\": \"晴天\", \"humidity\": 45}"
    },
    {
      "role": "assistant",
      "content": "北京今天天气晴好，气温 22°C，湿度 45%，是出行的好天气！"
    }
  ],
  "quality_score": 0.95,
  "verified": true
}

22.3.2 多步工具调用序列样本

{
  "id": "fc_train_002",
  "type": "multi_step_tool_call",
  "task_description": "研究某公司的股价并生成分析报告",
  "messages": [
    {
      "role": "system",
      "content": "你是一个金融分析 AI 助手..."
    },
    {
      "role": "user",
      "content": "帮我分析一下苹果公司（AAPL）最近30天的股价走势"
    },
    {
      "role": "assistant",
      "content": "[inner_monologue]\n需要先获取 AAPL 的历史股价数据，然后进行技术分析。\n[/inner_monologue]\n[tool_call]\n{\"name\": \"stock_price_history\", \"arguments\": {\"symbol\": \"AAPL\", \"days\": 30}}\n[/tool_call]"
    },
    {
      "role": "tool",
      "content": "{\"symbol\": \"AAPL\", \"data\": [{\"date\": \"2024-01-01\", \"close\": 185.2}, ...]}"
    },
    {
      "role": "assistant",
      "content": "[inner_monologue]\n获取到了30天的数据。现在我需要计算技术指标：MA5、MA20、RSI。\n[/inner_monologue]\n[tool_call]\n{\"name\": \"calculate_technical_indicators\", \"arguments\": {\"data\": \"<previous_result>\", \"indicators\": [\"MA5\", \"MA20\", \"RSI\"]}}\n[/tool_call]"
    },
    {
      "role": "tool",
      "content": "{\"MA5\": 187.3, \"MA20\": 183.5, \"RSI\": 58.4, \"trend\": \"bullish\"}"
    },
    {
      "role": "assistant",
      "content": "[tool_call]\n{\"name\": \"generate_chart\", \"arguments\": {\"type\": \"candlestick\", \"data\": \"<price_data>\", \"indicators\": [\"MA5\", \"MA20\"]}}\n[/tool_call]"
    },
    {
      "role": "tool",
      "content": "{\"chart_url\": \"https://charts.example.com/aapl_30d.png\"}"
    },
    {
      "role": "assistant",
      "content": "## 苹果公司（AAPL）30天股价分析\n\n**趋势研判**: 多头趋势\n- MA5 (187.3) > MA20 (183.5)，短期均线在长期均线上方，看涨信号\n- RSI = 58.4，处于中性偏强区间，未过热\n\n[查看完整图表](https://charts.example.com/aapl_30d.png)"
    }
  ],
  "tool_call_count": 3,
  "quality_score": 0.92
}

22.3.3 拒绝调用（Negative）样本

高质量训练数据必须包含不应该调用工具的场景：

{
  "id": "fc_train_003",
  "type": "no_tool_call",
  "reason": "问题可以直接回答，无需工具",
  "messages": [
    {
      "role": "system",
      "content": "你是一个有帮助的 AI 助手，有 web_search 工具可用"
    },
    {
      "role": "user",
      "content": "1+1 等于多少？"
    },
    {
      "role": "assistant",
      "content": "1+1 等于 2。"
    }
  ],
  "annotation": {
    "should_call_tool": false,
    "reason": "基础数学问题不需要网络搜索，直接回答更高效"
  }
}

22.4 构造高质量训练样本的原则

22.4.1 参数质量维度

# 参数质量评分系统
class ParameterQualityScorer:
    def score(self, tool_call: dict, ground_truth: dict) -> dict:
        scores = {}
        
        # 1. 参数完整性：必填字段是否都提供了？
        required_fields = ground_truth.get("required", [])
        provided = tool_call.get("arguments", {}).keys()
        scores["completeness"] = len(set(required_fields) & set(provided)) / max(len(required_fields), 1)
        
        # 2. 参数准确性：值是否正确？
        correct_count = sum(
            1 for k, v in ground_truth["arguments"].items()
            if tool_call["arguments"].get(k) == v
        )
        scores["accuracy"] = correct_count / len(ground_truth["arguments"])
        
        # 3. 参数精度：类型是否匹配？
        type_correct = sum(
            1 for k, v in tool_call.get("arguments", {}).items()
            if isinstance(v, type(ground_truth["arguments"].get(k)))
        )
        scores["type_precision"] = type_correct / max(len(tool_call.get("arguments", {})), 1)
        
        # 4. 无幻觉：是否引入了不存在的参数？
        extra_params = set(tool_call.get("arguments", {}).keys()) - set(ground_truth.get("properties", {}).keys())
        scores["no_hallucination"] = 1.0 if not extra_params else 0.5
        
        scores["overall"] = sum(scores.values()) / len(scores)
        return scores

22.4.2 训练数据多样性要求

维度	要求	占比建议
工具数量	单工具 / 双工具 / 三工具+	50% / 30% / 20%
调用步数	1步 / 2步 / 3步+	40% / 35% / 25%
是否调用	应调用 / 不应调用	70% / 30%
参数类型	string / number / array / object	均衡分布
错误恢复	含工具失败场景	至少 15%
语言	中文 / 英文 / 混合	按需分配

22.4.3 批量生成训练数据（代码示例）

import json
import random
from hermes.data import TrainingDataGenerator

generator = TrainingDataGenerator(
    llm_backend="gpt-4",         # 使用强模型生成训练数据
    tools_pool_path="./tools_definitions.json",
    output_dir="./training_data/",
)

# 配置生成策略
config = {
    "total_samples": 10000,
    "distribution": {
        "single_tool_call": 0.40,
        "multi_step_tool_call": 0.30,
        "parallel_tool_calls": 0.15,
        "no_tool_call": 0.15,
    },
    "quality_threshold": 0.85,   # 低于此分数的样本自动丢弃
    "languages": ["zh", "en"],
    "domains": [
        "web_research", "code_assistance", "data_analysis",
        "file_operations", "calendar_management"
    ],
}

# 生成并过滤
results = generator.generate_batch(config)
print(f"生成 {results.total} 条，保留 {results.passed} 条（通过率: {results.pass_rate:.1%}）")

22.5 多步工具调用序列的标注规范

22.5.1 步骤间数据流标注

{
  "steps": [
    {
      "step_id": 1,
      "tool": "web_search",
      "arguments": {"query": "AAPL stock price"},
      "output_var": "$search_results",
      "produces": ["$search_results.results[0].url"]
    },
    {
      "step_id": 2,
      "tool": "browser_navigate",
      "arguments": {"url": "$search_results.results[0].url"},
      "depends_on": [1],
      "output_var": "$page_content",
      "data_flow": "step1_output -> url_parameter"
    },
    {
      "step_id": 3,
      "tool": "html_extract",
      "arguments": {
        "html": "$page_content.html",
        "selector": ".price-display"
      },
      "depends_on": [2],
      "output_var": "$price_text"
    }
  ],
  "critical_path": [1, 2, 3],
  "parallelizable": [],
  "annotations": {
    "data_dependencies": "步骤1的结果URL传入步骤2，步骤2的HTML传入步骤3",
    "decision_points": "如果步骤1搜索结果为空，需要修改查询重试"
  }
}

22.5.2 决策点（Decision Point）标注

决策点是训练数据中最关键的标注，它们教模型"根据中间结果做判断"：

{
  "decision_points": [
    {
      "after_step": 1,
      "condition": "web_search.result_count == 0",
      "action": "retry_with_broader_query",
      "reasoning": "搜索结果为空时，应该放宽查询条件重试"
    },
    {
      "after_step": 2,
      "condition": "browser_navigate.status_code == 403",
      "action": "try_alternative_url",
      "reasoning": "403 说明网页拒绝访问，应切换到其他来源"
    },
    {
      "after_step": 3,
      "condition": "html_extract.content == null",
      "action": "use_full_page_text",
      "reasoning": "CSS 选择器失效时，退回到全文提取"
    }
  ]
}

22.6 Hermes 4 在 Function Calling 上的优势来源

22.6.1 核心改进维度

改进领域	Hermes 3	Hermes 4	提升幅度
参数准确率	87.3%	94.1%	+6.8pp
工具选择准确率	91.2%	97.8%	+6.6pp
多步调用成功率	72.4%	88.9%	+16.5pp
错误恢复率	58.3%	79.2%	+20.9pp
Schema 合规率	95.1%	99.3%	+4.2pp

22.6.2 技术改进要点

1. 增强型工具描述理解： Hermes 4 的训练数据中，工具描述被故意设计得不完整或模糊，强制模型学会从上下文推断参数语义。

2. 11种解析器联合训练： 同一个工具调用场景用 11 种格式同时标注，使模型具备格式泛化能力。

3. 负样本强化： 训练数据中 30% 为"不应调用工具"场景，显著减少了幻觉工具调用。

4. 错误注入训练： 人工引入工具调用错误（参数错误、类型错误），然后标注正确的恢复路径，训练模型的自我修正能力。

# 错误注入样本示例
{
  "type": "error_recovery",
  "messages": [
    {"role": "assistant", "content": "[tool_call]\n{\"name\": \"get_weather\", \"arguments\": {\"city\": 123}}\n[/tool_call]"},
    {"role": "tool", "content": "{\"error\": \"TypeError: city must be string, got int\"}"},
    {"role": "assistant", "content": "[inner_monologue]\n我犯了参数类型错误，city 应该是字符串，不是数字。让我修正。\n[/inner_monologue]\n[tool_call]\n{\"name\": \"get_weather\", \"arguments\": {\"city\": \"Beijing\"}}\n[/tool_call]"}
  ]
}

22.7 小结

本章系统讲解了 Hermes Function Calling 的训练数据工程：

11种解析器：覆盖不同模型架构的格式偏好，从原生 hermes 格式到 react 文本格式
训练数据格式：包含单次调用、多步调用、并行调用、拒绝调用四大类型
质量维度：参数完整性、准确性、类型精度、无幻觉四维评分
标注规范：步骤间数据流、决策点标注是多步推理能力的关键
Hermes 4 优势：联合训练、负样本强化、错误注入三大技术支柱

思考题

在生成训练数据时，使用强模型（如 GPT-4）生成样本再进行过滤，vs 人工标注，各有什么优缺点？如何设计混合策略？
"不应该调用工具"的负样本比例设为 30%。如果这个比例过高（如 50%），模型会产生什么行为偏差？
11 种解析器联合训练是否会导致模型在格式选择上产生"不稳定性"（即同样的任务有时生成 hermes 格式，有时生成 react 格式）？如何通过系统提示词强制指定格式？

本章评分

4.5 / 5 (10 评分)