← Back to Skills Marketplace
huaibuer

Data Generator

by HuaiBuer · GitHub ↗ · v2.2.0 · MIT-0
cross-platform ✓ Security Clean
166
Downloads
0
Stars
0
Active Installs
6
Versions
Install in OpenClaw
/install data-generator
Description
训练数据生成技能。根据传入的工具名和用户指令列表,生成多轮对话格式的 JSONL 训练数据。触发场景:(1) 传入工具名和用户指令列表,生成完整训练数据;(2) 批量生成指定工具的标注数据;(3) 给定指令列表,输出 JSONL 对话样本。
README (SKILL.md)

Data Generator

将用户指令列表转换为标准 JSONL 训练数据。

输入

两个必填参数:

参数 类型 说明
tool_name string 工具名,如 dev_controlscene_generatorweather
user_instructions string[] 用户指令列表,如 ["5分钟后打开空调", "3分钟后关灯"]

输出 JSONL 格式

{"conversations":[
  {"from":"human","value":"\x3C当前用户指令>打开客厅空调\x3C/当前用户指令>\
\x3C本地设备>格力冷静王(空调)\x3C/本地设备>\
\x3C当前时间>2026-03-15 14:22:47\x3C/当前时间>\
\x3C用户场景列表>[{\"scene_id\":1001,\"scene_name\":\"回家模式\",\"room_name\":\"全屋\"},{\"scene_id\":1002,\"scene_name\":\"睡眠模式\",\"room_name\":\"主卧\"}]\x3C/用户场景列表>\
\x3C用户设备列表>{\"客厅\":[\"格力冷静王(空调)\",\"洗碗机A1(洗碗机)\"],\"主卧\":[\"美的舒省风(空调)\"]}\x3C/用户设备列表>"},
  {"from":"assistant","value":"\x3Ctool_call>{\"tool_name\":\"dev_control\",\"query\":\"打开客厅空调\"}\x3C/tool_call>"},
  {"from":"observation","value":"\x3Ctool_response>客厅空调已打开\x3C/tool_response>"},
  {"from":"assistant","value":"好的,客厅空调已经打开啦~"}
],"system":"","history":[]}

格式规则

  1. human value = 完整上下文,格式固定:
    \x3C当前用户指令>用户原始指令\x3C/当前用户指令>
    \x3C本地设备>设备名(类型)\x3C/本地设备>
    \x3C当前时间>YYYY-MM-DD HH:mm:ss\x3C/当前时间>
    \x3C用户场景列表>[{"scene_id":xxx,"scene_name":"场景名","room_name":"房间名"},...]\x3C/用户场景列表>
    \x3C用户设备列表>{"房间":["设备名(类型)",...]}\x3C/用户设备列表>
    
  2. assistant tool_call = 直接输出 tool_call 标签,无垫音前缀
  3. observation = \x3Ctool_response>...\x3C/tool_response>\x3Ctool_call>{...}\x3C/tool_call>(dev_info/weather 等工具)
  4. assistant 终接回复 = 直接回复内容,无垫音前缀
  5. system = ""history = []

工作流

1. 接收 tool_name + user_instructions[]
         ↓
2. 加载提示词:通用要求 + 工具特定要求(references/tools/{tool}.txt)
         ↓
3. 将 user_instructions 注入提示词
         ↓
4. 生成 JSONL(每条独立)
         ↓
5. 输出 .jsonl 文件

提示词拼接

拼接规则:

[通用要求]
# ═══════════════════════════════════════════════════════════════════
# 【工具特定要求】
# 本次只调用:{TOOL_NAME}
# ────────────────────────────────────────────────────────────────
[references/tools/{TOOL_NAME}.txt 内容]

拼接脚本:scripts/build_prompt.py

工具与文件对照

工具 要求文件
dev_control references/tools/dev_control.txt
scene_generator references/tools/scene_generator.txt
alarm_remind references/tools/alarm_remind.txt
weather references/tools/weather.txt
scene_control references/tools/scene_control.txt
dev_info references/tools/dev_info.txt
exit_dialog references/tools/exit_dialog.txt
GreeQA references/tools/GreeQA.txt
scene_guide references/tools/scene_guide.txt
chat references/tools/chat.txt

使用示例

输入:

tool_name: "dev_info"
user_instructions: ["家里空调数量", "有几个空调"]

输出字段说明:

字段 说明
conversations[0].value \x3C当前用户指令> + 完整上下文
conversations[1].value \x3Ctool_call>{"tool_name":"dev_info"}\x3C/tool_call>
conversations[2].value dev_info 返回结果(设备列表)
conversations[3].value 文字终接回复
system 空字符串 ""
history 空数组 []

BUG 修复数据

当传入 tool_name 为修复后的正确工具时,生成的数据应体现:

  • 工具调用格式正确(符合工具要求文件)
  • query 字段格式正确(如延时类指令含 timing 字段)
  • 文字回复符合预期(含延时时间描述)

具体格式参考:references/tools/scene_generator.txt

Usage Guidance
This skill appears to do only what it claims: generate JSONL training samples from tool-specific templates and user instruction lists. Before running: (1) inspect a sample output to ensure it doesn't accidentally encode any sensitive content you might include in the input file; (2) do not pass paths to sensitive system files to the --file option (the script will embed file contents into the generated data); (3) if you plan to include real user data or production logs, sanitize or anonymize them first. Otherwise it is safe to install from a provenance perspective (no network calls, no secrets requested).
Capability Analysis
Type: OpenClaw Skill Name: data-generator Version: 2.2.0 The data-generator skill bundle is a utility for creating synthetic JSONL training datasets for a smart home AI assistant. The included Python scripts (scripts/build_prompt.py and scripts/gen_data.py) generate simulated multi-turn dialogues by combining user instructions with randomized device and scene contexts. The logic is entirely focused on data formatting and string manipulation based on tool definitions found in the references/ directory. No indicators of malicious intent, such as data exfiltration, unauthorized system access, or harmful prompt injection, were detected.
Capability Assessment
Purpose & Capability
Name/description (生成训练数据) match the included artifacts: prompt templates, per-tool rule files under references/tools/, and two generator scripts. All required files and behavior (creating JSONL from user instructions) are coherent with the stated purpose.
Instruction Scope
SKILL.md and scripts constrain behavior to assembling prompts and producing JSONL records. The generator reads local reference files and, if invoked with --file, will read a user-supplied file of instructions — this is expected for the task but means you should not point it at sensitive system files because their contents would be embedded into generated data.
Install Mechanism
No install spec; instruction-only skill with bundled Python scripts. No downloads, no external package installs, and no code executed from remote URLs.
Credentials
The skill requests no environment variables, credentials, or config paths. The scripts operate on provided inputs and bundled reference files only, which is proportionate to the stated functionality.
Persistence & Privilege
Skill is not always-enabled and uses normal model invocation semantics. It does write output files (the generated .jsonl) as expected, but does not modify other skills or request elevated/system-wide privileges.
How to Use
  1. Make sure OpenClaw is installed (local or Docker)
  2. Run the install command in chat: /install data-generator
  3. After installation, invoke the skill by name or use /data-generator
  4. Provide required inputs per the skill's parameter spec and get structured output
Version History
v2.2.0
更新scene_generator提示词:新增repeat_type=holidays时datetime不能为空的补充规则;更新GreeQA提示词
v2.1.0
更新GreeQA提示词:修正"格力小智"为"格仔",完善返回分布规则(2条50%/1条25%/0条25%),新增双方式结构说明和模型回复示例
v2.0.0
新格式v2:conversations[0].value含完整上下文,system=空history=空数组,assistant无垫音前缀
v1.2.0
SKILL.md优化:输入tool_name+user_instructions列表,专注JSONL生成
v1.1.0
新增gen_data.py:接收tool_name+指令列表生成JSONL;优化SKILL.md说明输入输出格式
v1.0.0
Data Generator skill initial release. - Enables generating multi-turn dialogue training data for smart home scenarios in JSONL format based on natural language instructions. - Supports 10 different tool types, each with specific data generation requirements. - Provides a prompt-building script for composing tool-specific generation requests. - Allows batch generation of specified tool training data for standard, special, and bug-fix scenarios. - Includes comprehensive documentation and usage examples for both manual and automated workflows.
Metadata
Slug data-generator
Version 2.2.0
License MIT-0
All-time Installs 0
Active Installs 0
Total Versions 6
Frequently Asked Questions

What is Data Generator?

训练数据生成技能。根据传入的工具名和用户指令列表,生成多轮对话格式的 JSONL 训练数据。触发场景:(1) 传入工具名和用户指令列表,生成完整训练数据;(2) 批量生成指定工具的标注数据;(3) 给定指令列表,输出 JSONL 对话样本。 It is an AI Agent Skill for Claude Code / OpenClaw, with 166 downloads so far.

How do I install Data Generator?

Run "/install data-generator" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.

Is Data Generator free?

Yes, Data Generator is completely free, licensed under MIT-0. You can download, install and use it at no cost.

Which platforms does Data Generator support?

Data Generator is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).

Who created Data Generator?

It is built and maintained by HuaiBuer (@huaibuer); the current version is v2.2.0.

💬 Comments