← 返回 Skills 市场

data-synthesis

Name: data-synthesis
Author: erxiong0

作者 chichisyun · GitHub ↗ · v1.0.0 · MIT-0

cross-platform ✓ 安全检测通过

总下载

当前安装

版本数

在 OpenClaw 中安装

/install data-synthesis

功能描述

从 CSV 语料切块后，用同一套 LLM 接口依次生成问题与答案，输出 JSONL 训练数据。适用于文档/表格语料合成 QA、微调数据准备；支持 OpenAI 兼容网关与内网 Qwen 等服务。

使用说明 (SKILL.md)

Data Synthesis

流程概览

CSV 语料 → 按字符切块 → 每块生成问题列表 → 对每个问题基于该块生成答案 → 写出 JSONL。
不包含单独的小模型质量过滤环节。

阶段	说明
输入	带表头的 CSV，需有一列长文本可供切块
切块与遍历	`scripts/synthesize_qa.py`
问题生成	每个文本块调用一次 LLM
答案生成	每个「块 + 问题」调用一次 LLM
输出	每行一条 JSON（JSONL）

Agent 建议执行顺序

准备：运行 scripts/parse_file.py，确认列名、行数、文本列与内容预览。
合成：运行 scripts/synthesize_qa.py，得到 JSONL。
模式：未开启 API 时为 dry-run（不联网）；设置 DATA_SYNTHESIS_USE_API=1 后走真实推理。若网关需要鉴权，再配置 OPENAI_API_KEY。

脚本与依赖

脚本	作用
`scripts/parse_file.py`	校验 CSV，输出列名、行数、`detected_text_column`、文本预览
`scripts/synthesize_qa.py`	完整 QA 流水线；默认输出 `\x3C输入文件名_stem>_qa.jsonl`，可用 `-o` 指定路径

仅依赖 Python 标准库。API 模式通过 urllib 调用 OpenAI 兼容的 POST .../v1/chat/completions（OPENAI_BASE_URL 可写成根路径 .../v1，也可写成完整 URL 直至 .../chat/completions）。

命令示例

# 1. 检查语料
python scripts/parse_file.py path/to/corpus.csv

# 2. 本地 dry-run（不写密钥、不调外网）
python scripts/synthesize_qa.py path/to/corpus.csv -o path/to/out.jsonl

内网 Qwen 示例（与环境变量一致，推荐）：

export DATA_SYNTHESIS_USE_API=1
export OPENAI_BASE_URL='http://xxx'
export DATA_SYNTHESIS_MODEL='xxx'
export DATA_SYNTHESIS_MAX_TOKENS=2000
export DATA_SYNTHESIS_TEMPERATURE=0.7

python scripts/synthesize_qa.py path/to/corpus.csv \
  --text-column text \
  --chunk-size 6000 \
  --chunk-overlap 200 \
  --sleep 0.3 \
  -o path/to/out.jsonl

仅用命令行指定模型（与上面环境变量二选一即可）：

python scripts/synthesize_qa.py path/to/corpus.csv \
  --text-column text \
  --model xxx \
  -o path/to/out.jsonl

常用参数：--max-rows 试跑前几行；--sleep 控制请求间隔。勿在已设置 DATA_SYNTHESIS_MODEL 时再传冲突的 --model。

输入格式

CSV 第一行为表头。
文本列自动匹配：text、content、body、正文、文本；否则使用第一列。
手动指定：--text-column 列名。

输出格式

JSONL：每行一个对象，主要字段包括 chunk_id、row_index、chunk、question、answer，以及 source_fields（除主文本列外、值非空的其它列）。
运行结束会在标准输出打印统计 JSON：rows、chunks、questions_generated、qa_pairs_written、errors。

能力与局限

能力： 适合从表格语料批量得到结构化 QA，便于后续清洗与训练；出题与作答共用同一模型与网关配置。

局限： 切块为字符滑动窗口，不是按段落语义切分；问题列表依赖模型返回可解析的 JSON，失败会记入 errors。

安全使用建议

This skill appears to do exactly what it says: chunk CSV text and produce JSONL Q/A pairs. Notes before you run it: (1) By default it runs in dry-run mode; it only calls external LLMs if you set DATA_SYNTHESIS_USE_API=1. (2) If you enable API mode and set OPENAI_API_KEY (or point OPENAI_BASE_URL to a gateway), the script will POST the text chunks and questions to that endpoint — that may reveal sensitive data from your CSV and incur costs. Audit the CSV for PII before sending, prefer an internal/enterprise gateway if available, and test with dry-run and small inputs first. (3) The output 'source_fields' includes other non-empty columns from each row — remove or redact sensitive columns beforehand. (4) Use a scoped API key and monitor usage. Overall the skill is internally coherent and there are no hidden endpoints or obfuscated behaviors in the provided code.

功能分析

Type: OpenClaw Skill Name: data-synthesis Version: 1.0.0 The skill bundle provides a legitimate utility for generating QA datasets from CSV files using LLM APIs. The scripts (parse_file.py and synthesize_qa.py) use Python's standard library for CSV processing and HTTP requests, following standard patterns for OpenAI-compatible integrations without any evidence of data exfiltration, unauthorized execution, or malicious prompt injection.

能力评估

✓ Purpose & Capability

Name/description (synthesize QA from CSV for training) matches the included scripts and SKILL.md. The scripts only read the input CSV, chunk text, call an LLM endpoint when enabled, and write JSONL output — all consistent with the stated function.

✓ Instruction Scope

SKILL.md and scripts limit actions to: validate/preview CSV, chunk text, call LLM (when DATA_SYNTHESIS_USE_API=1), and write output JSONL. There are no instructions to read unrelated files, access external endpoints beyond the configured OPENAI_BASE_URL, or exfiltrate other system data.

✓ Install Mechanism

Instruction-only skill with bundled Python scripts; no install spec, no downloads, and only Python standard library usage (urllib, csv, json). No suspicious install sources or archive extraction.

✓ Credentials

Optional environment variables (DATA_SYNTHESIS_USE_API, OPENAI_API_KEY, OPENAI_BASE_URL, DATA_SYNTHESIS_MODEL, etc.) are appropriate for contacting an LLM gateway. The registry metadata lists no required env vars, which is consistent because API use is opt-in. No unrelated credentials are requested.

✓ Persistence & Privilege

Skill does not request always:true, does not modify other skills or system configuration, and does not persist credentials. Agent autonomy is default and not combined with other concerning privileges.

如何使用

确保已安装 OpenClaw（本地或 Docker 部署）
在对话框中输入安装命令：/install data-synthesis
安装完成后，直接呼叫该 Skill 的名称或使用 /data-synthesis 触发
根据 Skill 的参数说明提供必要输入，即可获得结构化输出

版本历史

v1.0.0

- Initial release of "data-synthesis": generates QA training data by splitting CSV text into chunks, then using an LLM to create questions and answers for each chunk. - Supports both OpenAI-compatible APIs and internal Qwen services. - Implements a dry-run mode (no API calls) and flexible environment configuration. - Provides scripts for CSV parsing and full QA data synthesis, outputting JSONL with key metadata. - Designed for efficient document/table data QA synthesis, with clear input/output formats and usage instructions.

元数据

Slug data-synthesis

版本 1.0.0

许可证 MIT-0

累计安装 0

当前安装数 0

历史版本数 1

常见问题

data-synthesis 是什么？

从 CSV 语料切块后，用同一套 LLM 接口依次生成问题与答案，输出 JSONL 训练数据。适用于文档/表格语料合成 QA、微调数据准备；支持 OpenAI 兼容网关与内网 Qwen 等服务。它是一个面向 Claude Code / OpenClaw 的 AI Agent Skill 插件，目前累计下载 80 次。

如何安装 data-synthesis？

在 OpenClaw 或 Claude Code 对话框中运行命令「/install data-synthesis」即可一键安装，无需额外配置。

data-synthesis 是免费的吗？

是的，data-synthesis 完全免费，采用 MIT-0 许可证，可自由下载、安装和使用。

data-synthesis 支持哪些平台？

data-synthesis 跨平台运行，可在任意部署了 OpenClaw / Claude Code 的环境中使用（cross-platform）。

谁开发了 data-synthesis？

由 chichisyun（@erxiong0）开发并维护，当前版本 v1.0.0。