← Back to Skills Marketplace
erxiong0

data-synthesis

by chichisyun · GitHub ↗ · v1.0.0 · MIT-0
cross-platform ✓ Security Clean
80
Downloads
0
Stars
0
Active Installs
1
Versions
Install in OpenClaw
/install data-synthesis
Description
从 CSV 语料切块后,用同一套 LLM 接口依次生成问题与答案,输出 JSONL 训练数据。 适用于文档/表格语料合成 QA、微调数据准备;支持 OpenAI 兼容网关与内网 Qwen 等服务。
README (SKILL.md)

Data Synthesis

流程概览

CSV 语料 → 按字符切块 → 每块生成问题列表 → 对每个问题基于该块生成答案 → 写出 JSONL。
不包含单独的小模型质量过滤环节。

阶段 说明
输入 带表头的 CSV,需有一列长文本可供切块
切块与遍历 scripts/synthesize_qa.py
问题生成 每个文本块调用一次 LLM
答案生成 每个「块 + 问题」调用一次 LLM
输出 每行一条 JSON(JSONL)

Agent 建议执行顺序

  1. 准备:运行 scripts/parse_file.py,确认列名、行数、文本列与内容预览。
  2. 合成:运行 scripts/synthesize_qa.py,得到 JSONL。
  3. 模式:未开启 API 时为 dry-run(不联网);设置 DATA_SYNTHESIS_USE_API=1 后走真实推理。若网关需要鉴权,再配置 OPENAI_API_KEY

脚本与依赖

脚本 作用
scripts/parse_file.py 校验 CSV,输出列名、行数、detected_text_column、文本预览
scripts/synthesize_qa.py 完整 QA 流水线;默认输出 \x3C输入文件名_stem>_qa.jsonl,可用 -o 指定路径

仅依赖 Python 标准库。API 模式通过 urllib 调用 OpenAI 兼容POST .../v1/chat/completionsOPENAI_BASE_URL 可写成根路径 .../v1,也可写成完整 URL 直至 .../chat/completions)。

命令示例

# 1. 检查语料
python scripts/parse_file.py path/to/corpus.csv

# 2. 本地 dry-run(不写密钥、不调外网)
python scripts/synthesize_qa.py path/to/corpus.csv -o path/to/out.jsonl

内网 Qwen 示例(与环境变量一致,推荐):

export DATA_SYNTHESIS_USE_API=1
export OPENAI_BASE_URL='http://xxx'
export DATA_SYNTHESIS_MODEL='xxx'
export DATA_SYNTHESIS_MAX_TOKENS=2000
export DATA_SYNTHESIS_TEMPERATURE=0.7

python scripts/synthesize_qa.py path/to/corpus.csv \
  --text-column text \
  --chunk-size 6000 \
  --chunk-overlap 200 \
  --sleep 0.3 \
  -o path/to/out.jsonl

仅用命令行指定模型(与上面环境变量二选一即可):

python scripts/synthesize_qa.py path/to/corpus.csv \
  --text-column text \
  --model xxx \
  -o path/to/out.jsonl

常用参数:--max-rows 试跑前几行;--sleep 控制请求间隔。勿在已设置 DATA_SYNTHESIS_MODEL 时再传冲突的 --model

输入格式

  • CSV 第一行为表头。
  • 文本列自动匹配:textcontentbody正文文本;否则使用第一列
  • 手动指定:--text-column 列名

输出格式

  • JSONL:每行一个对象,主要字段包括 chunk_idrow_indexchunkquestionanswer,以及 source_fields(除主文本列外、值非空的其它列)。
  • 运行结束会在标准输出打印统计 JSON:rowschunksquestions_generatedqa_pairs_writtenerrors

能力与局限

能力: 适合从表格语料批量得到结构化 QA,便于后续清洗与训练;出题与作答共用同一模型与网关配置。

局限: 切块为字符滑动窗口,不是按段落语义切分;问题列表依赖模型返回可解析的 JSON,失败会记入 errors

Usage Guidance
This skill appears to do exactly what it says: chunk CSV text and produce JSONL Q/A pairs. Notes before you run it: (1) By default it runs in dry-run mode; it only calls external LLMs if you set DATA_SYNTHESIS_USE_API=1. (2) If you enable API mode and set OPENAI_API_KEY (or point OPENAI_BASE_URL to a gateway), the script will POST the text chunks and questions to that endpoint — that may reveal sensitive data from your CSV and incur costs. Audit the CSV for PII before sending, prefer an internal/enterprise gateway if available, and test with dry-run and small inputs first. (3) The output 'source_fields' includes other non-empty columns from each row — remove or redact sensitive columns beforehand. (4) Use a scoped API key and monitor usage. Overall the skill is internally coherent and there are no hidden endpoints or obfuscated behaviors in the provided code.
Capability Analysis
Type: OpenClaw Skill Name: data-synthesis Version: 1.0.0 The skill bundle provides a legitimate utility for generating QA datasets from CSV files using LLM APIs. The scripts (parse_file.py and synthesize_qa.py) use Python's standard library for CSV processing and HTTP requests, following standard patterns for OpenAI-compatible integrations without any evidence of data exfiltration, unauthorized execution, or malicious prompt injection.
Capability Assessment
Purpose & Capability
Name/description (synthesize QA from CSV for training) matches the included scripts and SKILL.md. The scripts only read the input CSV, chunk text, call an LLM endpoint when enabled, and write JSONL output — all consistent with the stated function.
Instruction Scope
SKILL.md and scripts limit actions to: validate/preview CSV, chunk text, call LLM (when DATA_SYNTHESIS_USE_API=1), and write output JSONL. There are no instructions to read unrelated files, access external endpoints beyond the configured OPENAI_BASE_URL, or exfiltrate other system data.
Install Mechanism
Instruction-only skill with bundled Python scripts; no install spec, no downloads, and only Python standard library usage (urllib, csv, json). No suspicious install sources or archive extraction.
Credentials
Optional environment variables (DATA_SYNTHESIS_USE_API, OPENAI_API_KEY, OPENAI_BASE_URL, DATA_SYNTHESIS_MODEL, etc.) are appropriate for contacting an LLM gateway. The registry metadata lists no required env vars, which is consistent because API use is opt-in. No unrelated credentials are requested.
Persistence & Privilege
Skill does not request always:true, does not modify other skills or system configuration, and does not persist credentials. Agent autonomy is default and not combined with other concerning privileges.
How to Use
  1. Make sure OpenClaw is installed (local or Docker)
  2. Run the install command in chat: /install data-synthesis
  3. After installation, invoke the skill by name or use /data-synthesis
  4. Provide required inputs per the skill's parameter spec and get structured output
Version History
v1.0.0
- Initial release of "data-synthesis": generates QA training data by splitting CSV text into chunks, then using an LLM to create questions and answers for each chunk. - Supports both OpenAI-compatible APIs and internal Qwen services. - Implements a dry-run mode (no API calls) and flexible environment configuration. - Provides scripts for CSV parsing and full QA data synthesis, outputting JSONL with key metadata. - Designed for efficient document/table data QA synthesis, with clear input/output formats and usage instructions.
Metadata
Slug data-synthesis
Version 1.0.0
License MIT-0
All-time Installs 0
Active Installs 0
Total Versions 1
Frequently Asked Questions

What is data-synthesis?

从 CSV 语料切块后,用同一套 LLM 接口依次生成问题与答案,输出 JSONL 训练数据。 适用于文档/表格语料合成 QA、微调数据准备;支持 OpenAI 兼容网关与内网 Qwen 等服务。 It is an AI Agent Skill for Claude Code / OpenClaw, with 80 downloads so far.

How do I install data-synthesis?

Run "/install data-synthesis" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.

Is data-synthesis free?

Yes, data-synthesis is completely free, licensed under MIT-0. You can download, install and use it at no cost.

Which platforms does data-synthesis support?

data-synthesis is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).

Who created data-synthesis?

It is built and maintained by chichisyun (@erxiong0); the current version is v1.0.0.

💬 Comments