← 返回 Skills 市场
yuzhihui886

Corpus Builder

作者 yuzhihui886 · GitHub ↗ · v1.1.2 · MIT-0
cross-platform ✓ 安全检测通过
141
总下载
0
收藏
0
当前安装
8
版本数
在 OpenClaw 中安装
/install corpus-builder
功能描述
语料库构建工具,支持智能分块、AI 标注、向量化存储。可选 LLM 标注(需 DashScope API)或规则降级。
使用说明 (SKILL.md)

Corpus Builder - 语料库构建工具

轻量级语料库构建工具,针对中文小说优化,支持场景智能分块、10 维度 AI 标注、ChromaDB 向量存储。

标注模式

  • LLM 模式(推荐):使用 DashScope API 进行智能标注(需 DASHSCOPE_API_KEY
  • 规则模式(降级):无 API 时使用规则引擎自动标注(完全离线)

🔐 安全说明

本技能承诺

  • ✅ API Key 通过环境变量 DASHSCOPE_API_KEY 传递
  • 不读取 ~/.openclaw/ 目录或任何全局配置文件
  • 不存储 API Key 到 skill 目录或本地文件
  • 不使用 subprocess 调用外部 CLI 工具
  • 不访问 其他 provider 的凭证

环境配置

LLM 模式(需要 API Key)

设置环境变量(唯一支持的方式):

# 临时设置(当前终端有效)
export DASHSCOPE_API_KEY="sk-xxx"

# 永久设置(添加到 ~/.bashrc)
echo 'export DASHSCOPE_API_KEY="sk-xxx"' >> ~/.bashrc
source ~/.bashrc

⚠️ 注意: 不要将 API Key 提交到 Git 或分享给他人。

规则模式(完全离线)

无需 API Key,自动使用规则引擎进行标注:

  • 不设置 DASHSCOPE_API_KEY 环境变量
  • 技能自动降级到规则标注模式
  • 质量较低但完全离线运行

可选:SQLite3 兼容性

如果运行时报错 sqlite3 version \x3C 3.35.0

# 安装 pysqlite3-binary(仅旧系统需要)
pip3 install pysqlite3-binary --user

现代系统(Ubuntu 20.04+, macOS 12+, Python 3.10+)通常不需要。

快速开始

构建语料库

cd ~/.openclaw/workspace/skills/corpus-builder

# 1. 批量处理小说文本
python3 scripts/build_corpus.py \
    --source ~/workspace/novels/reference \
    --name 玄幻打斗 \
    --genre 玄幻 \
    --max-chunk-size 2000

# 2. 查看统计信息
python3 scripts/build_corpus.py \
    --stats \
    --collection 玄幻打斗

# 3. 导出标注数据
python3 scripts/build_corpus.py \
    --export json \
    --collection 玄幻打斗 \
    --output results.json

💡 需要检索语料? 请使用 corpus-search 技能。

标注数据示例

{
    "scene_type": "打斗",
    "emotion": "紧张",
    "quality_score": 8,
    "original_text": "...",
    "source_file": "没钱修什么仙.txt"
}

依赖安装

cd ~/.openclaw/workspace/skills/corpus-builder
pip3 install -r requirements.txt --user

必需依赖

用途
chromadb 向量数据库
sentence-transformers 嵌入模型
pyyaml YAML 处理
rich CLI 美化
psutil 内存监控

内存优化

  • 监控阈值: 2.5GB
  • 自动释放: 浏览器/模型缓存
  • 批量策略: AI 标注 5/批,向量化 32/批
  • 增量处理: 断点续传,避免重复

配置文件

编辑 configs/default_config.yml:

chunking:
  max_chunk_size: 2000
  min_chunk_size: 100
  overlap: 200
processing:
  batch_size: 5
  embedding_batch_size: 32
  max_workers: 3
models:
  embedding: "BAAI/bge-small-zh-v1.5"
  annotation: "dashscope-coding/qwen3.5-plus"
storage:
  persist_directory: "./corpus/chroma"
  checkpoint_dir: "./corpus/cache"

故障排除

内存过高

# 降低内存限制
python3 scripts/build_corpus.py \
    --source ./novels \
    --name test \
    --memory-limit 1500 \
    --batch-size 3

LLM 调用失败

使用规则降级方案,标注结果仍可生成,只是质量得分较低。

ChromaDB 错误

删除向量库重新构建:

rm -rf corpus/chroma/{collection_name}
python3 scripts/build_corpus.py --source ./novels --name test

相关脚本

脚本 用途
scripts/build_corpus.py 主程序(语料库构建)

许可证

MIT License

Created for OpenClaw 🦞
Version: 1.0.0
Last Updated: 2026-03-28

安全使用建议
What to check before installing: - The only sensitive input is an optional DASHSCOPE_API_KEY environment variable used for LLM annotation. If you don't set it, the skill will run in rule-based (offline) mode. - Inspect the annotator's HTTP/OpenAI fallback (_call_llm_http) before use to confirm requests go only to the expected DashScope endpoint; the visible code uses the OpenAI-compatible client with base_url pointing to coding.dashscope.aliyuncs.com which matches the README, but the fallback implementation was truncated in the provided view — verify it doesn't send data elsewhere. - Packaging mismatch: requirements.txt includes 'openai' and pysqlite3-binary while pyproject.toml does not list 'openai' as a dependency. If you install via pip -r requirements.txt you will pull the OpenAI SDK (needed for LLM mode); if you install with pyproject tools you may not. Use the install method you trust and review dependencies. - The skill stores checkpoints, embeddings, and the Chroma DB under local directories (configurable). Ensure you point persist_directory/checkpoint_dir to storage you control and do not include sensitive texts you don't want stored. - Prefer rule-based mode (no DASHSCOPE_API_KEY) if you want fully offline operation; test with a small dataset first and run the included unit tests (pytest) to validate behavior in your environment. - Avoid putting API keys into files checked into git or shared shells; prefer per-session environment variables or a secrets manager. Overall: the code and docs largely match the stated purpose; the mismatches are documentation/packaging points to confirm rather than indicators of malicious behavior.
功能分析
Type: OpenClaw Skill Name: corpus-builder Version: 1.1.2 The 'corpus-builder' skill bundle is a legitimate tool designed for processing text files into a vector database with AI-generated annotations. The code implements standard NLP workflows including scene-based chunking (chunker.py), feature extraction, and LLM-based annotation via the DashScope API (annotator.py). It handles sensitive information like the DASHSCOPE_API_KEY appropriately by reading from environment variables without local storage or exfiltration. The use of pysqlite3-binary for SQLite version compatibility is a standard practice for ChromaDB integration. No malicious patterns such as unauthorized data access, prompt injection, or suspicious remote execution were identified.
能力评估
Purpose & Capability
Name/description (corpus building, chunking, AI annotation, embeddings) align with the shipped code: chunker, annotator, embedder, store and CLI script are present. The only credential mentioned (DASHSCOPE_API_KEY) is appropriate for the optional LLM annotation mode.
Instruction Scope
SKILL.md instructs running the included scripts and passing the optional DASHSCOPE_API_KEY via env var; code shown reads the key only from the environment. The docs/examples reference a path under ~/.openclaw/workspace/skills for where to run the project, but the code does not appear to read global OpenClaw config files (the README/CHANGELOG note that reading ~/.openclaw was removed). No instructions ask the agent to read unrelated system files or other credentials.
Install Mechanism
This is instruction-only (no platform install spec). The package contains Python code and a requirements.txt; installation is via pip as documented. No external binary downloads or URL-based extract/install steps are present in the provided metadata.
Credentials
The skill declares only one optional env var (DASHSCOPE_API_KEY) for LLM mode, which is proportionate. Minor inconsistency: requirements.txt lists 'openai' (and pysqlite3-binary) while pyproject.toml's dependencies do not include openai; this is a packaging/documentation mismatch to verify. No other secret env vars are requested.
Persistence & Privilege
Skill is not marked always:true and does not request persistent system-wide privileges. It writes checkpoints/embeddings into local directories under the skill (config-controlled) which is expected behavior for a corpus builder.
如何使用
  1. 确保已安装 OpenClaw(本地或 Docker 部署)
  2. 在对话框中输入安装命令:/install corpus-builder
  3. 安装完成后,直接呼叫该 Skill 的名称或使用 /corpus-builder 触发
  4. 根据 Skill 的参数说明提供必要输入,即可获得结构化输出
版本历史
v1.1.2
修复 ClawHub 审查问题
v1.1.1
修复安全文档:明确环境变量要求,删除过时文档,添加可选依赖声明
v1.1.0
代码模块化重构,添加单元测试和详细文档
v1.0.4
整体改进:添加 pyproject.toml + README.md + CHANGELOG.md
v1.0.3
代码质量优化
v1.0.2
安全修复:删除全局配置读取,仅使用环境变量
v1.0.1
修复文档:明确 LLM 集成和 API 要求
v1.0.0
语料库构建工具
元数据
Slug corpus-builder
版本 1.1.2
许可证 MIT-0
累计安装 0
当前安装数 0
历史版本数 8
常见问题

Corpus Builder 是什么?

语料库构建工具,支持智能分块、AI 标注、向量化存储。可选 LLM 标注(需 DashScope API)或规则降级。 它是一个面向 Claude Code / OpenClaw 的 AI Agent Skill 插件,目前累计下载 141 次。

如何安装 Corpus Builder?

在 OpenClaw 或 Claude Code 对话框中运行命令「/install corpus-builder」即可一键安装,无需额外配置。

Corpus Builder 是免费的吗?

是的,Corpus Builder 完全免费,采用 MIT-0 许可证,可自由下载、安装和使用。

Corpus Builder 支持哪些平台?

Corpus Builder 跨平台运行,可在任意部署了 OpenClaw / Claude Code 的环境中使用(cross-platform)。

谁开发了 Corpus Builder?

由 yuzhihui886(@yuzhihui886)开发并维护,当前版本 v1.1.2。

💬 留言讨论