← 返回 Skills 市场
wangm-a3

Datacrawl Debug

作者 WangM-A3 · GitHub ↗ · v1.1.0 · MIT-0
cross-platform ⚠ suspicious
71
总下载
0
收藏
0
当前安装
5
版本数
在 OpenClaw 中安装
/install datacrawl-debug
功能描述
Use when user needs to process web data, debug data collection code, clean processed data, or iterate on data processing strategies. Use when generating data...
使用说明 (SKILL.md)

DataProcess Debug — 数据处理全流程工具

处理得了·修得好·洗得净·跑得稳

核心定位

数据处理的"急诊室+健身房"——出了问题来急诊(DebugRunner),日常训练来健身(IterateOptimizer),全程配营养师(DataCleaner)。

5大核心模块

1. ProcessEngine — 数据处理配置生成 + 结果解析

scripts/process-engine.py config --url URL --fields 字段1 字段2 --mode static|dynamic|api
scripts/process-engine.py extract --html "HTML内容" --fields 字段1 字段2
  • 站点类型自动识别(电商/B2B/社媒/内容/政府/开发者)
  • 3种模式工具推荐 + CSS/XPath选择器建议
  • HTML结构化提取(文本/链接/图片/表格/列表)

2. CodeGenerator — 数据处理代码自动生成

scripts/code-generator.py --name 项目名 --url URL --fields 字段1 字段2 --mode requests_bs4|playwright|api_client
  • 3种模板自动选择:静态页面/动态渲染/API接口
  • 生成完整可运行代码 + 依赖安装 + 使用步骤

3. DebugRunner — 代码调试与修复

scripts/debug-runner.py --error "错误信息"
  • 8类错误模式库:connection/http_error/timeout/selector_error/encoding/json_parse/selenium_playwright/rate_limit
  • HTTP子类型精准诊断(403限流/429限流/503服务不可用等各有方案)
  • 代码片段扫描(缺异常处理/超时/延迟/UA自动检测)

4. DataCleaner — 数据清洗格式化

scripts/data-cleaner.py clean --input 数据 --remove-html --remove-duplicates
scripts/data-cleaner.py normalize --input 数据 --schema 类型定义
scripts/data-cleaner.py format --input 数据 --format json|csv|jsonl --fields 字段列表

5. IterateOptimizer — 自我迭代优化

scripts/iterate-optimizer.py analyze --input 运行历史.json
scripts/iterate-optimizer.py improve --config 当前配置 --analysis 分析结果
  • 成功率趋势 / 错误聚类 / 字段覆盖率 / 优化建议
  • 自动调整延迟/超时/重试/模式切换

实战案例:外贸博主数据处理

内置 scripts/trade-contact-scorer.py

  • 5维粉丝质量评分(互动率/收藏比/评论活跃/粉丝规模/外贸相关度)
  • S/A/B/C/D 5级分层
  • 粉丝画像推断(工厂主/跨境卖家/SOHO/公司经营者/新手)
  • 批量数据处理(去重+外贸筛选+评分+画像)

常见处理问题诊断

直接请求API → 必遇限制。正确方案:

  1. 用Playwright打开网页版
  2. 手动登录后保存Cookie
  3. 通过搜索页面提取数据
  4. 用本技能的评分模型替代简单加权

使用流程

  1. 配置: process-engine.py config → 了解目标站点+推荐方案
  2. 生成代码: code-generator.py → 获得起始代码模板
  3. 调试: 遇错 → debug-runner.py → 秒级诊断
  4. 清洗: data-cleaner.py → 去重+标准化+格式化
  5. 迭代: iterate-optimizer.py → 基于运行数据持续改进
安全使用建议
Install only if you intend to use it for authorized web-data processing. Before using cookie-based or anti-detection workflows, confirm you have permission, avoid sharing real account cookies, and run generated code in an isolated environment after reviewing it.
功能分析
Type: OpenClaw Skill Name: datacrawl-debug Version: 1.1.0 The skill bundle provides a comprehensive suite of tools for web scraping, data cleaning, and error diagnosis. Key components include 'code-generator.py' for creating scraping templates, 'debug-runner.py' for analyzing common HTTP and parsing errors, and 'trade-contact-scorer.py' for evaluating lead quality. The logic across all scripts is transparent and strictly aligned with the stated purpose of data processing; no evidence of malicious intent, unauthorized data exfiltration, or harmful prompt injection was found.
能力评估
Purpose & Capability
The included scripts for parsing, cleaning, debugging, and generating crawler code are coherent with the stated purpose, but the reference material also promotes anti-detection tactics that go beyond ordinary data processing.
Instruction Scope
Instructions include proxy rotation, cookie pools, fingerprint obfuscation, human-behavior simulation, and saved login cookies, without consistently bounding use to authorized data collection.
Install Mechanism
There is no automatic install step, but generated code may instruct users to install unpinned Python packages such as requests, beautifulsoup4, lxml, and playwright.
Credentials
The local scripts do not show exfiltration or hidden endpoints, but the recommended use of authenticated browser state and anti-bot evasion is broader than the metadata-declared no-credential setup.
Persistence & Privilege
No background persistence was found in code, but the documentation recommends cookie persistence and cookie pools, which can preserve account session authority beyond a single run.
如何使用
  1. 确保已安装 OpenClaw(本地或 Docker 部署)
  2. 在对话框中输入安装命令:/install datacrawl-debug
  3. 安装完成后,直接呼叫该 Skill 的名称或使用 /datacrawl-debug 触发
  4. 根据 Skill 的参数说明提供必要输入,即可获得结构化输出
版本历史
v1.0.4
Brand update
v1.0.2
Brand update
v1.1.0
v1.1.0: rename crawl→process, remove platform-specific scraper, add trade-contact-scorer, full compliance cleanup
v1.0.1
Fix: compliance wording
v1.0.0
Initial release
元数据
Slug datacrawl-debug
版本 1.1.0
许可证 MIT-0
累计安装 0
当前安装数 0
历史版本数 5
常见问题

Datacrawl Debug 是什么?

Use when user needs to process web data, debug data collection code, clean processed data, or iterate on data processing strategies. Use when generating data... 它是一个面向 Claude Code / OpenClaw 的 AI Agent Skill 插件,目前累计下载 71 次。

如何安装 Datacrawl Debug?

在 OpenClaw 或 Claude Code 对话框中运行命令「/install datacrawl-debug」即可一键安装,无需额外配置。

Datacrawl Debug 是免费的吗?

是的,Datacrawl Debug 完全免费,采用 MIT-0 许可证,可自由下载、安装和使用。

Datacrawl Debug 支持哪些平台?

Datacrawl Debug 跨平台运行,可在任意部署了 OpenClaw / Claude Code 的环境中使用(cross-platform)。

谁开发了 Datacrawl Debug?

由 WangM-A3(@wangm-a3)开发并维护,当前版本 v1.1.0。

💬 留言讨论