← 返回 Skills 市场
xueylee-dotcom

Deep Web Fetcher

作者 xueylee-dotcom · GitHub ↗ · v1.0.0 · MIT-0
cross-platform ✓ 安全检测通过
247
总下载
0
收藏
0
当前安装
1
版本数
在 OpenClaw 中安装
/install deep-web-fetcher
功能描述
Fetch and extract structured content from JS-rendered web pages, including main text, metadata, and key domain-specific metrics, without paid APIs.
使用说明 (SKILL.md)

Skill: Deep Web Fetcher

版本:1.0.0
描述:免费网页抓取 + 内容提取 + 结构化输出,无需付费API


核心功能

  • 网页抓取:支持JS渲染,自动等待页面加载
  • 正文提取:智能识别文章主体,过滤广告/导航
  • 元数据提取:自动提取标题、作者、发布时间
  • 指标提取:从正文提取关键数据(样本量、AUC、成本等)

触发命令

/web-fetcher \x3Curl> [--domain \x3C领域>]

参数说明

参数 默认值 说明
url 必填 目标网页URL
--domain general 研究领域,影响指标提取规则

领域选项

  • general:通用提取
  • healthcare:医疗/健康领域
  • medical:医学研究
  • insurance:保险控费
  • machine_learning:机器学习

执行流程

1. 启动Playwright浏览器
2. 访问目标URL,等待JS渲染完成
3. 使用Readability提取正文
4. 提取元数据(标题、作者、时间)
5. 根据领域规则提取关键指标
6. 输出生成JSON

输出格式

{
  "url": "https://example.com/article",
  "success": true,
  "title": "文章标题",
  "author": "作者名",
  "published_date": "2024-01-15",
  "content_text": "正文内容...",
  "content_html": "\x3Chtml>...\x3C/html>",
  "word_count": 1500,
  "extracted_metrics": {
    "sample_size": "9,080",
    "auc": 0.85,
    "accuracy": 92.5
  },
  "error": null
}

使用示例

抓取arXiv论文

/web-fetcher "https://arxiv.org/abs/2301.12345" --domain "machine learning"

抓取PubMed摘要

/web-fetcher "https://pubmed.ncbi.nlm.nih.gov/38134648/" --domain "medical"

抓取政府报告

/web-fetcher "https://www.gov.cn/zhengce/zhengceku/2024-01/15/content_6923456.htm" --domain "insurance"

依赖安装

# 安装Python依赖
pip install playwright readability-lxml lxml beautifulsoup4

# 安装浏览器驱动(首次运行需下载~100MB)
playwright install chromium

注意事项

反爬策略

部分网站有反爬机制,如遇失败可:

  1. 增加延迟:在脚本中调整 time.sleep()
  2. 使用代理:在 browser.new_context() 中添加代理
  3. 轮换UA:修改 user_agent 参数

提取准确率

  • 标准网页(文章/博客):✅ 效果优秀
  • 复杂布局(多栏/动态加载):⚠️ 可能需人工复核
  • PDF页面:❌ 不支持,请用PDF专用工具

执行速度

  • 单页抓取:5-15秒(含浏览器启动)
  • 批量抓取:建议并发3-5个

与深度研究v6.0集成

# 生成卡片
/web-fetcher \x3Curl> --domain "insurance" > sources/card-xxx.json

# 转换卡片格式
python3 scripts/convert-to-card.py sources/card-xxx.json

文件结构

skills/web-fetcher/
├── SKILL.md
└── scripts/
    └── web-fetcher.py

版本历史

版本 日期 更新
1.0.0 2026-03-19 初始版本

完全免费,本地运行,数据不出机器

安全使用建议
This skill appears to do what it says: run a headless Chromium via Playwright, extract article text and simple metrics, and print structured JSON. Before installing/running: 1) run it in an isolated environment (virtualenv/container) because Playwright will download browser binaries and the tool will execute JS from arbitrary sites; 2) ensure you have legal permission to scrape your targets and avoid aggressive concurrency to reduce IP blocking; 3) be cautious if you choose to configure proxies or automation to bypass anti-bot protections (those are documented but could be abused); 4) review and pin dependency versions (playwright, readability-lxml) before pip installing; and 5) if you will chain outputs to other services, remember the SKILL.md claim that "data does not leave the machine" only holds if you don't forward the JSON elsewhere. Overall the package is internally consistent and contains no obvious covert exfiltration, but follow standard operational safety and legal/ethical scraping practices.
功能分析
Type: OpenClaw Skill Name: deep-web-fetcher Version: 1.0.0 The skill bundle provides a legitimate web scraping and content extraction tool using Playwright and Readability. The Python script (web-fetcher.py) and instructions (SKILL.md) are consistent with the stated purpose of fetching and structuring web data locally, with no evidence of data exfiltration, malicious execution, or prompt injection.
能力评估
Purpose & Capability
Name/description, SKILL.md, and scripts/web-fetcher.py align: the script launches Playwright, renders JS pages, runs Readability and regex extraction for metrics. No unrelated credentials, binaries, or config paths are requested.
Instruction Scope
SKILL.md and the script stay within scraping and extraction. The docs explicitly recommend proxies, rotating user agents, and increasing delays to bypass anti-bot protections — these are legitimate for robust scraping but are also techniques that can be misused. The skill does not read local files or environment variables beyond what a normal script would use, and it prints JSON to stdout.
Install Mechanism
There is no packaged install spec; SKILL.md instructs pip installs and running 'playwright install chromium' (standard for Playwright). No downloads from untrusted hosts or embedded binaries are present in the bundle.
Credentials
The skill declares no required environment variables or credentials and the code does not attempt to access secrets or external auth tokens. The lack of credentials is proportionate to the stated local-scrape purpose.
Persistence & Privilege
always is false, the skill does not request persistent/system-wide changes, and the script does not modify other skills or agent configuration. It runs as a normal one-shot tool.
如何使用
  1. 确保已安装 OpenClaw(本地或 Docker 部署)
  2. 在对话框中输入安装命令:/install deep-web-fetcher
  3. 安装完成后,直接呼叫该 Skill 的名称或使用 /deep-web-fetcher 触发
  4. 根据 Skill 的参数说明提供必要输入,即可获得结构化输出
版本历史
v1.0.0
Deep Web Fetcher 1.0.0 – Initial Release - Provides free, local web scraping with JS rendering and no paid API required. - Extracts main article content, metadata (title, author, publish date), and key metrics (sample size, AUC, cost, etc.). - Supports multiple fields (general, healthcare, medical, insurance, machine learning) for tailored extraction. - Outputs structured JSON with text, HTML, and extracted metrics. - Integrates easily into "深度研究v6.0" workflow. - Includes anti-crawling tips and performance advice.
元数据
Slug deep-web-fetcher
版本 1.0.0
许可证 MIT-0
累计安装 0
当前安装数 0
历史版本数 1
常见问题

Deep Web Fetcher 是什么?

Fetch and extract structured content from JS-rendered web pages, including main text, metadata, and key domain-specific metrics, without paid APIs. 它是一个面向 Claude Code / OpenClaw 的 AI Agent Skill 插件,目前累计下载 247 次。

如何安装 Deep Web Fetcher?

在 OpenClaw 或 Claude Code 对话框中运行命令「/install deep-web-fetcher」即可一键安装,无需额外配置。

Deep Web Fetcher 是免费的吗?

是的,Deep Web Fetcher 完全免费,采用 MIT-0 许可证,可自由下载、安装和使用。

Deep Web Fetcher 支持哪些平台?

Deep Web Fetcher 跨平台运行,可在任意部署了 OpenClaw / Claude Code 的环境中使用(cross-platform)。

谁开发了 Deep Web Fetcher?

由 xueylee-dotcom(@xueylee-dotcom)开发并维护,当前版本 v1.0.0。

💬 留言讨论