← Back to Skills Marketplace
Deep Web Fetcher
by
xueylee-dotcom
· GitHub ↗
· v1.0.0
· MIT-0
247
Downloads
0
Stars
0
Active Installs
1
Versions
Install in OpenClaw
/install deep-web-fetcher
Description
Fetch and extract structured content from JS-rendered web pages, including main text, metadata, and key domain-specific metrics, without paid APIs.
README (SKILL.md)
Skill: Deep Web Fetcher
版本:1.0.0
描述:免费网页抓取 + 内容提取 + 结构化输出,无需付费API
核心功能
- 网页抓取:支持JS渲染,自动等待页面加载
- 正文提取:智能识别文章主体,过滤广告/导航
- 元数据提取:自动提取标题、作者、发布时间
- 指标提取:从正文提取关键数据(样本量、AUC、成本等)
触发命令
/web-fetcher \x3Curl> [--domain \x3C领域>]
参数说明
| 参数 | 默认值 | 说明 |
|---|---|---|
url |
必填 | 目标网页URL |
--domain |
general | 研究领域,影响指标提取规则 |
领域选项
general:通用提取healthcare:医疗/健康领域medical:医学研究insurance:保险控费machine_learning:机器学习
执行流程
1. 启动Playwright浏览器
2. 访问目标URL,等待JS渲染完成
3. 使用Readability提取正文
4. 提取元数据(标题、作者、时间)
5. 根据领域规则提取关键指标
6. 输出生成JSON
输出格式
{
"url": "https://example.com/article",
"success": true,
"title": "文章标题",
"author": "作者名",
"published_date": "2024-01-15",
"content_text": "正文内容...",
"content_html": "\x3Chtml>...\x3C/html>",
"word_count": 1500,
"extracted_metrics": {
"sample_size": "9,080",
"auc": 0.85,
"accuracy": 92.5
},
"error": null
}
使用示例
抓取arXiv论文
/web-fetcher "https://arxiv.org/abs/2301.12345" --domain "machine learning"
抓取PubMed摘要
/web-fetcher "https://pubmed.ncbi.nlm.nih.gov/38134648/" --domain "medical"
抓取政府报告
/web-fetcher "https://www.gov.cn/zhengce/zhengceku/2024-01/15/content_6923456.htm" --domain "insurance"
依赖安装
# 安装Python依赖
pip install playwright readability-lxml lxml beautifulsoup4
# 安装浏览器驱动(首次运行需下载~100MB)
playwright install chromium
注意事项
反爬策略
部分网站有反爬机制,如遇失败可:
- 增加延迟:在脚本中调整
time.sleep() - 使用代理:在
browser.new_context()中添加代理 - 轮换UA:修改
user_agent参数
提取准确率
- 标准网页(文章/博客):✅ 效果优秀
- 复杂布局(多栏/动态加载):⚠️ 可能需人工复核
- PDF页面:❌ 不支持,请用PDF专用工具
执行速度
- 单页抓取:5-15秒(含浏览器启动)
- 批量抓取:建议并发3-5个
与深度研究v6.0集成
# 生成卡片
/web-fetcher \x3Curl> --domain "insurance" > sources/card-xxx.json
# 转换卡片格式
python3 scripts/convert-to-card.py sources/card-xxx.json
文件结构
skills/web-fetcher/
├── SKILL.md
└── scripts/
└── web-fetcher.py
版本历史
| 版本 | 日期 | 更新 |
|---|---|---|
| 1.0.0 | 2026-03-19 | 初始版本 |
完全免费,本地运行,数据不出机器
Usage Guidance
This skill appears to do what it says: run a headless Chromium via Playwright, extract article text and simple metrics, and print structured JSON. Before installing/running: 1) run it in an isolated environment (virtualenv/container) because Playwright will download browser binaries and the tool will execute JS from arbitrary sites; 2) ensure you have legal permission to scrape your targets and avoid aggressive concurrency to reduce IP blocking; 3) be cautious if you choose to configure proxies or automation to bypass anti-bot protections (those are documented but could be abused); 4) review and pin dependency versions (playwright, readability-lxml) before pip installing; and 5) if you will chain outputs to other services, remember the SKILL.md claim that "data does not leave the machine" only holds if you don't forward the JSON elsewhere. Overall the package is internally consistent and contains no obvious covert exfiltration, but follow standard operational safety and legal/ethical scraping practices.
Capability Analysis
Type: OpenClaw Skill
Name: deep-web-fetcher
Version: 1.0.0
The skill bundle provides a legitimate web scraping and content extraction tool using Playwright and Readability. The Python script (web-fetcher.py) and instructions (SKILL.md) are consistent with the stated purpose of fetching and structuring web data locally, with no evidence of data exfiltration, malicious execution, or prompt injection.
Capability Assessment
Purpose & Capability
Name/description, SKILL.md, and scripts/web-fetcher.py align: the script launches Playwright, renders JS pages, runs Readability and regex extraction for metrics. No unrelated credentials, binaries, or config paths are requested.
Instruction Scope
SKILL.md and the script stay within scraping and extraction. The docs explicitly recommend proxies, rotating user agents, and increasing delays to bypass anti-bot protections — these are legitimate for robust scraping but are also techniques that can be misused. The skill does not read local files or environment variables beyond what a normal script would use, and it prints JSON to stdout.
Install Mechanism
There is no packaged install spec; SKILL.md instructs pip installs and running 'playwright install chromium' (standard for Playwright). No downloads from untrusted hosts or embedded binaries are present in the bundle.
Credentials
The skill declares no required environment variables or credentials and the code does not attempt to access secrets or external auth tokens. The lack of credentials is proportionate to the stated local-scrape purpose.
Persistence & Privilege
always is false, the skill does not request persistent/system-wide changes, and the script does not modify other skills or agent configuration. It runs as a normal one-shot tool.
How to Use
- Make sure OpenClaw is installed (local or Docker)
- Run the install command in chat:
/install deep-web-fetcher - After installation, invoke the skill by name or use
/deep-web-fetcher - Provide required inputs per the skill's parameter spec and get structured output
Version History
v1.0.0
Deep Web Fetcher 1.0.0 – Initial Release
- Provides free, local web scraping with JS rendering and no paid API required.
- Extracts main article content, metadata (title, author, publish date), and key metrics (sample size, AUC, cost, etc.).
- Supports multiple fields (general, healthcare, medical, insurance, machine learning) for tailored extraction.
- Outputs structured JSON with text, HTML, and extracted metrics.
- Integrates easily into "深度研究v6.0" workflow.
- Includes anti-crawling tips and performance advice.
Metadata
Frequently Asked Questions
What is Deep Web Fetcher?
Fetch and extract structured content from JS-rendered web pages, including main text, metadata, and key domain-specific metrics, without paid APIs. It is an AI Agent Skill for Claude Code / OpenClaw, with 247 downloads so far.
How do I install Deep Web Fetcher?
Run "/install deep-web-fetcher" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.
Is Deep Web Fetcher free?
Yes, Deep Web Fetcher is completely free, licensed under MIT-0. You can download, install and use it at no cost.
Which platforms does Deep Web Fetcher support?
Deep Web Fetcher is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).
Who created Deep Web Fetcher?
It is built and maintained by xueylee-dotcom (@xueylee-dotcom); the current version is v1.0.0.
More Skills