← Back to Skills Marketplace
236
Downloads
0
Stars
1
Active Installs
1
Versions
Install in OpenClaw
/install defuddle-extractor
Description
Extract main webpage content using Defuddle library and convert it to Markdown, supporting CLI and Node.js for web scraping and text processing tasks.
README (SKILL.md)
Defuddle 网页内容提取技能
使用 Defuddle 库从任意网页提取主要内容并转换为 Markdown 格式。
功能特性
- 内容提取:自动检测并提取网页主要内容
- Markdown 转换:将 HTML 内容转换为 Markdown 格式
- 垃圾清理:移除广告、侧边栏、评论等网页垃圾
- CLI 支持:提供命令行接口快速使用
- Node.js 集成:支持在 Node.js 环境中使用
- 自定义配置:支持自定义内容选择器和选项
技术实现
- 使用 Defuddle 库进行网页内容提取
- 支持多种配置选项
- 提供简单易用的 API
使用方法
1. 命令行使用
# 解析 URL 并输出为 Markdown
npx defuddle parse https://example.com/article --markdown
# 解析本地 HTML 文件
npx defuddle parse page.html --markdown
# 输出为 JSON 格式(包含元数据)
npx defuddle parse page.html --json
2. 脚本使用
# 从 URL 提取内容并发送到微信文件传输助手
bash scripts/extract_and_send.sh "https://example.com/article" "文件传输助手"
# 从 URL 提取内容并发送到 Telegram
bash scripts/extract_and_send_telegram.sh "https://example.com/article" \x3Cchat_id>
3. Node.js API
import { JSDOM } from 'jsdom';
import { Defuddle } from 'defuddle/node';
async function extractContent(url) {
const response = await fetch(url);
const html = await response.text();
const dom = new JSDOM(html, { url });
const result = await Defuddle(dom.window.document);
return {
title: result.title,
content: result.content,
markdown: result.contentMarkdown
};
}
配置选项
- markdown: 转换为 Markdown 格式
- debug: 启用调试模式
- contentSelector: 自定义内容选择器
- removeImages: 移除图片
- removeHiddenElements: 移除隐藏元素
脚本说明
scripts/extract_content.sh: 从 URL 提取内容并输出到控制台scripts/extract_and_send.sh: 提取内容并发送到微信scripts/extract_and_send_telegram.sh: 提取内容并发送到 Telegram
依赖
- Node.js 和 npm(用于 CLI)
- defuddle 库(已通过 npm 安装)
安装
npm install -g defuddle
注意事项
- Defuddle 需要 Node.js 环境(建议使用 Node.js 18 或更高版本)
- 某些网站可能有防爬虫机制,可能导致提取失败
- 大型网页内容提取可能需要较长时间
Usage Guidance
This skill's core feature (extract webpage content and convert to Markdown) is coherent, but review and proceed cautiously: 1) Inspect or remove the send scripts before use — they transmit scraped content to WeChat/Telegram. 2) The WeChat helper is a hardcoded path to /Users/honcy/... which will fail on your system and could reference another skill you don't control — do not run it without validating that script. 3) npx will fetch and execute the 'defuddle' package from npm at runtime — verify the npm package's source/reputation before running. 4) If you only want extraction, run the extraction commands in a controlled environment and avoid executing the send scripts until you confirm destinations and credentials. 5) If you need this skill, consider forking/cleaning the scripts to remove hardcoded paths and to require explicit consent/targets before sending data.
Capability Analysis
Type: OpenClaw Skill
Name: defuddle-extractor
Version: 1.0.0
The skill bundle provides a legitimate utility for extracting webpage content into Markdown using the 'defuddle' library. It includes shell scripts to automate extraction and delivery to WeChat or Telegram via the OpenClaw CLI. While 'scripts/extract_and_send.sh' contains a hardcoded absolute path to a specific user's directory (/Users/honcy/), this is a portability flaw rather than a malicious indicator, and no evidence of data exfiltration or unauthorized execution was found.
Capability Assessment
Purpose & Capability
The SKILL.md and scripts clearly require Node.js/npm and the 'defuddle' npm package (and use npx). However the registry metadata lists no required binaries or environment variables — a mismatch. One bundled script calls a hardcoded path (/Users/honcy/.openclaw/skills/WeChat-Send/scripts/wechat_send.sh) which targets another skill/user-specific location that is not declared and is unlikely to exist for other users.
Instruction Scope
Instructions and scripts operate on arbitrary URLs and then transmit extracted content to external messaging endpoints (WeChat via a local script path and Telegram via openclaw message send). Transmitting arbitrary scraped content is consistent with the advertised 'send' scripts, but it is an exfiltration vector and the WeChat script reference expands scope to other local skill files. The SKILL.md and scripts do not declare or limit what user files or environment will be read beyond fetching URLs, but they do rely on the openclaw CLI and npx behavior.
Install Mechanism
There is no formal install spec (instruction-only), which limits on-disk installation risk by this bundle. However SKILL.md and scripts rely on 'npx defuddle' and suggest 'npm install -g defuddle' — which means runtime will fetch/execute code from the npm registry (npx executes remote packages), a supply-chain/execution vector to be aware of.
Credentials
The registry lists no required environment variables or credentials, which is consistent with the included files. But the scripts assume the availability of other platform credentials/agents: openclaw message send (Telegram) and a local WeChat helper script (which likely depends on credentials/config stored elsewhere). Those credentials are not declared by the skill and may be used implicitly when the scripts run.
Persistence & Privilege
always is false and there is no install-time persistence requested. The skill can be invoked by the agent (normal), and its scripts can send messages autonomously if run, so users should be aware of the ability to transmit extracted content but there is no elevated 'always' privilege or hidden persistence in the bundle.
How to Use
- Make sure OpenClaw is installed (local or Docker)
- Run the install command in chat:
/install defuddle-extractor - After installation, invoke the skill by name or use
/defuddle-extractor - Provide required inputs per the skill's parameter spec and get structured output
Version History
v1.0.0
Initial release of defuddle-extractor.
- Extracts main content from any webpage using the Defuddle library.
- Converts HTML to Markdown, removing ads, sidebars, and comments.
- Provides both a command-line interface and Node.js API.
- Supports custom selectors, configuration options, and multiple output formats (Markdown, JSON).
- Includes scripts for sending extracted content to WeChat and Telegram.
- Compatible with macOS, Linux, and Windows.
Metadata
Frequently Asked Questions
What is Defuddle?
Extract main webpage content using Defuddle library and convert it to Markdown, supporting CLI and Node.js for web scraping and text processing tasks. It is an AI Agent Skill for Claude Code / OpenClaw, with 236 downloads so far.
How do I install Defuddle?
Run "/install defuddle-extractor" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.
Is Defuddle free?
Yes, Defuddle is completely free, licensed under MIT-0. You can download, install and use it at no cost.
Which platforms does Defuddle support?
Defuddle is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).
Who created Defuddle?
It is built and maintained by Honcy Ye (@yeholdon); the current version is v1.0.0.
More Skills