← 返回 Skills 市场
samcheng0717

douyin-scraper

作者 SamCheng0717 · GitHub ↗ · v1.0.0 · MIT-0
cross-platform ⚠ suspicious
114
总下载
0
收藏
4
当前安装
1
版本数
在 OpenClaw 中安装
/install douyin-scraper
功能描述
抖音图文笔记采集工具。搜索关键词 → 自动筛选「图文·一周内」→ Playwright 截图(绕过反爬虫)→ Baidu OCR 识别图片文字 → 输出 Markdown 报告(含热度评分)。当用户提到"抖音图文采集"、"抖音笔记抓取"、"抖音爬虫"、"抖音内容采集"等场景时加载此技能。
使用说明 (SKILL.md)

\r \r

douyin-scraper\r

\r 抖音图文笔记采集工具 —— 一条命令完成:搜索 → 筛选图文 → 截图 → OCR → Markdown 报告。\r \r

⚠️ 前置配置\r

\r

1. 安装依赖\r

\r

pip install playwright requests python-dotenv\r
python -m playwright install chromium\r
```\r
\r
### 2. 配置 Baidu PaddleOCR Token\r
\r
在技能目录创建 `.env`:\r
\r
```\r
BAIDU_PADDLEOCR_TOKEN=你的token\r
```\r
\r
获取 Token:访问 [百度 AI Studio](https://aistudio.baidu.com/paddleocr),免费注册,每天 1 万次免费调用。\r
\r
### 3. 登录抖音(只需一次)\r
\r
```bash\r
python \x3Cskill_path>/scripts/login.py\r
```\r
\r
浏览器打开抖音,扫码登录后关闭。登录状态自动保存,后续无需重复操作。\r
\r
---\r
\r
## 使用\r
\r
```bash\r
# 采集 10 篇图文笔记(含 OCR)\r
python \x3Cskill_path>/scripts/full_workflow.py --keyword "韩国医美"\r
\r
# 指定数量\r
python \x3Cskill_path>/scripts/full_workflow.py --keyword "减肥餐" --count 5\r
\r
# 跳过 OCR(仅截图)\r
python \x3Cskill_path>/scripts/full_workflow.py --keyword "咖啡" --no-ocr\r
```\r
\r
| 参数 | 说明 | 默认值 |\r
|------|------|--------|\r
| `--keyword` | 搜索关键词 | 必填 |\r
| `--count` | 采集笔记数量 | `5` |\r
| `--no-ocr` | 跳过 OCR | 关闭 |\r
\r
---\r
\r
## 输出\r
\r
报告保存至 `output/notes_{keyword}_{timestamp}.md`,图片保存至 `data/images/`。\r
\r
每篇笔记包含:\r
\r
- 🔥 热度分数(点赞数 / 发布天数)及计算公式\r
- 👍 点赞数、发布时间、作者、原文链接\r
- 📝 原文描述\r
- 🔍 OCR 识别的图片文字(支持多图)\r
- 🖼️ 本地截图路径\r
\r
---\r
\r
## 技术特点\r
\r
- **Playwright 截图**:通过 `element.screenshot()` 截取内容图,绕过抖音图片 URL 反爬虫\r
- **图文过滤**:自动识别并跳过视频,只采集「图文」类型笔记\r
- **OCR 噪音过滤**:自动去除截图中的抖音导航栏文字(精选/推荐/关注 等)\r
- **多图支持**:一篇图文多张图片逐张截图 + OCR,合并识别结果\r
- **反检测**:有头浏览器(headless=False)+ 拟人操作节奏,避免触发验证码\r
- **热度公式**:`likes / days_ago`,越新越热排越前\r
\r
---\r
\r
## 目录结构\r
\r
```\r
douyin-scraper/\r
├── scripts/\r
│   ├── full_workflow.py   # 主流水线\r
│   └── login.py           # 登录脚本\r
├── data/\r
│   └── images/            # 截图\r
├── output/                # Markdown 报告\r
├── profile/               # 浏览器登录状态\r
└── .env                   # Token 配置\r
```\r
安全使用建议
This package largely does what it says (scrapes Douyin pages, screenshots content, and calls an OCR API), but there are two actionable concerns you should address before using it with sensitive data: (1) the project expects you to set BAIDU_PADDLEOCR_TOKEN (the registry omitted this requirement) — verify where that token will be used; (2) the script's default OCR_API_URL is aistudio-app.com (https://r41cd0p9x7dfp1s7.aistudio-app.com/layout-parsing), not the obvious official Baidu endpoint, meaning screenshots (base64-encoded) will be uploaded to that host by default. If you plan to use it, either (a) set BAIDU_PADDLEOCR_API_URL explicitly to the official Baidu PaddleOCR API endpoint you trust, or (b) inspect/replace the OCR POST implementation to use an OCR provider you control. Also review and be comfortable with storing a browser profile in profile/ (it contains your logged-in session) and with the legal/terms-of-service risks of scraping Douyin. If the author can confirm the aistudio-app.com URL is an official Baidu-hosted endpoint (and update docs/metadata), the concerns would be substantially reduced.
功能分析
Type: OpenClaw Skill Name: douyin-scraper Version: 1.0.0 The skill bundle is a functional Douyin (TikTok China) scraper designed to collect image-text notes, perform OCR, and generate Markdown reports. The code in `scripts/full_workflow.py` and `scripts/login.py` uses Playwright for browser automation and correctly implements session persistence in a local directory. While it communicates with a specific Baidu AI Studio app endpoint (`aistudio-app.com`) for OCR tasks, this behavior is documented and aligned with the tool's stated purpose. No evidence of data exfiltration, credential theft, or malicious prompt injection was found.
能力评估
Purpose & Capability
The skill's name/description (Douyin image-text scraping + OCR) aligns with the included scripts (Playwright scraping, screenshots, OCR, Markdown output). However the registry metadata declares no required environment variables while SKILL.md and the code require BAIDU_PADDLEOCR_TOKEN (and optionally BAIDU_PADDLEOCR_API_URL). The omission of the env requirement in metadata is an incoherence that reduces transparency.
Instruction Scope
SKILL.md instructs the agent to install Playwright, create a .env with BAIDU_PADDLEOCR_TOKEN, and run the login and full_workflow scripts — which is consistent with scraping + OCR. The runtime instructions do not warn that screenshots will be uploaded to a remote HTTP API; the code base64-encodes screenshots and POSTs them (with an Authorization header) to OCR_API_URL. The default OCR_API_URL in the script is aistudio-app.com (https://r41cd0p9x7dfp1s7.aistudio-app.com/layout-parsing) rather than an obviously-official Baidu API endpoint, and the README/SKILL.md do not document this alternate endpoint or the privacy implications of uploading screenshots.
Install Mechanism
There is no install spec (instruction-only install via pip/playwright commands), so nothing arbitrary is downloaded by an installer. The install instructions require pip installing Playwright and running 'playwright install chromium' — expected for the declared functionality.
Credentials
Requiring a Baidu PaddleOCR token is proportionate for OCR. But the registry lists no required env vars while the code requires BAIDU_PADDLEOCR_TOKEN and supports BAIDU_PADDLEOCR_API_URL. The code's default OCR_API_URL points to a non-standard domain (aistudio-app.com subdomain) which may be a third-party/proxy endpoint; this makes the token and uploaded screenshots potentially usable by that third party rather than only by an official Baidu API, which is disproportionate and not documented.
Persistence & Privilege
The skill persists a Playwright browser profile under profile/ to store login state (login.py). It does not request always:true or system-wide config changes. Headful Playwright sessions and saved browser profile are reasonable for this use-case.
如何使用
  1. 确保已安装 OpenClaw(本地或 Docker 部署)
  2. 在对话框中输入安装命令:/install douyin-scraper
  3. 安装完成后,直接呼叫该 Skill 的名称或使用 /douyin-scraper 触发
  4. 根据 Skill 的参数说明提供必要输入,即可获得结构化输出
版本历史
v1.0.0
抖音图文笔记采集工具。搜索关键词 → 自动筛选「图文·一周内」→ Playwright 截图(绕过反爬虫)→ Baidu OCR 识别图片文字 → 输出 Markdown 报告(含热度评分)。
元数据
Slug douyin-scraper
版本 1.0.0
许可证 MIT-0
累计安装 4
当前安装数 4
历史版本数 1
常见问题

douyin-scraper 是什么?

抖音图文笔记采集工具。搜索关键词 → 自动筛选「图文·一周内」→ Playwright 截图(绕过反爬虫)→ Baidu OCR 识别图片文字 → 输出 Markdown 报告(含热度评分)。当用户提到"抖音图文采集"、"抖音笔记抓取"、"抖音爬虫"、"抖音内容采集"等场景时加载此技能。 它是一个面向 Claude Code / OpenClaw 的 AI Agent Skill 插件,目前累计下载 114 次。

如何安装 douyin-scraper?

在 OpenClaw 或 Claude Code 对话框中运行命令「/install douyin-scraper」即可一键安装,无需额外配置。

douyin-scraper 是免费的吗?

是的,douyin-scraper 完全免费,采用 MIT-0 许可证,可自由下载、安装和使用。

douyin-scraper 支持哪些平台?

douyin-scraper 跨平台运行,可在任意部署了 OpenClaw / Claude Code 的环境中使用(cross-platform)。

谁开发了 douyin-scraper?

由 SamCheng0717(@samcheng0717)开发并维护,当前版本 v1.0.0。

💬 留言讨论