← 返回 Skills 市场

Douyin Scraper V2

Name: Douyin Scraper V2
Author: terrycarter1985

作者 terrycarter1985 · GitHub ↗ · v1.0.0 · MIT-0

cross-platform ⚠ suspicious

总下载

当前安装

版本数

在 OpenClaw 中安装

/install douyin-scraper-v2

功能描述

抖音图文笔记采集工具。支持自然语言搜索（如"搜索一下海鲜视频"），自动提取关键词 → 搜索 → 筛选「图文·一周内」→ Playwright 截图（绕过反爬虫）→ Baidu OCR 识别图片文字 → 输出 Markdown 报告（含热度评分）。当用户提到"抖音搜索"、"抖音图文采集"、"抖音笔记抓取"、"抖音爬...

使用说明 (SKILL.md)

\r \r

douyin-scraper\r

\r 抖音图文笔记采集工具 —— 一条命令完成：搜索 → 筛选图文 → 截图 → OCR → Markdown 报告。\r \r

⚠️ 前置配置\r

1. 安装依赖\r

pip install playwright requests python-dotenv\r
python -m playwright install chromium\r
```\r
\r
### 2. 配置 Baidu PaddleOCR Token\r
\r
在技能目录创建 `.env`：\r
\r
```\r
BAIDU_PADDLEOCR_TOKEN=你的token\r
```\r
\r
获取 Token：访问 [百度 AI Studio](https://aistudio.baidu.com/paddleocr)，免费注册，每天 1 万次免费调用。\r
\r
### 3. 登录抖音（只需一次）\r
\r
```bash\r
python \x3Cskill_path>/scripts/login.py\r
```\r
\r
浏览器打开抖音，扫码登录后关闭。登录状态自动保存，后续无需重复操作。\r
\r
---\r
\r
## 🗣️ 自然语言搜索（Agent 入口）\r
\r
当用户用自然语言提出搜索需求时，**先提取关键词，再调用脚本**。\r
\r
### 提取规则\r
\r
1. 从用户输入中提取核心搜索词（去掉"搜索一下"、"帮我找"、"看看"等助词）\r
2. 如果用户指定了数量，提取为 `--count`；否则用默认值\r
3. 如果用户说"只要图片"或"不用识别文字"，加 `--no-ocr`\r
4. 关键词尽量简短精炼（2-6字），不要把整个句子当关键词\r
\r
### 示例\r
\r
| 用户输入 | 提取关键词 | 命令 |\r
|----------|-----------|------|\r
| 搜索一下海鲜视频 | 海鲜 | `--keyword "海鲜"` |\r
| 帮我找找韩国医美相关内容 | 韩国医美 | `--keyword "韩国医美"` |\r
| 抖音上最近有什么减肥餐笔记 | 减肥餐 | `--keyword "减肥餐"` |\r
| 看看咖啡相关的图文，要5条 | 咖啡 | `--keyword "咖啡" --count 5` |\r
| 搜一下宠物猫，不用OCR | 宠物猫 | `--keyword "宠物猫" --no-ocr` |\r
| 抖音搜索穿搭技巧 | 穿搭技巧 | `--keyword "穿搭技巧"` |\r
\r
### Agent 执行步骤\r
\r
1. 从用户输入提取关键词\r
2. 运行命令：\r
```bash\r
python \x3Cskill_path>/scripts/full_workflow.py --keyword "\x3C提取的关键词>" [--count N] [--no-ocr]\r
```\r
3. 脚本完成后，读取 `output/` 下生成的 Markdown 报告\r
4. 向用户摘要报告内容（笔记数量、热度最高的几条、关键发现）\r
\r
---\r
\r
## 🔧 直接命令行使用\r
\r
```bash\r
# 采集 10 篇图文笔记（含 OCR）\r
python \x3Cskill_path>/scripts/full_workflow.py --keyword "韩国医美"\r
\r
# 指定数量\r
python \x3Cskill_path>/scripts/full_workflow.py --keyword "减肥餐" --count 5\r
\r
# 跳过 OCR（仅截图）\r
python \x3Cskill_path>/scripts/full_workflow.py --keyword "咖啡" --no-ocr\r
```\r
\r
| 参数 | 说明 | 默认值 |\r
|------|------|--------|\r
| `--keyword` | 搜索关键词 | 必填 |\r
| `--count` | 采集笔记数量 | `5` |\r
| `--no-ocr` | 跳过 OCR | 关闭 |\r
\r
---\r
\r
## 输出\r
\r
报告保存至 `output/notes_{keyword}_{timestamp}.md`，图片保存至 `data/images/`。\r
\r
每篇笔记包含：\r
\r
- 🔥 热度分数（点赞数 / 发布天数）及计算公式\r
- 👍 点赞数、发布时间、作者、原文链接\r
- 📝 原文描述\r
- 🔍 OCR 识别的图片文字（支持多图）\r
- 🖼️ 本地截图路径\r
\r
---\r
\r
## 技术特点\r
\r
- **Playwright 截图**：通过 `element.screenshot()` 截取内容图，绕过抖音图片 URL 反爬虫\r
- **图文过滤**：自动识别并跳过视频，只采集「图文」类型笔记\r
- **OCR 噪音过滤**：自动去除截图中的抖音导航栏文字（精选/推荐/关注 等）\r
- **多图支持**：一篇图文多张图片逐张截图 + OCR，合并识别结果\r
- **反检测**：有头浏览器（headless=False）+ 拟人操作节奏，避免触发验证码\r
- **热度公式**：`likes / days_ago`，越新越热排越前\r
\r
---\r
\r
## 目录结构\r
\r
```\r
douyin-scraper/\r
├── scripts/\r
│   ├── full_workflow.py   # 主流水线\r
│   └── login.py           # 登录脚本\r
├── data/\r
│   └── images/            # 截图\r
├── output/                # Markdown 报告\r
├── profile/               # 浏览器登录状态\r
└── .env                   # Token 配置\r
```\r

安全使用建议

Review carefully before installing. If you proceed, use a dedicated Douyin account, confirm each scrape intentionally, verify the package identity mismatch with the publisher, inspect the OCR endpoint, and delete the profile/ directory when you no longer want the skill to retain login state.

功能分析

Type: OpenClaw Skill Name: douyin-scraper-v2 Version: 1.0.0 The skill facilitates Douyin scraping using Playwright for browser automation and a third-party Baidu PaddleOCR API for text extraction. It is classified as suspicious due to high-risk capabilities and a potential command injection vulnerability in the `SKILL.md` instructions, which direct the AI agent to construct shell commands using unvalidated user input (e.g., `--keyword "<提取的关键词>"`). The workflow involves persisting browser sessions in a local `profile/` directory and exfiltrating captured image data to an external endpoint (aistudio-app.com) for processing, which, while aligned with the stated purpose, presents a privacy and security risk.

能力评估

⚠ Purpose & Capability

The code and documentation are broadly coherent with the advertised scraper, but the advertised capability includes bypassing anti-scraping controls and avoiding bot detection, which is materially risky for a user account.

⚠ Instruction Scope

Natural-language triggers instruct the agent to run the workflow script directly after extracting a keyword, with no explicit per-run confirmation for use of the logged-in Douyin profile or account-risking scraping behavior.

⚠ Install Mechanism

There is no install spec, the docs require manual unpinned package/browser installation, and the embedded _meta.json identity/version does not match the registry metadata shown for this submitted skill.

⚠ Credentials

Browser automation, local screenshots/reports, and OCR network calls are expected for the purpose, but the workflow also launches a persistent logged-in browser with anti-detection settings and sends image data to a default external app endpoint.

⚠ Persistence & Privilege

The skill persists Douyin login state under profile/ and reuses it on later runs, giving the automation ongoing access as the user until that profile is cleared.

如何使用

确保已安装 OpenClaw（本地或 Docker 部署）
在对话框中输入安装命令：/install douyin-scraper-v2
安装完成后，直接呼叫该 Skill 的名称或使用 /douyin-scraper-v2 触发
根据 Skill 的参数说明提供必要输入，即可获得结构化输出

版本历史

v1.0.0

Natural language search support + agent instructions

元数据

Slug douyin-scraper-v2

版本 1.0.0

许可证 MIT-0

累计安装 0

当前安装数 0

历史版本数 1

常见问题