← Back to Skills Marketplace
terrycarter1985

Douyin Scraper V2

by terrycarter1985 · GitHub ↗ · v1.0.0 · MIT-0
cross-platform ⚠ suspicious
44
Downloads
0
Stars
0
Active Installs
1
Versions
Install in OpenClaw
/install douyin-scraper-v2
Description
抖音图文笔记采集工具。支持自然语言搜索(如"搜索一下海鲜视频"),自动提取关键词 → 搜索 → 筛选「图文·一周内」→ Playwright 截图(绕过反爬虫)→ Baidu OCR 识别图片文字 → 输出 Markdown 报告(含热度评分)。当用户提到"抖音搜索"、"抖音图文采集"、"抖音笔记抓取"、"抖音爬...
README (SKILL.md)

\r \r

douyin-scraper\r

\r 抖音图文笔记采集工具 —— 一条命令完成:搜索 → 筛选图文 → 截图 → OCR → Markdown 报告。\r \r

⚠️ 前置配置\r

\r

1. 安装依赖\r

\r

pip install playwright requests python-dotenv\r
python -m playwright install chromium\r
```\r
\r
### 2. 配置 Baidu PaddleOCR Token\r
\r
在技能目录创建 `.env`:\r
\r
```\r
BAIDU_PADDLEOCR_TOKEN=你的token\r
```\r
\r
获取 Token:访问 [百度 AI Studio](https://aistudio.baidu.com/paddleocr),免费注册,每天 1 万次免费调用。\r
\r
### 3. 登录抖音(只需一次)\r
\r
```bash\r
python \x3Cskill_path>/scripts/login.py\r
```\r
\r
浏览器打开抖音,扫码登录后关闭。登录状态自动保存,后续无需重复操作。\r
\r
---\r
\r
## 🗣️ 自然语言搜索(Agent 入口)\r
\r
当用户用自然语言提出搜索需求时,**先提取关键词,再调用脚本**。\r
\r
### 提取规则\r
\r
1. 从用户输入中提取核心搜索词(去掉"搜索一下"、"帮我找"、"看看"等助词)\r
2. 如果用户指定了数量,提取为 `--count`;否则用默认值\r
3. 如果用户说"只要图片"或"不用识别文字",加 `--no-ocr`\r
4. 关键词尽量简短精炼(2-6字),不要把整个句子当关键词\r
\r
### 示例\r
\r
| 用户输入 | 提取关键词 | 命令 |\r
|----------|-----------|------|\r
| 搜索一下海鲜视频 | 海鲜 | `--keyword "海鲜"` |\r
| 帮我找找韩国医美相关内容 | 韩国医美 | `--keyword "韩国医美"` |\r
| 抖音上最近有什么减肥餐笔记 | 减肥餐 | `--keyword "减肥餐"` |\r
| 看看咖啡相关的图文,要5条 | 咖啡 | `--keyword "咖啡" --count 5` |\r
| 搜一下宠物猫,不用OCR | 宠物猫 | `--keyword "宠物猫" --no-ocr` |\r
| 抖音搜索穿搭技巧 | 穿搭技巧 | `--keyword "穿搭技巧"` |\r
\r
### Agent 执行步骤\r
\r
1. 从用户输入提取关键词\r
2. 运行命令:\r
```bash\r
python \x3Cskill_path>/scripts/full_workflow.py --keyword "\x3C提取的关键词>" [--count N] [--no-ocr]\r
```\r
3. 脚本完成后,读取 `output/` 下生成的 Markdown 报告\r
4. 向用户摘要报告内容(笔记数量、热度最高的几条、关键发现)\r
\r
---\r
\r
## 🔧 直接命令行使用\r
\r
```bash\r
# 采集 10 篇图文笔记(含 OCR)\r
python \x3Cskill_path>/scripts/full_workflow.py --keyword "韩国医美"\r
\r
# 指定数量\r
python \x3Cskill_path>/scripts/full_workflow.py --keyword "减肥餐" --count 5\r
\r
# 跳过 OCR(仅截图)\r
python \x3Cskill_path>/scripts/full_workflow.py --keyword "咖啡" --no-ocr\r
```\r
\r
| 参数 | 说明 | 默认值 |\r
|------|------|--------|\r
| `--keyword` | 搜索关键词 | 必填 |\r
| `--count` | 采集笔记数量 | `5` |\r
| `--no-ocr` | 跳过 OCR | 关闭 |\r
\r
---\r
\r
## 输出\r
\r
报告保存至 `output/notes_{keyword}_{timestamp}.md`,图片保存至 `data/images/`。\r
\r
每篇笔记包含:\r
\r
- 🔥 热度分数(点赞数 / 发布天数)及计算公式\r
- 👍 点赞数、发布时间、作者、原文链接\r
- 📝 原文描述\r
- 🔍 OCR 识别的图片文字(支持多图)\r
- 🖼️ 本地截图路径\r
\r
---\r
\r
## 技术特点\r
\r
- **Playwright 截图**:通过 `element.screenshot()` 截取内容图,绕过抖音图片 URL 反爬虫\r
- **图文过滤**:自动识别并跳过视频,只采集「图文」类型笔记\r
- **OCR 噪音过滤**:自动去除截图中的抖音导航栏文字(精选/推荐/关注 等)\r
- **多图支持**:一篇图文多张图片逐张截图 + OCR,合并识别结果\r
- **反检测**:有头浏览器(headless=False)+ 拟人操作节奏,避免触发验证码\r
- **热度公式**:`likes / days_ago`,越新越热排越前\r
\r
---\r
\r
## 目录结构\r
\r
```\r
douyin-scraper/\r
├── scripts/\r
│   ├── full_workflow.py   # 主流水线\r
│   └── login.py           # 登录脚本\r
├── data/\r
│   └── images/            # 截图\r
├── output/                # Markdown 报告\r
├── profile/               # 浏览器登录状态\r
└── .env                   # Token 配置\r
```\r
Usage Guidance
Review carefully before installing. If you proceed, use a dedicated Douyin account, confirm each scrape intentionally, verify the package identity mismatch with the publisher, inspect the OCR endpoint, and delete the profile/ directory when you no longer want the skill to retain login state.
Capability Analysis
Type: OpenClaw Skill Name: douyin-scraper-v2 Version: 1.0.0 The skill facilitates Douyin scraping using Playwright for browser automation and a third-party Baidu PaddleOCR API for text extraction. It is classified as suspicious due to high-risk capabilities and a potential command injection vulnerability in the `SKILL.md` instructions, which direct the AI agent to construct shell commands using unvalidated user input (e.g., `--keyword "<提取的关键词>"`). The workflow involves persisting browser sessions in a local `profile/` directory and exfiltrating captured image data to an external endpoint (aistudio-app.com) for processing, which, while aligned with the stated purpose, presents a privacy and security risk.
Capability Assessment
Purpose & Capability
The code and documentation are broadly coherent with the advertised scraper, but the advertised capability includes bypassing anti-scraping controls and avoiding bot detection, which is materially risky for a user account.
Instruction Scope
Natural-language triggers instruct the agent to run the workflow script directly after extracting a keyword, with no explicit per-run confirmation for use of the logged-in Douyin profile or account-risking scraping behavior.
Install Mechanism
There is no install spec, the docs require manual unpinned package/browser installation, and the embedded _meta.json identity/version does not match the registry metadata shown for this submitted skill.
Credentials
Browser automation, local screenshots/reports, and OCR network calls are expected for the purpose, but the workflow also launches a persistent logged-in browser with anti-detection settings and sends image data to a default external app endpoint.
Persistence & Privilege
The skill persists Douyin login state under profile/ and reuses it on later runs, giving the automation ongoing access as the user until that profile is cleared.
How to Use
  1. Make sure OpenClaw is installed (local or Docker)
  2. Run the install command in chat: /install douyin-scraper-v2
  3. After installation, invoke the skill by name or use /douyin-scraper-v2
  4. Provide required inputs per the skill's parameter spec and get structured output
Version History
v1.0.0
Natural language search support + agent instructions
Metadata
Slug douyin-scraper-v2
Version 1.0.0
License MIT-0
All-time Installs 0
Active Installs 0
Total Versions 1
Frequently Asked Questions

What is Douyin Scraper V2?

抖音图文笔记采集工具。支持自然语言搜索(如"搜索一下海鲜视频"),自动提取关键词 → 搜索 → 筛选「图文·一周内」→ Playwright 截图(绕过反爬虫)→ Baidu OCR 识别图片文字 → 输出 Markdown 报告(含热度评分)。当用户提到"抖音搜索"、"抖音图文采集"、"抖音笔记抓取"、"抖音爬... It is an AI Agent Skill for Claude Code / OpenClaw, with 44 downloads so far.

How do I install Douyin Scraper V2?

Run "/install douyin-scraper-v2" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.

Is Douyin Scraper V2 free?

Yes, Douyin Scraper V2 is completely free, licensed under MIT-0. You can download, install and use it at no cost.

Which platforms does Douyin Scraper V2 support?

Douyin Scraper V2 is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).

Who created Douyin Scraper V2?

It is built and maintained by terrycarter1985 (@terrycarter1985); the current version is v1.0.0.

💬 Comments