← 返回 Skills 市场
gpttang

Douyin Content Tracker Skill

作者 yibo · GitHub ↗ · v1.0.0 · MIT-0
cross-platform ⚠ suspicious
99
总下载
0
收藏
0
当前安装
1
版本数
在 OpenClaw 中安装
/install douyin-content-tracker
功能描述
Scrapes Douyin creator videos, downloads audio (Playwright+ffmpeg with yt-dlp fallback), and transcribes with Whisper. Covers setup, daily tracking, cookie m...
使用说明 (SKILL.md)

Douyin Content Tracker

通过 MediaCrawler 爬取抖音创作者视频,用 Playwright+ffmpeg 下载音频(被封锁时自动切换 yt-dlp),用 Whisper 进行语音识别转写。

快速开始

一键完整流程(推荐)

SKILL_DIR=$(python -c "import pathlib; print(pathlib.Path.home().rglob('douyin-content-tracker-skill/SKILL.md').__next__().parent)")
cd "$SKILL_DIR"

# 1. 采集最新 3 条视频
python scripts/track_latest.py --limit 3

# 2. 下载音频并转写(macOS)
export KMP_DUPLICATE_LIB_OK=TRUE
WHISPER_MODEL=small python scripts/extract_subtitle.py

首次设置

1. 安装 Python 依赖

cd $SKILL_DIR
pip install -r scripts/requirements.txt
pip install openai-whisper
python -m playwright install chromium

# yt-dlp(音频下载备用方案)
pip install yt-dlp
# 或
brew install yt-dlp

2. 安装 MediaCrawler

# macOS/Linux
git clone https://github.com/NanmiCoder/MediaCrawler ~/MediaCrawler
cd ~/MediaCrawler && pip install -r requirements.txt

# Windows
git clone https://github.com/NanmiCoder/MediaCrawler D:/MediaCrawler
cd D:/MediaCrawler && pip install -r requirements.txt

3. 配置 .env

cd $SKILL_DIR
cp .env.template .env

编辑 .env

# 必填:MediaCrawler 路径
MEDIACRAWLER_DIR=~/MediaCrawler

# 可选:输出目录(默认 ~/DouyinContentTracker)
OUTPUT_BASE_DIR=~/DouyinContentTracker

# 可选:Whisper 模型(推荐 small,更快更稳定)
WHISPER_MODEL=small

4. 添加目标账号

编辑 accounts.txt

博主名称 | https://www.douyin.com/user/MS4wLjABAAAA...

5. 获取 Cookie(三选一)

方法 A:扫码登录(生成新 Cookie)

cd $SKILL_DIR
python scripts/scrape_profile.py

方法 B:复制微信 Cookie(macOS)

cp ~/Library/Containers/com.tencent.xinWeChat/Data/Documents/xwechat_files/*/msg/file/*/.douyin_cookies.json \
   $SKILL_DIR/.douyin_cookies.json
chmod 600 $SKILL_DIR/.douyin_cookies.json

方法 C:使用已有 Cookie 确保 .douyin_cookies.json 在 skill 目录下。


日常使用

采集 + 转写(标准流程)

cd $SKILL_DIR

# 1. 采集(默认 3 条)
python scripts/track_latest.py

# 2. 清洗数据
python scripts/clean_data.py

# 3. 下载音频
python scripts/download_video.py

# 4. 语音识别(macOS 需要设置环境变量)
export KMP_DUPLICATE_LIB_OK=TRUE
python scripts/extract_subtitle.py

常用命令

# 采集指定数量
python scripts/track_latest.py --limit 5

# 使用自定义账号列表
python scripts/track_latest.py --accounts-file /path/to/accounts.txt

# 仅采集数据(不下载音频)
python scripts/track_latest.py --no-audio

# 仅转写(跳过下载)
export KMP_DUPLICATE_LIB_OK=TRUE
python scripts/extract_subtitle.py

故障排查

❌ 0 视频提取 / "未获取到视频 URL"

原因: Cookie 过期或无效,或抖音 API 封锁了 Playwright 请求

解决:

# 1. 检查 Cookie 文件
ls -la .douyin_cookies.json

# 2. 复制新 Cookie(方法 B)
cp ~/Library/Containers/com.tencent.xinWeChat/Data/Documents/xwechat_files/*/msg/file/*/.douyin_cookies.json \
   $SKILL_DIR/.douyin_cookies.json

# 3. 或重新扫码
python scripts/scrape_profile.py

# 4. 重试下载(Playwright 失败时自动切换 yt-dlp)
python scripts/download_video.py

❌ yt-dlp 也被拦截

# 带浏览器 Cookie 重试
yt-dlp -x --audio-format m4a --cookies-from-browser chrome \x3C视频链接>

如果 Chrome Cookie 有效,在 download_video.pyytdlp_download_audio 函数中的 cmd 列表里加入:

"--cookies-from-browser", "chrome",

❌ Whisper 崩溃(SIGSEGV / OpenMP 错误)

原因: macOS OpenMP 运行时冲突

解决:

export KMP_DUPLICATE_LIB_OK=TRUE
WHISPER_MODEL=small python scripts/extract_subtitle.py

永久解决:.env 中设置 WHISPER_MODEL=small

❌ Playwright 浏览器缺失

python -m playwright install chromium

❌ Cookie 警告 "已 N 天未更新"

python scripts/scrape_profile.py  # 重新扫码

输出目录结构

~/DouyinContentTracker/
├── data/                           # 采集数据
│   ├── 周凯谈烘焙_20260321_083047.csv
│   └── cleaned_周凯谈烘焙_20260321_083047.csv
├── audio/                          # 音频文件
│   └── 周凯谈烘焙/
│       ├── 7559900409483300105.m4a    (96 KB)
│       ├── 7491508890513886505.m4a    (584 KB)
│       └── 7446734179963866379.m4a    (1,775 KB)
├── subtitles/                      # 语音转写文稿
│   └── 周凯谈烘焙/
│       ├── 7559900409483300105.md
│       ├── 7491508890513886505.md
│       └── 7446734179963866379.md
└── models/                         # Whisper 模型
    └── small.pt

执行报告模板

每一步完成后向用户报告进度:

步骤 报告内容
采集 博主名称、采集条数、失败原因
清洗 有效数据条数、输出文件
音频 成功数/总数、跳过条数
转写 生成字幕数、输出路径
完成 博主数、视频数、字幕数、输出目录

示例:

[步骤 1/4 采集] 博主「周凯谈烘焙」— 采集完成,共 345 条视频
[步骤 2/4 清洗] 有效数据 115 条 → data/cleaned_周凯谈烘焙_20260321_083047.csv
[步骤 3/4 音频] 下载完成 3/3 → audio/周凯谈烘焙/
[步骤 4/4 字幕] 生成 3 个字幕文件 → subtitles/周凯谈烘焙/
[完成] 1 位博主 · 3 条视频 · 3 个字幕,输出目录:~/DouyinContentTracker

技术细节

管道流程

accounts.txt
    ↓
scrape_profile.py → MediaCrawler (CDP) → data/*.csv
    ↓
clean_data.py → cleaned_*.csv
    ↓
download_video.py → Playwright + ffmpeg → audio/{blogger}/*.m4a
                 ↘ (Playwright 失败) yt-dlp ↗
    ↓
extract_subtitle.py → Whisper → subtitles/{blogger}/{video_id}.md

音频下载双重策略:

  • 主路径:Playwright 打开视频页拦截 aweme API 拿真实 URL → ffmpeg 提取 .m4a
  • 备用路径:Playwright 拿不到 URL 时,自动调用 yt-dlp -x --audio-format m4a,输出同样是 .m4a,Whisper 无需适配

Whisper 模型选择

模型 大小 速度 准确度 推荐场景
tiny 75MB 最快 一般 测试/快速预览
small 461MB 日常使用(推荐)
medium 1.5GB 很好 高准确度需求
large 3GB 最慢 最佳 专业转写

参考文件

调试或扩展时加载:

  • references/pipeline.md — 脚本技术细节、数据格式、关键函数
  • references/troubleshooting.md — Cookie、MediaCrawler、ffmpeg、Whisper、数据错误修复

更新日志

2026-03-21(二次更新)

  • ✅ 新增 yt-dlp 作为音频下载备用方案(Playwright 被封锁时自动切换)
  • ✅ yt-dlp 输出保持 .m4a 格式,Whisper 管道无需改动
  • ✅ 新增 --cookies-from-browser chrome 故障排查说明

2026-03-21

  • ✅ 新增 macOS 微信 Cookie 复制方法
  • ✅ 新增 OpenMP 冲突解决方案(KMP_DUPLICATE_LIB_OK=TRUE
  • ✅ 推荐使用 small 模型(更快更稳定)
  • ✅ 新增一键完整流程示例
  • ✅ 新增输出目录结构示例
  • ✅ 新增执行报告模板
安全使用建议
What to consider before installing/running this skill: - The skill will require you to clone and run a third-party project (NanmiCoder/MediaCrawler) and install Playwright, ffmpeg/imageio-ffmpeg, yt-dlp and openai-whisper; these are explicit but not reflected in the registry metadata — expect manual setup. - The pipeline needs a valid Douyin cookie file (.douyin_cookies.json). SKILL.md suggests copying it from WeChat's local container; that points to sensitive local data. Prefer exporting cookies manually rather than letting scripts grab files automatically. - The scraper will pass the cookie string on the MediaCrawler command line and will load cookies into Playwright contexts. Passing secrets on a command line can expose them to other local users via process listings; if this is a concern, run the skill in an isolated environment (VM/container) or modify the code to read cookies from a protected file only. - The skill temporarily edits MediaCrawler's config file (config/base_config.py) to change fetch counts and then restores it. If you don't trust the cloned MediaCrawler repo, audit it first — the skill will write to files outside its own directory. - Run in an isolated environment (dedicated user account, container, or VM) if you plan to provide real cookies or clone repositories. Review the code (scrape_profile.py, download_video.py, extract_subtitle.py) — key behaviors: reading/writing .douyin_cookies.json, calling subprocesses (MediaCrawler, ffmpeg, yt-dlp), writing outputs under ~/DouyinContentTracker by default, and downloading Whisper models into models/. - If you decide to run it: set restrictive permissions on the cookie file (chmod 600), avoid running as root, and consider manually exporting cookies and setting MEDIACRAWLER_DIR in .env rather than copying from other app containers. If you want tighter safety, run the pipeline but skip the MediaCrawler step and feed sanitized CSVs instead. - Missing info that would increase confidence: an explicit manifest of required env vars/binaries in the registry metadata, a signed or verified MediaCrawler source, and a mode that avoids command-line cookie exposure (e.g., pass cookies via a file or stdin).
能力评估
Purpose & Capability
The code matches the stated purpose (scraping Douyin, extracting audio, running Whisper). However the registry metadata claims no required env vars/binaries, while the SKILL.md and scripts require: MEDIACRAWLER_DIR (in .env), a valid .douyin_cookies.json (or ability to scan a QR), Playwright/browser, ffmpeg (or imageio-ffmpeg), and optionally yt-dlp. The omission of these required inputs in the declared metadata is an inconsistency users should notice.
Instruction Scope
Runtime instructions and code do more than just call Douyin endpoints: they ask users to copy cookies from a local WeChat container path (SKILL.md suggests cp from ~/Library/Containers/.../xwechat_files), the scripts load and use that .douyin_cookies.json, and scrape_profile.py injects cookies into MediaCrawler via the command line. scrape_profile.py also temporarily writes to MediaCrawler's config file (config/base_config.py) to change CRAWLER_MAX_NOTES_COUNT — i.e., the skill modifies files in an external repo. These behaviors go beyond simple API integration and involve local sensitive data and changing third-party project files.
Install Mechanism
There is no formal install spec in the registry (instruction-only), but SKILL.md tells users to pip install requirements, pip/brew install yt-dlp, run 'python -m playwright install chromium', and to git clone https://github.com/NanmiCoder/MediaCrawler. Pulling and executing the upstream MediaCrawler code is expected for this skill, but it is an explicit external code dependency the user must fetch and run locally.
Credentials
The registry lists no required environment variables/credentials, yet the skill expects and uses .env values (MEDIACRAWLER_DIR, OUTPUT_BASE_DIR, WHISPER_MODEL) and a cookie file (.douyin_cookies.json). The skill's instructions advise copying cookies from a local WeChat container path (sensitive user data). Additionally, cookies are passed into a subprocess command line (MediaCrawler cmd includes --cookies '<cookie_str>') which can expose cookie content via process listings on some systems. These environment and credential demands are not proportionally declared.
Persistence & Privilege
The skill does not request 'always: true' and is not force-installed, but it writes persistent outputs under OUTPUT_BASE_DIR (data/, audio/, subtitles/, models/) — expected for this type of tool. The notable concern is that scrape_profile.py temporarily modifies MediaCrawler's config file (MEDIACRAWLER_DIR/config/base_config.py) and writes it back; modifying other software on disk is a privileged side-effect and should be acknowledged by users before running.
如何使用
  1. 确保已安装 OpenClaw(本地或 Docker 部署)
  2. 在对话框中输入安装命令:/install douyin-content-tracker
  3. 安装完成后,直接呼叫该 Skill 的名称或使用 /douyin-content-tracker 触发
  4. 根据 Skill 的参数说明提供必要输入,即可获得结构化输出
版本历史
v1.0.0
Initial release of douyin-content-tracker. - Scrapes Douyin creator videos via MediaCrawler and downloads audio using Playwright + ffmpeg, with automatic fallback to yt-dlp if blocked. - Integrates Whisper for audio transcription, supporting quick and stable processing. - Includes setup instructions for dependency installation, environment configuration, and cookie management (WeChat, scan, or manual). - Provides step-by-step daily usage guide and detailed troubleshooting for common issues like cookie expiry, download failures, and Whisper/OpenMP errors. - Offers ready-to-use command examples, directory structure overview, and templated progress report output.
元数据
Slug douyin-content-tracker
版本 1.0.0
许可证 MIT-0
累计安装 0
当前安装数 0
历史版本数 1
常见问题

Douyin Content Tracker Skill 是什么?

Scrapes Douyin creator videos, downloads audio (Playwright+ffmpeg with yt-dlp fallback), and transcribes with Whisper. Covers setup, daily tracking, cookie m... 它是一个面向 Claude Code / OpenClaw 的 AI Agent Skill 插件,目前累计下载 99 次。

如何安装 Douyin Content Tracker Skill?

在 OpenClaw 或 Claude Code 对话框中运行命令「/install douyin-content-tracker」即可一键安装,无需额外配置。

Douyin Content Tracker Skill 是免费的吗?

是的,Douyin Content Tracker Skill 完全免费,采用 MIT-0 许可证,可自由下载、安装和使用。

Douyin Content Tracker Skill 支持哪些平台?

Douyin Content Tracker Skill 跨平台运行,可在任意部署了 OpenClaw / Claude Code 的环境中使用(cross-platform)。

谁开发了 Douyin Content Tracker Skill?

由 yibo(@gpttang)开发并维护,当前版本 v1.0.0。

💬 留言讨论