功能描述

This skill should be used when the user wants to scrape Douyin (TikTok China) creator content, download audio, and transcribe it with Whisper. Covers first-t...

使用说明 (SKILL.md)

Douyin Content Tracker

Name: Douyin Content Tracker Skill
Author: gpttang

Scrapes Douyin creator videos via MediaCrawler, downloads audio with ffmpeg, and transcribes speech with Whisper.

Finding the Skill Base Directory

All commands must run from this skill's directory. To locate it, run:

python -c "import pathlib; print([p for p in pathlib.Path.home().rglob('douyin-content-tracker-skill/SKILL.md')])"

Or check common locations:

~/.claude/skills/douyin-content-tracker-skill/
The path shown when the skill was installed

Set it as a variable for convenience:

SKILL_DIR="~/.claude/skills/douyin-content-tracker-skill"   # adjust to actual path
cd "$SKILL_DIR"

First-Time Setup

Run these steps once on a new machine.

1. Install Python dependencies

cd $SKILL_DIR
pip install -r scripts/requirements.txt
python -m playwright install chromium

2. Install MediaCrawler

# Windows
git clone https://github.com/NanmiCoder/MediaCrawler D:/MediaCrawler
cd D:/MediaCrawler && pip install -r requirements.txt

# macOS/Linux
git clone https://github.com/NanmiCoder/MediaCrawler ~/MediaCrawler
cd ~/MediaCrawler && pip install -r requirements.txt

3. Configure `.env`

cd $SKILL_DIR
cp .env.template .env

Edit .env — required field:

MEDIACRAWLER_DIR=D:/MediaCrawler    # adjust to actual MediaCrawler path (use ~/MediaCrawler on macOS/Linux)

Optional overrides:

# Where to store data/audio/subtitles/models (default: ~/DouyinContentTracker or %USERPROFILE%\DouyinContentTracker)
OUTPUT_BASE_DIR=/Users/me/DouyinContentTracker

# Whisper model size (default: medium)
WHISPER_MODEL=small

4. Add target accounts

Edit accounts.txt (or set TRACKER_ACCOUNTS_FILE / pass --accounts-file when running):

博主名称 | https://www.douyin.com/user/MS4wLjABAAAA...

5. First login (generates cookie)

cd $SKILL_DIR
python scripts/scrape_profile.py

A browser opens — scan the Douyin QR code to log in. Cookie is saved to .douyin_cookies.json.

Daily Usage

cd $SKILL_DIR

# Track latest 3 videos per account (default). main.py mirrors track_latest.py
python scripts/track_latest.py
# or
python scripts/main.py

# Track latest N videos
python scripts/track_latest.py --limit 5

# Use a custom account list (also works via env TRACKER_ACCOUNTS_FILE)
python scripts/track_latest.py --accounts-file /path/to/accounts.txt

# Skip audio download and transcription (data only)
python scripts/track_latest.py --no-audio

Cookie Refresh

When scraping returns 0 videos or warns "Cookie 已 N 天未更新":

cd $SKILL_DIR
python scripts/scrape_profile.py    # opens browser, scan QR

Pipeline Flow

accounts.txt (or the list pointed by --accounts-file / TRACKER_ACCOUNTS_FILE)
    ↓
scripts/scrape_profile.py   → MediaCrawler (CDP) → OUTPUT_BASE_DIR/data/*.csv
    ↓
scripts/clean_data.py       → normalized OUTPUT_BASE_DIR/data/cleaned_*.csv
    ↓
scripts/download_video.py   → Playwright + ffmpeg → OUTPUT_BASE_DIR/audio/{blogger}/*.m4a
    ↓
scripts/extract_subtitle.py → Whisper → OUTPUT_BASE_DIR/subtitles/{blogger}/{video_id}.md

Output Locations

All generated files live under OUTPUT_BASE_DIR (defaults to ~/DouyinContentTracker on macOS/Linux, %USERPROFILE%\DouyinContentTracker on Windows).

Subdir	Contents
`data/cleaned_*.csv`	Scraped + normalized video metadata
`audio/{blogger}/{video_id}.m4a`	Extracted audio
`subtitles/{blogger}/{video_id}.md`	Whisper transcript (title as first line)
`subtitles/{blogger}.md`	All transcripts for one blogger merged

Execution Logging Guide

When running the pipeline, report progress to the user after each step completes. Do not wait until the entire pipeline finishes.

Step-by-step reporting template:

After each Bash tool call returns, immediately tell the user:

Step	What to report
采集（scrape）	博主名称、采集到的视频条数，若失败注明原因
清洗（clean）	清洗后有效条数
音频下载（download）	成功下载的音频数 / 总数，跳过的条数
语音识别（whisper）	生成的字幕文件数，输出路径
完成	汇总：共处理博主数、视频数、生成字幕数，以及输出目录路径

If a step fails, stop the pipeline, report the error output verbatim, and suggest the matching fix from references/troubleshooting.md before asking the user whether to continue.

Example output style:

[步骤 1/4 采集] 博主「某某」— 采集完成，共 10 条视频
[步骤 2/4 清洗] 有效数据 10 条 → data/cleaned_profile_xxx.csv
[步骤 3/4 音频] 下载完成 8/10（2 条无音频流，已跳过）
[步骤 4/4 字幕] 生成 8 个字幕文件 → subtitles/某某/
[完成] 1 位博主 · 10 条视频 · 8 个字幕，输出目录：~/DouyinContentTracker

References

Load these files into context when debugging or extending the pipeline:

references/pipeline.md — per-script technical breakdown, data schemas, key function signatures
references/troubleshooting.md — fixes for cookie, MediaCrawler, ffmpeg, Whisper, and data errors

安全使用建议

This package largely does what it says (scrapes Douyin, extracts audio, runs Whisper), but take these precautions before installing/ running: - Confirm and set MEDIACRAWLER_DIR in the skill .env (the registry metadata omitted this required env). The skill will not work without it. - Backup your MediaCrawler installation before running this: the scripts temporarily overwrite MediaCrawler/config/base_config.py to change fetch count and then restore it. Ensure MEDIACRAWLER_DIR points to an isolated/copy location you control. - Be aware cookies: the pipeline stores Douyin session cookies in a local .douyin_cookies.json and passes them as a command-line argument to MediaCrawler. Passing secrets on a command line can expose them via process listings to other users on the same machine — avoid running this on multi-user/shared systems. After use, consider deleting or rotating the cookie. - Expect big downloads and disk usage: Playwright browser binaries and Whisper model weights (medium model ~GBs) will be downloaded; make sure you have bandwidth and storage. - Review accounts.txt and .env for any unintended targets or output directories; set OUTPUT_BASE_DIR to an isolated folder you control. - Confirm legal/ToS considerations: automated scraping may violate Douyin/TikTok terms of service or local law — ensure you have the right to scrape the targeted accounts. If you want to proceed in a safer manner: run the pipeline on an isolated VM/container or a dedicated user account, inspect the code (especially run_mediacrawler and set_mediacrawler_max_count), and avoid running the workflow on shared systems where process listing or filesystem writes could leak credentials or affect other software.

功能分析

Type: OpenClaw Skill Name: douyin-content-tracker-skill Version: 1.0.0 The douyin-content-tracker-skill is a legitimate tool designed to scrape Douyin video metadata, extract audio using ffmpeg, and generate transcriptions via OpenAI's Whisper model. The code logic is well-structured and aligns perfectly with the stated purpose, utilizing standard libraries like Playwright, pandas, and subprocess for orchestration. While it performs potentially risky actions such as modifying the MediaCrawler configuration file (scrape_profile.py) and searching the home directory for its installation path (SKILL.md), these behaviors are documented, targeted, and functionally necessary for the pipeline's operation. No evidence of data exfiltration, malicious persistence, or prompt injection was found.

能力评估

ℹ Purpose & Capability

The skill's name/description (scrape Douyin, extract audio, transcribe with Whisper) matches the included scripts and pipeline. It legitimately needs Playwright, ffmpeg access, MediaCrawler, and Whisper model downloads. However, the registry metadata declares no required environment variables while the SKILL.md and code require/expect a MEDIACRAWLER_DIR value in .env — an inconsistency between claimed requirements and actual needs.

⚠ Instruction Scope

SKILL.md instructs the agent/user to clone and run MediaCrawler, run Playwright browser installs, create/modify a .env, run scripts that open a browser for QR login (producing a .douyin_cookies.json), and run local pipeline scripts that read/write many files. The scripts also modify an external project's config file (MediaCrawler config/base_config.py) temporarily to set fetch count. The pipeline passes the user's Douyin cookies into MediaCrawler via a command-line argument. Those actions go beyond simple 'read-only' scraping guidance and introduce potential exposure (see environment_proportionality).

ℹ Install Mechanism

There is no automated install spec in the registry (instruction-only), which is lower automatic risk. The instructions ask the user to git clone a public GitHub repo (NanmiCoder/MediaCrawler) and to pip-install dependencies and Playwright's browser binaries. Using an official GitHub repo is typical; the user-run clone means no arbitrary binary downloads are silently executed by the platform. Still, installing Playwright and downloading Whisper model weights are large actions the user should expect.

⚠ Credentials

The skill's registry metadata lists no required credentials/env vars, yet the code expects and uses values from .env (MEDIACRAWLER_DIR, optional OUTPUT_BASE_DIR, WHISPER_MODEL). The pipeline relies on a local cookie file (.douyin_cookies.json) and passes the cookie string into MediaCrawler on the subprocess command line (--cookies <cookie_str>), which can expose session cookies via process listings to other local users. The code also writes into the external MediaCrawler repo (overwriting base_config.py and restoring it), which requires write access to user-specified paths and could have side effects if the path points to non-isolated locations.

ℹ Persistence & Privilege

The skill is not marked 'always: true' and does not auto-enable itself across agents. It writes run-state and output files under OUTPUT_BASE_DIR and will create/copy ffmpeg executables in library cache folders on Windows. The main privileged behavior is modifying the MediaCrawler base_config.py (temporary patch/restore) in a user-specified directory — this requires filesystem write permission to that installation and is outside the skill's own directory.

版本历史

v1.0.0

Add step-by-step execution logging guide

元数据

Slug douyin-content-tracker-skill

版本 1.0.0

许可证 MIT-0

累计安装 0

当前安装数 0

历史版本数 1

常见问题