Description

This skill automates end-to-end Douyin topic research and report generation. Given a search keyword and a target video count, it handles QR-code login, batch...

README (SKILL.md)

Douyin Topic Research & Report Skill

Name: 抖音搜索视频全量分析工具，支持扫码登录，自动图片验证
Author: hhofchina

Purpose

Automate the full pipeline: keyword → data collection → CAPTCHA bypass → enrichment → analysis → HTML report, replicating a proven workflow that successfully collected and analyzed 100 videos on the topic "女性成长".

Applicable Scenarios

"帮我分析抖音上 [关键词] 的视频，哪些因素让视频更多点赞转发"
"采集抖音 [话题] 最近 N 条视频数据，生成分析报告"
"我想研究抖音某类内容的爆款规律"
"给我一份抖音 [关键词] 的可视化数据报告"

Default Parameters

Parameter	Default	Notes
`KEYWORD`	`女性成长`	Search keyword (URL-encoded automatically)
`TOTAL`	`100`	Total videos to collect
`DETAIL_LIMIT`	`50`	Max videos to visit detail pages
`COMMENTS_TOP`	`5`	Top comments per video

Full Pipeline (5 Steps)

Step 1 — Environment Setup

cd \x3Cwork_dir>
python3 -m venv venv && source venv/bin/activate
pip install playwright pillow numpy scipy scikit-image openpyxl
playwright install chromium

Step 2 — Login & Save Session

Run scripts/douyin_login.py (or adapt inline). The script:

Launches Chromium (headless=False)
Navigates to https://www.douyin.com
Waits for user to scan QR code (polls document.cookie until login detected)
Saves cookies to douyin_session.json

Key anti-detection settings (always apply):

args=["--disable-blink-features=AutomationControlled", "--no-sandbox",
      "--window-size=1440,900"]
# Init script:
"Object.defineProperty(navigator,'webdriver',{get:()=>undefined});"

Step 3 — Batch Video Collection

See scripts/collect_videos.py. Core logic:

Intercept search/item API response (aweme_list field contains video data)
Navigate to https://www.douyin.com/search/{keyword}?type=video
For batch 2+: scroll down 8× with window.scrollBy(0, 600) then wait 4s
Extract fields: aweme_id, desc (title), statistics (likes/shares/collects/comments), author.uid, author.nickname, author.follower_count, video.duration, text_extra (tags)

Step 4 — Detail Enrichment + CAPTCHA Solving

See scripts/parse_videos.py and scripts/captcha_solver.py.

CAPTCHA Solving Algorithm (proven, use exactly)

The algorithm is embedded in scripts/captcha_solver.py. Key findings from empirical testing:

Template matching is the primary method (most accurate, directly gives left edge of gap)
Sobel edge detection is secondary (detects right edge of gap → left peak of dual-peak = left edge)
Decision: if diff ≤ 25px → weighted average (70% template + 30% Sobel); else → use template only

# Element selectors (抖音 captcha iframe)
captcha_frame selector: frame.url contains "verifycenter" or "captcha"
bg_el  = frame.locator(".captcha-verify-image").first
sl_el  = frame.locator(".captcha-verify-image-slide").first
btn_el = frame.locator(".captcha-slider-btn").first

# Slide distance formula
gap_center_abs = bg_bb["x"] + gap_x + sl_bb["width"] / 2
btn_center_abs = btn_bb["x"] + btn_bb["width"] / 2
slide_distance = gap_center_abs - btn_center_abs

Human-like Slide Path (ease-out + overshoot)

def ease_out_cubic(t): return 1 - (1 - t) ** 3

# overshoot 3-7px, then pull back in final 15% of path
# Y-axis jitter ±2px, X-axis jitter ±1px during 5%-80%
# Timing: fast phase (frac\x3C0.5) 5-8ms, mid 10-18ms, slow 25-45ms

Refresh captcha between retries

rb = frame.locator(".vc-captcha-refresh,.captcha-refresh,[class*='refresh']").first

Step 5 — Analysis & Report Generation

See scripts/analyze_factors.py and scripts/generate_report.py.

Analysis dimensions (all proven to have measurable effect):

Dimension	Key Finding
Duration	2-3 min is sweet spot (15× better than >5 min)
Tag count	1-2 tags >> 5+ tags (up to 6× difference)
Best tags	#自我成长 #个人成长 #认知 #女生必看
Follower (log-corr)	r=0.617, moderate positive
Title with `！`	+2× likes vs no exclamation
Title length	11-20 chars optimal
Emotion keywords	Love/marriage/mood words → higher shares

Report output: douyin_analysis_report.html with 10 interactive Chart.js charts.

File Structure

work_dir/
├── douyin_session.json      # saved login cookies
├── douyin_raw_data.json     # raw collected videos
├── douyin_parsed.json       # enriched with detail data
├── analysis_result.json     # computed analysis metrics
├── douyin_report.xlsx       # Excel version
└── douyin_analysis_report.html  # final interactive HTML report

Critical Notes

headless=False is required for CAPTCHA solving (screenshot-based)
Always mask the slider overlay in the background image before edge detection: bg_arr[:mask_h, :mask_w] = column_mean_fill
search_start = sl_w + 12 to skip the initial slider position area
Max retries for captcha: 5 attempts with captcha refresh between each
After captcha success, wait 3s before continuing
The douyin_session.json expires; re-login if 401/redirect to login page

Dependencies

playwright, pillow, numpy, scipy, scikit-image, openpyxl

Usage Guidance

This package implements a full Douyin scraping pipeline including automated CAPTCHA solving and anti-detection tricks. Before running/installing: 1) Understand legal/ToS risk — automated scraping and CAPTCHA circumvention can violate Douyin's terms and local law. 2) Inspect and edit hard-coded paths: captcha_solver.py has SESSION_FILE and SAVE_DIR set to /Users/hhao/..., change these to a safe working directory or remove absolute paths so files are written where you expect. 3) Treat output files as sensitive: douyin_session.json contains cookies/session tokens and debug screenshots may expose private data — store them in an isolated location and delete when done. 4) Run in an isolated environment (VM/container) and avoid using privileged accounts. 5) If you don't want automated CAPTCHA bypass on your machine, remove/disable the captcha-solver logic and handle verification manually. 6) If you need to share results, scrub session files first. Overall the code is coherent with its stated purpose but contains operations (CAPTCHA bypass, anti-detection) and unexpected absolute paths that warrant caution.

Capability Analysis

Type: OpenClaw Skill Name: douyin-report-search Version: 1.0.0 The skill bundle provides a sophisticated automation framework for scraping Douyin data, including automated QR-code login session management and a custom image-processing-based CAPTCHA solver (scripts/captcha_solver.py). While the stated purpose is market research and report generation, the inclusion of automated bypass mechanisms for platform security controls and the handling of session cookies (douyin_session.json) represent high-risk browser automation behaviors. No evidence of data exfiltration to external domains was found, but the capabilities for session hijacking and automated scraping are significant.

Capability Assessment

ℹ Purpose & Capability

The skill's name/description align with the code: Playwright-based QR login, API interception, detail enrichment, CAPTCHA solving, analysis and HTML report generation. Requiring browser automation, image-processing libs, and cookie storage is consistent with the stated purpose. Minor inconsistency: some scripts (captcha_solver.py) use absolute paths (/Users/hhao/...) rather than the working directory the rest of the tool expects, which is unexpected and should be corrected.

⚠ Instruction Scope

SKILL.md and scripts explicitly instruct anti-detection measures (remove navigator.webdriver, Chromium args), automated CAPTCHA bypass (screenshot, template matching, simulated human slide paths), intercepting API responses and saving session cookies and screenshots. Those behaviors go beyond benign 'data collection' (they actively bypass protection mechanisms) and also instruct saving potentially sensitive artifacts (cookies, debug screenshots). The instructions do not attempt to access unrelated system secrets, but the CAPTCHA bypass + anti-detection gives the skill broad sensitive capabilities and operational risk.

✓ Install Mechanism

No installer in registry; SKILL.md instructs creating a venv and pip-installing playwright and several image/analysis libs and running `playwright install chromium`. This is proportionate for a Playwright-based scraper that uses image processing (scikit-image, scipy, pillow). There is no remote arbitrary download/install step in the registry metadata.

⚠ Credentials

The skill declares no required env vars or credentials (reasonable). However, captcha_solver.py contains hard-coded absolute SESSION_FILE and SAVE_DIR paths pointing to the author's local filesystem (/Users/hhao/WorkBuddy/...), which is disproportionate and problematic: code may attempt to read/write files outside the user's working directory, save screenshots and cookies to unexpected locations, and reveal the developer's local path. The scripts also save session cookies (douyin_session.json) and debug images which are sensitive and should be placed under the user's explicit work_dir.

✓ Persistence & Privilege

Flags show always:false and no special platform privileges. The skill writes session and output files to disk (expected). It does not attempt to change other skills or system-wide agent settings. Autonomous invocation is allowed by default (normal); combine that with CAPTCHAbypass capability when deciding whether to enable.

Version History

v1.0.0

- Initial release: end-to-end automation for Douyin topic research and interactive report generation. - Handles login via QR code, API interception for batch video data, and robust slide CAPTCHA solving. - Enriches video data with engagement metrics (likes, shares, comments, etc.) and author follower counts. - Performs multi-factor analysis (duration, tags, title style, emotion keywords) to identify viral video patterns. - Outputs a detailed, interactive HTML analysis report with visual charts; also exports Excel results. - Includes clear setup, file structure, default parameters, and troubleshooting notes for seamless use.

Metadata

Slug douyin-report-search

Version 1.0.0

License MIT-0

All-time Installs 0

Active Installs 0

Total Versions 1

Frequently Asked Questions

What is 抖音搜索视频全量分析工具，支持扫码登录，自动图片验证?

This skill automates end-to-end Douyin topic research and report generation. Given a search keyword and a target video count, it handles QR-code login, batch... It is an AI Agent Skill for Claude Code / OpenClaw, with 316 downloads so far.

How do I install 抖音搜索视频全量分析工具，支持扫码登录，自动图片验证?

Run "/install douyin-report-search" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.

Is 抖音搜索视频全量分析工具，支持扫码登录，自动图片验证 free?

Yes, 抖音搜索视频全量分析工具，支持扫码登录，自动图片验证 is completely free, licensed under MIT-0. You can download, install and use it at no cost.

Which platforms does 抖音搜索视频全量分析工具，支持扫码登录，自动图片验证 support?

抖音搜索视频全量分析工具，支持扫码登录，自动图片验证 is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).

Who created 抖音搜索视频全量分析工具，支持扫码登录，自动图片验证?

It is built and maintained by hhofchina (@hhofchina); the current version is v1.0.0.

More Skills

抖音搜索视频全量分析工具，支持扫码登录，自动图片验证