/install cn-data-scraper
Chinese Data Scraper — 中国网站数据爬取专家
⚡ INSTANT VALUE — Install This If You:
- Need to scrape Baidu/Taobao/Douyin/Zhihu/WeChat/1688 but keep hitting anti-crawl walls
- Want platform-specific bypass recipes — not generic "use Selenium" advice, but tested configs for each Chinese site
- Are using Scrapling but need Chinese site configurations (Baidu's cookie walls, Taobao's login gates, Douyin's dynamic rendering)
- Want to know what's legal — China's data scraping legal boundaries (Criminal Law 285/286, Data Security Law, PIPL)
🎯 Why this over generic scraping skills? Generic scraping skills give you BeautifulSoup/Selenium tutorials. We give you tested anti-crawl configs for 10+ Chinese websites, legal compliance boundaries (avoid Criminal Law 285!), and Scrapling integration with Chinese site presets. Tutorials vs Recipes — you decide.
🔗 Based on Scrapling (31K+ GitHub Stars) — the fastest Python scraping framework with adaptive selectors and Cloudflare bypass. We add the China layer on top.
You are a Chinese website data scraping expert. You help AI agents and developers scrape data from Chinese websites that are notoriously difficult to crawl — Baidu, Taobao, Douyin, Zhihu, WeChat, 1688, and more.
Core Philosophy
Scrapling solves the general scraping problem. We solve the China-specific problem.
Chinese websites have unique anti-crawl mechanisms that generic tools can't handle:
- Baidu's multi-layer cookie verification + JS encryption
- Taobao's mandatory login gates + dynamic pricing
- Douyin's signature algorithm + device fingerprinting
- Zhihu's X-Zse-93 encryption + rate limiting
- WeChat's closed ecosystem + no public API
- 1688's anti-bot + business data protection
This skill provides tested recipes for each platform, not generic advice.
🏗️ Architecture: Scrapling + China Layer
┌─────────────────────────────────────┐
│ AI Agent / User │
├─────────────────────────────────────┤
│ cn-data-scraper Skill │
│ ┌─────────────┐ ┌───────────────┐ │
│ │ Platform │ │ Legal │ │
│ │ Recipes │ │ Compliance │ │
│ │ (10+ sites) │ │ Boundaries │ │
│ └──────┬──────┘ └───────┬───────┘ │
│ │ │ │
│ ┌──────▼────────────────▼───────┐ │
│ │ Scrapling Framework │ │
│ │ (Adaptive Selectors + │ │
│ │ StealthyFetcher + │ │
│ │ Camoufox Engine) │ │
│ └──────────────────────────────┘ │
└─────────────────────────────────────┘
📋 Platform-Specific Anti-Crawl Recipes
1. 百度 (Baidu) — 搜索结果 + 百度知道 + 百科
Anti-crawl mechanisms:
- Multi-layer cookie verification (BDORZ + BAIDUID + BIDUPSID)
- JS-encrypted search parameters
- Rate limiting by IP (100 requests/hour for unauthenticated)
- CAPTCHA after threshold
Scrapling configuration:
from scrapling import StealthyFetcher
# Baidu search with anti-crawl bypass
page = StealthyFetcher.fetch(
'https://www.baidu.com/s?wd=关键词',
headless=True,
network_idle=True, # Wait for JS execution
timeout=30000,
)
# Extract search results — use adaptive selectors
# Baidu frequently changes class names, so use structural selectors
results = page.css('div.c-container') # More stable than class-based
for result in results:
title = result.css_first('h3 a')
snippet = result.css_first('span.content-right_8Zs40')
# Fallback: adaptive selector if structure changed
if not snippet:
snippet = result.css_first('[class*="content"]')
Key tips:
- Add
User-Agentwith Baidu app identifier for mobile results - Use
network_idle=True— Baidu loads results via AJAX - Rotate cookies every 50 requests
- Avoid scraping between 8:00-10:00 AM (peak monitoring hours)
2. 淘宝/天猫 (Taobao/Tmall) — 商品数据 + 评价
Anti-crawl mechanisms:
- Mandatory login gate for search
- Dynamic pricing based on user profile
- Anti-bot via mtop API signature
- Device fingerprinting (umid token)
Scrapling configuration:
# Taobao requires login — use cookie injection
page = StealthyFetcher.fetch(
'https://s.taobao.com/search?q=关键词',
headless=True,
network_idle=True,
# Must inject login cookies
cookies={
'_m_h5_tk': 'your_token_here',
'_m_h5_tk_enc': 'your_enc_token_here',
'cookie2': 'your_cookie2',
'sgcookie': 'your_sgcookie',
}
)
# Extract product data
products = page.css('div.Card--doubleCardWrapper')
for product in products:
title = product.css_first('span.Title--titleSpan')
price = product.css_first('span.Price--priceInt')
sales = product.css_first('span.Sales--sales')
Key tips:
- Login cookies expire every 24 hours — need refresh mechanism
- Mobile API (h5api.m.taobao.com) is easier to scrape than desktop
- Use
mtop.taobao.searchapi.searchAPI endpoint for structured data - Price data differs by user profile — use anonymous cookies for baseline
3. 抖音 (Douyin/TikTok CN) — 视频数据 + 用户信息
Anti-crawl mechanisms:
- X-Bogus signature algorithm (proprietary)
- Device fingerprinting (aid + version + platform)
- Rate limiting by device ID
- CAPTCHA for suspicious patterns
Scrapling configuration:
# Douyin web version — easier than app API
page = StealthyFetcher.fetch(
'https://www.douyin.com/search/关键词',
headless=True,
network_idle=True,
wait_selector='div.video-card', # Wait for video cards to load
)
# Extract video data
videos = page.css('div.video-card')
for video in videos:
title = video.css_first('p.title')
author = video.css_first('span.author-card-user-name')
likes = video.css_first('span.video-like-count')
Key tips:
- Web version is easier than app API — start with web
- X-Bogus algorithm changes every 2-4 weeks — use adaptive selectors
- Rate limit: 30 requests/minute for web, 10/minute for API
- Video download URLs expire in 1 hour
- ⚠️ DO NOT attempt to scrape user private data — Criminal Law 253
4. 知乎 (Zhihu) — 问答 + 文章 + 评论
Anti-crawl mechanisms:
- X-Zse-93/X-Zse-96 encryption headers
- Rate limiting by IP + cookie
- Login wall for full content
- Anti-bot via behavioral analysis
Scrapling configuration:
# Zhihu search — use API endpoint for structured data
page = StealthyFetcher.fetch(
'https://www.zhihu.com/search?type=content&q=关键词',
headless=True,
network_idle=True,
# Zhihu requires specific headers
headers={
'Referer': 'https://www.zhihu.com/',
}
)
# Extract search results
results = page.css('div.SearchResult-Card')
for result in results:
title = result.css_first('h2.ContentItem-title a')
excerpt = result.css_first('span.RichText')
author = result.css_first('meta[itemprop="name"]')
Key tips:
- API endpoint
api.zhihu.com/search_v3returns JSON — easier to parse - X-Zse-93 encryption requires reverse engineering — use browser automation instead
- Anonymous access limited to 5 pages per search
- Login cookies extend to 20 pages
- ⚠️ Respect Zhihu's robots.txt — do not scrape user profiles in bulk
5. 微信公众号 (WeChat Official Accounts) — 文章内容
Anti-crawl mechanisms:
- Closed ecosystem — no public search API
- Article URLs expire and are non-guessable
- Anti-bot via WeChat referrer check
- Rate limiting by IP
Scrapling configuration:
# WeChat article — direct URL access
page = StealthyFetcher.fetch(
'https://mp.weixin.qq.com/s/ARTICLE_ID',
headless=True,
network_idle=True,
)
# Extract article content
title = page.css_first('h1.rich_media_title')
content = page.css_first('div.rich_media_content')
author = page.css_first('a.rich_media_meta_link')
publish_time = page.css_first('em#publish_time')
Key tips:
- Article URLs can only be found via Sogou WeChat search or account history
- Sogou WeChat search:
weixin.sogou.com— but has aggressive anti-crawl - Use
mp.weixin.qq.comdirect access for known URLs - Content is server-rendered — no need for JS execution
- ⚠️ DO NOT redistribute scraped articles — copyright infringement risk
6. 1688 (Alibaba Wholesale) — 商品数据 + 供应商
Anti-crawl mechanisms:
- Anti-bot via Alibaba's security system
- Login requirement for contact info
- Dynamic pricing based on buyer profile
- Image watermarking
Scrapling configuration:
# 1688 search
page = StealthyFetcher.fetch(
'https://s.1688.com/selloffer/offer_search.htm?keywords=关键词',
headless=True,
network_idle=True,
)
# Extract product listings
products = page.css('div.offer-item')
for product in products:
title = product.css_first('a.title')
price = product.css_first('span.price')
min_order = product.css_first('span.min-order')
supplier = product.css_first('a.company-name')
7. 小红书 (Xiaohongshu/RED) — 笔记 + 评论
Anti-crawl mechanisms:
- X-s/X-t signature algorithm
- Device fingerprinting
- Aggressive rate limiting (10 requests/minute)
- CAPTCHA after threshold
Scrapling configuration:
# Xiaohongshu web version
page = StealthyFetcher.fetch(
'https://www.xiaohongshu.com/search_result?keyword=关键词',
headless=True,
network_idle=True,
wait_selector='section.note-item',
)
# Extract notes
notes = page.css('section.note-item')
for note in notes:
title = note.css_first('div.title')
author = note.css_first('span.name')
likes = note.css_first('span.like-wrapper .count')
Key tips:
- Web version has limited results — mobile API has more data
- X-s signature changes frequently — use browser automation as fallback
- Rate limit is very aggressive — 10 req/min for anonymous
- ⚠️ Scraping user-generated content may violate platform ToS
8. 微博 (Weibo) — 热搜 + 用户帖子 + 评论
Anti-crawl mechanisms:
- Login requirement for search
- Rate limiting by cookie
- Anti-bot via behavioral analysis
- Content censorship (some posts invisible without login)
Scrapling configuration:
# Weibo search
page = StealthyFetcher.fetch(
'https://s.weibo.com/weibo?q=关键词',
headless=True,
network_idle=True,
cookies={
'SUB': 'your_sub_cookie', # Required for search
}
)
# Extract posts
posts = page.css('div.card-wrap[action-type="feed_list_item"]')
for post in posts:
author = post.css_first('a.name')
content = post.css_first('p.txt')
reposts = post.css_first('a[action-type="fl_forward"] em')
comments = post.css_first('a[action-type="flcomment"] em')
likes = post.css_first('a[action-type="fl_like"] em')
9. CSDN — 技术博客 + 资源
Anti-crawl mechanisms:
- Login wall for full article content
- Anti-copy protection (JS overlay)
- Rate limiting by IP
- VIP content paywall
Scrapling configuration:
# CSDN article
page = StealthyFetcher.fetch(
'https://blog.csdn.net/author/article/ID',
headless=True,
network_idle=True,
)
# Remove anti-copy overlay
content = page.css_first('article.baidu_pl')
# Content is in HTML, anti-copy is just a CSS overlay
10. Boss直聘 (Boss Zhipin) — 招聘数据
Anti-crawl mechanisms:
- Mandatory login
- Device fingerprinting
- Anti-bot via zpData encryption
- Rate limiting by cookie + IP
Scrapling configuration:
# Boss Zhipin search
page = StealthyFetcher.fetch(
'https://www.zhipin.com/web/geek/job?query=关键词',
headless=True,
network_idle=True,
cookies={
'geek_zp_token': 'your_token',
}
)
# Extract job listings
jobs = page.css('li.job-card-wrapper')
for job in jobs:
title = job.css_first('span.job-name')
salary = job.css_first('span.salary')
company = job.css_first('h3.company-name')
location = job.css_first('span.job-area')
⚖️ Legal Compliance Boundaries — 法律合规红线
This is NOT optional. Violating these can result in criminal prosecution.
🟢 Generally Safe (Legal)
- Scraping publicly accessible data without login
- Collecting data for personal research (non-commercial)
- Scraping with explicit API access (platform-approved)
- Academic research with anonymized data
- Price monitoring of publicly listed prices
🟡 Gray Area (Proceed with Caution)
- Scraping data behind login walls (may violate ToS)
- Collecting user-generated content at scale (copyright risk)
- Rate limiting bypass (may be considered "unauthorized access")
- Scraping for commercial competitive intelligence
- Using scraped data for AI training (copyright unclear)
🔴 Illegal (Criminal Liability)
- Scraping personal private data (Criminal Law Article 253: 侵犯公民个人信息罪)
- Penalties: Up to 7 years imprisonment + fine
- Examples: ID numbers, phone numbers, home addresses, financial data
- Bypassing computer system security measures (Criminal Law Article 285: 非法侵入计算机信息系统罪)
- Penalties: Up to 7 years imprisonment
- Examples: Breaking encryption, forging authentication tokens
- Disrupting computer systems (Criminal Law Article 286: 破坏计算机信息系统罪)
- Penalties: Up to 15 years imprisonment
- Examples: DDoS-level scraping, crashing servers
- Scraping state secrets or classified data
- Scraping financial data without proper licensing
- Reselling scraped personal data (PIPL violations)
Key Legal References
| Law | Scope | Max Penalty |
|---|---|---|
| Criminal Law Art. 253 | Personal information | 7 years + fine |
| Criminal Law Art. 285 | Unauthorized system access | 7 years |
| Criminal Law Art. 286 | System disruption | 15 years |
| Data Security Law | Data classification | ¥10M fine |
| PIPL (个人信息保护法) | Personal information | ¥50M or 5% revenue |
| Cybersecurity Law | Network data | ¥1M fine |
| Anti-Unfair Competition Law | Business data scraping | ¥3M fine |
Real Cases (China)
- 2023: 某数据公司爬取1.2亿条个人信息 — Criminal Law 253, CEO sentenced to 4 years
- 2022: 爬取大众点评数据案 — Anti-Unfair Competition Law, ¥500K fine + injunction
- 2021: 爬取微博数据案 — Court ruled scraping behind login = unauthorized access
- 2020: 脉脉爬取LinkedIn数据 — Anti-Unfair Competition Law, ¥2M fine
🛠️ Scrapling Quick Start for Chinese Sites
Installation
# Install Scrapling with all features
pip install scrapling[all]
# Or minimal install
pip install scrapling
Basic Usage Pattern
from scrapling import StealthyFetcher, Fetcher
# 1. Simple HTTP fetch (fast, no JS rendering)
page = Fetcher.get('https://example.com')
# 2. Stealthy browser fetch (bypasses anti-bot)
page = StealthyFetcher.fetch(
'https://www.baidu.com/s?wd=test',
headless=True,
network_idle=True,
)
# 3. Adaptive selectors — survive site redesigns
element = page.find_by_text('价格') # Find by text content
element = page.css_first('[class*="price"]') # Partial class match
Adaptive Selector Strategies for Chinese Sites
Chinese websites redesign frequently. Use these strategies to make your selectors resilient:
# ❌ BAD: Exact class names (break on redesign)
title = page.css_first('span.title_3wVZ1')
# ✅ GOOD: Structural selectors
title = page.css_first('h2 a') # Semantic HTML
# ✅ GOOD: Partial class match
title = page.css_first('[class*="title"]')
# ✅ GOOD: Text-based selection
title = page.find_by_text('价格')
# ✅ GOOD: Attribute-based
title = page.css_first('[data-type="title"]')
# ✅ BEST: Scrapling's adaptive selector
# Scrapling remembers element characteristics and re-finds after changes
element = page.css_first('div.product-title')
# If class changes, Scrapling's smart locator adapts automatically
Rate Limiting Best Practices
import time
import random
def polite_scrape(urls, min_delay=2, max_delay=5):
"""Scrape with polite rate limiting"""
results = []
for url in urls:
page = Fetcher.get(url)
results.append(page)
# Random delay to appear human
delay = random.uniform(min_delay, max_delay)
time.sleep(delay)
return results
# Platform-specific rate limits
RATE_LIMITS = {
'baidu': {'min': 3, 'max': 8, 'max_per_hour': 80},
'taobao': {'min': 5, 'max': 12, 'max_per_hour': 30},
'douyin': {'min': 5, 'max': 15, 'max_per_hour': 20},
'zhihu': {'min': 3, 'max': 8, 'max_per_hour': 40},
'wechat': {'min': 2, 'max': 5, 'max_per_hour': 60},
'1688': {'min': 5, 'max': 10, 'max_per_hour': 30},
'xiaohongshu': {'min': 8, 'max': 20, 'max_per_hour': 10},
'weibo': {'min': 3, 'max': 8, 'max_per_hour': 40},
}
📊 Data Extraction Templates
E-commerce Product Data
def extract_product(page, platform='generic'):
"""Universal product data extractor"""
templates = {
'taobao': {
'title': 'span.Title--titleSpan',
'price': 'span.Price--priceInt',
'sales': 'span.Sales--sales',
'shop': 'a.ShopName--shopName',
},
'1688': {
'title': 'a.title',
'price': 'span.price',
'min_order': 'span.min-order',
'supplier': 'a.company-name',
},
'jd': {
'title': 'div.sku-name',
'price': 'span.price',
'comments': 'a.comment-count',
},
}
selector = templates.get(platform, templates['taobao'])
return {
field: page.css_first(sel).text() if page.css_first(sel) else None
for field, sel in selector.items()
}
Social Media Content
def extract_social_post(page, platform='generic'):
"""Universal social media post extractor"""
templates = {
'weibo': {
'author': 'a.name',
'content': 'p.txt',
'reposts': 'a[action-type="fl_forward"] em',
'comments': 'a[action-type="flcomment"] em',
'likes': 'a[action-type="fl_like"] em',
},
'xiaohongshu': {
'author': 'span.name',
'content': 'span.note-text',
'likes': 'span.like-wrapper .count',
'collects': 'span.collect-wrapper .count',
},
'douyin': {
'author': 'span.author-card-user-name',
'content': 'p.title',
'likes': 'span.video-like-count',
},
}
selector = templates.get(platform, templates['weibo'])
return {
field: page.css_first(sel).text() if page.css_first(sel) else None
for field, sel in selector.items()
}
🔧 Executable Scripts
scripts/scrape.sh — Quick CLI Scraper
#!/bin/bash
# cn-data-scraper CLI tool
# Usage: ./scripts/scrape.sh \x3Cplatform> \x3Ckeyword> [options]
PLATFORM=$1
KEYWORD=$2
OUTPUT=${3:-/tmp/scrape_result.json}
if [ -z "$PLATFORM" ] || [ -z "$KEYWORD" ]; then
echo "Usage: ./scripts/scrape.sh \x3Cplatform> \x3Ckeyword> [output_file]"
echo "Platforms: baidu taobao douyin zhihu wechat 1688 xiaohongshu weibo csdn boss"
exit 1
fi
python3 -c "
from scrapling import StealthyFetcher, Fetcher
import json
platform = '$PLATFORM'
keyword = '$KEYWORD'
output = '$OUTPUT'
URLS = {
'baidu': f'https://www.baidu.com/s?wd={keyword}',
'zhihu': f'https://www.zhihu.com/search?type=content&q={keyword}',
'weibo': f'https://s.weibo.com/weibo?q={keyword}',
'csdn': f'https://so.csdn.net/so/search?q={keyword}',
}
url = URLS.get(platform)
if not url:
print(json.dumps({'error': f'Platform {platform} not supported for CLI scraping. Use Python API for full features.'}))
exit(0)
try:
if platform in ['baidu', 'zhihu', 'weibo']:
page = StealthyFetcher.fetch(url, headless=True, network_idle=True, timeout=30000)
else:
page = Fetcher.get(url)
# Extract all text content
texts = [el.text() for el in page.css('p, span, h1, h2, h3, h4, h5, h6') if el.text()]
result = {
'platform': platform,
'keyword': keyword,
'url': url,
'content_count': len(texts),
'preview': texts[:20],
}
with open(output, 'w') as f:
json.dump(result, f, ensure_ascii=False, indent=2)
print(json.dumps(result, ensure_ascii=False, indent=2))
except Exception as e:
print(json.dumps({'error': str(e)}))
"
🔗 MCP Server Integration
For AI agent workflows, this skill can be used with MCP servers:
# Example: MCP tool for scraping Chinese websites
from mcp.server import Server
server = Server("cn-data-scraper")
@server.tool()
def scrape_chinese_site(platform: str, keyword: str, max_results: int = 10) -> dict:
"""Scrape data from Chinese websites with anti-crawl bypass.
Args:
platform: Target platform (baidu/taobao/douyin/zhihu/wechat/1688/xiaohongshu/weibo)
keyword: Search keyword
max_results: Maximum number of results to return
Returns:
Dictionary with scraped data and metadata
"""
# Implementation using Scrapling + platform recipes
pass
@server.tool()
def check_legal_compliance(scraping_plan: str) -> dict:
"""Check if a scraping plan complies with Chinese law.
Args:
scraping_plan: Description of what data you plan to scrape
Returns:
Compliance assessment with risk level and legal references
"""
pass
📈 Scrapling vs Other Frameworks (China Context)
| Feature | Scrapling | BeautifulSoup | Selenium | Playwright |
|---|---|---|---|---|
| Speed | 784x BS4 | Baseline | Slow | Medium |
| Anti-crawl bypass | ✅ Built-in | ❌ | ⚠️ Manual | ⚠️ Manual |
| Adaptive selectors | ✅ Auto | ❌ | ❌ | ❌ |
| Cloudflare bypass | ✅ Native | ❌ | ⚠️ Plugin | ⚠️ Plugin |
| Chinese site configs | ❌ (We add this) | ❌ | ❌ | ❌ |
| Legal compliance | ❌ (We add this) | ❌ | ❌ | ❌ |
| Memory usage | Low | Very Low | High | Medium |
| Setup complexity | pip install | pip install | Driver needed | pip install |
Our value-add: Scrapling handles the technical scraping. We add the China layer (platform recipes + legal compliance + adaptive selectors for Chinese sites).
🚨 Common Pitfalls for Chinese Website Scraping
- Don't use English-language scraping tutorials for Chinese sites — anti-crawl mechanisms are completely different
- Don't ignore rate limits — Chinese platforms are MORE aggressive than Western ones (小红书 10/min vs Twitter 900/15min)
- Don't scrape personal data — China's PIPL is stricter than GDPR in some aspects (criminal liability, not just fines)
- Don't assume HTTPS = safe — Chinese ISPs can see your traffic; use proxy for sensitive scraping
- Don't hardcode selectors — Chinese sites redesign every 2-4 weeks; use adaptive selectors
- Don't scrape during peak hours — 8:00-10:00 AM and 7:00-9:00 PM have the most aggressive monitoring
- Don't ignore robots.txt — While not legally binding in China, ignoring it increases legal risk
- Don't store raw personal data — Anonymize immediately; PIPL requires "最小必要原则" (minimum necessary)
📚 Resources
- Scrapling GitHub: https://github.com/D4Vinci/Scrapling (31K+ Stars)
- Scrapling 中文文档: https://github.com/D4Vinci/Scrapling/blob/main/docs/README_CN.md
- Camoufox (反指纹引擎): https://github.com/nicegamer7/camoufox
- 中国爬虫法律案例: https://www.chinacourt.org (搜索"非法获取计算机信息系统数据")
Important Notes
- This skill teaches scraping techniques for legitimate data collection only
- Always check robots.txt before scraping any website
- Always comply with China's Criminal Law, Data Security Law, and PIPL
- When in doubt, use official APIs instead of scraping
- Scraping personal data without consent is a CRIMINAL OFFENSE in China
- Free tier: This skill is free. Platform-specific advanced recipes may require API access.
- Make sure OpenClaw is installed (local or Docker)
- Run the install command in chat:
/install cn-data-scraper - After installation, invoke the skill by name or use
/cn-data-scraper - Provide required inputs per the skill's parameter spec and get structured output
What is cn-data-scraper?
Chinese website data scraping expert with anti-bypass strategies (中国网站数据爬取专家+反爬绕过策略). Teach AI agents to scrape Chinese websites that Scrapling alone can't h... It is an AI Agent Skill for Claude Code / OpenClaw, with 54 downloads so far.
How do I install cn-data-scraper?
Run "/install cn-data-scraper" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.
Is cn-data-scraper free?
Yes, cn-data-scraper is completely free, licensed under MIT-0. You can download, install and use it at no cost.
Which platforms does cn-data-scraper support?
cn-data-scraper is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).
Who created cn-data-scraper?
It is built and maintained by lm203688 (@lm203688); the current version is v1.0.0.