功能描述

High-performance containerized web scraper (Docker + Crawlee + Playwright). Use when user mentions any of these: 爬虫, 爬取, 抓取, 采集, 数据采集, 爬数据, 抓数据, 获取数据, scrape...

使用说明 (SKILL.md)

Deep Scraper

Name: Deep Scraper + Amazon
Author: jiafar

Docker容器化爬虫，支持穿透反爬，三种模式自动识别。

前置要求

Docker已安装并运行
镜像已构建: docker build -t clawd-crawlee skills/deep-scraper/

模式选择规则

1. Amazon模式 (`amazon_handler.js`)

自动触发条件: URL包含 amazon.com，或用户提到亚马逊/Amazon/ASIN/BSR/选品/竞品/畅销榜/类目分析等关键词

根据URL自动识别页面类型：

URL特征	页面类型	可获取字段
`/zgbs/` 或 `/bestsellers/`	畅销榜	rank, title, asin, price, rating, reviews, image, url
`/zg/new-releases/`	新品榜	同上
`/zg/movers-and-shakers/`	飙升榜	同上
`/s?k=` 或 `/s/`	搜索结果	title, asin, price, rating, reviews, image, url, boughtPastMonth, sponsored
`/dp/` 或 `/gp/product/`	产品详情	title, asin, price, rating, reviews, brand, bsr, boughtPastMonth, dateFirstAvailable, category, bullets, details, image

⚠️ 重要规则:

Best Sellers页面没有月销量(boughtPastMonth)数据 — 亚马逊不在榜单页显示此信息
要获取月销量，必须用搜索页(/s?k=关键词)或产品详情页(/dp/ASIN)
如果用户同时需要排名+月销量，建议：先爬Best Sellers拿排名，再用搜索页补月销

# 畅销榜（有排名，无月销）
docker run -t --rm clawd-crawlee node assets/amazon_handler.js "https://www.amazon.com/zgbs/electronics"

# 搜索结果（有月销，无排名）
docker run -t --rm clawd-crawlee node assets/amazon_handler.js "https://www.amazon.com/s?k=feather+duster"

# 产品详情（最全字段：BSR、品牌、卖点、月销）
docker run -t --rm clawd-crawlee node assets/amazon_handler.js "https://www.amazon.com/dp/B001TQ6IHS"

# 多页爬取
docker run -t --rm clawd-crawlee node assets/amazon_handler.js "URL" --pages 2

输出格式: JSON

{
  "status": "SUCCESS",
  "type": "bestsellers|search|product-detail",
  "category": "品类名",
  "totalProducts": 30,
  "scrapedAt": "ISO时间",
  "products": [
    {
      "rank": 1,
      "title": "产品名",
      "asin": "B001TQ6IHS",
      "price": 9.94,
      "priceStr": "$9.94",
      "rating": 4.6,
      "reviews": 20547,
      "boughtPastMonth": "1K+",
      "image": "https://...",
      "url": "https://..."
    }
  ]
}

2. YouTube模式 (`main_handler.js`)

自动触发条件: URL包含 youtube.com，或用户提到YouTube/视频字幕/转录/transcript

拦截网络请求捕获字幕API (timedtext)
模拟点击"展开描述"和"转录稿"按钮
输出: {status, type:"TRANSCRIPT"|"DESCRIPTION", videoId, data}

docker run -t --rm clawd-crawlee node assets/main_handler.js "https://youtube.com/watch?v=xxx"

3. 通用模式 (`main_handler.js`)

触发条件: 非Amazon、非YouTube的URL，或用户提到爬取/抓取任意网页/社交媒体

Playwright打开页面，等待JS加载完成
提取 document.body.innerText（纯文本，去广告噪音）
输出上限10000字符
输出: {status:"SUCCESS", type:"GENERIC", title, data}

docker run -t --rm clawd-crawlee node assets/main_handler.js "https://任意网址"

Agent调用决策树

用户给了URL?
├─ 包含 amazon.com → 用 amazon_handler.js
│   ├─ 需要月销量? → 建议用搜索URL(/s?k=) 或详情页(/dp/)
│   └─ 需要排名? → 用畅销榜URL(/zgbs/)
├─ 包含 youtube.com → 用 main_handler.js (自动YouTube模式)
└─ 其他网站 → 用 main_handler.js (通用模式)

用户没给URL，只说了需求?
├─ "爬亚马逊XX品类Top" / "XX类目排行" / "XX畅销榜" → 构造 https://www.amazon.com/zgbs/品类
├─ "搜亚马逊XX" / "XX关键词搜索" / "找XX产品" → 构造 https://www.amazon.com/s?k=关键词
├─ "分析某个ASIN" / "看看这个产品" / "XX的详情" → 构造 https://www.amazon.com/dp/ASIN
├─ "XX的月销量" / "XX卖了多少" / "XX销量怎么样" → 用搜索页或详情页（有boughtPastMonth）
├─ "竞品分析" / "竞品调研" / "对手在卖什么" → 先搜索再逐个爬详情
├─ "选品" / "什么好卖" / "品类机会" / "市场调研" → Best Sellers + 搜索结合
└─ 其他 → 先web_search找到URL，再用对应模式爬

常见用户意图 → 操作映射

用户说	操作
"帮我看看亚马逊XX品类"	爬 /zgbs/品类畅销榜
"XX在亚马逊卖得怎么样"	搜索 /s?k=XX 看月销
"分析一下这个ASIN: BXXXXXXXXX"	爬 /dp/ASIN 详情页
"XX品类有什么机会"	畅销榜 + 搜索综合分析
"帮我爬这个链接"	判断URL类型，选对应handler
"这个YouTube视频讲了什么"	YouTube模式抓字幕
"帮我抓XX网站的内容"	通用模式
"搜一下XX的竞品"	搜索页爬取 + 分析
"XX月销多少" / "XX一个月卖多少"	搜索页或详情页
"帮我看看top 100" / "热门产品"	Best Sellers畅销榜
"新品有哪些" / "最近上了什么新品"	/zg/new-releases/
"什么产品涨得快" / "飙升榜"	/zg/movers-and-shakers/

反爬能力

每次清除Cookie，模拟全新用户
Docker沙箱隔离，无指纹追踪
Playwright模拟真实浏览器行为
自动滚动加载懒加载内容
支持重试（maxRetries: 2）

局限

通用模式输出上限10000字符
Amazon单页最多约30-50个产品
不支持需要登录的页面
Docker容器启动有~10秒冷启动时间

安全使用建议

Key things to check before installing/using: - Verify there is a Dockerfile and inspect it before running docker build; do not build unreviewed Dockerfiles on production hosts. - Expect Playwright to download browser binaries during install/run (large network activity and disk use). Run in an isolated environment if possible. - The agent can autonomously construct and crawl URLs from user text — this may cause unintended scraping of many pages or of content you don't intend to fetch. Consider restricting invocation or asking users for explicit URLs. - The skill overstates anti‑bot capabilities; code uses basic evasion (cookie clearing, simulated clicks) but not advanced fingerprinting protections. Don’t rely on it to evade protections legally or ethically — scraping may violate sites' terms of service or laws. - There are no obvious network exfiltration endpoints in the code, but always review for hidden remote uploads before running untrusted code. - If you want to proceed: run builds and scrapes in a locked sandbox, review and, if needed, harden the code (rate limiting, domain allowlist, logging controls), and ensure compliance with target sites' policies and applicable law.

功能分析

Type: OpenClaw Skill Name: deep-scraper-amazon Version: 1.0.0 The skill is classified as suspicious due to a potential prompt injection vulnerability in SKILL.md. The instructions guide the AI agent to construct URLs from user input (e.g., for Amazon categories or search keywords) and then execute them via `docker run`. If the agent does not properly quote or escape the user-derived parts of the URL when forming the `docker run` command, it could lead to shell injection on the host system executing the Docker command. While the JavaScript code itself is benign and focused on web scraping, this vulnerability in the agent's interaction with the skill's instructions poses a significant risk.

能力评估

ℹ Purpose & Capability

The name, description, SKILL.md, and code align: handlers for Amazon, YouTube, and generic pages exist and the package.json lists Playwright/Crawlee. Nothing obvious is requesting unrelated credentials or services. However the README claims robust 'anti-fingerprint' and '穿透反爬' features while the code uses only basic techniques (cookie clearing, headless Playwright, simulated clicks). That claim appears overstated compared to the implementation.

ℹ Instruction Scope

Runtime instructions and code operate within the stated scraping scope (navigating pages, clicking, intercepting network requests for YouTube timedtext, extracting innerText). A notable behavior: the agent is instructed to autonomously construct target URLs from user text when no URL is provided, giving the agent broad discretion to generate and crawl arbitrary Amazon pages or other sites — this is open‑ended and can lead to unexpected or high‑volume scraping. The SKILL.md does not instruct reading unrelated system files or environment variables.

⚠ Install Mechanism

There is no install spec (instruction-only), but package.json lists Playwright and Crawlee which require installing native browsers and can perform large network downloads. SKILL.md instructs a docker build (docker build -t clawd-crawlee skills/deep-scraper/) but no Dockerfile is included in the skill bundle — this is an inconsistency: the recommended build step may fail or rely on external files. Running Docker builds or installing playwright/browser binaries can be heavy and execute arbitrary code; you should verify the Dockerfile and build context before building.

✓ Credentials

The skill does not request environment variables, credentials, or config paths. The scraping handlers operate on user-supplied or constructed URLs only. There is no evidence in the code of hidden credential collection or requests to external exfiltration endpoints.

✓ Persistence & Privilege

The skill is not always-enabled and does not request elevated persistence. It does not modify other skills or system-wide config. It will run crawls when invoked and expects Docker/containerized execution per the README, which is normal for this use case.

版本历史

v1.0.0

Fork of deep-scraper with Amazon-specific structured data extraction. Supports Best Sellers, search results, product detail pages with ASIN, price, rating, reviews, boughtPastMonth. Also includes YouTube transcript capture and generic page scraping.

元数据

Slug deep-scraper-amazon

版本 1.0.0

许可证 —

累计安装 2

当前安装数 2

历史版本数 1

常见问题

Deep Scraper + Amazon 是什么？

High-performance containerized web scraper (Docker + Crawlee + Playwright). Use when user mentions any of these: 爬虫, 爬取, 抓取, 采集, 数据采集, 爬数据, 抓数据, 获取数据, scrape... 它是一个面向 Claude Code / OpenClaw 的 AI Agent Skill 插件，目前累计下载 669 次。

如何安装 Deep Scraper + Amazon？

在 OpenClaw 或 Claude Code 对话框中运行命令「/install deep-scraper-amazon」即可一键安装，无需额外配置。

Deep Scraper + Amazon 是免费的吗？

是的，Deep Scraper + Amazon 完全免费（开源免费），可自由下载、安装和使用。

Deep Scraper + Amazon 支持哪些平台？

Deep Scraper + Amazon 跨平台运行，可在任意部署了 OpenClaw / Claude Code 的环境中使用（cross-platform）。

谁开发了 Deep Scraper + Amazon？

由 jiafar（@jiafar）开发并维护，当前版本 v1.0.0。

Deep Scraper + Amazon