功能描述

Use DataLens MCP tools to scrape structured data from any website open in Chrome. Triggers when the user wants to extract lists, tables, comments, products,...

使用说明 (SKILL.md)

DataLens Scraping Skill

Name: web-scraper
Author: weird94

How Tool Calls Work

Every DataLens tool is invoked by running a terminal command. No MCP client configuration is required.

The datalens-mcp-call binary handles the MCP stdio handshake and returns the tool result as YAML/JSON to stdout.

run_in_terminal: datalens-mcp-call \x3Ctool_name> '\x3Cargs_json>'

If datalens-mcp-call is not on PATH (e.g. not globally installed), use npx:

run_in_terminal: npx datalens-mcp-call \x3Ctool_name> '\x3Cargs_json>'

Prerequisites

datalens-mcp-server npm package installed: npm install -g datalens-mcp-server (or use npx).
DataLens Chrome extension installed and active in Chrome.
Chrome open with the target page loaded (or provide url in the tool args — the extension will open it).
Node.js ≥ 18 available in the terminal.

How This Works

datalens-mcp-call spawns the DataLens MCP proxy as a child process, performs the MCP initialization handshake over stdio, calls the requested tool, and prints the result.

AI Agent
  ↓ run_in_terminal
datalens-mcp-call \x3Ctool> \x3Cargs>
  ↓ stdio JSON-RPC
DataLens MCP Proxy (datalens-mcp-proxy)
  ↓ WebSocket (localhost:17373)
Chrome Extension
  ↓
Browser Tab

Standard Scraping Workflow

Follow these steps in order. Do not skip steps or call scrape_start before scrape_analyze_columns completes.

Step 1 — Detect tables

datalens-mcp-call scrape_detect_tables '{"url":"https://example.com","prompt":"article list"}'

Returns a list of detected table structures with rootSelector, itemSelector, documentInfoPath. Pick the best matching table and copy those three values for subsequent steps.

If the page requires login, ask the user to log in in Chrome first, then re-run this command.

Step 2 (optional) — Inspect tree for expand buttons

datalens-mcp-call scrape_get_table_tree '{"rootSelector":"\x3Cfrom step 1>","itemSelector":"\x3Cfrom step 1>","documentInfoPath":"\x3Cfrom step 1>"}'

Use when the data has nested replies, collapsed rows, or "load more" buttons. Inspect the _uid-annotated tree in the output to identify expand button UIDs.

Step 2b (optional) — Expand and re-detect

datalens-mcp-call scrape_click_expand_and_redetect '{"rootSelector":"...","itemSelector":"...","documentInfoPath":"...","expandButtonUids":[{"type":"reply","uids":["uid1","uid2"]}]}'

The extension clicks the buttons, waits for new content, then re-detects. Use the updated rootSelector/itemSelector/documentInfoPath from this output in Step 3.

Step 3 — Analyze columns

datalens-mcp-call scrape_analyze_columns '{"rootSelector":"...","itemSelector":"...","documentInfoPath":"...","url":"https://example.com","prompt":"article list"}'

Calls the backend AI to identify fields, data types, and pagination. Returns a scraperConfig and jobDraft. Confirm the field list looks correct before proceeding.

Step 4 — Start scraping

# Pass the jobDraft object returned by scrape_analyze_columns
datalens-mcp-call scrape_start '{"jobDraft":\x3Cpaste jobDraft here>,"maxRecords":10}'

Returns a jobId. Use maxRecords: 10 for a preview run first.

Step 5 — Poll for status

datalens-mcp-call scrape_status '{"jobId":"\x3CjobId>","waitMs":3000}'

Re-run until status is COMPLETED, FAILED, or STOPPED.

Key status fields:

status: QUEUED → PREPARING → RUNNING → COMPLETED / FAILED / STOPPED
scrapedCount: rows collected so far
error: present only on failure

Step 6 — Retrieve results

Save to file (recommended for large results):

datalens-mcp-call scrape_export_to_file '{"jobId":"\x3CjobId>","outputDir":"/tmp/datalens","format":"json"}'

Returns the saved file path.

Inline preview (small result sets):

datalens-mcp-call scrape_result '{"jobId":"\x3CjobId>","limit":50}'

Use the cursor field from each response to fetch the next page.

In-memory export:

datalens-mcp-call scrape_export '{"jobId":"\x3CjobId>","format":"csv"}'

Returns base64-encoded file content.

Job Control

datalens-mcp-call scrape_pause  '{"jobId":"\x3CjobId>"}'
datalens-mcp-call scrape_resume '{"jobId":"\x3CjobId>"}'
datalens-mcp-call scrape_stop   '{"jobId":"\x3CjobId>"}'

Browser Tab Management

datalens-mcp-call browser_list_tabs
datalens-mcp-call browser_open_tab  '{"url":"https://example.com"}'
datalens-mcp-call browser_use_tab   '{"tabId":123}'
datalens-mcp-call browser_close_tab '{"tabId":123}'

Tab management is usually not needed — scrape_detect_tables with a url arg handles tab opening automatically.

Agent Decision Rules

Never call scrape_start without a jobDraft or scraperConfig from a prior scrape_analyze_columns response. Fabricating a scraperConfig will produce wrong results.
Never skip scrape_analyze_columns and jump straight to scrape_start. The analyze step is required to build the config.
If scrape_detect_tables returns an empty list, the page may need login or may be dynamically loaded. Ask the user to open the target URL in Chrome and scroll to load content, then retry.
If scrape_status stays at QUEUED for more than 30 seconds, check that the Chrome extension is active and that a tab for the target URL is open.
Use maxRecords: 10 for a preview scrape to confirm the config is correct before running a full job.
Default export format is JSON. Use CSV or XLSX when the user asks for spreadsheet output.

End-to-End Example: Scrape Toutiao Headlines

# 1. Detect tables on the homepage
datalens-mcp-call scrape_detect_tables '{"url":"https://www.toutiao.com/?is_new_connect=0&is_new_user=0","prompt":"article list"}'

# 2. Analyze columns (fill in selectors from step 1 output)
datalens-mcp-call scrape_analyze_columns '{"rootSelector":"\x3Cfrom step 1>","itemSelector":"\x3Cfrom step 1>","documentInfoPath":"\x3Cfrom step 1>","url":"https://www.toutiao.com/?is_new_connect=0&is_new_user=0","prompt":"article list"}'

# 3. Preview run — first 10 rows (paste the full jobDraft JSON object from step 2)
datalens-mcp-call scrape_start '{"jobDraft":\x3Cpaste jobDraft>,"maxRecords":10}'

# 4. Poll until status is COMPLETED
datalens-mcp-call scrape_status '{"jobId":"\x3CjobId>","waitMs":3000}'

# 5. Save results to file
datalens-mcp-call scrape_export_to_file '{"jobId":"\x3CjobId>","outputDir":"/tmp/datalens","format":"json"}'

Set DATALENS_TIMEOUT=180000 before running if a tool call takes longer than the default 120 s:

DATALENS_TIMEOUT=180000 datalens-mcp-call scrape_analyze_columns '...'

Debug Tools

These are for troubleshooting only. Do not use in normal scraping workflows.

datalens-mcp-call debug_get_logs '{"levels":["error"]}'
datalens-mcp-call debug_clear_logs '{}'
datalens-mcp-call debug_export_logs_to_file '{"outputDir":"/tmp/datalens"}'

安全使用建议

Before installing or running this skill, consider the following: - Inconsistency: the registry metadata lists no required binaries, but the SKILL.md requires `datalens-mcp-call`, Node ≥18, and installing `datalens-mcp-server` via npm (or using npx). Ask the publisher why metadata omits these requirements and request a homepage/source repository. - npm/npx risk: the instructions rely on npm/npx to fetch and run third-party code. Verify the exact package name and a specific version on the npm registry, inspect the package contents (or the Git repository), and prefer pinned versions over npx or global installs. - Chrome extension risk: the extension will have access to page content and tabs. Confirm the extension's publisher, review its permissions, and audit its code (or install only from the official Chrome Web Store after validation). Avoid using on pages with sensitive credentials, personal data, bank/account pages, or internal corporate systems until you validate privacy practices. - Data exfiltration: the skill mentions a backend AI call during analysis — clarify what data is sent to external servers, read the service privacy policy, and avoid scraping sensitive pages if the backend processes page content remotely. - Operational safety: if you must test it, run it against non-sensitive public pages first, in an isolated environment, and prefer a local, audited build of the datalens tools rather than npx. If the publisher/source cannot be verified, treat this skill as untrusted. Request a homepage, source repo, package links, and privacy/security documentation; that information could change the assessment to benign if it confirms provenance and appropriate safeguards.

功能分析

Type: OpenClaw Skill Name: datalens-web-scraper Version: 1.0.0 The DataLens Web Scraper skill provides a structured interface for an AI agent to extract data from websites via a Chrome extension and an MCP server. The SKILL.md file outlines a legitimate multi-step workflow (detection, analysis, execution, and export) using a local binary `datalens-mcp-call`. There are no signs of data exfiltration, unauthorized file access, or malicious prompt injection; all actions are consistent with the stated purpose of web scraping.

能力评估

⚠ Purpose & Capability

The SKILL.md clearly implements a web scraper using a local MCP proxy and a Chrome extension (consistent with the skill name). However the registry metadata declares no required binaries or install steps, while the instructions require datalens-mcp-call (or npx), datalens-mcp-server (npm) and Node ≥18 plus a Chrome extension. That metadata/instruction mismatch is incoherent and worth questioning.

⚠ Instruction Scope

Instructions tell the agent to spawn a local MCP proxy that talks to a Chrome extension to click buttons, open/close tabs, and extract page content. They also mention a backend AI call during column analysis — implying scraped page content or selectors will be sent to an external DataLens backend. The SKILL.md does not describe where data is sent, privacy boundaries, or what the backend does with page content.

⚠ Install Mechanism

There is no formal install spec in the skill bundle, but the instructions ask users/agents to run `npm install -g datalens-mcp-server` or use `npx datalens-mcp-call`. Using npx/global npm installs will fetch and run code from the npm registry on demand — a moderate-to-high risk operation if the package source/version and publisher are not verified. The Chrome extension installation is also required and can grant wide permissions to browsing data; neither the extension source nor the npm package provenance is provided.

ℹ Credentials

The skill declares no required environment variables or credentials (which is consistent). However it implicitly requires the user to be logged into target sites in Chrome and to install an extension that can read all page content and tabs. While expected for a scraper, that level of browser access is high-privilege relative to the simple registry metadata.

ℹ Persistence & Privilege

The skill is not marked always:true and does not request persistent system-wide changes in the bundle. However it enables the agent to run local commands and (if allowed to invoke autonomously) could execute npx or spawn servers that interact with the browser — combine this with the install-and-extension flow and the autonomous invocation capability increases blast radius. Autonomous invocation alone is not flagged, but it amplifies the other concerns.

版本历史

v1.0.0

Initial release of datalens-web-scraper skill. - Enables structured data extraction from any website open in Chrome using DataLens MCP tools. - Provides a clear, step-by-step scraping workflow covering table detection, column analysis, job start, status polling, and data export. - Outlines prerequisite setup for required npm packages, Chrome extension, and Node.js. - Supports browser tab management and job control commands. - Includes troubleshooting tools and explicit agent rules for correct operation.

元数据

Slug datalens-web-scraper

版本 1.0.0

许可证 MIT-0

累计安装 0

当前安装数 0

历史版本数 1

常见问题

web-scraper 是什么？

Use DataLens MCP tools to scrape structured data from any website open in Chrome. Triggers when the user wants to extract lists, tables, comments, products,... 它是一个面向 Claude Code / OpenClaw 的 AI Agent Skill 插件，目前累计下载 68 次。

如何安装 web-scraper？

在 OpenClaw 或 Claude Code 对话框中运行命令「/install datalens-web-scraper」即可一键安装，无需额外配置。

web-scraper 是免费的吗？

是的，web-scraper 完全免费，采用 MIT-0 许可证，可自由下载、安装和使用。

web-scraper 支持哪些平台？

web-scraper 跨平台运行，可在任意部署了 OpenClaw / Claude Code 的环境中使用（cross-platform）。

谁开发了 web-scraper？

由 weird94（@weird94）开发并维护，当前版本 v1.0.0。

web-scraper