← Back to Skills Marketplace
mx2013713828

Image-crawler

by MagicWolf · GitHub ↗ · v1.0.0 · MIT-0
cross-platform ✓ Security Clean
131
Downloads
0
Stars
0
Active Installs
1
Versions
Install in OpenClaw
/install image-crawler
Description
图片采集/爬虫工具,支持百度和Bing图片搜索引擎。当用户要求采集、爬取、下载、 搜集图片时使用。支持关键词拓展、图片去重(URL+内容hash,跨次运行持久化)、 进度监控和停滞检测。触发词:采集图片、爬取图片、下载图片、图片爬虫、抓取图片。
README (SKILL.md)

Image Crawler

通过百度/Bing图片搜索批量采集图片,内置去重、关键词拓展、进度监控。

快速流程

1. 确认需求 → 2. 生成拓展关键词 → 3. 构造命令 → 4. 运行并监控 → 5. 汇报结果

Step 1: 确认采集需求

从用户请求中提取:

  • 关键词(必须):采集什么图片
  • 数量(默认 100):需要多少张
  • 输出目录(默认 ./crawled_images):存放位置
  • 引擎(默认 baidu):百度通常更稳定,中文搜索效果更好

Step 2: 关键词拓展

利用 LLM 能力生成 5-15 个拓展关键词,传入 --expand-terms

拓展策略(按领域选择):

设备/产品类:品牌 + 型号 + 使用场景

用户说"挖掘机" → 三一,卡特,小松,沃尔沃,日立,临工,大型,小型,施工现场,工地

动物/植物类:品种 + 环境 + 状态

用户说"猫" → 橘猫,英短,布偶,暹罗,黑猫,可爱,睡觉,户外

建筑/场景类:风格 + 地点 + 时间

用户说"别墅" → 欧式,中式,现代,豪华,花园,室内,外观,夜景

通用原则:拓展词应增加多样性而非重复。中英文混合可增加搜索覆盖面。

Step 3: 构造并运行命令

脚本位置:scripts/image_crawler.py(相对于此 SKILL.md)

python {skill_dir}/scripts/image_crawler.py \
  -k "关键词1" -k "关键词2" \
  -n 数量 \
  -o 输出目录 \
  -e baidu \
  --expand --expand-terms "拓展词1,拓展词2,..." \
  --json

始终使用 --json 模式以便解析输出。

典型示例:

# 采集 200 张挖掘机图片
python scripts/image_crawler.py \
  -k "挖掘机" -k "excavator" \
  -n 200 -o ./excavator_images \
  --expand --expand-terms "三一,卡特,小松,沃尔沃,临工,大型,施工现场" \
  --json

Step 4: 监控采集过程

以后台模式运行脚本,定期检查输出:

  1. execbackground: true 启动脚本
  2. process(poll) 获取最新输出
  3. 解析 JSON 行,关注以下事件:
type 含义 Agent 动作
progress 下载进度 向用户报告进度和预估时间
stall 采集停滞 提醒用户可能有问题
error 严重错误 立即中断并告知用户(反爬/网络问题)
done 采集完成 汇报统计信息

停滞判断:如果 poll 长时间无新 progress 输出(>60s),主动检查进程状态。

Step 5: 汇报结果

采集完成后,向用户报告:

  • 成功下载数 / 目标数
  • 去重移除数
  • 总耗时
  • 输出目录路径
  • 如有失败,说明可能原因(反爬、网络、源站不可用)

追加采集

脚本支持跨次运行去重。如果用户需要更多图片,直接用相同输出目录再次运行:

  • .dedup_hashes.json 自动跳过已有图片
  • 文件编号自动递增,不会覆盖

详细接口和自定义

参见 references/customization.md

  • 完整 CLI 参数表
  • JSON 输出格式详解
  • 去重机制说明
  • 添加新搜索引擎指南
  • 常见问题排查

脚本模板

scripts/ 下包含两个独立可用的引擎模板,适合用户学习或二次开发:

  • baidu_crawler.py — 百度图片搜索,接口清晰,中文搜索效果好
  • bing_crawler.py — Bing图片搜索,英文搜索覆盖面广
Usage Guidance
This skill appears to do what it says (scrape images from Baidu/Bing and deduplicate). Before installing or running it: (1) ensure you run it in a controlled environment (sandbox or non-privileged account) because it will download many files and use network bandwidth; (2) install Python and the 'requests' package (pip install requests) — the skill doesn't declare this dependency in metadata; (3) set a safe output directory and disk quota to avoid filling your disk; (4) respect website terms of service and robots.txt and be aware of legal/ethical issues with mass scraping; (5) consider lowering concurrency and increasing delays (the code already exposes sleep/timeouts) to reduce anti-scraping risk; (6) review the scripts for any changes if you plan to run them on sensitive hosts — although no hidden network sinks or credential access were found, the crawler will fetch arbitrary external URLs, which can host unexpected content; (7) do not run as root/administrator and avoid supplying any unrelated credentials to the skill. If you want higher assurance, ask the publisher to update the metadata to declare Python and requests as required, and provide an explicit dependency/install instruction.
Capability Analysis
Type: OpenClaw Skill Name: image-crawler Version: 1.0.0 The image-crawler skill bundle is a functional tool designed for batch downloading images from Baidu and Bing. The core logic in scripts/image_crawler.py and its engine-specific templates (baidu_crawler.py, bing_crawler.py) focuses on legitimate web scraping, including features like MD5-based deduplication, keyword expansion, and progress reporting via JSON. No evidence of data exfiltration, malicious execution, or prompt injection was found; all network activity is directed at well-known search engines, and file operations are restricted to the user-specified output directory.
Capability Assessment
Purpose & Capability
Name/description match the provided scripts: the package contains crawler implementations for Baidu and Bing and a wrapper script that coordinates search, download, deduplication and progress reporting. However, the registry metadata declared no required binaries or environment variables while the SKILL.md and scripts assume a Python runtime and the 'requests' library; that runtime dependency is not declared in the metadata.
Instruction Scope
SKILL.md instructs the agent to extract keywords, expand them, run the bundled Python script in JSON mode and monitor its line-delimited JSON output. The instructions stay within the crawler's scope and do not request unrelated files, system credentials, or external endpoints beyond search engines and target image hosts. Use of the LLM to expand keywords is intentional for coverage and is documented.
Install Mechanism
This is an instruction-only skill (no install spec). The included code runs as Python scripts and makes network calls. There is no remote download/installation of code at install time and no obscure third-party install URLs. Note: the script will exit if 'requests' is not installed and prints instructions to pip install it — the dependency should be declared.
Credentials
The skill requests no environment variables or credentials and does not attempt to access system config paths beyond writing to the user-specified output directory. Network access to Bing, Baidu, and arbitrary image hosts is required and expected for its purpose.
Persistence & Privilege
The skill does not request permanent 'always' inclusion, nor does it modify other skills or system-wide settings. It persists deduplication hashes to a file under the chosen output directory (.dedup_hashes.json), which is consistent with stated behavior.
How to Use
  1. Make sure OpenClaw is installed (local or Docker)
  2. Run the install command in chat: /install image-crawler
  3. After installation, invoke the skill by name or use /image-crawler
  4. Provide required inputs per the skill's parameter spec and get structured output
Version History
v1.0.0
image-crawler v1.0.0 – 首发版本 - 支持通过百度和Bing图片搜索按关键词批量采集图片 - 内置智能关键词拓展,提升图片多样性 - 提供图片去重(URL与内容hash,支持持久化) - 支持进度监控、停滞检测与自动化错误处理 - 脚本输出标准JSON,便于集成和结果追踪 - 支持追加采集并自动跳过已下载图片
Metadata
Slug image-crawler
Version 1.0.0
License MIT-0
All-time Installs 0
Active Installs 0
Total Versions 1
Frequently Asked Questions

What is Image-crawler?

图片采集/爬虫工具,支持百度和Bing图片搜索引擎。当用户要求采集、爬取、下载、 搜集图片时使用。支持关键词拓展、图片去重(URL+内容hash,跨次运行持久化)、 进度监控和停滞检测。触发词:采集图片、爬取图片、下载图片、图片爬虫、抓取图片。 It is an AI Agent Skill for Claude Code / OpenClaw, with 131 downloads so far.

How do I install Image-crawler?

Run "/install image-crawler" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.

Is Image-crawler free?

Yes, Image-crawler is completely free, licensed under MIT-0. You can download, install and use it at no cost.

Which platforms does Image-crawler support?

Image-crawler is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).

Who created Image-crawler?

It is built and maintained by MagicWolf (@mx2013713828); the current version is v1.0.0.

💬 Comments