Lobster Crawler Skill

Name: Lobster Crawler Skill
Author: 7487

功能描述

定向抓取 Webnovel/ReelShorts 等站点的书籍/短剧内容，支持内容分级与钉钉播报。

使用说明 (SKILL.md)

龙虾爬虫技能

定向抓取 Webnovel 小说和 ReelShorts 短剧的结构化内容，支持增量更新、内容分级（高/中/低）和钉钉机器人播报。

环境初始化

首次使用前，在技能目录下初始化 Python 环境：

cd {{skillPath}}
uv venv .venv
uv pip install -r requirements.txt

不需要安装浏览器。反爬通过 curl_cffi TLS 指纹伪装实现，纯 Python 库，无系统依赖。

后续所有命令都通过 uv run 执行，它会自动激活 .venv 虚拟环境。

触发条件

当用户消息包含以下意图时激活此技能：

抓取/爬取小说、短剧、webnovel、reelshorts 内容
查看爬虫状态、已抓取作品列表
播报抓取结果到钉钉
生成 RSS 订阅源
管理定时抓取任务

命令

所有命令必须在技能目录下执行。先 cd {{skillPath}}，再运行命令。

抓取内容

uv run python -m src.cli crawl \x3Cspider_name>

spider_name 可选值：webnovel（小说）、reelshorts（短剧）
支持传递爬虫参数：uv run python -m src.cli crawl webnovel -a max_pages=5

列出已抓取作品

uv run python -m src.cli list [--site \x3Csite>] [--grade \x3Cgrade>] [--limit \x3Cn>]

--site：按站点过滤（webnovel / reelshorts）
--grade：按分级过滤（high / medium / low）
--limit：显示数量，默认 20

查看系统状态

uv run python -m src.cli status

返回数据库统计（作品数、章节数、剧集数）和各分级数量。

播报到钉钉

uv run python -m src.cli broadcast [--site \x3Csite>] [--grade \x3Cgrade>] [--title \x3Ctitle>]

生成 Markdown 消息并发送到钉钉群。需要设置环境变量 DINGTALK_WEBHOOK。

管理定时任务

uv run python -m src.cli schedule --action=list    # 查看任务
uv run python -m src.cli schedule --action=load    # 从配置加载
uv run python -m src.cli schedule --action=start   # 启动调度器

生成 RSS 订阅源

uv run python -m src.cli rss [--format rss|atom] [--output \x3Cpath>] [--site \x3Csite>] [--grade \x3Cgrade>]

默认输出到 data/rss.xml。

规则

首次使用前，必须先运行"环境初始化"步骤安装依赖。如果 uv run 报错找不到模块，重新执行初始化。
运行爬虫前，先执行 status 确认系统正常。
用户未指定站点时，询问要抓取 webnovel 还是 reelshorts。
播报前先用 list 确认有数据可播报。
钉钉播报需要确认 DINGTALK_WEBHOOK 环境变量已配置。
抓取可能耗时较长，提前告知用户并在完成后汇报结果。
不要同时运行多个爬虫实例，避免并发冲突。

安全使用建议

This package appears to be a functioning crawler + DingTalk broadcaster, but I found several red flags you should address before installing or running it: - Install metadata mismatch: the skill requires a 'uv' CLI but the install block lists package 'curl_cffi' and claims it will create 'uv' — that is inconsistent. Do not run any automatic 'install' step until this is clarified. Prefer creating a Python venv and running 'pip install -r requirements.txt' yourself in an isolated environment. - Environment variables: you must supply DINGTALK_WEBHOOK for broadcasts; the code also reads DINGTALK_SECRET (for signed webhooks) but that is not declared. If you supply a secret, ensure it's the intended value. Review any .env files before use. - Hidden agent/LLM behavior: repository docs and scripts instruct running an LLM loop (claude_loop.sh), and to persist agent memory under ~/.claude/projects/... — these actions are unrelated to simple crawling and grant the project the ability to read/write outside the repo and to repeatedly invoke an LLM. Only run these parts if you trust the publisher and understand what will be written and sent. - Run in isolation: test in a disposable environment (container or VM), with network restricted if necessary. Inspect and, if needed, remove or disable scripts/claude_loop.sh and CLAUDE.md steps that write to your home directory before allowing autonomous runs. - Verify robots/ethics: review target sites' robots.txt and legal terms — the repository itself has conflicting notes about obeying robots.txt. If you want, I can list the exact files and lines that reference the problematic install block, DINGTALK_SECRET usage, and the ~/.claude memory writes so you can inspect them before proceeding.

能力评估

⚠ Purpose & Capability

Name/description (crawler + DingTalk broadcast) align with most code (scrapy spiders, RSS, broadcast module). However, the declared required binary 'uv' and the install block are incoherent (install declares kind: uv with package 'curl_cffi' and bins: ['uv'] — installing curl_cffi would not create an 'uv' binary). The repo also bundles an LLM loop (scripts/claude_loop.sh, prompts/claude_loop_prompt.txt, CLAUDE.md) which is not strictly necessary for a crawler; that increases the runtime footprint beyond the stated simple crawler+broadcast purpose.

⚠ Instruction Scope

SKILL.md runtime instructions focus on using 'uv run' to run the CLI (crawl/list/status/broadcast/rss) which is coherent. But repository docs (CLAUDE.md, agent.md, scripts/claude_loop.sh and prompts) instruct an agent to run continuous LLM loops, to read and update repo docs and to persist project memory into ~/.claude/projects/... — that asks the agent to write to a global home-path and to run an external 'claude' binary. Those behaviours (writing to user home, running an LLM loop) go beyond crawling and are not declared in SKILL.md.

⚠ Install Mechanism

Declared install block is inconsistent: kind: uv package: 'curl_cffi' bins: ['uv'] — this does not make sense (curl_cffi is a Python library, not an installer that yields an 'uv' binary). The SKILL.md uses 'uv venv' and 'uv run', implying a dependency on a tool named 'uv' but the install metadata doesn't install that tool. The repo otherwise uses standard Python dependencies via requirements.txt (pip). This mismatch suggests either broken install metadata or sloppy packaging; treat automatic install as risky.

⚠ Credentials

Registry declares a single required env var DINGTALK_WEBHOOK (primaryEnv) which matches the broadcast feature. However code (src/broadcast/dingtalk.py) also reads DINGTALK_SECRET for HMAC signing, but that variable is not declared in requires.env. Additional optional envs appear in config logic (DB_PATH, LOG_LEVEL) and .env references in docker-compose. The skill also includes scripts and docs that reference ~/.claude memory paths and require an external 'claude' CLI — those introduce implicit credentials/configuration and external network usage that are not declared.

⚠ Persistence & Privilege

The skill is not marked always:true (good). Nevertheless repository docs instruct agents to persist memory into a global ~/.claude/projects/ path and scripts/claude_loop.sh create .claude/out and .claude/logs inside repo and call the 'claude' binary. The combination of an LLM loop, automated write-to-home instructions, and webhook broadcasting increases blast radius if run autonomously. This behavior is not explained in the high-level SKILL.md and is outside the crawler's minimal needs.

版本历史

v0.7.0

- 反爬从 Playwright 无头浏览器切换为 curl_cffi 指纹伪装，无需安装浏览器和系统依赖，简化初始化流程 - 移除 Playwright 及其相关安装/启动说明，技能环境更加轻量纯 Python - requirements.txt、Skill 元数据及安装指令同步更新至 curl_cffi - 抓取逻辑和相关中间件适配 curl_cffi 实现 - 文档简化部分初始化说明，删除与浏览器相关的常见故障段落

v0.6.0

**环境初始化与依赖管理方式大幅优化，爬虫底层由 crawl4ai 切换为 playwright。** - 环境安装流程重写，浏览器依赖改为 `playwright`，需显式安装 Chromium 及其系统库。 - `install` 配置同步更改，去除对 `python3`、`crawl4ai-browser` 的强依赖，主依赖简化为 `uv` 与 Python 库。 - 新增“故障排除”板块，指导 Chromium/Playwright 在沙盒容器环境中的排错方法。 - 原“首次使用”流程、命令行用法未变。

v0.5.1

- Bumped version to 0.5.1. - Documentation updates in SKILL.md and MEMORY.md. - Code/documentation changes in `src/spiders/middlewares.py`.

v0.5.0

Version 0.5.0 - Updated Webnovel site configuration and spider logic. - Improved internal middleware handling. - Refined documentation with new features and usage instructions. - General stability and maintainability improvements.

v0.4.1

Version 0.4.1 - Updated version metadata from 0.4.0 to 0.4.1 in SKILL.md. - Minor documentation and metadata updates; no breaking changes. - No command or workflow changes introduced.

v0.4.0

- Added support for Crawl4ai headless browser (Chromium) integration. - Updated environment initialization steps to include browser installation. - Modified documentation and metadata to reflect new browser dependency and setup. - Various adjustments to configuration and middleware to enable browser-based crawling.

v0.3.0

Initial public release with core features and documentation. - Added source code for core crawling, classification, database, broadcasting, and RSS feed modules. - Introduced configuration files for settings and supported sites (Webnovel, ReelShorts). - Included documentation files covering setup, feedback, memory, state, and todo. - Provided CLI interface for crawling, listing, status, broadcasting, scheduling, and RSS feed generation. - Added Docker Compose support and installation/setup guides.

v0.2.0

- Switched Python依赖安装和环境激活方式至uv；不再需要sudo即可初始化和运行。 - 安装指令、所有命令统一更新为uv run方式，自动激活虚拟环境。 - 增加"环境初始化"章节，并在规则中明确首次使用步骤。 - 安装依赖和操作系统/命令行环境检测由python3/bins调整为anyBins（uv或python3）。 - 更新install流程，支持一键下载安装uv。 - 其余功能和命令未作更改。

v0.1.0

Initial release of lp-lobster-crawler. - Supports targeted crawling of Webnovel and ReelShorts sites for books and short dramas. - Features incremental update, content grading (high/medium/low), and DingTalk group broadcasting. - Includes commands for crawling, listing works, checking system status, DingTalk broadcast, scheduling, and generating RSS/Atom feeds. - Offers usage rules to ensure correct operation and user guidance. - Requires Python 3 and setting the DINGTALK_WEBHOOK environment variable.

元数据

Slug lp-lobster-crawler

版本 0.7.0

许可证 MIT-0

累计安装 0

当前安装数 0

历史版本数 9

常见问题