← Back to Skills Marketplace
reed1898

Knowledge Base Collector

by Reed · GitHub ↗ · v0.1.3
cross-platform ⚠ suspicious
1061
Downloads
1
Stars
3
Active Installs
4
Versions
Install in OpenClaw
/install knowledge-base-collector
Description
Collect and organize a personal knowledge base from URLs (web/X/WeChat) and screenshots. Use when the user says they want to save an URL, ingest a link, archive content to KB, tag/classify notes, store screenshots, or search their saved knowledge in Telegram. Supports WeChat via a connected macOS node when cloud fetch is blocked.
README (SKILL.md)

Summary

  • Ingest: web URLs, X/Twitter links, WeChat Official Account links (mp.weixin.qq.com), and screenshots
  • Store: writes to a shared KB folder with per-item content.md + meta.json and a global index.jsonl
  • Organize: tag-first classification with richer tags (e.g. #agent, #coding-agent, #claude-code, #mcp, #rag, #prompt-injection, #security, #pricing, #database)
  • WeChat: cloud fetch may be blocked; when a macOS node (e.g. Reed-Mac) is online, prefer node-side fetch to improve success rate; otherwise create a placeholder entry
  • Search: designed to support Telegram Q&A / search flows on top of the index and content

把用户发来的链接/截图沉淀到共享知识库(KB),并做标签化整理。

默认 KB 位置

  • KB Root(可改):/home/ubuntu/.openclaw/kb
  • 索引:kb/20_Inbox/urls/index.jsonl
  • 每条内容目录:kb/20_Inbox/urls/\x3CYYYY-MM>/\x3Citem>/content.md + meta.json

目标:先入库不丢,再迭代“摘要/标签/检索”。

你要做的事(按输入类型)

1) 普通网页 / X(Twitter) / 公众号 URL 入库

运行脚本:

python3 /home/ubuntu/.openclaw/skills/knowledge-base-collector/scripts/ingest_url.py "\x3CURL>" --tags "#optional" --note "context"

行为:

  • 自动识别来源(web/x/wechat)
  • 优先用 r.jina.ai 抽取正文(无需登录)
  • 公众号遇到风控会写占位条目:status=blocked_verification + tag #needs-manual
  • 对同一 URL 做 key 去重(已存在则跳过)

WeChat 更高成功率(推荐路径)

当云端抓取命中“环境异常/验证”时:

  • 如果有已连接的 macOS 节点(例如 Reed-Mac)且该节点能访问该文章,可用 nodes.run 在节点上执行抓取(requests+bs4),然后写入 KB。
  • 注意:这条路径依赖节点在线与网络环境;无法承诺 100%。

2) 截图/图片入库(含 OCR 文本)

脚本:

python3 /home/ubuntu/.openclaw/skills/knowledge-base-collector/scripts/ingest_image.py /path/to/image.jpg \
  --text-file /path/to/ocr.txt \
  --title "..." --tags "#ai #product" --note "..."

说明:

  • ingest_image.py 负责“落盘+索引”。OCR 可用:
    • 本机 tesseract(若安装了 tesseract-ocr + chi_sim
    • 或用多模态 LLM 抽取文字后写入 --text-file

Telegram 里直接问(检索)

推荐先用脚本(本机/服务器):

python3 /home/ubuntu/.openclaw/skills/knowledge-base-collector/scripts/search_kb.py --q "claude code" --limit 10
python3 /home/ubuntu/.openclaw/skills/knowledge-base-collector/scripts/search_kb.py --tags "#claude-code #coding-agent" --limit 20
python3 /home/ubuntu/.openclaw/skills/knowledge-base-collector/scripts/search_kb.py --source wechat --since 7d --q "Elys"

公众号待补抓队列(占位条目)

python3 /home/ubuntu/.openclaw/skills/knowledge-base-collector/scripts/wechat_backlog.py --limit 30

周报/主题报告候选清单(给 LLM 写总结用)

python3 /home/ubuntu/.openclaw/skills/knowledge-base-collector/scripts/weekly_digest.py --days 7 --limit 30

重要注意事项(安全/隐私)

  • 截图/网页可能包含 token/验证码/密钥:入库前应做脱敏(替换为 REDACTED)。
  • 公众号抓取受风控影响:建议允许“占位入库”,后续再补全。
Usage Guidance
This skill appears to implement a simple local knowledge-base writer and searcher and is mostly coherent with its description — but review these points before installing: - Third-party extractor: ingest_url.py uses https://r.jina.ai/<URL> to extract article text. That sends the target URL (and the extractor will fetch its content) to a third-party service; do not ingest URLs or articles that contain secrets or private tokens unless you accept that risk. Consider replacing r.jina.ai with a local extractor if privacy is required. - Claimed macOS node path is not implemented: SKILL.md mentions executing fetches on a connected macOS node (nodes.run) to bypass WeChat cloud blocks. The provided scripts do not implement remote node execution — instead they create placeholder entries for blocked WeChat pages. If you need automatic remote relays, the code does not provide them and the SKILL.md claim is misleading. - Local storage & permissions: by default the skill writes to /home/ubuntu/.openclaw/kb. Ensure that directory has appropriate filesystem permissions and that you don't inadvertently store screenshots or pages containing credentials, one-time codes, or other sensitive info. The code includes a reminder to redact tokens, but redaction is manual. - Network exposure: the scripts issue HTTP GETs to target URLs and to r.jina.ai via the host running the skill. If the agent runs in an environment with access to internal/intranet hosts, feeding internal URLs will cause external network requests (possible data leakage). - Review/validate: because the skill source and homepage are unknown and the package was published by an unfamiliar owner, consider running the scripts in a sandbox, inspecting KB output paths, and optionally forking/modifying the code to use a local extractor or to log fewer details before deploying to production. If these caveats are acceptable (or you modify the extractor behavior and storage path), the skill looks usable for basic KB ingestion. If you need stronger privacy guarantees, treat it as untrusted until you replace the external extractor and confirm the macOS relay behavior you expect.
Capability Analysis
Type: OpenClaw Skill Name: knowledge-base-collector Version: 0.1.3 The skill bundle is classified as suspicious due to potential shell injection vulnerabilities and the use of powerful execution capabilities. The `SKILL.md` instructs the AI agent to execute `python3` scripts with user-provided arguments (URL, tags, notes, image paths). If the agent fails to properly sanitize or escape these arguments before constructing the shell command, it could lead to remote code execution (RCE). Additionally, the `SKILL.md` mentions using `nodes.run` to execute commands on connected macOS nodes, which is a powerful capability that could be abused for unauthorized remote execution if the agent is prompted to run arbitrary commands. While the scripts themselves appear to perform their stated function, the method of execution described in `SKILL.md` introduces significant risks.
Capability Assessment
Purpose & Capability
The name/description match the code: scripts ingest URLs and images, write content.md/meta.json entries, index.jsonl, tag entries, and provide search/weekly digest tools. However SKILL.md claims a higher-success WeChat path that uses a connected macOS node ('nodes.run' / Reed-Mac) to fetch blocked articles; the provided scripts contain no implementation of that node-side relay or any nodes.run call. Also the SKILL.md mentions supporting Telegram Q&A flows, but there is no Telegram integration code — only CLI search output suitable to be called by an external Telegram bridge.
Instruction Scope
Instructions stay focused on ingesting URLs/images and writing to a KB on disk. They instruct network fetches (requests) and using r.jina.ai to extract text; they do not ask the agent to read unrelated system files. Caveat: SKILL.md suggests using a macOS node for blocked WeChat fetches, but the code falls back to creating placeholders; that advertised automatic remote execution is not present in the codebase.
Install Mechanism
No install spec; this is instruction + small Python scripts that run with Python + requests. Nothing is downloaded or written outside the KB folder by the code itself. Low install risk.
Credentials
The skill requests no credentials or special env vars. However it makes outbound network requests to third parties: it fetches the target URLs and uses https://r.jina.ai/<URL> as an extraction proxy. That means the target URL (and potentially its content via the proxy) is sent to a third-party service — this is proportional to fetching/extracting content but may leak sensitive URLs or article content (including tokens or screenshots if you later add image-to-LLM OCR). The default KB root (/home/ubuntu/.openclaw/kb) may contain sensitive artifacts; the skill will write files there with no extra access control.
Persistence & Privilege
Skill does not request always:true, does not modify other skills, and only writes files under a single KB tree. It can run autonomously (normal default) but has no elevated platform privileges.
How to Use
  1. Make sure OpenClaw is installed (local or Docker)
  2. Run the install command in chat: /install knowledge-base-collector
  3. After installation, invoke the skill by name or use /knowledge-base-collector
  4. Provide required inputs per the skill's parameter spec and get structured output
Version History
v0.1.3
chore: weekly digest + wechat backlog
v0.1.2
Tagging improvements: more stable 3-layer tag taxonomy (source/type + domain + entity) and added search_kb.py for local KB search by tags/keywords/source/time.
v0.1.1
Improve tagger: richer rule-based tags (agent/coding-agent/mcp/prompt-injection/security/engineering/etc) + language/entity tags.
v0.1.0
Initial release: ingest URLs (web/X/WeChat) + screenshots into a shared KB with tags, per-item markdown+meta, and an index. Supports WeChat node-side fetch (macOS) and placeholder entries when blocked.
Metadata
Slug knowledge-base-collector
Version 0.1.3
License
All-time Installs 3
Active Installs 3
Total Versions 4
Frequently Asked Questions

What is Knowledge Base Collector?

Collect and organize a personal knowledge base from URLs (web/X/WeChat) and screenshots. Use when the user says they want to save an URL, ingest a link, archive content to KB, tag/classify notes, store screenshots, or search their saved knowledge in Telegram. Supports WeChat via a connected macOS node when cloud fetch is blocked. It is an AI Agent Skill for Claude Code / OpenClaw, with 1061 downloads so far.

How do I install Knowledge Base Collector?

Run "/install knowledge-base-collector" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.

Is Knowledge Base Collector free?

Yes, Knowledge Base Collector is completely free (open-source). You can download, install and use it at no cost.

Which platforms does Knowledge Base Collector support?

Knowledge Base Collector is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).

Who created Knowledge Base Collector?

It is built and maintained by Reed (@reed1898); the current version is v0.1.3.

💬 Comments