功能描述

Local arXiv paper manager with semantic search. Crawls arXiv categories, downloads PDFs, chunks content, and indexes with FAISS + Ollama embeddings. No cloud...

使用说明 (SKILL.md)

ArXivKB — Science Knowledge Base

Name: arxivkb
Author: camopel

Why This Skill?

🏠 100% local — crawls arXiv's free API, embeds with Ollama (nomic-embed-text), indexes in FAISS + SQLite. No cloud cost.

🔍 Semantic search on paper content — FAISS indexes PDF chunks (not just abstracts), so you find papers by what they contain.

📂 arXiv category-based — tracks official arXiv categories (155 available, 8 groups). No free-text queries.

🧹 Auto-cleanup — configurable expiry deletes old papers, PDFs, and chunks.

Install

python3 scripts/install.py

Works on macOS and Linux. Installs Python deps (faiss-cpu, pdfplumber, tiktoken, arxiv, numpy), pulls nomic-embed-text via Ollama, creates data directories and DB.

Prerequisites

Ollama — must be installed and running (ollama serve)
Python 3.10+

Quick Start

# 1. Add arXiv categories to track
akb categories add cs.AI cs.CV cs.LG

# 2. Browse all available categories
akb categories browse

# 3. Ingest recent papers (last 7 days)
akb ingest

# 4. Check stats
akb stats

Ingestion

akb ingest                    # Crawl, download PDFs, chunk, embed
akb ingest --days 14          # Look back 14 days
akb ingest --dry-run          # Preview only
akb ingest --no-pdf           # Index abstracts only (faster)

Pipeline: arXiv API → PDF download → text extraction (pdfplumber) → chunking (tiktoken, 500 tokens, 50 overlap) → embedding (Ollama nomic-embed-text) → FAISS + SQLite.

Paper Details

akb paper 2401.12345    # Show title, abstract, categories, PDF status

Statistics

akb stats   # Papers, chunks, categories, DB size

Expiry & Cleanup

akb expire               # Delete papers older than 90 days (default)
akb expire --days 30     # Override: delete papers older than 30 days
akb expire --days 30 -y  # Skip confirmation

Configuration

No config file needed. Defaults:

Setting	Default	Override
Data directory	`~/workspace/arxivkb`	`ARXIVKB_DATA_DIR` env or `--data-dir`
Ollama endpoint	`http://localhost:11434`	— (hardcoded)
Embedding model	`nomic-embed-text` (768d)	— (hardcoded)
Chunk size	500 tokens, 50 overlap	—
Expiry	90 days	`--days` flag

Data Layout

~/workspace/arxivkb/
├── arxivkb.db           # SQLite: papers, chunks, translations, categories
├── pdfs/                  # Downloaded PDF files ({arxiv_id}.pdf)
└── faiss/
    └── arxivkb.faiss    # FAISS IndexFlatIP (chunk embeddings)

DB Schema

papers: id, arxiv_id, title, abstract, categories, published, status, created_at
chunks: id, paper_id, section, chunk_index, text, faiss_id, created_at
translations: paper_id, language, abstract, created_at (PK: paper_id+language)
categories: code, description, group_name, enabled, added_at (155 entries)

💬 Chat Commands (OpenClaw Agent)

When this skill is installed, the agent recognizes /akb as a shortcut:

Command	Action
`/akb list`	Show enabled categories
`/akb add cs.AI cs.RO`	Enable categories for crawling
`/akb remove cs.AI`	Disable a category
`/akb browse`	Browse all 155 arXiv categories
`/akb browse robotics`	Filter categories by keyword
`/akb stats`	Show paper/chunk/category counts
`/akb help`	Show available commands

The agent runs these via the akb CLI internally.

📱 PrivateApp Dashboard

A companion PWA dashboard is available. Provides:

Semantic search across paper content
Paper detail with abstract translation (on-demand via LLM)
Inline PDF viewing
Category browser
Stats (papers, chunks, categories)

Architecture

scripts/
├── cli.py             # CLI — categories, ingest, paper, stats, expire
├── db.py              # SQLite schema + CRUD
├── arxiv_crawler.py   # arXiv API search + PDF download
├── arxiv_taxonomy.py  # Full arXiv category taxonomy (155 categories)
├── pdf_processor.py   # PDF text extraction + tiktoken chunking
├── embed.py           # Ollama nomic-embed-text (768d, normalized)
├── faiss_index.py     # FAISS IndexFlatIP manager
├── search.py          # Semantic search: query → FAISS → group by paper
└── install.py         # One-command installer

安全使用建议

This package appears to be what it says — a local arXiv crawler with FAISS search — but it has a few sloppy/inconsistent implementation details and will install persistent background jobs. Before running the installer or giving it shell access: 1) Inspect scripts/install.py and the generated systemd/launchd files (it writes to ~/.config/systemd/user and ~/Library/LaunchAgents) and confirm you want a daily background ingest. 2) Note the data-directory mismatch: SKILL.md/README mention ~/workspace/arxivkb but the scripts use ~/Downloads/ArXivKB; set ARXIVKB_DATA_DIR or edit the defaults to control where PDFs/DB/index are stored. 3) The systemd/launchd service references a --config {config.json} that the installer does not create — background runs may fail unless you create/populate that config or adapt the service. 4) The installer will pip-install packages and run `ollama pull nomic-embed-text` (model download) — expect network activity and non-trivial disk usage. 5) Run the installer inside a virtual environment if you want to avoid global/user pip changes. 6) Ensure Ollama is installed and intentionally run as it will accept local HTTP requests; embedding calls target localhost only. If you want higher assurance, run the tool manually (invoke scripts/cli.py directly) instead of activating the installer’s automatic timer, and verify paths and config behavior first.

功能分析

Type: OpenClaw Skill Name: arxivkb Version: 1.0.1 The skill is classified as suspicious due to its use of high-risk capabilities, specifically the installation of a persistence mechanism (systemd timer on Linux, launchd plist on macOS) and the execution of external commands via `subprocess.run` in `scripts/install.py`. While these actions are declared and appear to serve the skill's stated purpose (daily paper ingestion and dependency installation), they represent a significant attack surface. There is no clear evidence of intentional malicious behavior such as data exfiltration to unauthorized endpoints or malicious prompt injection attempts against the agent in SKILL.md. However, the ability to install system-level services and execute arbitrary commands (even if currently benign) warrants a 'suspicious' classification as per the critical distinction between vulnerability and malice.

能力评估

✓ Purpose & Capability

The skill name/description align with the included scripts: it crawls arXiv, downloads PDFs, extracts and chunks text, embeds via Ollama (nomic-embed-text) and indexes with FAISS/SQLite. Required binaries (python3, ollama) match the design.

ℹ Instruction Scope

Runtime instructions and code operate within the declared purpose (arXiv API + local embedding). However, SKILL.md/README claim defaults and behaviors that do not fully match the code: SKILL.md says default data dir is `~/workspace/arxivkb`, while install.py/cli/db default to `~/Downloads/ArXivKB`. SKILL.md and README mention a `config.json` and `akb` CLI wrapper; the installer writes service/plist that references `--config {config.json}` but the installer does not create that config file or an 'akb' executable in PATH. These mismatches can cause unexpected file placement and failing background jobs.

ℹ Install Mechanism

The registry entry has no formal install spec, but a provided scripts/install.py will run pip installs and call `ollama pull`. That installer will (if executed) pip-install packages (possibly using --user), pull a model from Ollama (network download), create data directories, and write systemd/launchd files. No unusual remote or obfuscated download URLs are used, but the install script performs network operations and writes persistent service files to the user's profile.

✓ Credentials

No secrets or cloud API keys are requested. The only external endpoints contacted are arXiv (public) and a local Ollama server (http://localhost:11434). An optional env var ARXIVKB_DATA_DIR is supported for data directory override. No unrelated credentials or config paths are requested.

⚠ Persistence & Privilege

The installer writes user-level service definitions (systemd timer in ~/.config/systemd/user and launchd plist in ~/Library/LaunchAgents) to schedule daily crawls. This creates persistent background network activity (periodic arXiv downloads and embedding). While expected for a crawler, users should be aware this grants the skill ongoing presence on the host. always:false mitigates global forced inclusion, but the installer still modifies user startup/service configuration.

版本历史

v1.0.1

test

v1.0.0

Initial release

元数据

Slug arxivkb

版本 1.0.1

许可证 —

累计安装 0

当前安装数 0

历史版本数 2

常见问题

arxivkb 是什么？

Local arXiv paper manager with semantic search. Crawls arXiv categories, downloads PDFs, chunks content, and indexes with FAISS + Ollama embeddings. No cloud... 它是一个面向 Claude Code / OpenClaw 的 AI Agent Skill 插件，目前累计下载 665 次。

如何安装 arxivkb？

在 OpenClaw 或 Claude Code 对话框中运行命令「/install arxivkb」即可一键安装，无需额外配置。

arxivkb 是免费的吗？

是的，arxivkb 完全免费（开源免费），可自由下载、安装和使用。

arxivkb 支持哪些平台？

arxivkb 跨平台运行，可在任意部署了 OpenClaw / Claude Code 的环境中使用（cross-platform）。

谁开发了 arxivkb？

由 camopel（@camopel）开发并维护，当前版本 v1.0.1。

arxivkb