Description

Local arXiv paper manager with semantic search. Crawls arXiv categories, downloads PDFs, chunks content, and indexes with FAISS + Ollama embeddings. No cloud...

README (SKILL.md)

ArXivKB — Science Knowledge Base

Name: arxivkb
Author: camopel

Why This Skill?

🏠 100% local — crawls arXiv's free API, embeds with Ollama (nomic-embed-text), indexes in FAISS + SQLite. No cloud cost.

🔍 Semantic search on paper content — FAISS indexes PDF chunks (not just abstracts), so you find papers by what they contain.

📂 arXiv category-based — tracks official arXiv categories (155 available, 8 groups). No free-text queries.

🧹 Auto-cleanup — configurable expiry deletes old papers, PDFs, and chunks.

Install

python3 scripts/install.py

Works on macOS and Linux. Installs Python deps (faiss-cpu, pdfplumber, tiktoken, arxiv, numpy), pulls nomic-embed-text via Ollama, creates data directories and DB.

Prerequisites

Ollama — must be installed and running (ollama serve)
Python 3.10+

Quick Start

# 1. Add arXiv categories to track
akb categories add cs.AI cs.CV cs.LG

# 2. Browse all available categories
akb categories browse

# 3. Ingest recent papers (last 7 days)
akb ingest

# 4. Check stats
akb stats

Ingestion

akb ingest                    # Crawl, download PDFs, chunk, embed
akb ingest --days 14          # Look back 14 days
akb ingest --dry-run          # Preview only
akb ingest --no-pdf           # Index abstracts only (faster)

Pipeline: arXiv API → PDF download → text extraction (pdfplumber) → chunking (tiktoken, 500 tokens, 50 overlap) → embedding (Ollama nomic-embed-text) → FAISS + SQLite.

Paper Details

akb paper 2401.12345    # Show title, abstract, categories, PDF status

Statistics

akb stats   # Papers, chunks, categories, DB size

Expiry & Cleanup

akb expire               # Delete papers older than 90 days (default)
akb expire --days 30     # Override: delete papers older than 30 days
akb expire --days 30 -y  # Skip confirmation

Configuration

No config file needed. Defaults:

Setting	Default	Override
Data directory	`~/workspace/arxivkb`	`ARXIVKB_DATA_DIR` env or `--data-dir`
Ollama endpoint	`http://localhost:11434`	— (hardcoded)
Embedding model	`nomic-embed-text` (768d)	— (hardcoded)
Chunk size	500 tokens, 50 overlap	—
Expiry	90 days	`--days` flag

Data Layout

~/workspace/arxivkb/
├── arxivkb.db           # SQLite: papers, chunks, translations, categories
├── pdfs/                  # Downloaded PDF files ({arxiv_id}.pdf)
└── faiss/
    └── arxivkb.faiss    # FAISS IndexFlatIP (chunk embeddings)

DB Schema

papers: id, arxiv_id, title, abstract, categories, published, status, created_at
chunks: id, paper_id, section, chunk_index, text, faiss_id, created_at
translations: paper_id, language, abstract, created_at (PK: paper_id+language)
categories: code, description, group_name, enabled, added_at (155 entries)

💬 Chat Commands (OpenClaw Agent)

When this skill is installed, the agent recognizes /akb as a shortcut:

Command	Action
`/akb list`	Show enabled categories
`/akb add cs.AI cs.RO`	Enable categories for crawling
`/akb remove cs.AI`	Disable a category
`/akb browse`	Browse all 155 arXiv categories
`/akb browse robotics`	Filter categories by keyword
`/akb stats`	Show paper/chunk/category counts
`/akb help`	Show available commands

The agent runs these via the akb CLI internally.

📱 PrivateApp Dashboard

A companion PWA dashboard is available. Provides:

Semantic search across paper content
Paper detail with abstract translation (on-demand via LLM)
Inline PDF viewing
Category browser
Stats (papers, chunks, categories)

Architecture

scripts/
├── cli.py             # CLI — categories, ingest, paper, stats, expire
├── db.py              # SQLite schema + CRUD
├── arxiv_crawler.py   # arXiv API search + PDF download
├── arxiv_taxonomy.py  # Full arXiv category taxonomy (155 categories)
├── pdf_processor.py   # PDF text extraction + tiktoken chunking
├── embed.py           # Ollama nomic-embed-text (768d, normalized)
├── faiss_index.py     # FAISS IndexFlatIP manager
├── search.py          # Semantic search: query → FAISS → group by paper
└── install.py         # One-command installer

Usage Guidance

This package appears to be what it says — a local arXiv crawler with FAISS search — but it has a few sloppy/inconsistent implementation details and will install persistent background jobs. Before running the installer or giving it shell access: 1) Inspect scripts/install.py and the generated systemd/launchd files (it writes to ~/.config/systemd/user and ~/Library/LaunchAgents) and confirm you want a daily background ingest. 2) Note the data-directory mismatch: SKILL.md/README mention ~/workspace/arxivkb but the scripts use ~/Downloads/ArXivKB; set ARXIVKB_DATA_DIR or edit the defaults to control where PDFs/DB/index are stored. 3) The systemd/launchd service references a --config {config.json} that the installer does not create — background runs may fail unless you create/populate that config or adapt the service. 4) The installer will pip-install packages and run `ollama pull nomic-embed-text` (model download) — expect network activity and non-trivial disk usage. 5) Run the installer inside a virtual environment if you want to avoid global/user pip changes. 6) Ensure Ollama is installed and intentionally run as it will accept local HTTP requests; embedding calls target localhost only. If you want higher assurance, run the tool manually (invoke scripts/cli.py directly) instead of activating the installer’s automatic timer, and verify paths and config behavior first.

Capability Analysis

Type: OpenClaw Skill Name: arxivkb Version: 1.0.1 The skill is classified as suspicious due to its use of high-risk capabilities, specifically the installation of a persistence mechanism (systemd timer on Linux, launchd plist on macOS) and the execution of external commands via `subprocess.run` in `scripts/install.py`. While these actions are declared and appear to serve the skill's stated purpose (daily paper ingestion and dependency installation), they represent a significant attack surface. There is no clear evidence of intentional malicious behavior such as data exfiltration to unauthorized endpoints or malicious prompt injection attempts against the agent in SKILL.md. However, the ability to install system-level services and execute arbitrary commands (even if currently benign) warrants a 'suspicious' classification as per the critical distinction between vulnerability and malice.

Capability Assessment

✓ Purpose & Capability

The skill name/description align with the included scripts: it crawls arXiv, downloads PDFs, extracts and chunks text, embeds via Ollama (nomic-embed-text) and indexes with FAISS/SQLite. Required binaries (python3, ollama) match the design.

ℹ Instruction Scope

Runtime instructions and code operate within the declared purpose (arXiv API + local embedding). However, SKILL.md/README claim defaults and behaviors that do not fully match the code: SKILL.md says default data dir is `~/workspace/arxivkb`, while install.py/cli/db default to `~/Downloads/ArXivKB`. SKILL.md and README mention a `config.json` and `akb` CLI wrapper; the installer writes service/plist that references `--config {config.json}` but the installer does not create that config file or an 'akb' executable in PATH. These mismatches can cause unexpected file placement and failing background jobs.

ℹ Install Mechanism

The registry entry has no formal install spec, but a provided scripts/install.py will run pip installs and call `ollama pull`. That installer will (if executed) pip-install packages (possibly using --user), pull a model from Ollama (network download), create data directories, and write systemd/launchd files. No unusual remote or obfuscated download URLs are used, but the install script performs network operations and writes persistent service files to the user's profile.

✓ Credentials

No secrets or cloud API keys are requested. The only external endpoints contacted are arXiv (public) and a local Ollama server (http://localhost:11434). An optional env var ARXIVKB_DATA_DIR is supported for data directory override. No unrelated credentials or config paths are requested.

⚠ Persistence & Privilege

The installer writes user-level service definitions (systemd timer in ~/.config/systemd/user and launchd plist in ~/Library/LaunchAgents) to schedule daily crawls. This creates persistent background network activity (periodic arXiv downloads and embedding). While expected for a crawler, users should be aware this grants the skill ongoing presence on the host. always:false mitigates global forced inclusion, but the installer still modifies user startup/service configuration.

Version History

v1.0.1

test

v1.0.0

Initial release

Metadata

Slug arxivkb

Version 1.0.1

License —

All-time Installs 0

Active Installs 0

Total Versions 2

Frequently Asked Questions

What is arxivkb?

Local arXiv paper manager with semantic search. Crawls arXiv categories, downloads PDFs, chunks content, and indexes with FAISS + Ollama embeddings. No cloud... It is an AI Agent Skill for Claude Code / OpenClaw, with 665 downloads so far.

How do I install arxivkb?

Run "/install arxivkb" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.

Is arxivkb free?

Yes, arxivkb is completely free (open-source). You can download, install and use it at no cost.

Which platforms does arxivkb support?

arxivkb is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).

Who created arxivkb?

It is built and maintained by camopel (@camopel); the current version is v1.0.1.

More Skills

arxivkb