← Back to Skills Marketplace
willoscar

Arxiv Search

by WILLOSCAR · GitHub ↗ · v1.0.0 · MIT-0
cross-platform ✓ Security Clean
403
Downloads
0
Stars
8
Active Installs
1
Versions
Install in OpenClaw
/install arxiv-search
Description
Retrieve paper metadata from arXiv using keyword queries and save results as JSONL (`papers/papers_raw.jsonl`). **Trigger**: arXiv, arxiv, paper search, meta...
README (SKILL.md)

arXiv Search (metadata-first)

Collect an initial paper set with enough metadata to support downstream ranking, taxonomy building, and citation generation.

When online, prefer rich arXiv metadata (categories, arxiv_id, pdf_url, published/updated, etc.). When offline, accept an export and convert it cleanly.

Load Order

Always read:

  • references/domain_pack_overview.md — how domain packs drive topic-specific behavior

Domain packs (loaded by topic match):

  • assets/domain_packs/llm_agents.json — pinned IDs, query rewrite rules for LLM agent topics

Script Boundary

Use scripts/run.py only for:

  • arXiv API retrieval and XML parsing
  • offline export conversion (CSV/JSON/JSONL normalization)
  • metadata enrichment via id_list backfill

Do not treat run.py as the place for:

  • hardcoded topic detection or query rewriting (use domain packs)
  • domain-specific pinned paper lists (externalize to assets/domain_packs/)

Input

  • queries.md (keywords, excludes, time window)

Outputs

  • papers/papers_raw.jsonl (JSONL; 1 paper per line)
    • Each record includes at least: title, authors, year, url, abstract
    • When using the arXiv API online mode, records also include helpful metadata: arxiv_id, pdf_url, categories, primary_category, published, updated, doi, journal_ref, comment
  • Convenience index (optional but generated by the script):
    • papers/papers_raw.csv

Decision: online vs offline

  • If you have network access: run arXiv API retrieval.
  • If not: import an export the user provides (CSV/JSON/JSONL) and normalize fields.
  • Hybrid: if you import offline but still have network later, you can enrich missing fields (abstract/authors/categories) via arXiv id_list using --enrich-metadata or queries.md enrich_metadata: true.

Workflow (heuristic)

  1. Read queries.md and expand into concrete query strings.
  2. Retrieve results (online) or import an export (offline).
  3. Normalize every record to include at least:
    • title, authors (array), year, url, abstract
  4. Keep the set broad at this stage; dedupe/ranking comes next.
  5. Apply time window and max_results if specified.

Quality checklist

  • papers/papers_raw.jsonl exists.
  • Each line is valid JSON and contains title, authors, year, url.

Side effects

  • Allowed: create/overwrite papers/papers_raw.jsonl; append notes to STATUS.md.
  • Not allowed: write prose sections in output/ before writing is approved.

Script

Quick Start

  • python scripts/run.py --help
  • Online: python scripts/run.py --workspace \x3Cworkspace_dir> --query "\x3Cquery>" --max-results 200
  • Offline import: python scripts/run.py --workspace \x3Cworkspace_dir> --input \x3Cexport.csv|json|jsonl>

All Options

  • --query \x3Cq>: repeatable; multiple queries are unioned
  • --exclude \x3Cterm>: repeatable; excludes applied after retrieval
  • --max-results \x3Cn>: cap total retrieved
  • --input \x3Cexport.*>: offline mode (CSV/JSON/JSONL)
  • --enrich-metadata: best-effort enrich via arXiv id_list (needs network)
  • queries.md also supports: keywords, exclude, time window, max_results, enrich_metadata

Examples

  • Online (multi-query + excludes):
    • python scripts/run.py --workspace \x3Cws> --query "LLM agent" --query "tool use" --exclude "survey" --max-results 300
  • Fetch a single paper by arXiv ID (direct id_list fetch):
    • python scripts/run.py --workspace \x3Cws> --query 2509.02547 --max-results 1
  • Offline auto-detect (no flags):
    • Place papers/import.csv (or .json/.jsonl) under the workspace, then run: python scripts/run.py --workspace \x3Cws>
  • Offline import + time window (via queries.md):
    • Set - time window: { from: 2022, to: 2025 } then run offline import normally

Troubleshooting

Common Issues

Issue: papers/papers_raw.jsonl is empty

Symptom:

  • Script exits with “No results returned …” or output file is empty.

Causes:

  • Network is blocked (online mode).
  • Queries are too narrow or queries.md is empty.

Solutions:

  • Use offline import: place papers/import.csv|json|jsonl in the workspace or pass --input.
  • Broaden keywords and reduce excludes in queries.md.
  • Run with explicit --query to sanity-check the parser.

Issue: Offline import records miss fields

Symptom:

  • Downstream steps fail because records miss authors/year/abstract/url.

Causes:

  • Export columns don’t match expected fields; upstream export is incomplete.

Solutions:

  • Ensure the export contains at least title, authors, year, url, abstract.
  • If you later have network, use --enrich-metadata to backfill missing fields (best effort).

Recovery Checklist

  • Confirm queries.md has non-empty keywords (or pass --query).
  • If offline: confirm workspace has papers/import.* and rerun.
  • Spot-check 3–5 JSONL lines: valid JSON + required fields.
Usage Guidance
This skill appears to do exactly what it says: query arXiv (or normalize offline exports) and write a JSONL index under the workspace. Before installing or running: (1) ensure you trust the workspace path the skill will write to (it will create/overwrite papers/papers_raw.jsonl and a CSV index); (2) verify you have Python available and are OK with network calls to export.arxiv.org/arxiv.org when running online; (3) review scripts/run.py locally if you need assurance (it contains the API calls and normalization logic); (4) if you plan to feed offline exports, only use trusted exports to avoid garbage input; (5) note the skill can be invoked autonomously by the agent (default) — if you want to restrict autonomous runs, adjust agent invocation policies accordingly.
Capability Analysis
Type: OpenClaw Skill Name: arxiv-search Version: 1.0.0 The arxiv-search skill bundle is a legitimate tool designed for retrieving and processing research paper metadata. The primary script, `scripts/run.py`, interacts exclusively with the official arXiv API (export.arxiv.org) using standard Python libraries and includes robust logic for handling both online retrieval and offline data imports. The shared utilities in the `tooling/` directory, such as `executor.py` and `quality_gate.py`, provide necessary infrastructure for the OpenClaw agentic framework, including automated execution of sub-scripts and extensive validation of research artifacts (e.g., citation density and structural integrity). No evidence of malicious behavior, such as data exfiltration, unauthorized system access, or harmful prompt injection, was detected.
Capability Assessment
Purpose & Capability
Name/description (arXiv metadata retrieval) matches the included scripts and assets. The skill only requires a Python runtime and reads/writes workspace files (queries.md, papers/*). Domain-pack JSON files and pipeline docs are coherent with retrieval/query-rewrite behavior.
Instruction Scope
SKILL.md confines actions to reading queries.md, domain packs in the repo, doing online arXiv API calls or offline import conversion, and writing papers/papers_raw.jsonl (and optional CSV index). It does not instruct the agent to read unrelated system files or to transmit data to external endpoints other than arXiv (export.arxiv.org / arxiv.org) and does not request broad discretionary data collection.
Install Mechanism
No install spec — the skill is delivered as Python scripts and documentation and expects python/python3 on PATH. No external downloads or archive extraction are specified.
Credentials
The skill declares no required environment variables or credentials. Its network access (to arXiv) is appropriate for the stated purpose. No unexpected secrets or unrelated service tokens are requested.
Persistence & Privilege
The skill is not force-enabled (always:false) and does not request modifications to other skills or system-wide configuration. Autonomous invocation is allowed (default) but that is normal and not combined with broad privileges or secret access.
How to Use
  1. Make sure OpenClaw is installed (local or Docker)
  2. Run the install command in chat: /install arxiv-search
  3. After installation, invoke the skill by name or use /arxiv-search
  4. Provide required inputs per the skill's parameter spec and get structured output
Version History
v1.0.0
- Initial release of arxiv-search. - Enables retrieval of arXiv paper metadata using keyword queries. - Supports both online (arXiv API) and offline (CSV/JSON/JSONL import) workflows. - Outputs normalized results to `papers/papers_raw.jsonl` with key metadata fields. - Provides optional field enrichment via arXiv `id_list` if network is available. - Includes troubleshooting and quality guidance for smooth integration.
Metadata
Slug arxiv-search
Version 1.0.0
License MIT-0
All-time Installs 9
Active Installs 8
Total Versions 1
Frequently Asked Questions

What is Arxiv Search?

Retrieve paper metadata from arXiv using keyword queries and save results as JSONL (`papers/papers_raw.jsonl`). **Trigger**: arXiv, arxiv, paper search, meta... It is an AI Agent Skill for Claude Code / OpenClaw, with 403 downloads so far.

How do I install Arxiv Search?

Run "/install arxiv-search" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.

Is Arxiv Search free?

Yes, Arxiv Search is completely free, licensed under MIT-0. You can download, install and use it at no cost.

Which platforms does Arxiv Search support?

Arxiv Search is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).

Who created Arxiv Search?

It is built and maintained by WILLOSCAR (@willoscar); the current version is v1.0.0.

💬 Comments