Description

Model-guided arXiv paper collection workflow that plans queries, fetches metadata, filters relevance, and merges deduplicated results by language.

README (SKILL.md)

\r \r

ArXiv Search Collector\r

Name: Arxiv Search Collector
Author: xukp20

\r Use this skill when you want model-led query planning and model-led relevance filtering.\r \r

Core Principle\r

\r Scripts are tools. The model performs the reasoning and decisions:\r \r

Expand the original topic into multiple focused queries.\r
Run one fetch command per query.\r
Read each query result list and decide keep indexes.\r
Merge kept items and dedupe with one script.\r \r

Step 1: Initialize Run\r

\r

python3 scripts/init_collection_run.py \\r
  --output-root /path/to/data \\r
  --topic "LLM applications in Lean 4 formalization" \\r
  --keywords "Lean 4,LLM,formalization" \\r
  --categories "cs.AI,cs.LO" \\r
  --target-range 5-10 \\r
  --lookback 30d \\r
  --language English\r
```\r
\r
This creates a run directory with `task_meta.json`, `task_meta.md`, `query_results/`, and `query_selection/`.\r
\r
## Language Parameter\r
\r
- `--language` must be set manually for each collection run.\r
- Use the same language value across all collector scripts for consistency.\r
- If `--language` is non-English (for example `Chinese`), generated markdown files are written in that language:\r
  - `task_meta.md`\r
  - `query_results/\x3Clabel>.md`\r
  - `\x3Carxiv_id>/metadata.md`\r
  - `papers_index.md`\r
\r
## Query Writing Requirements\r
\r
Follow these rules before running per-query fetch:\r
\r
1. Determine query count from final target range.\r
- Prefer `3` queries for small/medium targets (`2-5`, `5-10`).\r
- Prefer `4` queries for larger targets (`10-50` or above).\r
- Avoid writing too many low-quality queries.\r
\r
2. Allocate target budget to each query, then oversample.\r
- Let `target_max` be the upper bound in target range.\r
- Compute `target_per_query = ceil(target_max / query_count)`.\r
- Fetch each query with `max_results = target_per_query * 2` (or `* 3` when recall is more important).\r
- Example: target `5-10`, query count `3` -> `target_per_query=4` -> each query fetches `8-12`.\r
\r
3. Keep one original-theme query, then add normalized/synonym expansions.\r
- Query 1 keeps original topic wording.\r
- Remaining queries use normalized terms and close synonyms.\r
- Prefer concise noun phrases that match arXiv indexing behavior.\r
\r
4. Use `OR` inside the same semantic group (synonyms), and `AND` across groups.\r
- Same-group synonyms should be connected with `OR` to increase recall.\r
  - Example group A (model terms): `LLM OR "large language model" OR AI`.\r
  - Example group B (Lean terms): `"Lean 4" OR Lean OR "formal language"`.\r
- Different semantic groups should be connected with `AND` to keep relevance.\r
  - Example: `(LLM-group) AND (Lean-group)`.\r
- Recommended pattern:\r
  - `(\x3Cdomain terms with OR>) AND (\x3Cmethod/model terms with OR>) [AND \x3Coptional constraint terms>]`\r
\r
### Query Examples (arXiv API-ready)\r
\r
Theme A: `LLM applications in Lean 4 formalization`\r
- `all:"LLM applications in Lean 4 formalization"`\r
- `(all:"Lean 4" OR all:"Lean" OR all:"formal language") AND (all:"LLM" OR all:"large language model" OR all:"AI")`\r
- `(all:"Lean" OR all:"formalization") AND (all:"LLM" OR all:"large language model") AND all:"theorem proving"`\r
- `(all:"Lean" OR all:"proof assistant") AND (all:"AI" OR all:"LLM")`\r
\r
Theme B: `agentic tool use for code generation`\r
- `all:"agentic tool use code generation"`\r
- `(all:"agentic" OR all:"autonomous agent") AND (all:"LLM" OR all:"large language model")`\r
- `(all:"tool use" OR all:"function calling") AND (all:"coding assistant" OR all:"code generation")`\r
\r
Theme C: `multimodal reasoning with retrieval`\r
- `all:"multimodal reasoning retrieval"`\r
- `(all:"multimodal" OR all:"vision language") AND (all:"retrieval" OR all:"RAG")`\r
- `(all:"multimodal model" OR all:"vision language model") AND (all:"reasoning" OR all:"tool use")`\r
\r
## Step 2: Fetch One Query at a Time\r
\r
Model defines queries manually, for example:\r
\r
- `all:"Lean 4"`\r
- `all:"LLM formalization"`\r
- `all:"AI formal verification"`\r
\r
Recommended batch mode (safe defaults, serial execution):\r
\r
```bash\r
python3 scripts/fetch_queries_batch.py \\r
  --run-dir /path/to/run-dir \\r
  --plan-json /path/to/query_plan.json\r
```\r
\r
In batch mode, the script auto-applies:\r
\r
- serial API calls\r
- `--min-interval-sec 5`\r
- `--retry-max 4`\r
- `--retry-base-sec 5`\r
- `--retry-max-sec 120`\r
- `--retry-jitter-sec 1`\r
- per-run rate-state file (`\x3Crun_dir>/.runtime/arxiv_api_state.json`) for throttling\r
- auto `max_results` from `target_range` and query count (default oversample `x2`, cap `60`)\r
- default language/categories from `task_meta.json`\r
\r
Minimal `query_plan.json` only needs `label` and `query`.\r
See `references/query-plan-format.md`.\r
You normally do not need to set fetch-control args manually.\r
\r
If you need one-by-one manual fetch, run each query:\r
\r
```bash\r
python3 scripts/fetch_query_metadata.py \\r
  --run-dir /path/to/run-dir \\r
  --label lean4 \\r
  --query 'all:"Lean 4"' \\r
  --max-results 30 \\r
  --min-interval-sec 5 \\r
  --retry-max 4 \\r
  --language English\r
```\r
\r
Output files:\r
\r
- `query_results/\x3Clabel>.json` (indexed full metadata list)\r
- `query_results/\x3Clabel>.md` (human-readable preview)\r
\r
Date range is applied directly in arXiv API `search_query` via `submittedDate:[... TO ...]`.\r
No second local date-filter pass is performed.\r
\r
Rate-limit controls in `fetch_query_metadata.py`:\r
\r
- `--min-interval-sec` (default `5.0`)\r
- `--retry-max` (default `4`)\r
- `--retry-base-sec` (default `5.0`)\r
- `--retry-max-sec` (default `120.0`)\r
- `--retry-jitter-sec` (default `1.0`)\r
- `--rate-state-path` (optional override; default is `\x3Crun_dir>/.runtime/arxiv_api_state.json`)\r
- `--force` to bypass cache and re-fetch\r
\r
## Step 3: Model Filters Relevance\r
\r
For each query list, the model reads indexed results and decides what to keep.\r
\r
Use keep specs by index and/or arXiv ID when merging.\r
To explicitly drop one weak query in later iterations, set that label to an empty keep list in `selection-json`.\r
\r
## Step 4: Merge and Dedupe\r
\r
```bash\r
python3 scripts/merge_selected_papers.py \\r
  --run-dir /path/to/run-dir \\r
  --keep lean4:0,2,4 \\r
  --keep llm-formalization:1,3 \\r
  --language English\r
```\r
\r
or with `selection-json`:\r
\r
```json\r
{\r
  "lean4-round1": [0, 2, 4],\r
  "lean4-round2": [],\r
  "formalization-round2": [1, 3, 5]\r
}\r
```\r
\r
An empty list means this query label is intentionally dropped (`keep 0`).\r
\r
This writes final outputs:\r
\r
- `\x3Carxiv_id>/metadata.json`\r
- `\x3Carxiv_id>/metadata.md`\r
- `papers_index.json`\r
- `papers_index.md`\r
\r
## Step 5: Iterative Retry Loop (Incremental)\r
\r
If relevance is weak or final count is insufficient after Step 4, iterate:\r
\r
1. Review `papers_index.md` and per-paper metadata quality.\r
2. Adjust query plan (usually broaden with additional synonym `OR` terms, keep cross-group `AND` constraints).\r
3. Fetch additional query results with new labels.\r
4. Re-run merge in incremental mode:\r
\r
```bash\r
python3 scripts/merge_selected_papers.py \\r
  --run-dir /path/to/run-dir \\r
  --incremental \\r
  --selection-json /path/to/updated_selection.json \\r
  --language English\r
```\r
\r
Incremental behavior:\r
\r
- Previous label selections are loaded from `query_selection/selected_by_query.json`.\r
- Labels provided in the new `selection-json` override previous selections for those labels.\r
- New labels can be added.\r
- Old labels can be dropped by setting `[]`.\r
\r
Stop retrying when:\r
\r
- relevance is acceptable, or\r
- additional broadened queries mainly add low-relevance papers.\r
\r
If relevant papers are genuinely scarce, it is valid to finish below the original minimum target range.\r
\r
## Notes\r
\r
- Keep API concurrency conservative by controlling query count and `--max-results`.\r
- Keep per-query fetch serial (no parallel API calls in Stage A).\r
- Reuse cache by default for identical query/date/request settings; only use `--force` when necessary.\r
- Prefer default run-local rate-state so all steps in the same run share one cooldown/throttling state.\r
- If arXiv API returns `429 Too Many Requests`, retry later and/or increase `--min-interval-sec`.\r
- Prefer explicit, narrow queries and let the model filter aggressively.\r
- Use `references/io-contract.md` for exact files and schema.\r
\r
## Related Skills\r
\r
This skill is a sub-skill of `arxiv-summarizer-orchestrator`.\r
\r
Pipeline position:\r
\r
1. Step 1 (collection): `arxiv-search-collector` (this skill)\r
2. Step 2 (per-paper processing): `arxiv-paper-processor`\r
3. Step 3 (batch reporting): `arxiv-batch-reporter`\r
\r
This skill produces the initial paper-set structure and metadata that Stage B and Stage C depend on.\r

Usage Guidance

This skill appears to do what it claims: batch and single-query fetches from arXiv, plus merging/deduping. Before running it, pick an explicit dedicated output root (do not point --output-root at a system or sensitive directory). Treat plan.json and per-query labels as trusted inputs — avoid running the scripts on untrusted plans since labels and keep-IDs are used verbatim when creating/removing files and directories (the code lacks strong filename sanitization). If you expect to run untrusted plans, run the collector inside a constrained environment (container or sandbox) or inspect/normalize labels first. Finally, the only network calls are to export.arxiv.org; if you need assurance, review the run_dir contents after a fetch and before running merges.

Capability Analysis

Type: OpenClaw Skill Name: arxiv-search-collector Version: 0.1.1 The OpenClaw AgentSkills bundle is classified as benign. All scripts and documentation align with the stated purpose of collecting arXiv paper metadata. The Python scripts (`init_collection_run.py`, `fetch_queries_batch.py`, `fetch_query_metadata.py`, `merge_selected_papers.py`) perform expected file system operations (creating/deleting directories, writing/reading JSON/Markdown files within a designated run directory) and network requests (to the official arXiv API). Input sanitization (e.g., `slugify` for paths, `urlencode` for API queries, `subprocess.run` with a list of arguments) and file system safeguards (e.g., `is_relative_to` check before `shutil.rmtree`) are appropriately implemented. There is no evidence of data exfiltration, unauthorized command execution, persistence mechanisms, obfuscation, or prompt injection attempts in the `SKILL.md` or other documentation.

Capability Assessment

✓ Purpose & Capability

The skill name/description (model-guided arXiv collection) matches the included scripts and SKILL.md: initializing a run directory, composing queries, fetching results from the arXiv API, letting a model select keep indexes, and merging/deduping results. Required resources (none) are proportionate to the stated purpose.

ℹ Instruction Scope

Runtime instructions direct the agent to run the included Python scripts and write/read files under a user-specified run directory; that is expected. The scripts perform HTTP calls only to the official arXiv API (export.arxiv.org). One behavioral note: labels and keep-IDs passed to the scripts are used directly to form filenames and directories without robust sanitization, so if you run these scripts with untrusted inputs they could create or remove files outside the intended query_results/ or per-paper directories (path traversal-ish behavior). This is an implementation-level safety concern but not evidence of hidden exfiltration or unrelated behavior.

✓ Install Mechanism

No install spec; this is instruction-and-script only. That minimizes supply-chain risk. The bundled Python scripts rely on the standard library and perform local file I/O and urllib calls; nothing is downloaded or executed from external/untrusted URLs at install time.

✓ Credentials

The skill requests no environment variables or external credentials. The scripts use only values passed on the command line and data from the run directory; this is proportionate to querying the public arXiv API.

✓ Persistence & Privilege

always is false and the skill does not request persistent platform privileges. The scripts create, update, and may delete files under the user-provided run directory (normal for this tool). There is no modification of other skills' configs or system-wide agent settings in the bundle.

Version History

v0.1.1

Document cross-skill relationships in all SKILL.md files

v0.1.0

- Initial release of arxiv-search-collector. - Provides a model-driven workflow for arXiv search: model plans queries and filters relevance. - Supports manual language selection for multilingual markdown output. - Includes scripts to initialize runs, fetch queries in batch or individually, and merge/dedupe metadata. - Emphasizes model-authored query planning and per-query relevance selection over rule-based heuristics. - Documents iterative loop for refining search results with incremental query and selection adjustments.

Metadata

Slug arxiv-search-collector

Version 0.1.1

License —

All-time Installs 16

Active Installs 14

Total Versions 2

Frequently Asked Questions

What is Arxiv Search Collector?

Model-guided arXiv paper collection workflow that plans queries, fetches metadata, filters relevance, and merges deduplicated results by language. It is an AI Agent Skill for Claude Code / OpenClaw, with 1584 downloads so far.

How do I install Arxiv Search Collector?

Run "/install arxiv-search-collector" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.

Is Arxiv Search Collector free?

Yes, Arxiv Search Collector is completely free (open-source). You can download, install and use it at no cost.

Which platforms does Arxiv Search Collector support?

Arxiv Search Collector is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).

Who created Arxiv Search Collector?

It is built and maintained by xukp20 (@xukp20); the current version is v0.1.1.

More Skills

Arxiv Search Collector