/install arxiv-search-collector
\r \r
ArXiv Search Collector\r
\r Use this skill when you want model-led query planning and model-led relevance filtering.\r \r
Core Principle\r
\r Scripts are tools. The model performs the reasoning and decisions:\r \r
- Expand the original topic into multiple focused queries.\r
- Run one fetch command per query.\r
- Read each query result list and decide keep indexes.\r
- Merge kept items and dedupe with one script.\r \r
Step 1: Initialize Run\r
\r
python3 scripts/init_collection_run.py \\r
--output-root /path/to/data \\r
--topic "LLM applications in Lean 4 formalization" \\r
--keywords "Lean 4,LLM,formalization" \\r
--categories "cs.AI,cs.LO" \\r
--target-range 5-10 \\r
--lookback 30d \\r
--language English\r
```\r
\r
This creates a run directory with `task_meta.json`, `task_meta.md`, `query_results/`, and `query_selection/`.\r
\r
## Language Parameter\r
\r
- `--language` must be set manually for each collection run.\r
- Use the same language value across all collector scripts for consistency.\r
- If `--language` is non-English (for example `Chinese`), generated markdown files are written in that language:\r
- `task_meta.md`\r
- `query_results/\x3Clabel>.md`\r
- `\x3Carxiv_id>/metadata.md`\r
- `papers_index.md`\r
\r
## Query Writing Requirements\r
\r
Follow these rules before running per-query fetch:\r
\r
1. Determine query count from final target range.\r
- Prefer `3` queries for small/medium targets (`2-5`, `5-10`).\r
- Prefer `4` queries for larger targets (`10-50` or above).\r
- Avoid writing too many low-quality queries.\r
\r
2. Allocate target budget to each query, then oversample.\r
- Let `target_max` be the upper bound in target range.\r
- Compute `target_per_query = ceil(target_max / query_count)`.\r
- Fetch each query with `max_results = target_per_query * 2` (or `* 3` when recall is more important).\r
- Example: target `5-10`, query count `3` -> `target_per_query=4` -> each query fetches `8-12`.\r
\r
3. Keep one original-theme query, then add normalized/synonym expansions.\r
- Query 1 keeps original topic wording.\r
- Remaining queries use normalized terms and close synonyms.\r
- Prefer concise noun phrases that match arXiv indexing behavior.\r
\r
4. Use `OR` inside the same semantic group (synonyms), and `AND` across groups.\r
- Same-group synonyms should be connected with `OR` to increase recall.\r
- Example group A (model terms): `LLM OR "large language model" OR AI`.\r
- Example group B (Lean terms): `"Lean 4" OR Lean OR "formal language"`.\r
- Different semantic groups should be connected with `AND` to keep relevance.\r
- Example: `(LLM-group) AND (Lean-group)`.\r
- Recommended pattern:\r
- `(\x3Cdomain terms with OR>) AND (\x3Cmethod/model terms with OR>) [AND \x3Coptional constraint terms>]`\r
\r
### Query Examples (arXiv API-ready)\r
\r
Theme A: `LLM applications in Lean 4 formalization`\r
- `all:"LLM applications in Lean 4 formalization"`\r
- `(all:"Lean 4" OR all:"Lean" OR all:"formal language") AND (all:"LLM" OR all:"large language model" OR all:"AI")`\r
- `(all:"Lean" OR all:"formalization") AND (all:"LLM" OR all:"large language model") AND all:"theorem proving"`\r
- `(all:"Lean" OR all:"proof assistant") AND (all:"AI" OR all:"LLM")`\r
\r
Theme B: `agentic tool use for code generation`\r
- `all:"agentic tool use code generation"`\r
- `(all:"agentic" OR all:"autonomous agent") AND (all:"LLM" OR all:"large language model")`\r
- `(all:"tool use" OR all:"function calling") AND (all:"coding assistant" OR all:"code generation")`\r
\r
Theme C: `multimodal reasoning with retrieval`\r
- `all:"multimodal reasoning retrieval"`\r
- `(all:"multimodal" OR all:"vision language") AND (all:"retrieval" OR all:"RAG")`\r
- `(all:"multimodal model" OR all:"vision language model") AND (all:"reasoning" OR all:"tool use")`\r
\r
## Step 2: Fetch One Query at a Time\r
\r
Model defines queries manually, for example:\r
\r
- `all:"Lean 4"`\r
- `all:"LLM formalization"`\r
- `all:"AI formal verification"`\r
\r
Recommended batch mode (safe defaults, serial execution):\r
\r
```bash\r
python3 scripts/fetch_queries_batch.py \\r
--run-dir /path/to/run-dir \\r
--plan-json /path/to/query_plan.json\r
```\r
\r
In batch mode, the script auto-applies:\r
\r
- serial API calls\r
- `--min-interval-sec 5`\r
- `--retry-max 4`\r
- `--retry-base-sec 5`\r
- `--retry-max-sec 120`\r
- `--retry-jitter-sec 1`\r
- per-run rate-state file (`\x3Crun_dir>/.runtime/arxiv_api_state.json`) for throttling\r
- auto `max_results` from `target_range` and query count (default oversample `x2`, cap `60`)\r
- default language/categories from `task_meta.json`\r
\r
Minimal `query_plan.json` only needs `label` and `query`.\r
See `references/query-plan-format.md`.\r
You normally do not need to set fetch-control args manually.\r
\r
If you need one-by-one manual fetch, run each query:\r
\r
```bash\r
python3 scripts/fetch_query_metadata.py \\r
--run-dir /path/to/run-dir \\r
--label lean4 \\r
--query 'all:"Lean 4"' \\r
--max-results 30 \\r
--min-interval-sec 5 \\r
--retry-max 4 \\r
--language English\r
```\r
\r
Output files:\r
\r
- `query_results/\x3Clabel>.json` (indexed full metadata list)\r
- `query_results/\x3Clabel>.md` (human-readable preview)\r
\r
Date range is applied directly in arXiv API `search_query` via `submittedDate:[... TO ...]`.\r
No second local date-filter pass is performed.\r
\r
Rate-limit controls in `fetch_query_metadata.py`:\r
\r
- `--min-interval-sec` (default `5.0`)\r
- `--retry-max` (default `4`)\r
- `--retry-base-sec` (default `5.0`)\r
- `--retry-max-sec` (default `120.0`)\r
- `--retry-jitter-sec` (default `1.0`)\r
- `--rate-state-path` (optional override; default is `\x3Crun_dir>/.runtime/arxiv_api_state.json`)\r
- `--force` to bypass cache and re-fetch\r
\r
## Step 3: Model Filters Relevance\r
\r
For each query list, the model reads indexed results and decides what to keep.\r
\r
Use keep specs by index and/or arXiv ID when merging.\r
To explicitly drop one weak query in later iterations, set that label to an empty keep list in `selection-json`.\r
\r
## Step 4: Merge and Dedupe\r
\r
```bash\r
python3 scripts/merge_selected_papers.py \\r
--run-dir /path/to/run-dir \\r
--keep lean4:0,2,4 \\r
--keep llm-formalization:1,3 \\r
--language English\r
```\r
\r
or with `selection-json`:\r
\r
```json\r
{\r
"lean4-round1": [0, 2, 4],\r
"lean4-round2": [],\r
"formalization-round2": [1, 3, 5]\r
}\r
```\r
\r
An empty list means this query label is intentionally dropped (`keep 0`).\r
\r
This writes final outputs:\r
\r
- `\x3Carxiv_id>/metadata.json`\r
- `\x3Carxiv_id>/metadata.md`\r
- `papers_index.json`\r
- `papers_index.md`\r
\r
## Step 5: Iterative Retry Loop (Incremental)\r
\r
If relevance is weak or final count is insufficient after Step 4, iterate:\r
\r
1. Review `papers_index.md` and per-paper metadata quality.\r
2. Adjust query plan (usually broaden with additional synonym `OR` terms, keep cross-group `AND` constraints).\r
3. Fetch additional query results with new labels.\r
4. Re-run merge in incremental mode:\r
\r
```bash\r
python3 scripts/merge_selected_papers.py \\r
--run-dir /path/to/run-dir \\r
--incremental \\r
--selection-json /path/to/updated_selection.json \\r
--language English\r
```\r
\r
Incremental behavior:\r
\r
- Previous label selections are loaded from `query_selection/selected_by_query.json`.\r
- Labels provided in the new `selection-json` override previous selections for those labels.\r
- New labels can be added.\r
- Old labels can be dropped by setting `[]`.\r
\r
Stop retrying when:\r
\r
- relevance is acceptable, or\r
- additional broadened queries mainly add low-relevance papers.\r
\r
If relevant papers are genuinely scarce, it is valid to finish below the original minimum target range.\r
\r
## Notes\r
\r
- Keep API concurrency conservative by controlling query count and `--max-results`.\r
- Keep per-query fetch serial (no parallel API calls in Stage A).\r
- Reuse cache by default for identical query/date/request settings; only use `--force` when necessary.\r
- Prefer default run-local rate-state so all steps in the same run share one cooldown/throttling state.\r
- If arXiv API returns `429 Too Many Requests`, retry later and/or increase `--min-interval-sec`.\r
- Prefer explicit, narrow queries and let the model filter aggressively.\r
- Use `references/io-contract.md` for exact files and schema.\r
\r
## Related Skills\r
\r
This skill is a sub-skill of `arxiv-summarizer-orchestrator`.\r
\r
Pipeline position:\r
\r
1. Step 1 (collection): `arxiv-search-collector` (this skill)\r
2. Step 2 (per-paper processing): `arxiv-paper-processor`\r
3. Step 3 (batch reporting): `arxiv-batch-reporter`\r
\r
This skill produces the initial paper-set structure and metadata that Stage B and Stage C depend on.\r
- Make sure OpenClaw is installed (local or Docker)
- Run the install command in chat:
/install arxiv-search-collector - After installation, invoke the skill by name or use
/arxiv-search-collector - Provide required inputs per the skill's parameter spec and get structured output
What is Arxiv Search Collector?
Model-guided arXiv paper collection workflow that plans queries, fetches metadata, filters relevance, and merges deduplicated results by language. It is an AI Agent Skill for Claude Code / OpenClaw, with 1584 downloads so far.
How do I install Arxiv Search Collector?
Run "/install arxiv-search-collector" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.
Is Arxiv Search Collector free?
Yes, Arxiv Search Collector is completely free (open-source). You can download, install and use it at no cost.
Which platforms does Arxiv Search Collector support?
Arxiv Search Collector is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).
Who created Arxiv Search Collector?
It is built and maintained by xukp20 (@xukp20); the current version is v0.1.1.