← Back to Skills Marketplace
xukp20

Arxiv Paper Processor

by xukp20 · GitHub ↗ · v0.1.1
cross-platform ⚠ suspicious
1842
Downloads
1
Stars
12
Active Installs
2
Versions
Install in OpenClaw
/install arxiv-paper-processor
Description
Tool for manual per-paper ArXiv paper processing: batch/source/pdf download then model-driven full-text reading and summary.md writing in chosen language.
README (SKILL.md)

\r \r

ArXiv Paper Processor\r

\r Use this skill for per-paper manual summarization, with optional batch artifact download.\r \r

  • Single-paper mode: process one paper directory (e.g. \x3Crun_dir>/\x3Carxiv_id>/).\r
  • Batch predownload mode: process many paper directories under one run dir before writing summaries.\r \r

Language Parameter\r

\r

  • Use a workflow language parameter (for example English or Chinese) and apply it manually.\r
  • The per-paper summary.md must be written in the selected language.\r
  • If download scripts are called directly, pass --language \x3CLANG> for traceability.\r \r

Core Principle\r

\r Scripts only fetch artifacts. The model performs reading and writing.\r \r

Non-negotiable Constraint\r

\r

  • Do not generate summary.md by script-based snippet extraction, regex harvesting, or template autofill.\r
  • Do not use Python/shell scripts to auto-compose section text from abstract/introduction fragments.\r
  • Scripts in this skill are only for artifact download (source/pdf) and trace logs.\r
  • The final summary.md must come from model-side reading and synthesis of the paper content.\r \r

Optional Batch Artifact Download (Many Papers)\r

\r Use this first when Stage B has many papers:\r \r

python3 scripts/download_papers_batch.py \\r
  --run-dir /path/to/run \\r
  --artifact source_then_pdf \\r
  --max-workers 3 \\r
  --min-interval-sec 5 \\r
  --language English\r
```\r
\r
Key behavior:\r
\r
- Supports `--artifact source`, `--artifact pdf`, or `--artifact source_then_pdf` (default).\r
- Supports concurrency (`--max-workers`) and safe throttling/retry (`--min-interval-sec`, retry args).\r
- Uses run-local throttle state by default (`\x3Crun_dir>/.runtime/arxiv_download_state.json`) to reduce 429 risk.\r
- Skips papers that already have usable `source/source_extract/*.tex` or existing `source/paper.pdf` (unless `--force`).\r
- Resume-friendly: if a paper already has a completed `summary.md`, you can skip that paper's summary-writing step.\r
- Writes batch log to `\x3Crun_dir>/download_batch_log.json` by default.\r
\r
## Step 1: Download Source (Preferred)\r
\r
```bash\r
python3 scripts/download_arxiv_source.py \\r
  --paper-dir /path/to/run/2602.00528 \\r
  --language English\r
```\r
\r
This writes:\r
\r
- `source/source_bundle.bin`\r
- `source/source_extract/`\r
- `source/download_source_log.json`\r
\r
If usable source already exists and `--force` is not set, the script reuses local artifacts.\r
\r
## Step 2: If Needed, Download PDF\r
\r
```bash\r
python3 scripts/download_arxiv_pdf.py \\r
  --paper-dir /path/to/run/2602.00528 \\r
  --language English\r
```\r
\r
This writes:\r
\r
- `source/paper.pdf`\r
- `source/download_pdf_log.json`\r
\r
If PDF already exists and `--force` is not set, the script reuses local artifacts.\r
\r
## Step 3: Model Reads and Summarizes\r
\r
1. If `summary.md` already exists and follows the required format, skip this paper and mark it complete.\r
2. Read `metadata.md` first.\r
3. If `source/source_extract/` already exists with readable `.tex` files, use it directly.\r
4. Otherwise, if `source/paper.pdf` already exists, use PDF directly.\r
5. If neither exists, run download scripts (single-paper scripts or batch script) first.\r
6. Manually write `summary.md` in the same paper directory, in the selected language.\r
\r
Do not rely on rule-based auto summarization.\r
Do not rely on auto-extracted snippets as the primary writing basis.\r
\r
## Quality Requirement\r
\r
- Every section should include paper-specific details that are traceable to full-text reading.\r
- Section 4/5/10 should reflect concrete method and evaluation details, not generic wording.\r
- If key details are unclear in the source, explicitly note uncertainty instead of guessing.\r
- Match the detail level shown in `references/summary-example-en.md` and `references/summary-example-zh.md`.\r
- If your draft is clearly shorter or less specific than the examples, expand it before finishing.\r
\r
## Required Output\r
\r
- `\x3Cpaper_dir>/summary.md` in fixed section format.\r
- Pay special attention to section `## 10. Brief Conclusion`: write a 3-4 sentence mini-conclusion that covers contribution, method, evaluation setup, and results with paper-specific details.\r
- In section `## 1. Paper Snapshot`, use exact keys: `ArXiv ID`, `Title`, `Authors`, `Publish date`, `Primary category`, `Reading basis`.\r
- Do not use key variants such as `Reading source`, `Author list`, `Published on`, or lowercase key names.\r
\r
See `references/summary-format.md` for exact section requirements.\r
\r
## Related Skills\r
\r
This skill is a sub-skill of `arxiv-summarizer-orchestrator`.\r
\r
Pipeline position:\r
\r
1. Step 1 (upstream): `arxiv-search-collector` produces the selected paper directories and metadata.\r
2. Step 2 (this skill): `arxiv-paper-processor` downloads artifacts and writes one `summary.md` per paper.\r
3. Step 3 (downstream): `arxiv-batch-reporter` uses these per-paper summaries to generate the final collection report.\r
\r
Use this skill together with Step 1 and Step 3 for full end-to-end execution.\r
Usage Guidance
This skill appears internally consistent: it downloads arXiv source/pdf artifacts and asks the model to manually read those artifacts and write summary.md files. Before installing/using it, do the following checks: 1) Open the full scripts (the prompt contained truncated files) and confirm that all network requests are aimed at legitimate arXiv endpoints (e.g., arxiv.org) and not to unknown third-party URLs. 2) Run the scripts in an isolated workspace (or container) so downloads and extracted files are restricted to intended run directories. 3) The scripts write logs and extracted files under the run/paper directories — ensure those directories are the ones you expect. 4) No credentials are required, so never add secrets to make it 'work'. 5) If you will allow the agent to invoke this skill autonomously, be aware it can perform network downloads and write files; if you need stricter controls, keep autonomous invocation disabled or sandbox its execution. If you want higher confidence, provide the untruncated full source so URL-building and any remaining code paths can be fully audited.
Capability Analysis
Type: OpenClaw Skill Name: arxiv-paper-processor Version: 0.1.1 The skill bundle 'arxiv-paper-processor' is designed for downloading and summarizing arXiv papers. The `SKILL.md` explicitly instructs the AI agent to perform model-driven summarization and forbids script-based content generation. The Python scripts (`download_arxiv_pdf.py`, `download_arxiv_source.py`) handle network requests to `arxiv.org` and local file operations, including safe tar extraction, without evidence of malicious intent or data exfiltration. However, `scripts/download_papers_batch.py` accepts arguments like `--python-bin`, `--source-script`, and `--pdf-script`. While `subprocess.run` is used safely with a list of arguments, if an untrusted caller (e.g., the AI agent with poor input sanitization or a malicious user) were to inject arbitrary commands or paths into these arguments, it could lead to arbitrary code execution, classifying this as suspicious due to a potential critical vulnerability.
Capability Assessment
Purpose & Capability
Name/description match the included artifacts: three downloader scripts and a batch orchestrator. The files and SKILL.md describe downloading arXiv source/PDF, local throttling, extraction, and asking the model to manually produce summary.md. There are no unrelated environment variables, binaries, or config paths requested.
Instruction Scope
SKILL.md instructs the agent to only use the scripts for artifact download and to perform model-driven reading and manual summary writing. The instructions reference only per-paper directories, metadata files, extracted source, and PDFs. They explicitly forbid using scripts or regex-based extraction to auto-generate summaries. Note: parts of the code in the prompt were truncated, so I could not fully confirm every URL construction; verify that network requests target arXiv endpoints only.
Install Mechanism
There is no install spec (instruction-only skill with bundled scripts). This is lowest-risk from an install perspective: the skill will not download remote install artifacts on install time. The included Python scripts are run by the user/agent at runtime.
Credentials
The skill declares no required environment variables, credentials, or config paths. The scripts perform HTTP requests and write local files under per-paper directories; this is proportionate to the stated purpose.
Persistence & Privilege
Flags show always: false and normal autonomous invocation allowed. The skill does not request permanent system-wide presence or modify other skills. Its runtime behavior is limited to writing artifacts and logs in the provided run/paper directories.
How to Use
  1. Make sure OpenClaw is installed (local or Docker)
  2. Run the install command in chat: /install arxiv-paper-processor
  3. After installation, invoke the skill by name or use /arxiv-paper-processor
  4. Provide required inputs per the skill's parameter spec and get structured output
Version History
v0.1.1
Document cross-skill relationships in all SKILL.md files
v0.1.0
Initial release: supports manual, model-driven summarization of arXiv papers with optional batch artifact downloading. - Provides scripts for downloading source files or PDFs for arXiv papers, with batch and single-paper modes. - Enforces manual summary writing by the model in a specified language parameter (no script-based summarization). - Batch download supports concurrency, safe throttling, resume, and skips already-processed papers. - Output summaries must follow a strict, detailed, sectioned format, as per provided examples, with concrete paper-specific detail. - Scripts are for fetching artifacts only; summarization is always based on model-side paper reading and synthesis.
Metadata
Slug arxiv-paper-processor
Version 0.1.1
License
All-time Installs 12
Active Installs 12
Total Versions 2
Frequently Asked Questions

What is Arxiv Paper Processor?

Tool for manual per-paper ArXiv paper processing: batch/source/pdf download then model-driven full-text reading and summary.md writing in chosen language. It is an AI Agent Skill for Claude Code / OpenClaw, with 1842 downloads so far.

How do I install Arxiv Paper Processor?

Run "/install arxiv-paper-processor" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.

Is Arxiv Paper Processor free?

Yes, Arxiv Paper Processor is completely free (open-source). You can download, install and use it at no cost.

Which platforms does Arxiv Paper Processor support?

Arxiv Paper Processor is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).

Who created Arxiv Paper Processor?

It is built and maintained by xukp20 (@xukp20); the current version is v0.1.1.

💬 Comments