← Back to Skills Marketplace

Sci Data Extractor

Name: Sci Data Extractor
Author: jackkuo666

by JackKuo666 · GitHub ↗ · v0.1.0

cross-platform ⚠ suspicious

463

Downloads

Stars

Active Installs

Versions

Install in OpenClaw

/install sci-data-extractor

Description

AI-powered tool for extracting structured data from scientific literature PDFs

Usage Guidance

What to check before installing or running this skill: - Origin: The skill's Source/Homepage are unknown; prefer code from a trusted repository. If you got this from an external repo, inspect the repo and maintainer reputation. - API keys: The code will send extracted text (potentially entire PDF contents) to external LLMs/Mathpix. Only use API keys with limited scope or billing controls, and avoid uploading sensitive or private documents. - Registry mismatch: The registry lists no required env vars but the SKILL.md and code require EXTRACTOR_API_KEY (or API_KEY), EXTRACTOR_BASE_URL and optionally Mathpix keys. Do not provide secrets until you confirm how they are used and where traffic goes. - LLM/provider inconsistency: The README defaults to a Claude model name but the code uses the openai Python client and a configurable base_url. Verify that the client and base_url will actually work with your provider; otherwise keys might be misdirected or fail. - Avoid running curl | sh blindly: The installer suggests running an external script (https://astral.sh/uv/install.sh). Do not run that unless you trust the source—prefer to install uv/venv tooling via package manager or inspect the script first. - Sandbox test: Run the tool in a disposable environment (VM/container) first, with a throwaway API key and non-sensitive PDFs. Monitor network requests during a test run to confirm endpoints and data sent. - Code review focus: The key network actions are in extractor.py (requests to Mathpix and the OpenAI client usage). Confirm there are no hidden endpoints or telemetry sending keys elsewhere. If you are not comfortable, do not provide production API keys. If you want, I can point out the exact lines in the code that perform the network calls and the places where environment variables are read, or produce a minimal checklist for a safe sandboxed test run.

Capability Analysis

Type: OpenClaw Skill Name: sci-data-extractor Version: 0.1.0 The skill is classified as suspicious due to several critical vulnerabilities that could be exploited by a malicious user, rather than intentional malicious behavior by the skill author. The most significant risks include an arbitrary file write vulnerability in `extractor.py` and `batch_extract.py` where user-controlled output paths could lead to writing to sensitive system files (e.g., `/etc/passwd`, `~/.ssh/authorized_keys`). Additionally, there's a potential Server-Side Request Forgery (SSRF) via the `EXTRACTOR_BASE_URL` configuration, allowing LLM API calls to be redirected to internal network resources. Local File Inclusion (LFI) is also possible through user-controlled PDF paths in `PDFProcessor.extract_text_pymupdf`. Finally, the installation instructions in `SKILL.md` and `README.md` use `curl | sh` for `uv` installation, which introduces a supply chain risk.

Capability Assessment

ℹ Purpose & Capability

Name/description match the code and docs: the project extracts text from PDFs (PyMuPDF or Mathpix) and sends content to an LLM to produce structured outputs. That capability legitimately requires an LLM API key and optionally Mathpix credentials. However, the registry metadata declares no required environment variables or primary credential while the SKILL.md and code clearly expect EXTRACTOR_API_KEY (or API_KEY), EXTRACTOR_BASE_URL, and optional MATHPIX_APP_ID / MATHPIX_APP_KEY — the missing declaration in registry is an inconsistency and reduces transparency.

ℹ Instruction Scope

Runtime instructions and code will read local PDFs and .env, upload PDFs to Mathpix if chosen, and send extracted text to an external LLM endpoint. That is coherent with the stated purpose, but it does mean entire document content (potentially sensitive or copyrighted material) is transmitted to third-party services. The SKILL.md also suggests running external install scripts (see next dimension).

⚠ Install Mechanism

There is no formal install spec in the registry, but the SKILL.md recommends installing the 'uv' tool via curl -LsSf https://astral.sh/uv/install.sh | sh which runs a remote install script — a higher-risk pattern. The README also suggests adding the skill via npx or cloning a GitHub repo. Running an arbitrary curl|sh should be treated cautiously; the project otherwise relies on pip packages listed in requirements.txt (reasonable).

⚠ Credentials

The code requires an LLM API key and optionally Mathpix credentials (EXTRACTOR_API_KEY or API_KEY, EXTRACTOR_BASE_URL, MATHPIX_APP_ID/KEY). Those are proportionate for an extractor. The problem: the registry metadata lists no required env vars, creating a transparency gap. Also the README/SKILL.md default model is a Claude model name while the code uses the openai.OpenAI client and accepts EXTRACTOR_BASE_URL — this mismatch (client vs declared model/provider) is suspicious and should be verified before supplying keys.

✓ Persistence & Privilege

The skill does not request always:true and does not claim to modify other skills or persistent system settings. It's a user-invoked tool and its runtime behavior is limited to reading local PDFs, optional .env, and making network calls to configured LLM/Mathpix endpoints.

How to Use

Make sure OpenClaw is installed (local or Docker)
Run the install command in chat: /install sci-data-extractor
After installation, invoke the skill by name or use /sci-data-extractor
Provide required inputs per the skill's parameter spec and get structured output

Version History

v0.1.0

Initial release of Sci-Data-Extractor. - Extracts structured data from scientific paper PDFs using LLMs and OCR methods. - Supports formula and table recognition with Mathpix OCR or PyMuPDF. - Outputs in Markdown tables or CSV files. - Provides preset extraction templates for enzyme kinetics, experiments, and literature reviews. - Allows usage of custom prompts for flexible data extraction needs. - Installation instructions for Python/pip, uv, or conda included, with API key configuration guidance.

Metadata

Slug sci-data-extractor

Version 0.1.0

License —

All-time Installs 1

Active Installs 1

Total Versions 1

Frequently Asked Questions

What is Sci Data Extractor?

AI-powered tool for extracting structured data from scientific literature PDFs. It is an AI Agent Skill for Claude Code / OpenClaw, with 463 downloads so far.

How do I install Sci Data Extractor?

Run "/install sci-data-extractor" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.

Is Sci Data Extractor free?

Yes, Sci Data Extractor is completely free (open-source). You can download, install and use it at no cost.

Which platforms does Sci Data Extractor support?

Sci Data Extractor is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).

Who created Sci Data Extractor?

It is built and maintained by JackKuo666 (@jackkuo666); the current version is v0.1.0.

More Skills