← Back to Skills Marketplace

PDF OCR Using Gemini LLM

Name: PDF OCR Using Gemini LLM
Author: ashtonizmev

by Issam El Alaoui · GitHub ↗ · v0.1.7

cross-platform ✓ Security Clean

306

Downloads

Stars

Active Installs

Versions

Install in OpenClaw

/install geminipdfocr

Description

Extract text from PDFs using Google Gemini OCR. Use when extracting text from PDFs, performing OCR on scanned documents, or processing image-based PDFs.

README (SKILL.md)

Purpose

Use geminipdfocr to extract text from PDF documents via OCR (Google Gemini).

Data and privacy

Full page images/files are sent to Google's API. PDFs are split into single-page files and each page is uploaded to Google Gemini for OCR. There are no hidden exfiltration endpoints or other data collection. Do not use with highly sensitive documents unless you accept that content is sent to Google.

Setup (venv installation)

Before first use, create and activate the virtual environment:

cd geminipdfocr && python -m venv venv && source venv/bin/activate && pip install -r requirements.txt

Set GOOGLE_API_KEY in your environment before running (e.g. export GOOGLE_API_KEY=your-key).

How to use

When requested to extract text or perform OCR on a PDF:

Run: cd geminipdfocr && source venv/bin/activate && python -m geminipdfocr \x3Cpath-to-pdf> [--json] [--output \x3Cfile>]
Use --json for structured data.
Use --max-pages N for testing or very long documents.
Use --quiet to suppress progress logs.

Requirements

A valid PDF file path.
GOOGLE_API_KEY set in the process environment (e.g. export GOOGLE_API_KEY=your-key).

CLI options

Option	Description
`pdf_path`	One or more PDF file paths (positional)
`--max-pages N`	Limit pages per PDF
`--json`	Output structured JSON instead of plain text
`--output FILE`	Write result to file (default: stdout)
`--quiet`	Suppress INFO/DEBUG logs

Usage Guidance

This skill appears to be what it says: it splits PDFs into single-page files and uploads them to Google Gemini for OCR, and it requires only GOOGLE_API_KEY. Before installing, consider: (1) privacy — full page images are sent to Google, so do not use with highly sensitive documents unless acceptable; (2) cost and quotas — large PDFs mean many uploads and API usage billed against your API key; (3) secure the GOOGLE_API_KEY (don’t paste it into logs or share it); (4) review and pin package versions if you want reproducible installs; (5) test on non-sensitive sample PDFs first to confirm behavior. If you need guarantees about retention or want OCR to run locally, consider a local OCR solution instead.

Capability Analysis

Type: OpenClaw Skill Name: geminipdfocr Version: 0.1.7 The geminipdfocr skill is a legitimate tool designed to perform OCR on PDF documents using the Google Gemini API. The code follows standard practices, using PyMuPDF for PDF splitting and the official google-genai library for API interactions. It includes clear documentation in SKILL.md regarding data privacy (disclosing that files are sent to Google) and lacks any indicators of data exfiltration to unauthorized endpoints, malicious execution, or persistence mechanisms.

Capability Assessment

✓ Purpose & Capability

Name/description, required env (GOOGLE_API_KEY), listed Python packages (google-genai, pymupdf), CLI entry point, and code all align with a PDF OCR tool that uploads pages to Google's Gemini API.

ℹ Instruction Scope

The SKILL.md and code explicitly split PDFs into single-page files and upload full page files to Google's API for OCR. This behaviour is documented in the README and implemented in gemini_client.py (files.upload + models.generate_content). There are no apparent instructions or code that read unrelated files, other env vars, or send data to unknown endpoints, but note that entire page images are transmitted to Google (privacy/cost implication).

✓ Install Mechanism

Dependencies are standard Python packages (google-genai, pymupdf, pydantic, pydantic-settings) and a requirements.txt is included. No downloads from custom URLs or extracts from arbitrary hosts are present.

✓ Credentials

Only GOOGLE_API_KEY is required and declared as the primary credential. That single key is appropriate and required for the Google Gemini client used by the skill. No unrelated secrets or config paths are requested.

✓ Persistence & Privilege

The skill is not always-enabled, does not modify other skills, and only writes temporary files under the system temp directory (cleans up after processing). It does not request elevated system persistence.

How to Use

Make sure OpenClaw is installed (local or Docker)
Run the install command in chat: /install geminipdfocr
After installation, invoke the skill by name or use /geminipdfocr
Provide required inputs per the skill's parameter spec and get structured output

Version History

v0.1.7

- Clarified data and privacy section to explicitly state that full page images/files are sent to Google's API. - Added note that there are no hidden exfiltration endpoints or other data collection. - Improved warning for users about using the skill with highly sensitive documents.

v0.1.6

- Added required Python package dependencies to the skill metadata for easier installation: google-genai, pymupdf, pydantic, and pydantic-settings. - No changes to functionality or usage.

v0.1.5

- Added a new metadata section for openclaw, specifying environment requirements. - Declared GOOGLE_API_KEY as the primary required environment variable in metadata. - No changes to functionality or usage instructions.

v0.1.4

- Switched configuration to require the GOOGLE_API_KEY environment variable to be set in the process environment, instead of loading from a local .env file. - Updated documentation to reflect the new authentication setup, removing instructions related to .env files.

v0.1.3

- Configuration now explicitly loads only `geminipdfocr/.env`, using a path relative to the package rather than the current working directory. - Updated documentation to clarify `.env` file loading behavior. - No other major changes or user-facing features added.

v0.1.2

- Project renamed from "geminipdf" to "geminipdfocr" throughout all files. - Updated documentation and setup instructions to reflect the new name. - Clarified configuration: now only reads environment variables from `geminipdfocr/.env`, not any parent directories. - Improved privacy note and setup details in SKILL.md. - Minor text and description improvements in CLI help and metadata.

v0.1.1

- Added an explicit warning about data and privacy, noting that PDF content is uploaded to Google Gemini for OCR. - Documented the required GOOGLE_API_KEY environment variable in metadata. - No functional changes to code; updates are documentation-only.

v0.1.0

- Initial release of Geminipdf OCR. - Extract text from PDFs using Google Gemini OCR, supporting scanned and image-based documents. - Command-line interface with options for JSON output, limiting pages, and quiet mode. - Requires GOOGLE_API_KEY configuration for operation. - Outputs can be saved to a file or printed to stdout.

Metadata

Slug geminipdfocr

Version 0.1.7

License —

All-time Installs 1

Active Installs 1

Total Versions 8

Frequently Asked Questions

What is PDF OCR Using Gemini LLM?

Extract text from PDFs using Google Gemini OCR. Use when extracting text from PDFs, performing OCR on scanned documents, or processing image-based PDFs. It is an AI Agent Skill for Claude Code / OpenClaw, with 306 downloads so far.

How do I install PDF OCR Using Gemini LLM?

Run "/install geminipdfocr" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.

Is PDF OCR Using Gemini LLM free?

Yes, PDF OCR Using Gemini LLM is completely free (open-source). You can download, install and use it at no cost.

Which platforms does PDF OCR Using Gemini LLM support?

PDF OCR Using Gemini LLM is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).

Who created PDF OCR Using Gemini LLM?

It is built and maintained by Issam El Alaoui (@ashtonizmev); the current version is v0.1.7.

More Skills