← Back to Skills Marketplace

multimodal-parser

Name: multimodal-parser
Author: ayalili

by Ayalili · GitHub ↗ · v1.0.1 · MIT-0

cross-platform ✓ Security Clean

630

Downloads

Stars

Active Installs

Versions

Install in OpenClaw

/install multimodal-parser

Description

Unified multi-modal content parser for images, PDF, DOCX, audio, auto OCR/transcription, output structured text for LLM processing

Usage Guidance

What to consider before installing/using: - Trust & origin: the package has no homepage and an unknown source; run it only if you trust the author or after reviewing the code (you have the code). - Permissions & sandboxing: the skill uses Deno to run subprocesses and read files. Grant it only the minimal filesystem and subprocess permissions, or run in a sandbox/container. - Dependencies: it requires external CLI tools (tesseract, pdftotext/poppler, pandoc, whisper, ffmpeg). Install those from official package repositories to avoid malicious binaries. - Network/supply-chain: the code imports zod from deno.land at runtime — this fetch is expected but is a supply-chain/network fetch; if you need offline assurance, vendor the dependency or audit the fetched module. - Data sensitivity: the skill processes user-provided files locally and does not appear to transmit results externally, but avoid testing on highly sensitive files until you confirm runtime permissions and behavior in your environment. - Sanity checks: test on non-sensitive sample files first; verify produced outputs and any error messages. If you need stronger assurance, run the code in an isolated VM and/or review and pin remote dependency versions.

Capability Analysis

Type: OpenClaw Skill Name: multimodal-parser Version: 1.0.1 The skill is a legitimate multi-modal content parser that uses standard open-source tools (Tesseract, Poppler, Pandoc, and Whisper) to process images, PDFs, Word documents, and audio files. The implementation in `index.ts` uses `Deno.Command` with argument arrays to execute these local binaries, which is a standard and relatively safe practice. There is no evidence of data exfiltration, malicious execution, or prompt injection; the code's behavior aligns perfectly with its stated purpose in `SKILL.md`.

Capability Assessment

✓ Purpose & Capability

Name/description match the implementation: the code implements OCR, PDF/docx conversion and audio transcription via tesseract/pdftotext/pandoc/whisper. The SKILL.md's suggested dependency list aligns with what the code invokes.

ℹ Instruction Scope

Runtime instructions and README ask you to install system packages; the code runs those external CLI tools on a user-supplied file path and reads file metadata. It does not attempt to read unrelated system files, access credentials, or send data to remote endpoints, but it will require filesystem read permissions and the ability to spawn subprocesses.

✓ Install Mechanism

No automated install spec is provided (instruction-only for installing system packages). The code imports zod from deno.land at runtime (remote module fetch), which is normal for Deno but is a supply-chain/network fetch to be aware of.

✓ Credentials

The skill declares no environment variables, no credentials, and no config paths. The code does not reference any hidden env vars or secrets.

✓ Persistence & Privilege

always:false and default invocation settings. The skill does not persist or modify other skills or global configuration; it only executes when invoked and uses local subprocesses/IO.

How to Use

Make sure OpenClaw is installed (local or Docker)
Run the install command in chat: /install multimodal-parser
After installation, invoke the skill by name or use /multimodal-parser
Provide required inputs per the skill's parameter spec and get structured output

Version History

v1.0.1

- Removed the skill.yaml file to streamline configuration. - Updated SKILL.md: moved metadata (name, slug, description) into frontmatter. - Cleaned up documentation structure by removing version, author, license, keywords, runtime, and entry fields from SKILL.md frontmatter.

v1.0.0

multimodal-parser v1.0.0 – Initial Release - Unified API for parsing images, PDFs, DOCX files, and audio into structured text. - Built-in OCR for images, transcription for audio, and document parsing with zero configuration required. - Supports multiple output formats: plain text, Markdown, and structured JSON for LLM-ready processing. - Helpful error messages with suggested dependency install commands. - Customizable parameters: file type, output format, OCR language, audio model, and PDF page range.

Metadata

Slug multimodal-parser

Version 1.0.1

License MIT-0

All-time Installs 3

Active Installs 3

Total Versions 2

Frequently Asked Questions

What is multimodal-parser?

Unified multi-modal content parser for images, PDF, DOCX, audio, auto OCR/transcription, output structured text for LLM processing. It is an AI Agent Skill for Claude Code / OpenClaw, with 630 downloads so far.

How do I install multimodal-parser?

Run "/install multimodal-parser" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.

Is multimodal-parser free?

Yes, multimodal-parser is completely free, licensed under MIT-0. You can download, install and use it at no cost.

Which platforms does multimodal-parser support?

multimodal-parser is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).

Who created multimodal-parser?

It is built and maintained by Ayalili (@ayalili); the current version is v1.0.1.

More Skills