← Back to Skills Marketplace
yuyonghao-123

Multimodal Base

by yuyonghao-123 · GitHub ↗ · v0.1.0 · MIT-0
cross-platform ⚠ suspicious
134
Downloads
0
Stars
0
Active Installs
1
Versions
Install in OpenClaw
/install yuyonghao-multimodal-base
Description
Supports image understanding, OCR, speech-to-text, and text-to-speech synthesis with multi-voice and multimodal unified processing using OpenAI and Edge TTS.
Usage Guidance
This skill largely does what it says (image understanding, OCR, STT and TTS), but there are several red flags you should consider before installing: - It requires an OpenAI API key (used by multiple modules) even though the registry metadata lists no required env vars — confirm you are willing to provide that key. Limit the key's scope if possible. - The README asks you to pip install the Python 'edge-tts' CLI, but package.json also lists an npm 'edge-tts' package — clarify which implementation is intended. The code spawns a system 'edge-tts' command, so you must install the Python CLI or otherwise provide that executable. - The speech recognizer can run locally and will attempt to download a Whisper model from Hugging Face and save it to disk. Downloading and executing third-party binaries has risk — review the model URL and consider running in a sandbox or verifying checksums. - The code spawns external executables ('whisper' / whisper.cpp, 'edge-tts', 'ffprobe') and writes output/temp files. Ensure you trust the package source and run it in an environment where those binaries and filesystem writes are acceptable. - If you want to proceed: ask the author to (1) update registry metadata to declare required env vars (OPENAI_API_KEY), (2) clarify install steps (npm vs pip edge-tts), and (3) document where the Whisper model is stored and whether checksums/verifications are provided. Otherwise run the skill in an isolated container or VM and avoid giving it high-privilege credentials.
Capability Analysis
Type: OpenClaw Skill Name: yuyonghao-multimodal-base Version: 0.1.0 The skill bundle provides a standard multimodal integration for image processing (GPT-4o/Tesseract), speech recognition (Whisper), and text-to-speech (Edge-TTS). The code uses legitimate APIs and follows standard patterns for handling external binaries via `child_process.spawn` with argument arrays, which mitigates basic shell injection risks. No evidence of data exfiltration, malicious persistence, or prompt injection was found across the source files (src/image-processor.js, src/speech-recognizer.js, etc.).
Capability Assessment
Purpose & Capability
Code and SKILL.md implement image understanding (OpenAI GPT-4V), OCR (tesseract.js), Whisper-based speech-to-text (API and local), and Edge TTS — which matches the skill description. However, registry metadata declared no required env vars while the code and docs rely on OPENAI_API_KEY. Also SKILL.md asks to pip install Python edge-tts while package.json lists an npm 'edge-tts' dependency — this mismatch is unexplained.
Instruction Scope
Runtime instructions and code perform file reads/writes (images, audio, temp files, output directory), call external network APIs (OpenAI endpoints, Hugging Face model URL), and spawn local executables ('whisper' / whisper.cpp, 'edge-tts', and 'ffprobe'). The pipeline also implements an automatic model download from a Hugging Face URL. Those actions go beyond pure in-process computation and require user awareness and filesystem/network permissions.
Install Mechanism
There is no automated install spec in the registry (instruction-only), but SKILL.md instructs npm install and pip install edge-tts. The code will download a binary model from a Hugging Face URL at runtime (extract/write to disk). Downloading/extracting model binaries and depending on external CLI tools increases risk and should be reviewed; the pip vs npm edge-tts ambiguity is also an installation coherence issue.
Credentials
The code and documentation require an OpenAI API key (process.env.OPENAI_API_KEY) for image and audio API calls, but the registry metadata lists no required environment variables. The skill also expects system binaries (whisper executable, edge-tts CLI, ffprobe) which are not declared in metadata. The requested access (OpenAI key + ability to write model and audio files + spawn executables) is significant and should be clearly declared and limited to what the user expects.
Persistence & Privilege
The skill does not request permanent inclusion (always:false) and does not modify other skills or global agent settings. It stores output and temporary files within its own directories but does not claim elevated privileges.
How to Use
  1. Make sure OpenClaw is installed (local or Docker)
  2. Run the install command in chat: /install yuyonghao-multimodal-base
  3. After installation, invoke the skill by name or use /yuyonghao-multimodal-base
  4. Provide required inputs per the skill's parameter spec and get structured output
Version History
v0.1.0
Initial release of Multimodal Base Skill. - Provides unified pipeline for image understanding (GPT-4V, OCR), speech recognition (Whisper), and speech synthesis (Edge TTS) - Supports flexible multimodal input/output handling in one interface - Includes separate ImageProcessor, SpeechRecognizer, and SpeechSynthesizer modules - Offers configurable options including model, API key, language, and voice settings - Out-of-the-box support for 8 TTS voice types across 4 languages - Documentation and code examples for quick setup and API usage
Metadata
Slug yuyonghao-multimodal-base
Version 0.1.0
License MIT-0
All-time Installs 0
Active Installs 0
Total Versions 1
Frequently Asked Questions

What is Multimodal Base?

Supports image understanding, OCR, speech-to-text, and text-to-speech synthesis with multi-voice and multimodal unified processing using OpenAI and Edge TTS. It is an AI Agent Skill for Claude Code / OpenClaw, with 134 downloads so far.

How do I install Multimodal Base?

Run "/install yuyonghao-multimodal-base" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.

Is Multimodal Base free?

Yes, Multimodal Base is completely free, licensed under MIT-0. You can download, install and use it at no cost.

Which platforms does Multimodal Base support?

Multimodal Base is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).

Who created Multimodal Base?

It is built and maintained by yuyonghao-123 (@yuyonghao-123); the current version is v0.1.0.

💬 Comments