← Back to Skills Marketplace
peand-rover

Caption Generator Image

by peandrover adam · GitHub ↗ · v1.0.0 · MIT-0
cross-platform ⚠ suspicious
72
Downloads
0
Stars
0
Active Installs
1
Versions
Install in OpenClaw
/install caption-generator-image
Description
Get captioned image videos ready to post, without touching a single slider. Upload your images with text (JPG, PNG, WEBP, GIF, up to 200MB), say something li...
README (SKILL.md)

Getting Started

Send me your images with text and I'll handle the AI caption generation. Or just describe what you're after.

Try saying:

  • "generate a product photo or social media image into a 1080p MP4"
  • "generate a caption for this image and overlay it as text"
  • "adding auto-generated captions to images for social media posts for social media creators"

Getting Connected

Before handling any user request, establish a connection to the backend API. Show a brief status like "Connecting...".

If NEMO_TOKEN is in the environment, use it directly and create a session. Otherwise, acquire a free starter token:

  • Generate a UUID as client identifier
  • POST to https://mega-api-prod.nemovideo.ai/api/auth/anonymous-token with the X-Client-Id header
  • The response includes a token with 100 free credits valid for 7 days — use it as NEMO_TOKEN

Then create a session by POSTing to https://mega-api-prod.nemovideo.ai/api/tasks/me/with-session/nemo_agent with Bearer authorization and body {"task_name":"project","language":"en"}. The session_id in the response is needed for all following requests.

Tell the user you're ready. Keep the technical details out of the chat.

Caption Generator Image — Generate Captions from Images

This tool takes your images with text and runs AI caption generation through a cloud rendering pipeline. You upload, describe what you want, and download the result.

Say you have a product photo or social media image and want to generate a caption for this image and overlay it as text — the backend processes it in about 20-40 seconds and hands you a 1080p MP4.

Tip: high-contrast images produce more accurate auto-generated captions.

Matching Input to Actions

User prompts referencing caption generator image, aspect ratio, text overlays, or audio tracks get routed to the corresponding action via keyword and intent classification.

User says... Action Skip SSE?
"export" / "导出" / "download" / "send me the video" → §3.5 Export
"credits" / "积分" / "balance" / "余额" → §3.3 Credits
"status" / "状态" / "show tracks" → §3.4 State
"upload" / "上传" / user sends file → §3.2 Upload
Everything else (generate, edit, add BGM…) → §3.1 SSE

Cloud Render Pipeline Details

Each export job queues on a cloud GPU node that composites video layers, applies platform-spec compression (H.264, up to 1080x1920), and returns a download URL within 30-90 seconds. The session token carries render job IDs, so closing the tab before completion orphans the job.

Base URL: https://mega-api-prod.nemovideo.ai

Endpoint Method Purpose
/api/tasks/me/with-session/nemo_agent POST Start a new editing session. Body: {"task_name":"project","language":"\x3Clang>"}. Returns session_id.
/run_sse POST Send a user message. Body includes app_name, session_id, new_message. Stream response with Accept: text/event-stream. Timeout: 15 min.
/api/upload-video/nemo_agent/me/\x3Csid> POST Upload a file (multipart) or URL.
/api/credits/balance/simple GET Check remaining credits (available, frozen, total).
/api/state/nemo_agent/me/\x3Csid>/latest GET Fetch current timeline state (draft, video_infos, generated_media).
/api/render/proxy/lambda POST Start export. Body: {"id":"render_\x3Cts>","sessionId":"\x3Csid>","draft":\x3Cjson>,"output":{"format":"mp4","quality":"high"}}. Poll status every 30s.

Accepted file types: mp4, mov, avi, webm, mkv, jpg, png, gif, webp, mp3, wav, m4a, aac.

Three attribution headers are required on every request and must match this file's frontmatter:

Header Value
X-Skill-Source caption-generator-image
X-Skill-Version frontmatter version
X-Skill-Platform auto-detect: clawhub / cursor / unknown from install path

All requests must include: Authorization: Bearer \x3CNEMO_TOKEN>, X-Skill-Source, X-Skill-Version, X-Skill-Platform. Missing attribution headers will cause export to fail with 402.

Error Handling

Code Meaning Action
0 Success Continue
1001 Bad/expired token Re-auth via anonymous-token (tokens expire after 7 days)
1002 Session not found New session §3.0
2001 No credits Anonymous: show registration URL with ?bind=\x3Cid> (get \x3Cid> from create-session or state response when needed). Registered: "Top up credits in your account"
4001 Unsupported file Show supported formats
4002 File too large Suggest compress/trim
400 Missing X-Client-Id Generate Client-Id and retry (see §1)
402 Free plan export blocked Subscription tier issue, NOT credits. "Register or upgrade your plan to unlock export."
429 Rate limit (1 token/client/7 days) Retry in 30s once

Reading the SSE Stream

Text events go straight to the user (after GUI translation). Tool calls stay internal. Heartbeats and empty data: lines mean the backend is still working — show "⏳ Still working..." every 2 minutes.

About 30% of edit operations close the stream without any text. When that happens, poll /api/state to confirm the timeline changed, then tell the user what was updated.

Backend Response Translation

The backend assumes a GUI exists. Translate these into API actions:

Backend says You do
"click [button]" / "点击" Execute via API
"open [panel]" / "打开" Query session state
"drag/drop" / "拖拽" Send edit via SSE
"preview in timeline" Show track summary
"Export button" / "导出" Execute export workflow

Draft field mapping: t=tracks, tt=track type (0=video, 1=audio, 7=text), sg=segments, d=duration(ms), m=metadata.

Timeline (3 tracks): 1. Video: city timelapse (0-10s) 2. BGM: Lo-fi (0-10s, 35%) 3. Title: "Urban Dreams" (0-3s)

Common Workflows

Quick edit: Upload → "generate a caption for this image and overlay it as text" → Download MP4. Takes 20-40 seconds for a 30-second clip.

Batch style: Upload multiple files in one session. Process them one by one with different instructions. Each gets its own render.

Iterative: Start with a rough cut, preview the result, then refine. The session keeps your timeline state so you can keep tweaking.

Tips and Tricks

The backend processes faster when you're specific. Instead of "make it look better", try "generate a caption for this image and overlay it as text" — concrete instructions get better results.

Max file size is 200MB. Stick to JPG, PNG, WEBP, GIF for the smoothest experience.

Export as MP4 for widest compatibility.

Usage Guidance
This skill uploads whatever images/audio you provide to a third‑party backend (mega-api-prod.nemovideo.ai). If you supply a NEMO_TOKEN it will be used for all requests; if you do not, the skill will create an anonymous token by calling the service. Consider the sensitivity of images you upload (they may contain PII or copyrighted material). Verify the service/domain is one you trust before providing a long‑lived token; prefer using an anonymous/limited token if possible and rotate tokens if you decide to supply a personal credential. Note the metadata lists a config path (~/.config/nemovideo/) that the instructions don't use — if you see later behavior that reads or writes local config, review it carefully. Finally, remember the agent can perform uploads automatically if invoked; restrict autonomous use if you want to limit accidental uploads.
Capability Analysis
Type: OpenClaw Skill Name: caption-generator-image Version: 1.0.0 The skill facilitates image captioning by interfacing with a third-party cloud rendering API (mega-api-prod.nemovideo.ai). It exhibits high-risk behaviors including mandatory network access for file uploads, automated generation of client identifiers, and the management of authentication tokens. Notably, the instructions establish a remote-control-like pattern where the agent is directed to execute API actions based on strings received from the backend's SSE stream (e.g., translating 'click button' messages into API calls). While these functions are plausibly aligned with the stated purpose of a cloud-based video editor, the combination of external data exfiltration (uploads), filesystem access requirements (~/.config/nemovideo/), and the backend-driven execution logic meets the criteria for a suspicious classification.
Capability Assessment
Purpose & Capability
Name/description, declared primaryEnv (NEMO_TOKEN), and the SKILL.md all point to a cloud rendering/captioning service at mega-api-prod.nemovideo.ai. Requiring an API token for this service is proportionate and expected.
Instruction Scope
Instructions only describe authenticating, creating a session, uploading media, streaming SSE, checking credits, and starting renders — all consistent with a caption/render service. They do instruct generating an anonymous token when NEMO_TOKEN is absent and to upload user media to the third‑party API, so users should be aware media (and any embedded PII) will be sent off‑host. The metadata mentions a config path (~/.config/nemovideo/) but the SKILL.md does not show any explicit read/write to that path (minor inconsistency).
Install Mechanism
Instruction-only skill with no install spec and no code files, so nothing is written to disk by an installer. This is the lowest-risk installation model.
Credentials
Only one required environment variable (NEMO_TOKEN) is declared and used for Bearer authorization; that matches the documented API usage. The presence of an optional configPaths entry in metadata is slightly unexpected since SKILL.md doesn't reference it, but it's not excessive.
Persistence & Privilege
always is false and the skill does not request persistent system privileges or modify other skills. It can be invoked autonomously (the platform default), which increases convenience but is not an intrinsic incoherence.
How to Use
  1. Make sure OpenClaw is installed (local or Docker)
  2. Run the install command in chat: /install caption-generator-image
  3. After installation, invoke the skill by name or use /caption-generator-image
  4. Provide required inputs per the skill's parameter spec and get structured output
Version History
v1.0.0
Initial release of Caption Generator Image – Generate Captions from Images. - Instantly generate and overlay AI captions on your uploaded images (JPG, PNG, WEBP, GIF up to 200MB). - Download captioned image videos as 1080p MP4, optimized for social media. - Automatic session creation and backend connection with free credits for new users. - Supports quick edits, batch processing, and iterative workflows with timeline state tracking. - Clear error handling for common issues (file size, format, credits, etc.). - Rapid cloud rendering pipeline (20–90s per job) with plain, actionable user prompts.
Metadata
Slug caption-generator-image
Version 1.0.0
License MIT-0
All-time Installs 0
Active Installs 0
Total Versions 1
Frequently Asked Questions

What is Caption Generator Image?

Get captioned image videos ready to post, without touching a single slider. Upload your images with text (JPG, PNG, WEBP, GIF, up to 200MB), say something li... It is an AI Agent Skill for Claude Code / OpenClaw, with 72 downloads so far.

How do I install Caption Generator Image?

Run "/install caption-generator-image" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.

Is Caption Generator Image free?

Yes, Caption Generator Image is completely free, licensed under MIT-0. You can download, install and use it at no cost.

Which platforms does Caption Generator Image support?

Caption Generator Image is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).

Who created Caption Generator Image?

It is built and maintained by peandrover adam (@peand-rover); the current version is v1.0.0.

💬 Comments