功能描述

Generate video from supplied image using Veo-3.0 async API. Gemini vision analyzes image, Veo creates video via predictLongRunning. All outputs in ~/.opencla...

使用说明 (SKILL.md)

Image to Video Generator (v3.0.0 – Working Veo-3.0 REST API)

Name: Image To Video Gen
Author: j3ffyang

Status: ✅ Tested & Working
Last Updated: 2026-04-11
Example: Golden Tibetan offering → 2.4 MB video in ~60 seconds

What it does

Generates cinematic 5-second video from any image using Google's Veo-3.0:

Gemini Vision (2.5-flash) analyzes image for motion/scene description
Veo-3.0 predictLongRunning generates video asynchronously via REST API
Polling loop monitors operation until video is ready (~60-90 seconds)
Download saves MP4 to ~/.openclaw/workspace/tibetanProc/ with yymmddHHMM prefix

Key Fix (v3.0.0): REST payload must use bytesBase64Encoded field (not data or inline_data)

Inputs

Image: Local path or URL
Duration: 5-10 seconds (optional)
Style: Hint like "cinematic", "ethereal", "smooth" (optional)

Quick Start

python3 \x3C\x3C 'PYEOF'
import os
import sys
import json
import time
import base64
import requests
import google.generativeai as genai
from pathlib import Path
from datetime import datetime

GOOGLE_API_KEY = os.getenv("GOOGLE_API_KEY")
WORKSPACE = Path.home() / ".openclaw" / "workspace" / "tibetanProc"
WORKSPACE.mkdir(parents=True, exist_ok=True)
TIMESTAMP = datetime.now().strftime("%y%m%d%H%M")

# Step 1: Load image
IMAGE_PATH = WORKSPACE / f"{TIMESTAMP}_input_image.jpg"
if not IMAGE_PATH.exists():
    print("✗ Image not found")
    sys.exit(1)

print(f"✓ Image: {IMAGE_PATH.name}")

# Step 2: Analyze with Gemini
genai.configure(api_key=GOOGLE_API_KEY)
model = genai.GenerativeModel("gemini-2.5-flash")
image_file = genai.upload_file(str(IMAGE_PATH), mime_type="image/jpeg")

prompt = """Analyze for cinematic video: describe subject, setting, lighting, textures, 
suggested camera movements (dolly, pan, orbit, zoom, rack focus)."""

response = model.generate_content([prompt, image_file])
analysis = response.text

prompt_path = WORKSPACE / f"{TIMESTAMP}_prompt.md"
with open(prompt_path, "w") as f:
    f.write(analysis)
print(f"✓ Analysis: {prompt_path.name}")

# Step 3: Create enhanced prompt
enhanced = f"""VIDEO GENERATION PROMPT
Duration: 5 seconds
Quality: High Definition

SCENE ANALYSIS:
{analysis}

MOTION GUIDELINES:
- Smooth, deliberate camera movement
- Enhance visual depth with elegant transitions
- Maintain consistent lighting
- Cinematic color grading
"""

enhanced_path = WORKSPACE / f"{TIMESTAMP}_enhanced_prompt.txt"
with open(enhanced_path, "w") as f:
    f.write(enhanced)
print(f"✓ Enhanced prompt: {enhanced_path.name}")

# Step 4: Call Veo API with CORRECT field names
with open(IMAGE_PATH, "rb") as f:
    image_b64 = base64.b64encode(f.read()).decode("utf-8")

VEO_URL = f"https://generativelanguage.googleapis.com/v1beta/models/veo-3.0-generate-001:predictLongRunning?key={GOOGLE_API_KEY}"

# ✅ CORRECT PAYLOAD (v3.0.0)
payload = {
    "instances": [{
        "prompt": enhanced,
        "image": {
            "bytesBase64Encoded": image_b64,  # ← CRITICAL: NOT "data" or "inline_data"
            "mimeType": "image/jpeg"
        }
    }]
}

print("\
🎬 Calling Veo API...")
response = requests.post(VEO_URL, json=payload, timeout=60)

if response.status_code not in [200, 202]:
    print(f"✗ API error: {response.json()}")
    sys.exit(1)

result = response.json()
operation_name = result.get("name")
if not operation_name:
    print(f"✗ No operation name")
    sys.exit(1)

print(f"✓ Operation: {operation_name}")

# Step 5: Poll until complete
POLL_URL = f"https://generativelanguage.googleapis.com/v1beta/{operation_name}?key={GOOGLE_API_KEY}"

for attempt in range(1, 121):
    time.sleep(5 if attempt > 1 else 2)
    
    poll_response = requests.get(POLL_URL, timeout=10)
    poll_result = poll_response.json()
    
    if poll_result.get("done"):
        print(f"✓ Complete in {attempt * 5}s")
        
        # Extract video URL
        try:
            video_uri = poll_result["response"]["generateVideoResponse"]["generatedSamples"][0]["video"]["uri"]
        except (KeyError, IndexError):
            print(f"✗ No video in response")
            print(json.dumps(poll_result, indent=2)[:500])
            sys.exit(1)
        
        # Step 6: Download video
        print(f"⬇️  Downloading...")
        video_response = requests.get(f"{video_uri}&key={GOOGLE_API_KEY}", timeout=120, stream=True)
        
        if video_response.status_code == 200:
            output_path = WORKSPACE / f"{TIMESTAMP}_video.mp4"
            with open(output_path, "wb") as f:
                for chunk in video_response.iter_content(chunk_size=8192):
                    if chunk:
                        f.write(chunk)
            
            size_mb = output_path.stat().st_size / (1024 * 1024)
            print(f"✓ Video: {output_path.name} ({size_mb:.1f} MB)")
            sys.exit(0)
        else:
            print(f"✗ Download failed: {video_response.status_code}")
            sys.exit(1)

print("✗ Timeout")
sys.exit(1)

PYEOF

Detailed Workflow

Step 1: Prepare Image

WORKSPACE="$HOME/.openclaw/workspace/tibetanProc"
mkdir -p "$WORKSPACE"
TIMESTAMP=$(date +%y%m%d%H%M)
cp ./my_image.jpg "$WORKSPACE/${TIMESTAMP}_input_image.jpg"

Step 2: Analyze with Gemini Vision

import google.generativeai as genai
from pathlib import Path

genai.configure(api_key=os.getenv("GOOGLE_API_KEY"))
model = genai.GenerativeModel("gemini-2.5-flash")

IMAGE_PATH = Path.home() / ".openclaw" / "workspace" / "tibetanProc" / "2604110411_input_image.jpg"
image_file = genai.upload_file(str(IMAGE_PATH), mime_type="image/jpeg")

analysis_prompt = """Analyze this image for cinematic video generation:
1. Main subject and focal point
2. Setting and environment
3. Lighting direction and mood
4. Materials and textures
5. Suggested camera movements (dolly, pan, orbit, zoom, rack focus)
6. Overall energy and pacing"""

response = model.generate_content([analysis_prompt, image_file])

# Save analysis
with open(WORKSPACE / "2604110411_prompt.md", "w") as f:
    f.write(response.text)

Step 3: Create Enhanced Prompt

# Read analysis
with open(WORKSPACE / "2604110411_prompt.md") as f:
    analysis = f.read()

# Add video instructions
enhanced = f"""VIDEO GENERATION PROMPT
Duration: 5 seconds
Quality: High Definition
Frame Rate: 24fps

SCENE ANALYSIS:
{analysis}

MOTION GUIDELINES:
- Smooth, deliberate camera movement
- Enhance visual depth with elegant transitions
- Maintain consistent lighting throughout
- Cinematic color grading
- Focus on visual storytelling

TECHNICAL SPECS:
- Resolution: 1080p minimum
- Aspect Ratio: 16:9
- Format: MP4 (H.264)
"""

with open(WORKSPACE / "2604110411_enhanced_prompt.txt", "w") as f:
    f.write(enhanced)

Step 4: Call Veo API (THE CRITICAL PART)

import base64
import requests

# Encode image
with open(IMAGE_PATH, "rb") as f:
    image_b64 = base64.b64encode(f.read()).decode("utf-8")

# Read enhanced prompt
with open(WORKSPACE / "2604110411_enhanced_prompt.txt") as f:
    prompt = f.read()

# ✅ CORRECT PAYLOAD STRUCTURE (v3.0.0)
payload = {
    "instances": [{
        "prompt": prompt,
        "image": {
            "bytesBase64Encoded": image_b64,    # ← KEY: NOT "data" or "inline_data"
            "mimeType": "image/jpeg"            # ← Must be here
        }
    }]
}

VEO_URL = f"https://generativelanguage.googleapis.com/v1beta/models/veo-3.0-generate-001:predictLongRunning?key={GOOGLE_API_KEY}"

response = requests.post(VEO_URL, json=payload, timeout=60)
result = response.json()

if response.status_code in [200, 202]:
    operation_name = result["name"]
    print(f"✓ Operation: {operation_name}")
else:
    print(f"✗ Error: {result}")
    exit(1)

Step 5: Poll Operation Status

import time

POLL_URL = f"https://generativelanguage.googleapis.com/v1beta/{operation_name}?key={GOOGLE_API_KEY}"

for attempt in range(120):  # 10 minutes max
    time.sleep(5)
    
    poll_response = requests.get(POLL_URL, timeout=10)
    poll_result = poll_response.json()
    
    if poll_result.get("done"):
        print(f"✓ Complete in {(attempt + 1) * 5}s")
        
        # Extract video URL from response structure
        video_uri = poll_result["response"]["generateVideoResponse"]["generatedSamples"][0]["video"]["uri"]
        print(f"✓ Video URL: {video_uri}")
        break
    
    progress = poll_result.get("metadata", {}).get("progressPercentage", "?")
    print(f"  Polling ({attempt + 1}/120)... {progress}%")

Step 6: Download Video

# URL needs API key appended
download_url = f"{video_uri}&key={GOOGLE_API_KEY}"

video_response = requests.get(download_url, timeout=120, stream=True)

output_path = WORKSPACE / "2604110411_video.mp4"
with open(output_path, "wb") as f:
    for chunk in video_response.iter_content(chunk_size=8192):
        if chunk:
            f.write(chunk)

size_mb = output_path.stat().st_size / (1024 * 1024)
print(f"✓ Video downloaded: {output_path} ({size_mb:.1f} MB)")

Response Structure

Initial Response (Step 4)

{
  "name": "models/veo-3.0-generate-001/operations/uiw8bpjdiqbn"
}

Final Response (Step 5 - when done=true)

{
  "name": "models/veo-3.0-generate-001/operations/uiw8bpjdiqbn",
  "done": true,
  "response": {
    "@type": "type.googleapis.com/google.ai.generativelanguage.v1beta.PredictLongRunningResponse",
    "generateVideoResponse": {
      "generatedSamples": [
        {
          "video": {
            "uri": "https://generativelanguage.googleapis.com/v1beta/files/txg4shogthoc:download?alt=media"
          }
        }
      ]
    }
  }
}

Output Files

~/.openclaw/workspace/tibetanProc/
├── 2604110411_input_image.jpg         # Original image
├── 2604110411_prompt.md               # Gemini analysis
├── 2604110411_enhanced_prompt.txt     # Motion-enhanced prompt
├── 2604110411_veo_init_response.json  # API response (init)
└── 2604110411_video.mp4               # ✅ Final video

Common Errors & Fixes

Error	Cause	Fix
`bytesBase64Encoded isn't supported`	Wrong field name	Use `bytesBase64Encoded` (not `data`, `inline_data`, `bytesBase64`)
`mimeType isn't supported`	Field name case	Use exact `mimeType` (camelCase, not `mime_type` or `mimeType`)
`No struct value found for field expecting an image`	Missing image entirely	Provide both `bytesBase64Encoded` and `mimeType`
`Video generation timeout`	Operation takes >10min	Rare; usually completes in 60-90s
`403 PERMISSION_DENIED` on download	API key issue	Add `?key={GOOGLE_API_KEY}` to video URL

Performance

Operation	Time
Gemini analysis	~3s
Veo generation	~50-90s
Download	~2-5s
Total	~60-100s

Version History

Version	Date	Change
3.0.0	2026-04-11	✅ Working REST API - Fixed field names: `bytesBase64Encoded` instead of `data`
2.0.0	2026-04-11	Documented async polling (did not work)
1.0.0	Original	Initial design (gRPC-only, REST broken)

Testing

Successfully tested with:

Image: Golden Tibetan offering (6.3 MB JPG)
Model: veo-3.0-generate-001
Duration: 5 seconds
Output: 2.4 MB MP4 video
Status: ✅ Working end-to-end

安全使用建议

This skill appears to do what it says: it uploads an image to Google (Gemini/Veo) and saves the returned video under ~/.openclaw/workspace/tibetanProc/. Before installing/using: 1) Confirm you are comfortable providing a GOOGLE_API_KEY (the key will be used client-side to call Google APIs and may incur billing). 2) Avoid uploading sensitive images — they are transmitted to Google's services. 3) Ensure the required Python packages (google-generativeai, requests) are installed in the environment; the skill does not include an installer. 4) Note minor metadata inconsistencies (packaged version vs _meta.json); this looks like bookkeeping rather than malicious behavior, but if you need stronger assurances ask the provider for a canonical release or source repository.

功能分析

Type: OpenClaw Skill Name: image-to-video-gen Version: 3.0.0 The skill bundle is a functional utility for generating videos from images using Google's Veo-3.0 and Gemini APIs. The Python script in SKILL.md correctly implements the asynchronous predictLongRunning flow, including image analysis, prompt enhancement, and status polling. All network communication is directed to legitimate Google API endpoints (generativelanguage.googleapis.com), and file operations are confined to a specific workspace directory (~/.openclaw/workspace/tibetanProc/). No evidence of data exfiltration, credential theft, or malicious prompt injection was found.

能力评估

✓ Purpose & Capability

Name/description (image→video using Veo + Gemini) matches the declared requirements: GOOGLE_API_KEY for Google APIs, python3 and small CLI tools for the provided quickstart. Declared Python packages (google-generativeai, requests) are what the example code uses.

✓ Instruction Scope

SKILL.md instructions stay within scope: analyze image with Gemini, call Veo predictLongRunning, poll and download video to ~/.openclaw/workspace/tibetanProc/. The script reads only the image file and GOOGLE_API_KEY and writes outputs to the stated workspace. It uploads the image to Google services (expected for Gemini/Veo).

✓ Install Mechanism

Instruction-only skill (no install spec or code files) — low install risk. It does require certain Python packages but provides no installer; this is an operational note rather than a security contradiction.

✓ Credentials

Only GOOGLE_API_KEY is requested, which is proportionate for calling Google generative APIs. No unrelated secrets or system config paths are requested. Be aware an API key can incur billing and will be sent to Google endpoints.

✓ Persistence & Privilege

Skill does not request persistent 'always' inclusion or elevated platform privileges. It only writes and reads files under the declared ~/.openclaw workspace directory.

版本历史

v3.0.0

image-to-video-gen v3.0.0 — Major update: introduces robust, fully working Veo-3.0 async REST API integration. - Gemini Vision now performs image analysis to generate intelligent motion/scene prompts. - Veo-3.0 video generation uses correct REST payload (`bytesBase64Encoded`), enabling successful async job submission. - Polling loop reliably tracks operation status and downloads generated MP4 upon completion. - All intermediary files and outputs are now consistently saved with timestamped names in `~/.openclaw/workspace/tibetanProc/`. - Sample workflows, payloads, and error handling updated for clarity and recent API requirements.

v1.0.0

- Initial release of image-to-video-gen skill. - Converts user-supplied images into video clips using nano-banana (Gemini vision) for prompt creation and Veo for video generation. - All process files and outputs are saved in ~/.openclaw/workspace/tibetanProc/ with yymmddHHMM prefixes. - Supports optional duration and style parameters; defaults to 5 seconds, "cinematic" style. - Includes strict output location guardrails and timestamped file naming for all generated content.

元数据

Slug image-to-video-gen

版本 3.0.0

许可证 MIT-0

累计安装 0

当前安装数 0

历史版本数 2

常见问题

Image To Video Gen 是什么？

Generate video from supplied image using Veo-3.0 async API. Gemini vision analyzes image, Veo creates video via predictLongRunning. All outputs in ~/.opencla... 它是一个面向 Claude Code / OpenClaw 的 AI Agent Skill 插件，目前累计下载 121 次。

如何安装 Image To Video Gen？

在 OpenClaw 或 Claude Code 对话框中运行命令「/install image-to-video-gen」即可一键安装，无需额外配置。

Image To Video Gen 是免费的吗？

是的，Image To Video Gen 完全免费，采用 MIT-0 许可证，可自由下载、安装和使用。

Image To Video Gen 支持哪些平台？

Image To Video Gen 跨平台运行，可在任意部署了 OpenClaw / Claude Code 的环境中使用（cross-platform）。

谁开发了 Image To Video Gen？

由 Jeff Yang（@j3ffyang）开发并维护，当前版本 v3.0.0。

Image To Video Gen