← 返回 Skills 市场
j3ffyang

Image To Video Gen

作者 Jeff Yang · GitHub ↗ · v3.0.0 · MIT-0
cross-platform ✓ 安全检测通过
121
总下载
0
收藏
0
当前安装
2
版本数
在 OpenClaw 中安装
/install image-to-video-gen
功能描述
Generate video from supplied image using Veo-3.0 async API. Gemini vision analyzes image, Veo creates video via predictLongRunning. All outputs in ~/.opencla...
使用说明 (SKILL.md)

Image to Video Generator (v3.0.0 – Working Veo-3.0 REST API)

Status: ✅ Tested & Working
Last Updated: 2026-04-11
Example: Golden Tibetan offering → 2.4 MB video in ~60 seconds


What it does

Generates cinematic 5-second video from any image using Google's Veo-3.0:

  1. Gemini Vision (2.5-flash) analyzes image for motion/scene description
  2. Veo-3.0 predictLongRunning generates video asynchronously via REST API
  3. Polling loop monitors operation until video is ready (~60-90 seconds)
  4. Download saves MP4 to ~/.openclaw/workspace/tibetanProc/ with yymmddHHMM prefix

Key Fix (v3.0.0): REST payload must use bytesBase64Encoded field (not data or inline_data)


Inputs

  • Image: Local path or URL
  • Duration: 5-10 seconds (optional)
  • Style: Hint like "cinematic", "ethereal", "smooth" (optional)

Quick Start

python3 \x3C\x3C 'PYEOF'
import os
import sys
import json
import time
import base64
import requests
import google.generativeai as genai
from pathlib import Path
from datetime import datetime

GOOGLE_API_KEY = os.getenv("GOOGLE_API_KEY")
WORKSPACE = Path.home() / ".openclaw" / "workspace" / "tibetanProc"
WORKSPACE.mkdir(parents=True, exist_ok=True)
TIMESTAMP = datetime.now().strftime("%y%m%d%H%M")

# Step 1: Load image
IMAGE_PATH = WORKSPACE / f"{TIMESTAMP}_input_image.jpg"
if not IMAGE_PATH.exists():
    print("✗ Image not found")
    sys.exit(1)

print(f"✓ Image: {IMAGE_PATH.name}")

# Step 2: Analyze with Gemini
genai.configure(api_key=GOOGLE_API_KEY)
model = genai.GenerativeModel("gemini-2.5-flash")
image_file = genai.upload_file(str(IMAGE_PATH), mime_type="image/jpeg")

prompt = """Analyze for cinematic video: describe subject, setting, lighting, textures, 
suggested camera movements (dolly, pan, orbit, zoom, rack focus)."""

response = model.generate_content([prompt, image_file])
analysis = response.text

prompt_path = WORKSPACE / f"{TIMESTAMP}_prompt.md"
with open(prompt_path, "w") as f:
    f.write(analysis)
print(f"✓ Analysis: {prompt_path.name}")

# Step 3: Create enhanced prompt
enhanced = f"""VIDEO GENERATION PROMPT
Duration: 5 seconds
Quality: High Definition

SCENE ANALYSIS:
{analysis}

MOTION GUIDELINES:
- Smooth, deliberate camera movement
- Enhance visual depth with elegant transitions
- Maintain consistent lighting
- Cinematic color grading
"""

enhanced_path = WORKSPACE / f"{TIMESTAMP}_enhanced_prompt.txt"
with open(enhanced_path, "w") as f:
    f.write(enhanced)
print(f"✓ Enhanced prompt: {enhanced_path.name}")

# Step 4: Call Veo API with CORRECT field names
with open(IMAGE_PATH, "rb") as f:
    image_b64 = base64.b64encode(f.read()).decode("utf-8")

VEO_URL = f"https://generativelanguage.googleapis.com/v1beta/models/veo-3.0-generate-001:predictLongRunning?key={GOOGLE_API_KEY}"

# ✅ CORRECT PAYLOAD (v3.0.0)
payload = {
    "instances": [{
        "prompt": enhanced,
        "image": {
            "bytesBase64Encoded": image_b64,  # ← CRITICAL: NOT "data" or "inline_data"
            "mimeType": "image/jpeg"
        }
    }]
}

print("\
🎬 Calling Veo API...")
response = requests.post(VEO_URL, json=payload, timeout=60)

if response.status_code not in [200, 202]:
    print(f"✗ API error: {response.json()}")
    sys.exit(1)

result = response.json()
operation_name = result.get("name")
if not operation_name:
    print(f"✗ No operation name")
    sys.exit(1)

print(f"✓ Operation: {operation_name}")

# Step 5: Poll until complete
POLL_URL = f"https://generativelanguage.googleapis.com/v1beta/{operation_name}?key={GOOGLE_API_KEY}"

for attempt in range(1, 121):
    time.sleep(5 if attempt > 1 else 2)
    
    poll_response = requests.get(POLL_URL, timeout=10)
    poll_result = poll_response.json()
    
    if poll_result.get("done"):
        print(f"✓ Complete in {attempt * 5}s")
        
        # Extract video URL
        try:
            video_uri = poll_result["response"]["generateVideoResponse"]["generatedSamples"][0]["video"]["uri"]
        except (KeyError, IndexError):
            print(f"✗ No video in response")
            print(json.dumps(poll_result, indent=2)[:500])
            sys.exit(1)
        
        # Step 6: Download video
        print(f"⬇️  Downloading...")
        video_response = requests.get(f"{video_uri}&key={GOOGLE_API_KEY}", timeout=120, stream=True)
        
        if video_response.status_code == 200:
            output_path = WORKSPACE / f"{TIMESTAMP}_video.mp4"
            with open(output_path, "wb") as f:
                for chunk in video_response.iter_content(chunk_size=8192):
                    if chunk:
                        f.write(chunk)
            
            size_mb = output_path.stat().st_size / (1024 * 1024)
            print(f"✓ Video: {output_path.name} ({size_mb:.1f} MB)")
            sys.exit(0)
        else:
            print(f"✗ Download failed: {video_response.status_code}")
            sys.exit(1)

print("✗ Timeout")
sys.exit(1)

PYEOF

Detailed Workflow

Step 1: Prepare Image

WORKSPACE="$HOME/.openclaw/workspace/tibetanProc"
mkdir -p "$WORKSPACE"
TIMESTAMP=$(date +%y%m%d%H%M)
cp ./my_image.jpg "$WORKSPACE/${TIMESTAMP}_input_image.jpg"

Step 2: Analyze with Gemini Vision

import google.generativeai as genai
from pathlib import Path

genai.configure(api_key=os.getenv("GOOGLE_API_KEY"))
model = genai.GenerativeModel("gemini-2.5-flash")

IMAGE_PATH = Path.home() / ".openclaw" / "workspace" / "tibetanProc" / "2604110411_input_image.jpg"
image_file = genai.upload_file(str(IMAGE_PATH), mime_type="image/jpeg")

analysis_prompt = """Analyze this image for cinematic video generation:
1. Main subject and focal point
2. Setting and environment
3. Lighting direction and mood
4. Materials and textures
5. Suggested camera movements (dolly, pan, orbit, zoom, rack focus)
6. Overall energy and pacing"""

response = model.generate_content([analysis_prompt, image_file])

# Save analysis
with open(WORKSPACE / "2604110411_prompt.md", "w") as f:
    f.write(response.text)

Step 3: Create Enhanced Prompt

# Read analysis
with open(WORKSPACE / "2604110411_prompt.md") as f:
    analysis = f.read()

# Add video instructions
enhanced = f"""VIDEO GENERATION PROMPT
Duration: 5 seconds
Quality: High Definition
Frame Rate: 24fps

SCENE ANALYSIS:
{analysis}

MOTION GUIDELINES:
- Smooth, deliberate camera movement
- Enhance visual depth with elegant transitions
- Maintain consistent lighting throughout
- Cinematic color grading
- Focus on visual storytelling

TECHNICAL SPECS:
- Resolution: 1080p minimum
- Aspect Ratio: 16:9
- Format: MP4 (H.264)
"""

with open(WORKSPACE / "2604110411_enhanced_prompt.txt", "w") as f:
    f.write(enhanced)

Step 4: Call Veo API (THE CRITICAL PART)

import base64
import requests

# Encode image
with open(IMAGE_PATH, "rb") as f:
    image_b64 = base64.b64encode(f.read()).decode("utf-8")

# Read enhanced prompt
with open(WORKSPACE / "2604110411_enhanced_prompt.txt") as f:
    prompt = f.read()

# ✅ CORRECT PAYLOAD STRUCTURE (v3.0.0)
payload = {
    "instances": [{
        "prompt": prompt,
        "image": {
            "bytesBase64Encoded": image_b64,    # ← KEY: NOT "data" or "inline_data"
            "mimeType": "image/jpeg"            # ← Must be here
        }
    }]
}

VEO_URL = f"https://generativelanguage.googleapis.com/v1beta/models/veo-3.0-generate-001:predictLongRunning?key={GOOGLE_API_KEY}"

response = requests.post(VEO_URL, json=payload, timeout=60)
result = response.json()

if response.status_code in [200, 202]:
    operation_name = result["name"]
    print(f"✓ Operation: {operation_name}")
else:
    print(f"✗ Error: {result}")
    exit(1)

Step 5: Poll Operation Status

import time

POLL_URL = f"https://generativelanguage.googleapis.com/v1beta/{operation_name}?key={GOOGLE_API_KEY}"

for attempt in range(120):  # 10 minutes max
    time.sleep(5)
    
    poll_response = requests.get(POLL_URL, timeout=10)
    poll_result = poll_response.json()
    
    if poll_result.get("done"):
        print(f"✓ Complete in {(attempt + 1) * 5}s")
        
        # Extract video URL from response structure
        video_uri = poll_result["response"]["generateVideoResponse"]["generatedSamples"][0]["video"]["uri"]
        print(f"✓ Video URL: {video_uri}")
        break
    
    progress = poll_result.get("metadata", {}).get("progressPercentage", "?")
    print(f"  Polling ({attempt + 1}/120)... {progress}%")

Step 6: Download Video

# URL needs API key appended
download_url = f"{video_uri}&key={GOOGLE_API_KEY}"

video_response = requests.get(download_url, timeout=120, stream=True)

output_path = WORKSPACE / "2604110411_video.mp4"
with open(output_path, "wb") as f:
    for chunk in video_response.iter_content(chunk_size=8192):
        if chunk:
            f.write(chunk)

size_mb = output_path.stat().st_size / (1024 * 1024)
print(f"✓ Video downloaded: {output_path} ({size_mb:.1f} MB)")

Response Structure

Initial Response (Step 4)

{
  "name": "models/veo-3.0-generate-001/operations/uiw8bpjdiqbn"
}

Final Response (Step 5 - when done=true)

{
  "name": "models/veo-3.0-generate-001/operations/uiw8bpjdiqbn",
  "done": true,
  "response": {
    "@type": "type.googleapis.com/google.ai.generativelanguage.v1beta.PredictLongRunningResponse",
    "generateVideoResponse": {
      "generatedSamples": [
        {
          "video": {
            "uri": "https://generativelanguage.googleapis.com/v1beta/files/txg4shogthoc:download?alt=media"
          }
        }
      ]
    }
  }
}

Output Files

~/.openclaw/workspace/tibetanProc/
├── 2604110411_input_image.jpg         # Original image
├── 2604110411_prompt.md               # Gemini analysis
├── 2604110411_enhanced_prompt.txt     # Motion-enhanced prompt
├── 2604110411_veo_init_response.json  # API response (init)
└── 2604110411_video.mp4               # ✅ Final video

Common Errors & Fixes

Error Cause Fix
bytesBase64Encoded isn't supported Wrong field name Use bytesBase64Encoded (not data, inline_data, bytesBase64)
mimeType isn't supported Field name case Use exact mimeType (camelCase, not mime_type or mimeType)
No struct value found for field expecting an image Missing image entirely Provide both bytesBase64Encoded and mimeType
Video generation timeout Operation takes >10min Rare; usually completes in 60-90s
403 PERMISSION_DENIED on download API key issue Add ?key={GOOGLE_API_KEY} to video URL

Performance

Operation Time
Gemini analysis ~3s
Veo generation ~50-90s
Download ~2-5s
Total ~60-100s

Version History

Version Date Change
3.0.0 2026-04-11 Working REST API - Fixed field names: bytesBase64Encoded instead of data
2.0.0 2026-04-11 Documented async polling (did not work)
1.0.0 Original Initial design (gRPC-only, REST broken)

Testing

Successfully tested with:

  • Image: Golden Tibetan offering (6.3 MB JPG)
  • Model: veo-3.0-generate-001
  • Duration: 5 seconds
  • Output: 2.4 MB MP4 video
  • Status: ✅ Working end-to-end
安全使用建议
This skill appears to do what it says: it uploads an image to Google (Gemini/Veo) and saves the returned video under ~/.openclaw/workspace/tibetanProc/. Before installing/using: 1) Confirm you are comfortable providing a GOOGLE_API_KEY (the key will be used client-side to call Google APIs and may incur billing). 2) Avoid uploading sensitive images — they are transmitted to Google's services. 3) Ensure the required Python packages (google-generativeai, requests) are installed in the environment; the skill does not include an installer. 4) Note minor metadata inconsistencies (packaged version vs _meta.json); this looks like bookkeeping rather than malicious behavior, but if you need stronger assurances ask the provider for a canonical release or source repository.
功能分析
Type: OpenClaw Skill Name: image-to-video-gen Version: 3.0.0 The skill bundle is a functional utility for generating videos from images using Google's Veo-3.0 and Gemini APIs. The Python script in SKILL.md correctly implements the asynchronous predictLongRunning flow, including image analysis, prompt enhancement, and status polling. All network communication is directed to legitimate Google API endpoints (generativelanguage.googleapis.com), and file operations are confined to a specific workspace directory (~/.openclaw/workspace/tibetanProc/). No evidence of data exfiltration, credential theft, or malicious prompt injection was found.
能力评估
Purpose & Capability
Name/description (image→video using Veo + Gemini) matches the declared requirements: GOOGLE_API_KEY for Google APIs, python3 and small CLI tools for the provided quickstart. Declared Python packages (google-generativeai, requests) are what the example code uses.
Instruction Scope
SKILL.md instructions stay within scope: analyze image with Gemini, call Veo predictLongRunning, poll and download video to ~/.openclaw/workspace/tibetanProc/. The script reads only the image file and GOOGLE_API_KEY and writes outputs to the stated workspace. It uploads the image to Google services (expected for Gemini/Veo).
Install Mechanism
Instruction-only skill (no install spec or code files) — low install risk. It does require certain Python packages but provides no installer; this is an operational note rather than a security contradiction.
Credentials
Only GOOGLE_API_KEY is requested, which is proportionate for calling Google generative APIs. No unrelated secrets or system config paths are requested. Be aware an API key can incur billing and will be sent to Google endpoints.
Persistence & Privilege
Skill does not request persistent 'always' inclusion or elevated platform privileges. It only writes and reads files under the declared ~/.openclaw workspace directory.
如何使用
  1. 确保已安装 OpenClaw(本地或 Docker 部署)
  2. 在对话框中输入安装命令:/install image-to-video-gen
  3. 安装完成后,直接呼叫该 Skill 的名称或使用 /image-to-video-gen 触发
  4. 根据 Skill 的参数说明提供必要输入,即可获得结构化输出
版本历史
v3.0.0
image-to-video-gen v3.0.0 — Major update: introduces robust, fully working Veo-3.0 async REST API integration. - Gemini Vision now performs image analysis to generate intelligent motion/scene prompts. - Veo-3.0 video generation uses correct REST payload (`bytesBase64Encoded`), enabling successful async job submission. - Polling loop reliably tracks operation status and downloads generated MP4 upon completion. - All intermediary files and outputs are now consistently saved with timestamped names in `~/.openclaw/workspace/tibetanProc/`. - Sample workflows, payloads, and error handling updated for clarity and recent API requirements.
v1.0.0
- Initial release of image-to-video-gen skill. - Converts user-supplied images into video clips using nano-banana (Gemini vision) for prompt creation and Veo for video generation. - All process files and outputs are saved in ~/.openclaw/workspace/tibetanProc/ with yymmddHHMM prefixes. - Supports optional duration and style parameters; defaults to 5 seconds, "cinematic" style. - Includes strict output location guardrails and timestamped file naming for all generated content.
元数据
Slug image-to-video-gen
版本 3.0.0
许可证 MIT-0
累计安装 0
当前安装数 0
历史版本数 2
常见问题

Image To Video Gen 是什么?

Generate video from supplied image using Veo-3.0 async API. Gemini vision analyzes image, Veo creates video via predictLongRunning. All outputs in ~/.opencla... 它是一个面向 Claude Code / OpenClaw 的 AI Agent Skill 插件,目前累计下载 121 次。

如何安装 Image To Video Gen?

在 OpenClaw 或 Claude Code 对话框中运行命令「/install image-to-video-gen」即可一键安装,无需额外配置。

Image To Video Gen 是免费的吗?

是的,Image To Video Gen 完全免费,采用 MIT-0 许可证,可自由下载、安装和使用。

Image To Video Gen 支持哪些平台?

Image To Video Gen 跨平台运行,可在任意部署了 OpenClaw / Claude Code 的环境中使用(cross-platform)。

谁开发了 Image To Video Gen?

由 Jeff Yang(@j3ffyang)开发并维护,当前版本 v3.0.0。

💬 留言讨论