Description

Generate video from supplied image using Veo-3.0 async API. Gemini vision analyzes image, Veo creates video via predictLongRunning. All outputs in ~/.opencla...

README (SKILL.md)

Image to Video Generator (v3.0.0 – Working Veo-3.0 REST API)

Name: Image To Video Gen
Author: j3ffyang

Status: ✅ Tested & Working
Last Updated: 2026-04-11
Example: Golden Tibetan offering → 2.4 MB video in ~60 seconds

What it does

Generates cinematic 5-second video from any image using Google's Veo-3.0:

Gemini Vision (2.5-flash) analyzes image for motion/scene description
Veo-3.0 predictLongRunning generates video asynchronously via REST API
Polling loop monitors operation until video is ready (~60-90 seconds)
Download saves MP4 to ~/.openclaw/workspace/tibetanProc/ with yymmddHHMM prefix

Key Fix (v3.0.0): REST payload must use bytesBase64Encoded field (not data or inline_data)

Inputs

Image: Local path or URL
Duration: 5-10 seconds (optional)
Style: Hint like "cinematic", "ethereal", "smooth" (optional)

Quick Start

python3 \x3C\x3C 'PYEOF'
import os
import sys
import json
import time
import base64
import requests
import google.generativeai as genai
from pathlib import Path
from datetime import datetime

GOOGLE_API_KEY = os.getenv("GOOGLE_API_KEY")
WORKSPACE = Path.home() / ".openclaw" / "workspace" / "tibetanProc"
WORKSPACE.mkdir(parents=True, exist_ok=True)
TIMESTAMP = datetime.now().strftime("%y%m%d%H%M")

# Step 1: Load image
IMAGE_PATH = WORKSPACE / f"{TIMESTAMP}_input_image.jpg"
if not IMAGE_PATH.exists():
    print("✗ Image not found")
    sys.exit(1)

print(f"✓ Image: {IMAGE_PATH.name}")

# Step 2: Analyze with Gemini
genai.configure(api_key=GOOGLE_API_KEY)
model = genai.GenerativeModel("gemini-2.5-flash")
image_file = genai.upload_file(str(IMAGE_PATH), mime_type="image/jpeg")

prompt = """Analyze for cinematic video: describe subject, setting, lighting, textures, 
suggested camera movements (dolly, pan, orbit, zoom, rack focus)."""

response = model.generate_content([prompt, image_file])
analysis = response.text

prompt_path = WORKSPACE / f"{TIMESTAMP}_prompt.md"
with open(prompt_path, "w") as f:
    f.write(analysis)
print(f"✓ Analysis: {prompt_path.name}")

# Step 3: Create enhanced prompt
enhanced = f"""VIDEO GENERATION PROMPT
Duration: 5 seconds
Quality: High Definition

SCENE ANALYSIS:
{analysis}

MOTION GUIDELINES:
- Smooth, deliberate camera movement
- Enhance visual depth with elegant transitions
- Maintain consistent lighting
- Cinematic color grading
"""

enhanced_path = WORKSPACE / f"{TIMESTAMP}_enhanced_prompt.txt"
with open(enhanced_path, "w") as f:
    f.write(enhanced)
print(f"✓ Enhanced prompt: {enhanced_path.name}")

# Step 4: Call Veo API with CORRECT field names
with open(IMAGE_PATH, "rb") as f:
    image_b64 = base64.b64encode(f.read()).decode("utf-8")

VEO_URL = f"https://generativelanguage.googleapis.com/v1beta/models/veo-3.0-generate-001:predictLongRunning?key={GOOGLE_API_KEY}"

# ✅ CORRECT PAYLOAD (v3.0.0)
payload = {
    "instances": [{
        "prompt": enhanced,
        "image": {
            "bytesBase64Encoded": image_b64,  # ← CRITICAL: NOT "data" or "inline_data"
            "mimeType": "image/jpeg"
        }
    }]
}

print("\
🎬 Calling Veo API...")
response = requests.post(VEO_URL, json=payload, timeout=60)

if response.status_code not in [200, 202]:
    print(f"✗ API error: {response.json()}")
    sys.exit(1)

result = response.json()
operation_name = result.get("name")
if not operation_name:
    print(f"✗ No operation name")
    sys.exit(1)

print(f"✓ Operation: {operation_name}")

# Step 5: Poll until complete
POLL_URL = f"https://generativelanguage.googleapis.com/v1beta/{operation_name}?key={GOOGLE_API_KEY}"

for attempt in range(1, 121):
    time.sleep(5 if attempt > 1 else 2)
    
    poll_response = requests.get(POLL_URL, timeout=10)
    poll_result = poll_response.json()
    
    if poll_result.get("done"):
        print(f"✓ Complete in {attempt * 5}s")
        
        # Extract video URL
        try:
            video_uri = poll_result["response"]["generateVideoResponse"]["generatedSamples"][0]["video"]["uri"]
        except (KeyError, IndexError):
            print(f"✗ No video in response")
            print(json.dumps(poll_result, indent=2)[:500])
            sys.exit(1)
        
        # Step 6: Download video
        print(f"⬇️  Downloading...")
        video_response = requests.get(f"{video_uri}&key={GOOGLE_API_KEY}", timeout=120, stream=True)
        
        if video_response.status_code == 200:
            output_path = WORKSPACE / f"{TIMESTAMP}_video.mp4"
            with open(output_path, "wb") as f:
                for chunk in video_response.iter_content(chunk_size=8192):
                    if chunk:
                        f.write(chunk)
            
            size_mb = output_path.stat().st_size / (1024 * 1024)
            print(f"✓ Video: {output_path.name} ({size_mb:.1f} MB)")
            sys.exit(0)
        else:
            print(f"✗ Download failed: {video_response.status_code}")
            sys.exit(1)

print("✗ Timeout")
sys.exit(1)

PYEOF

Detailed Workflow

Step 1: Prepare Image

WORKSPACE="$HOME/.openclaw/workspace/tibetanProc"
mkdir -p "$WORKSPACE"
TIMESTAMP=$(date +%y%m%d%H%M)
cp ./my_image.jpg "$WORKSPACE/${TIMESTAMP}_input_image.jpg"

Step 2: Analyze with Gemini Vision

import google.generativeai as genai
from pathlib import Path

genai.configure(api_key=os.getenv("GOOGLE_API_KEY"))
model = genai.GenerativeModel("gemini-2.5-flash")

IMAGE_PATH = Path.home() / ".openclaw" / "workspace" / "tibetanProc" / "2604110411_input_image.jpg"
image_file = genai.upload_file(str(IMAGE_PATH), mime_type="image/jpeg")

analysis_prompt = """Analyze this image for cinematic video generation:
1. Main subject and focal point
2. Setting and environment
3. Lighting direction and mood
4. Materials and textures
5. Suggested camera movements (dolly, pan, orbit, zoom, rack focus)
6. Overall energy and pacing"""

response = model.generate_content([analysis_prompt, image_file])

# Save analysis
with open(WORKSPACE / "2604110411_prompt.md", "w") as f:
    f.write(response.text)

Step 3: Create Enhanced Prompt

# Read analysis
with open(WORKSPACE / "2604110411_prompt.md") as f:
    analysis = f.read()

# Add video instructions
enhanced = f"""VIDEO GENERATION PROMPT
Duration: 5 seconds
Quality: High Definition
Frame Rate: 24fps

SCENE ANALYSIS:
{analysis}

MOTION GUIDELINES:
- Smooth, deliberate camera movement
- Enhance visual depth with elegant transitions
- Maintain consistent lighting throughout
- Cinematic color grading
- Focus on visual storytelling

TECHNICAL SPECS:
- Resolution: 1080p minimum
- Aspect Ratio: 16:9
- Format: MP4 (H.264)
"""

with open(WORKSPACE / "2604110411_enhanced_prompt.txt", "w") as f:
    f.write(enhanced)

Step 4: Call Veo API (THE CRITICAL PART)

import base64
import requests

# Encode image
with open(IMAGE_PATH, "rb") as f:
    image_b64 = base64.b64encode(f.read()).decode("utf-8")

# Read enhanced prompt
with open(WORKSPACE / "2604110411_enhanced_prompt.txt") as f:
    prompt = f.read()

# ✅ CORRECT PAYLOAD STRUCTURE (v3.0.0)
payload = {
    "instances": [{
        "prompt": prompt,
        "image": {
            "bytesBase64Encoded": image_b64,    # ← KEY: NOT "data" or "inline_data"
            "mimeType": "image/jpeg"            # ← Must be here
        }
    }]
}

VEO_URL = f"https://generativelanguage.googleapis.com/v1beta/models/veo-3.0-generate-001:predictLongRunning?key={GOOGLE_API_KEY}"

response = requests.post(VEO_URL, json=payload, timeout=60)
result = response.json()

if response.status_code in [200, 202]:
    operation_name = result["name"]
    print(f"✓ Operation: {operation_name}")
else:
    print(f"✗ Error: {result}")
    exit(1)

Step 5: Poll Operation Status

import time

POLL_URL = f"https://generativelanguage.googleapis.com/v1beta/{operation_name}?key={GOOGLE_API_KEY}"

for attempt in range(120):  # 10 minutes max
    time.sleep(5)
    
    poll_response = requests.get(POLL_URL, timeout=10)
    poll_result = poll_response.json()
    
    if poll_result.get("done"):
        print(f"✓ Complete in {(attempt + 1) * 5}s")
        
        # Extract video URL from response structure
        video_uri = poll_result["response"]["generateVideoResponse"]["generatedSamples"][0]["video"]["uri"]
        print(f"✓ Video URL: {video_uri}")
        break
    
    progress = poll_result.get("metadata", {}).get("progressPercentage", "?")
    print(f"  Polling ({attempt + 1}/120)... {progress}%")

Step 6: Download Video

# URL needs API key appended
download_url = f"{video_uri}&key={GOOGLE_API_KEY}"

video_response = requests.get(download_url, timeout=120, stream=True)

output_path = WORKSPACE / "2604110411_video.mp4"
with open(output_path, "wb") as f:
    for chunk in video_response.iter_content(chunk_size=8192):
        if chunk:
            f.write(chunk)

size_mb = output_path.stat().st_size / (1024 * 1024)
print(f"✓ Video downloaded: {output_path} ({size_mb:.1f} MB)")

Response Structure

Initial Response (Step 4)

{
  "name": "models/veo-3.0-generate-001/operations/uiw8bpjdiqbn"
}

Final Response (Step 5 - when done=true)

{
  "name": "models/veo-3.0-generate-001/operations/uiw8bpjdiqbn",
  "done": true,
  "response": {
    "@type": "type.googleapis.com/google.ai.generativelanguage.v1beta.PredictLongRunningResponse",
    "generateVideoResponse": {
      "generatedSamples": [
        {
          "video": {
            "uri": "https://generativelanguage.googleapis.com/v1beta/files/txg4shogthoc:download?alt=media"
          }
        }
      ]
    }
  }
}

Output Files

~/.openclaw/workspace/tibetanProc/
├── 2604110411_input_image.jpg         # Original image
├── 2604110411_prompt.md               # Gemini analysis
├── 2604110411_enhanced_prompt.txt     # Motion-enhanced prompt
├── 2604110411_veo_init_response.json  # API response (init)
└── 2604110411_video.mp4               # ✅ Final video

Common Errors & Fixes

Error	Cause	Fix
`bytesBase64Encoded isn't supported`	Wrong field name	Use `bytesBase64Encoded` (not `data`, `inline_data`, `bytesBase64`)
`mimeType isn't supported`	Field name case	Use exact `mimeType` (camelCase, not `mime_type` or `mimeType`)
`No struct value found for field expecting an image`	Missing image entirely	Provide both `bytesBase64Encoded` and `mimeType`
`Video generation timeout`	Operation takes >10min	Rare; usually completes in 60-90s
`403 PERMISSION_DENIED` on download	API key issue	Add `?key={GOOGLE_API_KEY}` to video URL

Performance

Operation	Time
Gemini analysis	~3s
Veo generation	~50-90s
Download	~2-5s
Total	~60-100s

Version History

Version	Date	Change
3.0.0	2026-04-11	✅ Working REST API - Fixed field names: `bytesBase64Encoded` instead of `data`
2.0.0	2026-04-11	Documented async polling (did not work)
1.0.0	Original	Initial design (gRPC-only, REST broken)

Testing

Successfully tested with:

Image: Golden Tibetan offering (6.3 MB JPG)
Model: veo-3.0-generate-001
Duration: 5 seconds
Output: 2.4 MB MP4 video
Status: ✅ Working end-to-end

Usage Guidance

This skill appears to do what it says: it uploads an image to Google (Gemini/Veo) and saves the returned video under ~/.openclaw/workspace/tibetanProc/. Before installing/using: 1) Confirm you are comfortable providing a GOOGLE_API_KEY (the key will be used client-side to call Google APIs and may incur billing). 2) Avoid uploading sensitive images — they are transmitted to Google's services. 3) Ensure the required Python packages (google-generativeai, requests) are installed in the environment; the skill does not include an installer. 4) Note minor metadata inconsistencies (packaged version vs _meta.json); this looks like bookkeeping rather than malicious behavior, but if you need stronger assurances ask the provider for a canonical release or source repository.

Capability Analysis

Type: OpenClaw Skill Name: image-to-video-gen Version: 3.0.0 The skill bundle is a functional utility for generating videos from images using Google's Veo-3.0 and Gemini APIs. The Python script in SKILL.md correctly implements the asynchronous predictLongRunning flow, including image analysis, prompt enhancement, and status polling. All network communication is directed to legitimate Google API endpoints (generativelanguage.googleapis.com), and file operations are confined to a specific workspace directory (~/.openclaw/workspace/tibetanProc/). No evidence of data exfiltration, credential theft, or malicious prompt injection was found.

Capability Assessment

✓ Purpose & Capability

Name/description (image→video using Veo + Gemini) matches the declared requirements: GOOGLE_API_KEY for Google APIs, python3 and small CLI tools for the provided quickstart. Declared Python packages (google-generativeai, requests) are what the example code uses.

✓ Instruction Scope

SKILL.md instructions stay within scope: analyze image with Gemini, call Veo predictLongRunning, poll and download video to ~/.openclaw/workspace/tibetanProc/. The script reads only the image file and GOOGLE_API_KEY and writes outputs to the stated workspace. It uploads the image to Google services (expected for Gemini/Veo).

✓ Install Mechanism

Instruction-only skill (no install spec or code files) — low install risk. It does require certain Python packages but provides no installer; this is an operational note rather than a security contradiction.

✓ Credentials

Only GOOGLE_API_KEY is requested, which is proportionate for calling Google generative APIs. No unrelated secrets or system config paths are requested. Be aware an API key can incur billing and will be sent to Google endpoints.

✓ Persistence & Privilege

Skill does not request persistent 'always' inclusion or elevated platform privileges. It only writes and reads files under the declared ~/.openclaw workspace directory.

Version History

v3.0.0

image-to-video-gen v3.0.0 — Major update: introduces robust, fully working Veo-3.0 async REST API integration. - Gemini Vision now performs image analysis to generate intelligent motion/scene prompts. - Veo-3.0 video generation uses correct REST payload (`bytesBase64Encoded`), enabling successful async job submission. - Polling loop reliably tracks operation status and downloads generated MP4 upon completion. - All intermediary files and outputs are now consistently saved with timestamped names in `~/.openclaw/workspace/tibetanProc/`. - Sample workflows, payloads, and error handling updated for clarity and recent API requirements.

v1.0.0

- Initial release of image-to-video-gen skill. - Converts user-supplied images into video clips using nano-banana (Gemini vision) for prompt creation and Veo for video generation. - All process files and outputs are saved in ~/.openclaw/workspace/tibetanProc/ with yymmddHHMM prefixes. - Supports optional duration and style parameters; defaults to 5 seconds, "cinematic" style. - Includes strict output location guardrails and timestamped file naming for all generated content.

Metadata

Slug image-to-video-gen

Version 3.0.0

License MIT-0

All-time Installs 0

Active Installs 0

Total Versions 2

Frequently Asked Questions

What is Image To Video Gen?

Generate video from supplied image using Veo-3.0 async API. Gemini vision analyzes image, Veo creates video via predictLongRunning. All outputs in ~/.opencla... It is an AI Agent Skill for Claude Code / OpenClaw, with 121 downloads so far.

How do I install Image To Video Gen?

Run "/install image-to-video-gen" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.

Is Image To Video Gen free?

Yes, Image To Video Gen is completely free, licensed under MIT-0. You can download, install and use it at no cost.

Which platforms does Image To Video Gen support?

Image To Video Gen is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).

Who created Image To Video Gen?

It is built and maintained by Jeff Yang (@j3ffyang); the current version is v3.0.0.

More Skills

Image To Video Gen

Image to Video Generator (v3.0.0 – Working Veo-3.0 REST API)

What it does

Inputs

Quick Start

Detailed Workflow

Step 1: Prepare Image

Step 2: Analyze with Gemini Vision

Step 3: Create Enhanced Prompt

Step 4: Call Veo API (THE CRITICAL PART)

Step 5: Poll Operation Status

Step 6: Download Video

Response Structure

Initial Response (Step 4)

Final Response (Step 5 - when done=true)

Output Files

Common Errors & Fixes

Performance

Version History

Testing

What is Image To Video Gen?

How do I install Image To Video Gen?

Is Image To Video Gen free?

Which platforms does Image To Video Gen support?

Who created Image To Video Gen?

💬 Comments