← Back to Skills Marketplace
wu-uk

openai-vision

by wu-uk · GitHub ↗ · v0.1.0 · MIT-0
cross-platform ⚠ suspicious
67
Downloads
0
Stars
0
Active Installs
1
Versions
Install in OpenClaw
/install jpg-ocr-stat-openai-vision
Description
Analyze images and multi-frame sequences using OpenAI GPT vision models
README (SKILL.md)

OpenAI Vision Analysis Skill

Purpose

This skill enables image analysis, scene understanding, text extraction, and multi-frame comparison using OpenAI's vision-capable GPT models (e.g., gpt-4o, gpt-4o-mini). It supports single images, multiple images for comparison, and sequential frames for temporal analysis.

When to Use

  • Analyzing image content (objects, scenes, colors, spatial relationships)
  • Extracting and reading text from images (OCR via vision models)
  • Comparing multiple images to detect differences or changes
  • Processing video frames to understand temporal progression
  • Generating detailed image descriptions or captions
  • Answering questions about visual content

Required Libraries

The following Python libraries are required:

from openai import OpenAI
import base64
import json
import os
from pathlib import Path

Input Requirements

  • File formats: JPG, JPEG, PNG, WEBP, non-animated GIF
  • Image sources: URL, Base64-encoded data, or local file paths
  • Size limits: Up to 20MB per image; total request payload under 50MB
  • Maximum images: Up to 500 images per request
  • Image quality: Clear, legible content; avoid watermarks or heavy distortions

Output Schema

Analysis results should be returned as valid JSON conforming to this schema:

{
  "success": true,
  "images_analyzed": 1,
  "analysis": {
    "description": "A detailed scene description...",
    "objects": [
      {"name": "car", "color": "red", "position": "foreground center"},
      {"name": "tree", "count": 3, "position": "background"}
    ],
    "text_content": "Any text visible in the image...",
    "colors": ["blue", "green", "white"],
    "scene_type": "outdoor/urban"
  },
  "comparison": {
    "differences": ["Object X appeared", "Color changed from A to B"],
    "similarities": ["Background unchanged", "Layout consistent"]
  },
  "metadata": {
    "model_used": "gpt-4o",
    "detail_level": "high",
    "token_usage": {"prompt": 1500, "completion": 200}
  },
  "warnings": []
}

Field Descriptions

  • success: Boolean indicating whether analysis completed
  • images_analyzed: Number of images processed in the request
  • analysis.description: Natural language description of the image content
  • analysis.objects: Array of detected objects with attributes
  • analysis.text_content: Any text extracted from the image
  • analysis.colors: Dominant colors identified
  • analysis.scene_type: Classification of the scene
  • comparison: Present when multiple images are analyzed; describes differences and similarities
  • metadata.model_used: The GPT model used for analysis
  • metadata.detail_level: Resolution level used (low, high, or auto)
  • metadata.token_usage: Token consumption for cost tracking
  • warnings: Array of any issues or limitations encountered

Code Examples

Basic Image Analysis from URL

from openai import OpenAI

client = OpenAI()

def analyze_image_url(image_url, prompt="Describe this image in detail."):
    """Analyze an image from a URL using GPT-4o vision."""
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": prompt},
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": image_url,
                            "detail": "high"
                        }
                    }
                ]
            }
        ],
        max_tokens=1000
    )
    return response.choices[0].message.content

Image Analysis from Local File (Base64)

from openai import OpenAI
import base64

client = OpenAI()

def encode_image_to_base64(image_path):
    """Encode a local image file to base64."""
    with open(image_path, "rb") as image_file:
        return base64.standard_b64encode(image_file.read()).decode("utf-8")

def get_image_media_type(image_path):
    """Determine the media type based on file extension."""
    ext = image_path.lower().split('.')[-1]
    media_types = {
        'jpg': 'image/jpeg',
        'jpeg': 'image/jpeg',
        'png': 'image/png',
        'gif': 'image/gif',
        'webp': 'image/webp'
    }
    return media_types.get(ext, 'image/jpeg')

def analyze_local_image(image_path, prompt="Describe this image in detail."):
    """Analyze a local image file using GPT-4o vision."""
    base64_image = encode_image_to_base64(image_path)
    media_type = get_image_media_type(image_path)
    
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": prompt},
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": f"data:{media_type};base64,{base64_image}",
                            "detail": "high"
                        }
                    }
                ]
            }
        ],
        max_tokens=1000
    )
    return response.choices[0].message.content

Multi-Image Comparison

from openai import OpenAI
import base64

client = OpenAI()

def compare_images(image_paths, comparison_prompt=None):
    """Compare multiple images and identify differences."""
    if comparison_prompt is None:
        comparison_prompt = (
            "Compare these images carefully. "
            "List all differences and similarities you observe. "
            "Describe any changes in objects, colors, positions, or text."
        )
    
    content = [{"type": "text", "text": comparison_prompt}]
    
    for i, image_path in enumerate(image_paths):
        base64_image = encode_image_to_base64(image_path)
        media_type = get_image_media_type(image_path)
        
        # Add label for each image
        content.append({
            "type": "text", 
            "text": f"Image {i + 1}:"
        })
        content.append({
            "type": "image_url",
            "image_url": {
                "url": f"data:{media_type};base64,{base64_image}",
                "detail": "high"
            }
        })
    
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": content}],
        max_tokens=2000
    )
    return response.choices[0].message.content

Multi-Frame Video Analysis

from openai import OpenAI
import base64
from pathlib import Path

client = OpenAI()

def analyze_video_frames(frame_paths, analysis_prompt=None):
    """Analyze a sequence of video frames for temporal understanding."""
    if analysis_prompt is None:
        analysis_prompt = (
            "These are sequential frames from a video. "
            "Describe what is happening over time. "
            "Identify any motion, changes, or events that occur across the frames."
        )
    
    content = [{"type": "text", "text": analysis_prompt}]
    
    for i, frame_path in enumerate(frame_paths):
        base64_image = encode_image_to_base64(frame_path)
        media_type = get_image_media_type(frame_path)
        
        content.append({
            "type": "text",
            "text": f"Frame {i + 1}:"
        })
        content.append({
            "type": "image_url",
            "image_url": {
                "url": f"data:{media_type};base64,{base64_image}",
                "detail": "auto"  # Use auto for frames to balance cost
            }
        })
    
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": content}],
        max_tokens=2000
    )
    return response.choices[0].message.content

Full Analysis with JSON Output

from openai import OpenAI
import base64
import json
import os

client = OpenAI()

def analyze_image_to_json(image_path, extract_text=True):
    """Perform comprehensive image analysis and return structured JSON."""
    filename = os.path.basename(image_path)
    
    prompt = """Analyze this image and return a JSON object with the following structure:
{
    "description": "detailed scene description",
    "objects": [{"name": "object name", "attributes": "color, size, position"}],
    "text_content": "any visible text or null if none",
    "colors": ["dominant", "colors"],
    "scene_type": "indoor/outdoor/abstract/etc",
    "people_count": 0,
    "notable_features": ["list of notable visual elements"]
}

Return ONLY valid JSON, no other text."""

    try:
        base64_image = encode_image_to_base64(image_path)
        media_type = get_image_media_type(image_path)
        
        response = client.chat.completions.create(
            model="gpt-4o",
            messages=[
                {
                    "role": "user",
                    "content": [
                        {"type": "text", "text": prompt},
                        {
                            "type": "image_url",
                            "image_url": {
                                "url": f"data:{media_type};base64,{base64_image}",
                                "detail": "high"
                            }
                        }
                    ]
                }
            ],
            max_tokens=1500
        )
        
        # Parse the response as JSON
        analysis_text = response.choices[0].message.content
        # Remove markdown code blocks if present
        if analysis_text.startswith("```"):
            analysis_text = analysis_text.split("```")[1]
            if analysis_text.startswith("json"):
                analysis_text = analysis_text[4:]
        analysis = json.loads(analysis_text.strip())
        
        result = {
            "success": True,
            "filename": filename,
            "analysis": analysis,
            "metadata": {
                "model_used": "gpt-4o",
                "detail_level": "high",
                "token_usage": {
                    "prompt": response.usage.prompt_tokens,
                    "completion": response.usage.completion_tokens
                }
            },
            "warnings": []
        }
        
    except json.JSONDecodeError as e:
        result = {
            "success": False,
            "filename": filename,
            "analysis": {"raw_response": response.choices[0].message.content},
            "metadata": {"model_used": "gpt-4o"},
            "warnings": [f"Failed to parse JSON: {str(e)}"]
        }
    except Exception as e:
        result = {
            "success": False,
            "filename": filename,
            "analysis": {},
            "metadata": {},
            "warnings": [f"Analysis failed: {str(e)}"]
        }
    
    return result

# Usage
result = analyze_image_to_json("photo.jpg")
print(json.dumps(result, indent=2))

Batch Processing Directory

from openai import OpenAI
import base64
import json
from pathlib import Path

client = OpenAI()

def process_image_directory(directory_path, output_file, prompt=None):
    """Process all images in a directory and save results."""
    if prompt is None:
        prompt = "Describe this image briefly, including any visible text."
    
    image_extensions = {'.jpg', '.jpeg', '.png', '.webp', '.gif'}
    results = []
    
    for file_path in sorted(Path(directory_path).iterdir()):
        if file_path.suffix.lower() in image_extensions:
            print(f"Processing: {file_path.name}")
            
            try:
                analysis = analyze_local_image(str(file_path), prompt)
                results.append({
                    "filename": file_path.name,
                    "success": True,
                    "analysis": analysis
                })
            except Exception as e:
                results.append({
                    "filename": file_path.name,
                    "success": False,
                    "error": str(e)
                })
    
    # Save results
    with open(output_file, 'w') as f:
        json.dump(results, f, indent=2)
    
    return results

Detail Level Configuration

The detail parameter controls image resolution and token usage:

# Low detail: 512x512 fixed, ~85 tokens per image
# Best for: Quick summaries, dominant colors, general scene type
{"detail": "low"}

# High detail: Full resolution processing
# Best for: Reading text, detecting small objects, detailed analysis
{"detail": "high"}

# Auto: Model decides based on image size
# Best for: General use when cost vs quality tradeoff is acceptable
{"detail": "auto"}

Choosing Detail Level

def get_recommended_detail(task_type):
    """Recommend detail level based on task type."""
    high_detail_tasks = {
        'ocr', 'text_extraction', 'document_analysis',
        'small_object_detection', 'detailed_comparison',
        'fine_grained_analysis'
    }
    
    low_detail_tasks = {
        'scene_classification', 'dominant_colors',
        'general_description', 'thumbnail_preview'
    }
    
    if task_type.lower() in high_detail_tasks:
        return "high"
    elif task_type.lower() in low_detail_tasks:
        return "low"
    else:
        return "auto"

Text Extraction (Vision-based OCR)

For extracting text from images using vision models:

def extract_text_from_image(image_path, preserve_layout=False):
    """Extract text from an image using GPT-4o vision."""
    if preserve_layout:
        prompt = (
            "Extract ALL text visible in this image. "
            "Preserve the original layout and formatting as much as possible. "
            "Include headers, paragraphs, captions, and any other text. "
            "Return only the extracted text, nothing else."
        )
    else:
        prompt = (
            "Extract all text visible in this image. "
            "Return the text in reading order (top to bottom, left to right). "
            "Return only the extracted text, nothing else."
        )
    
    return analyze_local_image(image_path, prompt)


def extract_structured_text(image_path):
    """Extract text with structure information as JSON."""
    prompt = """Extract all text from this image and return as JSON:
{
    "headers": ["list of headers/titles"],
    "paragraphs": ["list of paragraph texts"],
    "labels": ["list of labels or captions"],
    "other_text": ["any other text elements"],
    "reading_order": ["all text in reading order"]
}
Return ONLY valid JSON."""
    
    response = analyze_local_image(image_path, prompt)
    
    try:
        # Clean and parse JSON
        if response.startswith("```"):
            response = response.split("```")[1]
            if response.startswith("json"):
                response = response[4:]
        return json.loads(response.strip())
    except json.JSONDecodeError:
        return {"raw_text": response, "parse_error": True}

Error Handling

Common Issues and Solutions

Issue: API rate limits exceeded

import time
from openai import RateLimitError

def analyze_with_retry(image_path, prompt, max_retries=3):
    """Analyze image with exponential backoff retry."""
    for attempt in range(max_retries):
        try:
            return analyze_local_image(image_path, prompt)
        except RateLimitError:
            if attempt \x3C max_retries - 1:
                wait_time = 2 ** attempt  # Exponential backoff
                print(f"Rate limited, waiting {wait_time}s...")
                time.sleep(wait_time)
            else:
                raise

Issue: Image too large

from PIL import Image
import io

def resize_image_if_needed(image_path, max_size_mb=15):
    """Resize image if it exceeds size limit."""
    file_size_mb = os.path.getsize(image_path) / (1024 * 1024)
    
    if file_size_mb \x3C= max_size_mb:
        return encode_image_to_base64(image_path)
    
    # Resize the image
    img = Image.open(image_path)
    
    # Calculate new dimensions (reduce by 50% iteratively)
    while file_size_mb > max_size_mb:
        new_width = img.width // 2
        new_height = img.height // 2
        img = img.resize((new_width, new_height), Image.Resampling.LANCZOS)
        
        # Check new size
        buffer = io.BytesIO()
        img.save(buffer, format='JPEG', quality=85)
        file_size_mb = len(buffer.getvalue()) / (1024 * 1024)
    
    # Encode resized image
    buffer = io.BytesIO()
    img.save(buffer, format='JPEG', quality=85)
    return base64.standard_b64encode(buffer.getvalue()).decode("utf-8")

Issue: Invalid image format

def validate_image(image_path):
    """Validate image before processing."""
    valid_extensions = {'.jpg', '.jpeg', '.png', '.gif', '.webp'}
    
    path = Path(image_path)
    
    if not path.exists():
        return False, "File does not exist"
    
    if path.suffix.lower() not in valid_extensions:
        return False, f"Unsupported format: {path.suffix}"
    
    try:
        with Image.open(image_path) as img:
            img.verify()
        return True, "Valid image"
    except Exception as e:
        return False, f"Invalid image: {str(e)}"

Quality Self-Check

Before returning results, verify:

  • API response was received successfully
  • Output is valid JSON (if structured output requested)
  • All requested analysis fields are present
  • Token usage is within expected bounds
  • No error messages in the response
  • For multi-image: all images were processed
  • Confidence/warnings are included when analysis is uncertain

Limitations

  • Medical imagery: Not suitable for diagnostic analysis of CT scans, X-rays, or MRIs
  • Small text: Text smaller than ~12pt may be misread; use detail: high
  • Rotated/skewed text: Accuracy decreases with text rotation; pre-process if needed
  • Non-Latin scripts: Lower accuracy for complex scripts (CJK, Arabic, etc.)
  • Object counting: Approximate counts only; may miss or double-count similar objects
  • Spatial precision: Cannot provide pixel-accurate bounding boxes or measurements
  • Panoramic/fisheye: Distorted images reduce analysis accuracy
  • Graphs and charts: May misinterpret line styles, legends, or data points
  • Metadata: Cannot access EXIF data, camera info, or GPS coordinates from images
  • Cost: High-detail analysis of many images can be expensive; monitor token usage

Token Cost Estimation

Approximate token costs for image inputs:

Detail Level Tokens per Image Best For
low ~85 tokens (fixed) Quick classification, color detection
high 85 + 170 per 512x512 tile OCR, detailed analysis, small objects
auto Variable General use
def estimate_image_tokens(image_path, detail="high"):
    """Estimate token usage for an image."""
    if detail == "low":
        return 85
    
    with Image.open(image_path) as img:
        width, height = img.size
    
    # High detail: image is scaled to fit in 2048x2048, then tiled at 512x512
    scale = min(2048 / max(width, height), 1.0)
    scaled_width = int(width * scale)
    scaled_height = int(height * scale)
    
    # Ensure minimum 768 on shortest side
    if min(scaled_width, scaled_height) \x3C 768:
        scale = 768 / min(scaled_width, scaled_height)
        scaled_width = int(scaled_width * scale)
        scaled_height = int(scaled_height * scale)
    
    # Calculate tiles
    tiles_x = (scaled_width + 511) // 512
    tiles_y = (scaled_height + 511) // 512
    total_tiles = tiles_x * tiles_y
    
    return 85 + (170 * total_tiles)

Version History

  • 1.0.0 (2026-01-21): Initial release with GPT-4o vision support
Usage Guidance
This skill appears to be a thin wrapper around the OpenAI vision APIs and will read local image files and upload them for analysis. Before installing, confirm the following: (1) Where will requests be sent and which OpenAI account/key will be used? The skill metadata does not declare the required OPENAI_API_KEY — verify you will supply your own key and that the skill will not hardcode or exfiltrate credentials. (2) Understand privacy: any local images (including PII, documents, screenshots) will be transmitted to OpenAI; avoid using highly sensitive images unless you accept that. (3) Check for provenance: the skill's source/homepage are unknown — prefer skills with a traceable author or repo. (4) If you proceed, run it in a restricted environment and review logs to ensure only intended files are read and transmitted. If you cannot verify the above, treat this skill cautiously or consider alternatives that explicitly document credential and data-handling requirements.
Capability Analysis
Type: OpenClaw Skill Name: jpg-ocr-stat-openai-vision Version: 0.1.0 The skill bundle provides legitimate functionality for image analysis and OCR using OpenAI's vision models. The code in SKILL.md includes standard implementations for processing local and remote images, handling batch directories, and estimating token costs without any signs of malicious intent, data exfiltration, or prompt injection attacks.
Capability Assessment
Purpose & Capability
The name/description (image analysis with OpenAI vision models) matches the SKILL.md examples and capabilities (OCR, multi-frame comparison). The skill legitimately needs the OpenAI client and the ability to read image files/URLs, which the instructions request.
Instruction Scope
The runtime instructions explicitly read local image files, encode them to base64, and send them to the OpenAI SDK for remote analysis. That behavior is consistent with the stated purpose but it also means arbitrary local images (including potentially sensitive files) will be transmitted to a third-party API — the SKILL.md does not warn about this or restrict which paths should be used.
Install Mechanism
This is an instruction-only skill with no install spec or downloaded code, so nothing new is written to disk by an installer. The risk from install mechanism is low.
Credentials
The examples use the OpenAI Python client (client = OpenAI()) which requires an API key or other credential at runtime, but the skill metadata lists no required environment variables or primary credential. That omission is an incoherence: the skill will need access to an OpenAI API key (e.g., OPENAI_API_KEY) to function.
Persistence & Privilege
The skill is not always-enabled and does not request special platform-wide persistence. It can be invoked by the agent (normal behavior). No indications it modifies other skills or system configs.
How to Use
  1. Make sure OpenClaw is installed (local or Docker)
  2. Run the install command in chat: /install jpg-ocr-stat-openai-vision
  3. After installation, invoke the skill by name or use /jpg-ocr-stat-openai-vision
  4. Provide required inputs per the skill's parameter spec and get structured output
Version History
v0.1.0
Bulk publish from all-task-skills-dedup
Metadata
Slug jpg-ocr-stat-openai-vision
Version 0.1.0
License MIT-0
All-time Installs 0
Active Installs 0
Total Versions 1
Frequently Asked Questions

What is openai-vision?

Analyze images and multi-frame sequences using OpenAI GPT vision models. It is an AI Agent Skill for Claude Code / OpenClaw, with 67 downloads so far.

How do I install openai-vision?

Run "/install jpg-ocr-stat-openai-vision" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.

Is openai-vision free?

Yes, openai-vision is completely free, licensed under MIT-0. You can download, install and use it at no cost.

Which platforms does openai-vision support?

openai-vision is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).

Who created openai-vision?

It is built and maintained by wu-uk (@wu-uk); the current version is v0.1.0.

💬 Comments