功能描述

ifly-pdf&image-ocr skill supporting both image OCR (AI-powered LLM OCR) and PDF document recognition. Use when user asks to OCR images, extract text from ima...

使用说明 (SKILL.md)

ifly-pdf&image-ocr

Name: ifly-pdf-image-ocr
Author: qingzhe2020

AI-powered OCR service for images and PDF documents using iFlytek's advanced recognition APIs.

Quick Start

Image OCR (LLM OCR)

# OCR an image and extract text
python3 scripts/image_ocr.py /path/to/image.jpg

# Save result to file
python3 scripts/image_ocr.py /path/to/image.jpg -o output.txt

# Specify output format
python3 scripts/image_ocr.py /path/to/image.jpg --format json
python3 scripts/image_ocr.py /path/to/image.jpg --format markdown

PDF OCR

# Convert PDF to Word (default)
python3 scripts/pdf_ocr.py document.pdf

# Convert PDF to Markdown
python3 scripts/pdf_ocr.py document.pdf --format markdown

# Convert PDF to JSON
python3 scripts/pdf_ocr.py document.pdf --format json

# From public URL
python3 scripts/pdf_ocr.py --pdf-url "https://example.com/doc.pdf" --format word

Setup

API Credentials

Get credentials from iFlytek Open Platform:

For Image OCR:

APP_ID: Application ID
API_KEY: API key for authentication
API_SECRET: API secret for signing requests

For PDF OCR:

APP_ID: Application ID
API_SECRET: Application secret (for signature generation)

Environment Variables

# Required for both Image OCR and PDF OCR
export IFLY_APP_ID="your_app_id"

# Required for Image OCR
export IFLY_API_KEY="your_api_key"

# Required for PDF OCR
export IFLY_API_SECRET="your_api_secret"

Features

Image OCR (LLM OCR)

AI-powered: Advanced LLM-based OCR for high accuracy
Multi-format output: JSON, Markdown, or both
Layout understanding: Preserves document structure
Multi-language: Supports text extraction in multiple languages
Image preprocessing: Automatic rotation correction, noise removal

PDF OCR

AI-powered OCR: Advanced AI model for accurate text extraction
Multiple output formats:
- Word (.docx) - Editable Word document
- Markdown - Plain text with formatting
- JSON - Structured data
Large PDF support: Up to 100 pages per document
Page-by-page results: Access individual page results
Download URLs: Direct links to processed files

API Parameters

Image OCR Parameters

Parameter	Type	Required	Description
`image_path`	string	Yes	Path to image file
`--format`	string	No	Output format: json, markdown, json,markdown (default: json,markdown)
`--output`	string	No	Save result to file

PDF OCR Parameters

Parameter	Type	Required	Description
`pdf_path`	string	Yes*	Path to PDF file
`--pdf-url`	string	No*	Public URL of PDF file
`--format`	string	No	Output format: word, markdown, json (default: word)
`--no-poll`	flag	No	Return task ID without polling
`--poll-interval`	int	No	Polling interval in seconds (min 5, default: 5)
`--max-wait`	int	No	Maximum wait time in seconds (default: 300)

*Either pdf_path or --pdf-url must be provided

Authentication

Image OCR (HMAC-SHA256)

Uses HMAC-SHA256 signature authentication:

Generate RFC1123 format date: EEE, dd MMM yyyy HH:mm:ss GMT
Create signature origin: host: {host}\\ date: {date}\\ POST {path} HTTP/1.1
Calculate signature: HMAC-SHA256(signature_origin, apiSecret)
Build authorization: hmac username="{apiKey}", algorithm="hmac-sha256", headers="host date request-line", signature="{signature}"
Encode authorization in base64
Send as query parameters: ?authorization={auth}&host={host}&date={date}

PDF OCR (MD5 + HMAC-SHA1)

Uses MD5 + HMAC-SHA1 signature authentication:

Generate timestamp (Unix epoch in seconds)
Calculate auth = MD5(appId + timestamp)
Calculate signature = Base64(HMAC-SHA1(auth, apiSecret))
Send headers:
- appId: Application ID
- timestamp: Timestamp in seconds
- signature: Generated signature

Important: Timestamp must be within 5 minutes of server time.

Response Format

Image OCR Response

{
  "header": {
    "code": 0,
    "message": "success"
  },
  "payload": {
    "result": {
      "text": "Base64-encoded OCR text..."
    }
  }
}

PDF OCR Start Response

{
  "flag": true,
  "code": 0,
  "desc": "成功",
  "data": {
    "taskNo": "25082744936879",
    "status": "CREATE",
    "tip": "任务创建成功"
  }
}

PDF OCR Status Response

{
  "flag": true,
  "code": 0,
  "desc": "成功",
  "data": {
    "taskNo": "25082759289333",
    "exportFormat": "word",
    "status": "FINISH",
    "downUrl": "http://bjcdn.openstorage.cn/...",
    "tip": "已完成",
    "pageList": [...]
  }
}

Task Status (PDF OCR)

Status	Description
`CREATE`	Task created successfully
`WAITING`	Waiting in queue
`DOING`	Processing
`FINISH`	Completed
`FAILED`	Failed
`ANY_FAILED`	Partially completed (some pages failed)
`STOP`	Paused

Error Codes

(｡･ω･｡) 嗨~~遇到错误码了吗？来看看怎么解决吧~~ ✧⁺⸜(●˙▾˙●)⸝⁺✧

Platform Common Error Codes

Code	Description	Hint	Solution
10009	input invalid data	(◎_◎;) 哎呀~数据格式不太对呢	检查输入数据是否符合要求
10010	service license not enough	(╯°□°)╯︵ ┻━┻ 授权数量不足或已过期！	提交工单联系客服
10019	service read buffer timeout	(。-`ω´-) session超时啦~	检查是否数据发送完毕但未关闭连接
10043	Syscall AudioCodingDecode error	(◎_◎;) 音频解码失败惹...	检查aue参数，如果为speex，请确保音频是speex音频并分段压缩且与帧大小一致
10114	session timeout	(。-`ω´-) 会话时间超时啦~	检查是否发送数据时间超过了60s
10139	invalid param	(◎_◎;) 参数好像不太对呢	检查参数是否正确
10160	parse request json error	(◎_◎;) 请求数据格式有误~	检查请求数据是否是合法的json
10161	parse base64 string error	(◎_◎;) Base64解码失败啦	检查发送的数据是否使用base64编码了
10163	param validate error	(◎_◎;) 参数校验没通过呢	具体原因见详细的描述
10200	read data timeout	(。-`ω´-) 读取数据超时了~	检查是否累计10s未发送数据并且未关闭连接
10222	context deadline exceeded	(╯°□°)╯︵ ┻━┻ 出错啦！	1.检查上传数据是否超过接口上限；2.SSL证书无效请提交工单
10223	RemoteLB: can't find valued addr	(◎_◎;) 找不到服务节点呢	提交工单联系技术人员
10313	invalid appid	(◎_◎;) appid和apikey不匹配哦	检查appid是否合法
10317	invalid version	(◎_◎;) 版本号有问题呢	请到控制台提交工单联系技术人员
10700	not authority	(╯°□°)╯︵ ┻━┻ 权限不足！	按照报错原因对照开发文档检查，如仍无法解决，请提供sid及错误信息提交工单
11200	auth no license	(╯°□°)╯︵ ┻━┻ 功能未授权！	检查appid是否正确，确认是否添加了相关服务，检查调用量是否超限或授权是否到期
11201	auth no enough license	(╯°□°)╯︵ ┻━┻ 每日交互次数超限啦！	提交应用审核提额或联系商务购买企业级接口
11503	server error: atmos return error	(。-`ω´-) 服务器返回了错误数据...	提交工单
11502	server error: too many datas	(。-`ω´-) 服务器配置有问题呢	提交工单
100001~100010	WrapperInitErr	(◎_◎;) 引擎调用出错啦！	请根据message中的errno查看引擎错误码说明

Additional Resources

(｡･ω･｡) 服务购买链接：通用文字识别（OCR大模型版）
(｡･ω･｡) 商务咨询链接：购买服务量

Original API Error Codes

Code	Description	Solution
10000	System error	Check auth info, request method, parameters
10001	Signature authentication failed	Check credentials
10002	Business processing error	Check error message
10003	Quota/insufficient balance	Check account balance

Limitations

Image OCR

Format: Common image formats (JPG, PNG, etc.)
Size: Reasonable file sizes for web upload
Rate limiting: Follow API rate limits

PDF OCR

Max pages: 100 pages per PDF
Protected PDFs: Not supported (password/encrypted)
Rate limiting: Status query limited to once per 5 seconds
Time limit: Timestamp must be within ±5 minutes of server time

Tips

Image OCR

High-quality images: Use clear, high-resolution images for best results
Multiple formats: Use json,markdown to get both structured and formatted output
Save results: Use -o flag to save OCR results to file

PDF OCR

Math formulas: Use markdown format for PDFs with mathematical formulas
Large PDFs: Split into sections if > 100 pages
Polling interval: Minimum 5 seconds between status queries
Network URLs: Ensure PDF URLs are publicly accessible
Download URLs: Download files promptly as URLs may expire

安全使用建议

This skill's code implements iFlytek OCR and will upload images/PDFs to iFlytek servers and requires three environment variables (IFLY_APP_ID, IFLY_API_KEY, IFLY_API_SECRET) — but the registry metadata incorrectly listed no required credentials. Before installing, verify the skill source and owner (origin is unknown), confirm you trust iFlytek or the specific endpoints in SKILL.md, and avoid sending sensitive or regulated documents unless you control the account and understand the provider's data retention/privacy policy. Also ensure you set the declared environment variables only for a dedicated iFlytek account (do not reuse other secrets), and consider running the scripts manually in a sandbox to inspect behavior before granting it to an autonomous agent.

功能分析

Type: OpenClaw Skill Name: ifly-pdf-image-ocr Version: 1.0.0 The skill bundle provides legitimate OCR functionality for images and PDFs using official iFlytek APIs. The scripts (image_ocr.py and pdf_ocr.py) correctly implement the documented authentication protocols (HMAC-SHA256 and MD5+HMAC-SHA1) and communicate exclusively with verified iFlytek endpoints (xf-yun.com and xfyun.cn). No malicious behavior, data exfiltration, or prompt injection attempts were detected.

能力评估

✓ Purpose & Capability

The skill name/description (image and PDF OCR via iFlytek) matches the included scripts and runtime instructions: both scripts call iFlytek endpoints and implement the described HMAC/MD5 signing and result handling. The functionality requested (uploading PDFs/images to OCR service) is legitimate for this purpose.

ℹ Instruction Scope

SKILL.md and scripts instruct the agent to read local image/PDF files, read API credentials from environment variables, send files to iFlytek endpoints, and poll for results — all consistent with OCR. There is no evidence the instructions ask for unrelated system files or credentials, but the skill will transmit user files to external servers (iocr.xfyun.cn and cbm01.cn-huabei-1.xf-yun.com), which is expected for a cloud OCR service but has privacy implications.

✓ Install Mechanism

No install spec (instruction-only + shipped scripts). Nothing is downloaded or executed automatically by an installer. This lowers risk, but the included scripts will be executed if run.

⚠ Credentials

Registry metadata claims no required env vars/credentials, but both SKILL.md and the scripts require IFLY_APP_ID and at least IFLY_API_SECRET; image OCR also requires IFLY_API_KEY. The metadata omission is an incoherence: the skill legitimately needs these secrets, but they were not declared in the registry entry. Requesting API credentials for the OCR provider itself is reasonable; asking for unrelated credentials is not present. The missing declaration and unknown source increase risk.

✓ Persistence & Privilege

always is false and the skill does not request persistent system-wide privileges or modify other skills. It only requires environment variables and network access to the OCR endpoints.

版本历史

v1.0.0

Initial release of ifly-pdf&image-ocr skill. - Provides AI-powered OCR for both images and PDF documents via iFlytek APIs. - Supports multi-language text extraction with advanced document layout understanding. - Outputs can be in Word (.docx), Markdown, or JSON formats. - Allows conversion of PDF files to desired formats and extraction of text from images. - Includes authentication details, API parameters, example usage, and detailed error codes. - Supports both local files and public URL inputs for PDF processing.

元数据

Slug ifly-pdf-image-ocr

版本 1.0.0

许可证 MIT-0

累计安装 0

当前安装数 0

历史版本数 1

常见问题