功能描述

基金月报信息提取。支持文本+OCR 双重提取，自动处理双月对比。从 PDF 月报提取数据并填充 Excel 模板。

使用说明 (SKILL.md)

基金月报信息提取

Name: Fundreport Scrape
Author: imkiiki

上传 Excel 模板和 PDF 月报，AI 自动提取数据（文本+OCR）并生成对比 Excel。

🌟 技能亮点

文本+OCR 双重提取 - 图表数据不遗漏，识别准确率 95%+
双月自动对比 - 一次处理两个月份，生成完整对比数据
智能日期解析 - 支持 YYYYMM 和 YYMM 格式，自动补全年份
批量处理 - 一次处理 10+ 只基金，节省 99% 时间

⚙️ 功能

功能	说明
核心指标提取	久期、到期收益率 (YTM)、基金规模
分布数据提取	行业分布、地区分布、信用评级分布
模板保持	保持 Excel 原有样式、公式、数据类型
智能匹配	字段名模糊匹配，适应不同表述方式
自动分类	识别基金名称和日期，智能分 Sheet

📥 输入

类型	说明	要求
Excel 模板	用户自定义格式	文件名：`互认基金月度更新_YYYYMMvsYYYYMM.xlsx`
PDF 月报	基金月度报告	支持文本/图表/扫描版，文件名含月份（如 `华夏 2601.pdf`）

📤 输出

文件	说明
互认基金月度更新_YYYYMMvsYYYYMM_最终版.xlsx	包含上月（列 4）和本月（列 6）的完整对比数据

提取内容：

核心指标：久期、YTM（两月对比）
分布数据：行业、地区、信用评级（两月对比）
其他：十大持仓、派息记录等

🚀 快速开始

1️⃣ 安装依赖（首次使用）

# 系统工具
yum install -y tesseract tesseract-langpack-chi_simp poppler-utils

# Python 包
pip install pdf2image Pillow opencv-python-headless

2️⃣ 准备文件

工作目录/
├── 模板/
│   └── 互认基金月度更新_202512vs202601.xlsx
├── 月报数据/
│   ├── 202512/    # 上月 PDF
│   │   ├── 华夏 202512.pdf
│   │   └── 南方东英 202512.pdf
│   └── 202601/    # 本月 PDF
│       ├── 华夏 2601.pdf
│       └── 南方东英 2601.pdf

3️⃣ 运行处理

cd ~/.agents/skills/fundreport-scrape

python3 scripts/auto_update_two_months.py \
  "/path/to/互认基金月度更新_202512vs202601.xlsx" \
  "/path/to/月报数据/202512/" \
  "/path/to/月报数据/202601/" \
  "/path/to/互认基金月度更新_202512vs202601_最终版.xlsx"

4️⃣ 查看结果

输出文件包含：

✅ 上月数据（列 4）：202512
✅ 本月数据（列 6）：202601
✅ 自动对比：久期、YTM、行业分布等

📁 文件结构

fundreport-scrape/
├── SKILL.md                  # 技能说明
├── SECURITY_REVIEW.md        # 安全评估报告
├── _meta.json                # 元数据
├── requirements.txt          # Python 依赖
├── scripts/
│   ├── auto_update_two_months.py # ⭐ 双月处理（推荐）
│   ├── auto_update_ocr.py       # OCR 增强版
│   └── install_ocr_deps.sh      # 依赖安装脚本
└── references/
    ├── extraction_templates.json  # 提取模板配置
    ├── ocr_rules.md               # OCR 识别规则
    ├── field_mapping.md           # 字段映射规则
    ├── template_learning.md       # 模板学习规则
    ├── batch_processing.md        # 批量处理规则
    └── interaction_rules.md       # 交互规则

📋 脚本说明

脚本	用途	推荐使用
`auto_update_two_months.py`	双月对比处理	⭐⭐⭐ 推荐
`auto_update_ocr.py`	单月 OCR 处理	⭐⭐ 备选
`install_ocr_deps.sh`	一键安装依赖	⭐⭐⭐ 首次使用

❓ 常见问题

Q1: OCR 识别准确率低？

A: 确保 PDF 清晰度足够，建议：

使用 300 DPI 以上的 PDF
避免模糊或压缩过度的文件
图表数据建议对照 PDF 手动验证

Q2: 日期解析错误？

A: 检查文件名格式：

Excel 文件名必须包含 YYYYMMvsYYYYMM
PDF 文件名应包含月份信息（如 2601 或 202601）

Q3: 部分基金数据未提取？

A: 可能原因：

PDF 中基金名称与模板不匹配
数据以复杂图表形式存在
建议查看日志中的"未匹配"提示

📝 更新日志

v1.0.0 (2026-03-14)

核心功能：

✅ 文本+OCR 双重提取，支持图表数据识别
✅ 双月对比处理，自动生成对比数据
✅ 智能日期解析，支持 YYYYMM 和 YYMM 格式
✅ 自动年份补齐（2601 → 202601）
✅ 从 Excel 文件名解析对比月份
✅ 批量处理 10+ 只基金
✅ 保持 Excel 原有样式和公式

技术特性：

✅ Tesseract OCR 引擎（中文+英文）
✅ pdfplumber 文本提取
✅ OpenCV 图像预处理
✅ 自动基金匹配和分类

系统依赖：

Tesseract OCR 5.x + 中文语言包
Poppler-utils（PDF 转图片）
Python 3.8+

安全使用建议

This skill appears coherent and local-only, but take these precautions before installing or running it: 1) Inspect scripts/install_ocr_deps.sh before executing (it likely runs apt/brew/sudo). 2) Manually install Python deps in a virtualenv and system deps (tesseract + chi_sim, poppler) per the README rather than relying on any auto-install behavior. 3) Run the processing on non-sensitive sample PDFs first to confirm outputs and mappings. 4) Explicitly set output paths (avoid leaving defaults that may write to /root/.openclaw/media/outbound/ in hosted environments). 5) If you connect the skill to an automated chat/file ingestion flow, require explicit user confirmation before processing to avoid accidental processing of directories or files. 6) As recommended in the included SECURITY_REVIEW, validate user-supplied folder paths and be cautious with untrusted ZIP archives (they are extracted to a temporary directory).

功能分析

Type: OpenClaw Skill Name: fundreport-scrape Version: 1.0.0 The fundreport-scrape skill bundle is a legitimate tool designed for extracting financial data from PDF reports into Excel templates using text parsing and OCR. The Python scripts (auto_update_ocr.py and auto_update_two_months.py) utilize standard libraries such as pdfplumber, pytesseract, and openpyxl to process local files without any network activity or data exfiltration logic. While the bundle includes a shell script (install_ocr_deps.sh) that performs system-level installations (yum/pip), this is explicitly documented as necessary for the OCR engine (Tesseract) and is not used for malicious persistence or backdoors. The markdown instructions are consistent with the stated purpose and do not contain harmful prompt injections.

能力评估

✓ Purpose & Capability

Name/description (fund monthly report extraction → fill Excel templates) matches the scripts, templates, and documentation. Required libraries and system dependencies (pdfplumber, pdf2image, pytesseract, poppler, tesseract) are appropriate for OCR + PDF → Excel processing.

ℹ Instruction Scope

Runtime instructions and scripts operate on user-specified PDF/Excel paths and folders and perform local text extraction + OCR + template filling. The skill includes interaction logic for handling multiple uploaded files and an optional timeout-based auto-start; exercise caution when connecting those interaction rules to automated chat/file ingestion (ensure the user explicitly confirms files). The references mention default output to the agent environment (/root/.openclaw/media/outbound/) in remote runs — verify the output path before running to avoid unintended exposure of results.

ℹ Install Mechanism

There is no package install spec in the registry (instruction-only), but included scripts and README request installing system packages (tesseract, poppler) and pip packages. The provided install_ocr_deps.sh likely runs system package managers (requires sudo on many systems) — inspect before running and prefer a virtualenv for pip installs. No evidence of downloads from untrusted URLs or obfuscated installers.

✓ Credentials

The skill declares no environment variables, credentials, or config paths. Code and docs likewise show no network calls or secret usage. One operational note: outputs may default to an agent-specific outbound folder in remote environments — that is not a credential leak but could cause files to be placed in a shared/remote workspace; choose output paths consciously.

✓ Persistence & Privilege

Skill is not marked always:true and does not request elevated privileges to run. Scripts create temporary files (tempdir, /tmp) and clean them up; installation of system deps may require sudo, which is normal for system packages. No code in the provided files attempts to modify other skills or system configs.

版本历史

v1.0.0

fundreport-scrape v1.0.0 - 支持基金月报文本+OCR双重信息提取，提升识别准确率 - 自动处理双月对比，智能解析与补全日期格式 - 批量处理10+基金PDF，输出对比Excel，保持原有模板样式与公式 - 提取核心指标与分布数据，兼容不同表述和智能匹配字段 - 附带一键依赖安装脚本及全面文档说明

元数据

Slug fundreport-scrape

版本 1.0.0

许可证 MIT-0

累计安装 0

当前安装数 0

历史版本数 1

常见问题

Fundreport Scrape 是什么？

基金月报信息提取。支持文本+OCR 双重提取，自动处理双月对比。从 PDF 月报提取数据并填充 Excel 模板。它是一个面向 Claude Code / OpenClaw 的 AI Agent Skill 插件，目前累计下载 204 次。

如何安装 Fundreport Scrape？

在 OpenClaw 或 Claude Code 对话框中运行命令「/install fundreport-scrape」即可一键安装，无需额外配置。

Fundreport Scrape 是免费的吗？

是的，Fundreport Scrape 完全免费，采用 MIT-0 许可证，可自由下载、安装和使用。

Fundreport Scrape 支持哪些平台？

Fundreport Scrape 跨平台运行，可在任意部署了 OpenClaw / Claude Code 的环境中使用（cross-platform）。

谁开发了 Fundreport Scrape？

由 ymzhang（@imkiiki）开发并维护，当前版本 v1.0.0。

Fundreport Scrape