Regulation Extractor

Name: Regulation Extractor
Author: youfeijun123

功能描述

从建筑工程规范PDF中结构化提取条文并同步到飞书多维表格。支持PDF双文字层（原文+OCR）去重、纯图片PDF的RapidOCR识别、条文编号切分（含带空格编号如6. 1. 2. 3）、带圈数字转换（如6.4.④→6.4.4）、OCR错误检测、质量标记、文本清洗（去换行/页眉/符号表/中英文粘连/过长切分）。输出...

安全使用建议

Before installing or running this skill: 1) Inspect and edit quality_check.py to remove or change the hard-coded Path to a directory you control (it currently points to a developer's D: directory). Do not run that script until you confirm the path or provide a safe working directory. 2) When using sync_to_bitable.py, prefer the --dry_run option first to preview changes; create a least-privileged Feishu app/service account for writes and rotate credentials after use. 3) Install and run dependencies (PyMuPDF, rapidocr-onnxruntime) in an isolated environment (virtualenv or container) to avoid system-wide changes. 4) Review any JSON outputs before syncing (they contain extracted regulatory text). 5) If you need to run these scripts on a multi-tenant or sensitive host, run them in an isolated VM/container and verify network access rules so that only the intended Feishu endpoint is reachable. 6) Overall risk: functionality appears legitimate, but the hard-coded path and undocumented credential handling are concrete issues to fix before trusting the package.

功能分析

Type: OpenClaw Skill Name: regulation-extractor Version: 3.0.0 The regulation-extractor skill bundle is a legitimate tool designed to extract, clean, and synchronize construction regulation data from PDFs to Feishu (Lark) Bitable. The scripts (extract_regulation.py, ocr_batch.py, sync_to_bitable.py) perform standard PDF parsing, OCR via RapidOCR, and API interactions with the official Feishu endpoint (open.feishu.cn). No evidence of data exfiltration, malicious command execution, or harmful prompt injection was found; the code logic is transparent and strictly aligned with the stated purpose of document processing and data management.

能力评估

✓ Purpose & Capability

Name/description align with the included scripts: extract_regulation.py, ocr_batch.py, deep_clean.py, clean_json.py, quality_check.py, and sync_to_bitable.py implement PDF text extraction, offline RapidOCR, cleaning, quality checks, and Feishu (飞书) sync. External network use is limited to Feishu API for its intended purpose.

⚠ Instruction Scope

SKILL.md instructs running each script with user-specified paths, but scripts are not fully consistent with that expectation: quality_check.py ignores CLI input and uses a hard-coded Windows path (output_dir = Path(r"D:\有斐家\小一\常用规范处理成果")), which could read arbitrary JSON files on the host if executed. Other scripts read PDFs and write JSON (expected). The sync script performs network writes to Feishu only when given credentials/IDs.

ℹ Install Mechanism

No automated install spec included (instruction-only), but SKILL.md lists pip deps (PyMuPDF, rapidocr-onnxruntime). That requires installing Python packages manually in the runtime; this is moderate risk but typical. There is no download from untrusted URLs or archive extraction in the skill bundle itself.

⚠ Credentials

Feishu credentials (app_id, app_secret, app_token, table_id) are required only for the sync_to_bitable step and are proportional to the stated purpose. However the skill metadata did not declare required credentials or env vars; credentials are passed as CLI args. The hard-coded path in quality_check.py can access local files unexpectedly, which is disproportionate to the stated single-file quality-check invocation.

✓ Persistence & Privilege

The skill does not request persistent installation privileges (always=false), does not modify other skills or system-wide configs, and will not autonomously exfiltrate data except when the user runs the sync script with Feishu credentials. No evidence of attempts to persist credentials or enable background network activity.

版本历史

v3.0.0

从建筑工程规范PDF中结构化提取条文。支持文字层+OCR双模式、5步清洗Pipeline、过长切分、符号表过滤、飞书同步。实测21个PDF、5865条条文、96.8%干净率。

元数据

Slug regulation-extractor

版本 3.0.0

许可证 MIT-0

累计安装 0

当前安装数 0

历史版本数 1

常见问题

Regulation Extractor 是什么？

从建筑工程规范PDF中结构化提取条文并同步到飞书多维表格。支持PDF双文字层（原文+OCR）去重、纯图片PDF的RapidOCR识别、条文编号切分（含带空格编号如6. 1. 2. 3）、带圈数字转换（如6.4.④→6.4.4）、OCR错误检测、质量标记、文本清洗（去换行/页眉/符号表/中英文粘连/过长切分）。输出... 它是一个面向 Claude Code / OpenClaw 的 AI Agent Skill 插件，目前累计下载 90 次。

如何安装 Regulation Extractor？

在 OpenClaw 或 Claude Code 对话框中运行命令「/install regulation-extractor」即可一键安装，无需额外配置。

Regulation Extractor 是免费的吗？

是的，Regulation Extractor 完全免费，采用 MIT-0 许可证，可自由下载、安装和使用。

Regulation Extractor 支持哪些平台？

Regulation Extractor 跨平台运行，可在任意部署了 OpenClaw / Claude Code 的环境中使用（cross-platform）。

谁开发了 Regulation Extractor？

由 youfeijun123（@youfeijun123）开发并维护，当前版本 v3.0.0。

Regulation Extractor 是什么？

如何安装 Regulation Extractor？

Regulation Extractor 是免费的吗？

Regulation Extractor 支持哪些平台？

谁开发了 Regulation Extractor？

💬 留言讨论