Regulation Extractor

Name: Regulation Extractor
Author: youfeijun123

Description

从建筑工程规范PDF中结构化提取条文并同步到飞书多维表格。支持PDF双文字层（原文+OCR）去重、纯图片PDF的RapidOCR识别、条文编号切分（含带空格编号如6. 1. 2. 3）、带圈数字转换（如6.4.④→6.4.4）、OCR错误检测、质量标记、文本清洗（去换行/页眉/符号表/中英文粘连/过长切分）。输出...

Usage Guidance

Before installing or running this skill: 1) Inspect and edit quality_check.py to remove or change the hard-coded Path to a directory you control (it currently points to a developer's D: directory). Do not run that script until you confirm the path or provide a safe working directory. 2) When using sync_to_bitable.py, prefer the --dry_run option first to preview changes; create a least-privileged Feishu app/service account for writes and rotate credentials after use. 3) Install and run dependencies (PyMuPDF, rapidocr-onnxruntime) in an isolated environment (virtualenv or container) to avoid system-wide changes. 4) Review any JSON outputs before syncing (they contain extracted regulatory text). 5) If you need to run these scripts on a multi-tenant or sensitive host, run them in an isolated VM/container and verify network access rules so that only the intended Feishu endpoint is reachable. 6) Overall risk: functionality appears legitimate, but the hard-coded path and undocumented credential handling are concrete issues to fix before trusting the package.

Capability Analysis

Type: OpenClaw Skill Name: regulation-extractor Version: 3.0.0 The regulation-extractor skill bundle is a legitimate tool designed to extract, clean, and synchronize construction regulation data from PDFs to Feishu (Lark) Bitable. The scripts (extract_regulation.py, ocr_batch.py, sync_to_bitable.py) perform standard PDF parsing, OCR via RapidOCR, and API interactions with the official Feishu endpoint (open.feishu.cn). No evidence of data exfiltration, malicious command execution, or harmful prompt injection was found; the code logic is transparent and strictly aligned with the stated purpose of document processing and data management.

Capability Assessment

✓ Purpose & Capability

Name/description align with the included scripts: extract_regulation.py, ocr_batch.py, deep_clean.py, clean_json.py, quality_check.py, and sync_to_bitable.py implement PDF text extraction, offline RapidOCR, cleaning, quality checks, and Feishu (飞书) sync. External network use is limited to Feishu API for its intended purpose.

⚠ Instruction Scope

SKILL.md instructs running each script with user-specified paths, but scripts are not fully consistent with that expectation: quality_check.py ignores CLI input and uses a hard-coded Windows path (output_dir = Path(r"D:\有斐家\小一\常用规范处理成果")), which could read arbitrary JSON files on the host if executed. Other scripts read PDFs and write JSON (expected). The sync script performs network writes to Feishu only when given credentials/IDs.

ℹ Install Mechanism

No automated install spec included (instruction-only), but SKILL.md lists pip deps (PyMuPDF, rapidocr-onnxruntime). That requires installing Python packages manually in the runtime; this is moderate risk but typical. There is no download from untrusted URLs or archive extraction in the skill bundle itself.

⚠ Credentials

Feishu credentials (app_id, app_secret, app_token, table_id) are required only for the sync_to_bitable step and are proportional to the stated purpose. However the skill metadata did not declare required credentials or env vars; credentials are passed as CLI args. The hard-coded path in quality_check.py can access local files unexpectedly, which is disproportionate to the stated single-file quality-check invocation.

✓ Persistence & Privilege

The skill does not request persistent installation privileges (always=false), does not modify other skills or system-wide configs, and will not autonomously exfiltrate data except when the user runs the sync script with Feishu credentials. No evidence of attempts to persist credentials or enable background network activity.

Version History

v3.0.0

从建筑工程规范PDF中结构化提取条文。支持文字层+OCR双模式、5步清洗Pipeline、过长切分、符号表过滤、飞书同步。实测21个PDF、5865条条文、96.8%干净率。

Metadata

Slug regulation-extractor

Version 3.0.0

License MIT-0

All-time Installs 0

Active Installs 0

Total Versions 1

Frequently Asked Questions

What is Regulation Extractor?

从建筑工程规范PDF中结构化提取条文并同步到飞书多维表格。支持PDF双文字层（原文+OCR）去重、纯图片PDF的RapidOCR识别、条文编号切分（含带空格编号如6. 1. 2. 3）、带圈数字转换（如6.4.④→6.4.4）、OCR错误检测、质量标记、文本清洗（去换行/页眉/符号表/中英文粘连/过长切分）。输出... It is an AI Agent Skill for Claude Code / OpenClaw, with 90 downloads so far.

How do I install Regulation Extractor?

Run "/install regulation-extractor" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.

Is Regulation Extractor free?

Yes, Regulation Extractor is completely free, licensed under MIT-0. You can download, install and use it at no cost.

Which platforms does Regulation Extractor support?

Regulation Extractor is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).

Who created Regulation Extractor?

It is built and maintained by youfeijun123 (@youfeijun123); the current version is v3.0.0.

More Skills

What is Regulation Extractor?

How do I install Regulation Extractor?

Is Regulation Extractor free?

Which platforms does Regulation Extractor support?

Who created Regulation Extractor?

💬 Comments