← Back to Skills Marketplace
linwumeng

Cfc Disclosure Monitor

by linwumeng · GitHub ↗ · v4.5.0 · MIT-0
cross-platform ⚠ suspicious
138
Downloads
0
Stars
0
Active Installs
6
Versions
Install in OpenClaw
/install cfc-disclosure-monitor
Description
30家持牌消费金融公司官网公告信披采集,支持增量每日Cron。
README (SKILL.md)

cfc-disclosure-monitor — 消费金融信披采集 Skill

三阶段贯通架构 v4.4 — 2026-04-17

Phase 1 → Phase 2 → Phase 3 自动流水线,数据从采集到知识图谱不落地二次处理。


核心概念

官网URL
   ↓
[Phase 1] collect.py          采集公告列表 + 下载PDF/附件
   ↓
cfc_raw_data/{公司}/
  ├── announcements.json      原始公告列表 + 正文 + PDF路径
  └── attachments/            PDF文件
   ↓
[Phase 2] phase2_parse.py     解析PDF → 提取合作机构名单
   ↓
cfc_raw_data/{公司}_合作机构_{类型}{日期}.json
   ↓
[Phase 3] phase3_ontology.py   实体入库 → 知识图谱
   ↓
memory/ontology/graph.jsonl    关系型知识图谱

一、架构文件

cfc-disclosure-monitor/
├── SKILL.md              ← 本文件(v4.0)
├── companies.json        ← 唯一配置源(30家 URL + 采集方法)
├── collect.py            ← Phase 1 入口:统一调度 collectors.COLLECTORS
├── collectors.py         ← 42个采集方法(v1 函数,list[dict])
├── collectors_v2.py      ← 5个采集方法(v2 类,BaseCollector)
│                          ← 导入时自动同步到 collectors.COLLECTORS
├── phase1_base.py        ← Announcement + BaseCollector + Builder
│                          ← 统一接口:__call__ 返回 list[dict]
├── core.py               ← 日期/分类/文本提取引擎
├── phase2_parse.py       ← Phase 2:PDF解析 + 合作机构提取
├── phase3_ontology.py     ← Phase 3:知识图谱写入
├── pipeline.py           ← 三阶段统一编排器
└── parsed/               ← Phase 2 中间产物

架构原则(v1 + v2 并存)

层级 文件 类型 返回值
v1 collectors.py 同步/异步函数 list[dict]
v2 collectors_v2.py 异步类(继承 BaseCollector list[dict](通过 __call__

两套统一注册到 collectors.COLLECTORScollect.py 无需修改。

新增 v2 collector 步骤:

  1. collectors_v2.py 中定义继承 BaseCollector 的类
  2. @collector("方法名") 注册(自动进 collectors.COLLECTORS
  3. companies.json 中对应公司 method 改为 v2 方法名

二、Pipeline 编排器

一键贯通运行(推荐):

cd ~/.openclaw/workspace/skills/cfc-disclosure-monitor/

# 全量:P1 → P2 → P3 全部跑完
python3 pipeline.py

# 从指定阶段开始
python3 pipeline.py --phase 2              # 从 P2 开始(全量)
python3 pipeline.py --phase 3 --company 中邮消费金融  # 只 P3 单公司

# 仅查看已采集的公司
python3 pipeline.py --list

分阶段独立运行:

# Phase 1:采集列表 + PDF
python3 collect.py --date 2026-04-15
python3 collect.py --company "中邮消费金融" --no-detail  # 仅列表验证

# Phase 2:解析 PDF,提取合作机构(不改动 announcements.json)
python3 phase2_parse.py                    # 全量
python3 phase2_parse.py 中邮消费金融       # 单公司

# Phase 3:写入知识图谱
python3 phase3_ontology.py                 # 全量
python3 phase3_ontology.py 中邮消费金融    # 单公司

三、Phase 1 — 公告列表采集

目标: 采集30家消金公司官网所有披露公告,保存正文和附件。

输出结构

cfc_raw_data/{公司名}/                ← 按公司分目录
├── announcements.json                 ← 公告列表(含正文摘要)
│   [{
│     "title": "中邮消费金融有限公司催收合作机构信息公示",
│     "date": "2026-02-28",
│     "url": "https://www.youcash.com/xxgg/77802.html",
│     "category": "合作机构",
│     "text": "尊敬的客户:...",      ← HTML正文
│     "_content_type": "html|vue|pdf|image",
│     "_attachments": [{"filename":"xxx.pdf","path":"...","type":"pdf"}]
│   }]
└── attachments/                       ← Phase 2 需要的原始文件
    └── *.pdf

采集方法映射(companies.json)

架构 方法 代表公司
静态列表 html_dom 蚂蚁、中信、宁银(老)、河北
JSON API(v2) cfcbnb_v2 宁银(Vue data.json,47条)
详情遍历(v2) mengshang_v2 蒙商(HTML详情,86家)
layui分页 zhongyin_layui 中银(13页)
Vue SPA vue_pagination 招联
AJAX翻页 multi_page 湖北(3 Tab)
首页滚动 homepage_scroll 平安、金美信
双栏翻页 jinshang_two_col 晋商
Vue Tab suyinkaiji_vue 苏银凯基
CDP截图 cdp3.py 长银五八

四、Phase 2 — PDF解析与实体提取

目标: 扫描 announcements.json → 下载PDF → pdftotext提取 → 保存结构化合作机构名单。

支持的披露类型

披露类型 识别关键词 输出文件
催收合作机构 催收 {公司}_合作机构_催收合作机构_{日期}.json
增信服务机构 增信担保 {公司}_合作机构_互联网贷款增信服务机构_{日期}.json
平台运营机构 平台运营机构 {公司}_合作机构_互联网贷款平台运营机构_{日期}.json
关联交易 关联交易 同上模式
不良资产 不良资产 同上模式

输出 JSON 格式

[
  {"name": "和君纵达数据科技股份有限公司", "phone": "18855123966"},
  {"name": "众焱普惠科技有限公司", "phone": "028-85567577"}
]

提取规则

  • 公司名识别:{4-20汉字}(有限公司|股份有限公司|有限责任公司|集团|事务所)
  • 电话识别:0\d{2,3}[-\s]\d{7,8}1[3-9]\d{9}95\d{3,5}400\d{7}
  • 自动修复:换行断裂的公司名(拼接相邻行)

五、Phase 3 — 知识图谱写入

目标: 将结构化数据写入 memory/ontology/graph.jsonl,建立跨公司关联。

实体类型

类型 说明 示例
Company 合作机构/公司 蚂蚁智信、和君纵达
DisclosureList 披露清单 中邮催收合作机构2026(101家)
DisclosureDocument 具体披露文档 中邮消金催收公示PDF
CooperationRelation 合作关系(通过 relation 表达)

Relation 类型

Relation 说明
cooperates_with Company Company(消金) 合作机构 ↔ 消金公司
publishes_disclosure Company(消金) DisclosureList 消金发布披露清单
includes_company DisclosureList Company 清单包含合作机构
disclosed_by DisclosureDocument Company(消金) 文档由消金发布
appears_in_document Company DisclosureDocument 合作机构出现在文档正文

查询示例(grep graph.jsonl)

# 查哪家消金跟河北银海有合作
grep "河北银海" memory/ontology/graph.jsonl

# 查中邮消金所有合作机构
grep "co_zhongyou" memory/ontology/graph.jsonl

# 查所有增信服务机构
grep "GuaranteeAgency" memory/ontology/graph.jsonl

六、Ontology Schema 扩展

已为消金监控扩展 ontology schema:

types:
  Company:
    required: [name]
    properties:
      name: string
      company_type: string   # LawFirm|TechCompany|CollectionBPO|GuaranteeCompany...
      disclosure_type: string
      source_company: string
      phone: string

  DisclosureList:
    required: [name, company, disclosure_type, count, date]
    properties:
      name: string
      company: string
      disclosure_type: string
      count: integer
      date: string
      source_url: string

  DisclosureDocument:
    required: [title, date, source_company]
    properties:
      title: string
      date: string
      url: string
      category: string
      disclosure_type: string
      source_company: string

七、使用示例

场景:查询"哪家消金跟河北银海融资担保有合作"

旧方式(手动搜索):

  1. 回忆之前的分析文档
  2. 搜索关键词
  3. 翻聊天记录
  4. 无法确认 → 返回"没找到"

新方式(ontology查询):

grep "河北银海" memory/ontology/graph.jsonl

→ 立即得到:co_zhongyou ← cooperates_with ← co_yinhai_guarantee

场景:添加新公司信披采集

  1. companies.json 添加公司 URL 和方法
  2. python3 pipeline.py --phase 1 --company "XX消费金融"
  3. python3 pipeline.py --phase 2 --company "XX消费金融"
  4. python3 pipeline.py --phase 3 --company "XX消费金融"
  5. 自动入库 → 可直接查询

场景:批量补充历史 PDF

# 假设已采集 announcements.json,但 PDF 未下载
python3 phase2_parse.py    # 自动扫描所有 PDF 链接并下载
python3 phase3_ontology.py # 自动提取并写入 ontology

八、目录结构总览

~/.openclaw/workspace/
├── cfc_raw_data/                         ← 原始数据湖
│   ├── 中邮消费金融/
│   │   ├── announcements.json            ← Phase 1 产出
│   │   ├── 中邮消费金融_合作机构_催收合作机构_2026-02-28.json  ← Phase 2 产出
│   │   ├── 中邮消费金融_合作机构_互联网贷款增信服务机构_2026-03-31.json
│   │   └── attachments/                 ← PDF/图片
│   ├── 兴业消费金融/
│   │   └── ...
│   └── ...
├── memory/ontology/
│   └── graph.jsonl                       ← Phase 3 产出(全公司知识图谱)
└── skills/cfc-disclosure-monitor/
    ├── pipeline.py                       ← 统一编排器
    ├── phase2_parse.py                   ← Phase 2
    ├── phase3_ontology.py                ← Phase 3
    └── ...

九、更新日志

日期 版本 内容
2026-04-15 v4.0 深度重构:phase1_base.py去除重复方法+死代码(325行,原474行);collectors.py末尾死函数collect_mengshang已删除(2240行,原2253行);v1+v2统一COLLECTORS注册表;Announcement dict兼容层保持;cfcbnb_v2/mengshang_v2验证通过(47条/宁银)
2026-04-15 v3.0 三阶段贯通:Phase 1→2→3 pipeline,新增phase2_parse.py/phase3_ontology.py/pipeline.py, ontology graph.jsonl 为统一出口
2026-04-15 v2.0 Phase 1+2 合并,详情页同步获取,PDF探测改进
2026-04-05 v1.0 初版,30家列表确认,html_dom方法建立
Usage Guidance
This skill's code matches its stated goal (scraping 30 firms' disclosures, parsing PDFs, building an ontology), but there are several practical and security concerns you should consider before running it: - External OCR service: clean_and_eval.py will upload images to https://api.minimax.chat. The file contains a hard-coded default API key that will be used if you don't set MINIMAX_API_KEY yourself. That means data (images/PDF content) may be transmitted to a third party under someone else's key. Replace the default key with your own, or remove the external OCR calls if the data is sensitive. - Undeclared dependencies: The package does not declare required binaries or a reproducible install step. The code uses Playwright (which downloads browser engines), pdfplumber, trafilatura, and other Python packages. Install and run only in a controlled environment (container or VM) and review requirements before installing browser binaries. - Network activity and data storage: The skill will fetch many pages and attachments from external websites and store them under your workspace (~/.openclaw/... and /tmp). Expect significant outbound HTTP traffic and local files. Run it on isolated infrastructure if you need to protect other data. - Audit the remaining source: The provided excerpts show the hard-coded API key and many network operations; inspect the rest of the files (truncated in the manifest) for other hidden endpoints or secrets before trusting the skill. If you plan to use this skill: (1) audit the source, (2) set MINIMAX_API_KEY to your own key or remove the service calls, (3) run in a sandboxed environment, and (4) add/verify an install spec that documents dependencies and their expected behavior.
Capability Analysis
Type: OpenClaw Skill Name: cfc-disclosure-monitor Version: 4.5.0 The skill bundle is a comprehensive pipeline for scraping and analyzing financial disclosure data. It is classified as suspicious due to the presence of hardcoded API keys for MiniMax and GLM services in 'clean_and_eval.py' and 'parser.py', which constitutes a significant credential leak. Additionally, the use of 'subprocess.run' to execute external binaries like 'pdftotext' and 'pdftoppm' in 'phase1_base.py' and 'vlm_ocr.py' presents a potential command injection risk, although the behavior appears consistent with the stated purpose of the tool rather than intentional malice.
Capability Tags
cryptocan-make-purchasesrequires-sensitive-credentials
Capability Assessment
Purpose & Capability
The name/description (采集消金公司公告并构建知识图谱) match the code: many collectors, PDF parsing, OCR, and ontology writers are present. However the SKILL.md and registry metadata declare no required binaries or env vars while the code clearly expects heavy runtime dependencies (Playwright, pdfplumber, trafilatura, etc.) and an external VLM OCR integration. The omission of these runtime requirements is a mismatch and should have been declared.
Instruction Scope
SKILL.md instructs running pipeline.py / collect.py / phase2/3 scripts which is consistent with the code. It does NOT call out that attachments/images/PDFs downloaded from third‑party websites will be uploaded to an external OCR API (https://api.minimax.chat) by clean_and_eval.py, nor does it document the effect of the default API key fallback. The runtime instructions therefore omit a notable external-network data flow.
Install Mechanism
There is no install spec (instruction-only at packaging level). That lowers install‑time risk, but the code requires non-trivial Python packages and Playwright (which downloads browser binaries at runtime). Absence of declared dependency/install steps is an operational gap and increases risk if a user runs the skill without properly sandboxing or vetting dependencies.
Credentials
The skill metadata declares no required environment variables, yet clean_and_eval.py reads MINIMAX_API_KEY and supplies a hard-coded default API key in the source. This means, unless the user sets their own key, images will be sent to a third‑party service using the embedded key (undeclared). That is disproportionate to what the manifest claims and introduces potential privacy, billing, and exfiltration concerns.
Persistence & Privilege
The skill is not always-enabled and is user-invocable (normal). It writes output to local workspace dirs (cfc_raw_data, memory/ontology/graph.jsonl) and /tmp for downloads — these are expected for a scraper/ETL skill and do not request elevated platform privileges.
How to Use
  1. Make sure OpenClaw is installed (local or Docker)
  2. Run the install command in chat: /install cfc-disclosure-monitor
  3. After installation, invoke the skill by name or use /cfc-disclosure-monitor
  4. Provide required inputs per the skill's parameter spec and get structured output
Version History
v4.5.0
苏银凯基详情页点击提取(items 1和3完整正文)
v4.4.0
P0修复v2:招联日期提取+标题优化(zhaolian→70条)、南银法巴死循环修复(nyfb→36条)、杭银POST API(hangyin→110条)、苏银凯基SSR DOM解析+多Tab(suyinkaiji→9条);修复collect.py为suyinkaiji注入专用CDP参数
v4.3.0
P0修复:招联(zhaolian)API分页→70条、南银法巴(nyfb)HTML解析→36条、杭银(hangyin)POST API→110条;苏银凯基(suyinkaiji)Cloudflare WAF限制暂保留旧方法(0条,需CDP连接真实Chrome)
v4.2.0
陕西长银修复:双Tab采集+正文提取(9/10条);海尔详情采集固化(133条,127条有正文)
v4.1.0
陕西长银修复:双Tab采集+正文提取(9/10条);URL正确捕获
v1.0.0
v1.0.0: 30家全覆盖(1427条),增量采集,每日Cron,架构重构(core/collectors/collect三层分离)
Metadata
Slug cfc-disclosure-monitor
Version 4.5.0
License MIT-0
All-time Installs 0
Active Installs 0
Total Versions 6
Frequently Asked Questions

What is Cfc Disclosure Monitor?

30家持牌消费金融公司官网公告信披采集,支持增量每日Cron。 It is an AI Agent Skill for Claude Code / OpenClaw, with 138 downloads so far.

How do I install Cfc Disclosure Monitor?

Run "/install cfc-disclosure-monitor" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.

Is Cfc Disclosure Monitor free?

Yes, Cfc Disclosure Monitor is completely free, licensed under MIT-0. You can download, install and use it at no cost.

Which platforms does Cfc Disclosure Monitor support?

Cfc Disclosure Monitor is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).

Who created Cfc Disclosure Monitor?

It is built and maintained by linwumeng (@linwumeng); the current version is v4.5.0.

💬 Comments