Description

Research-backed chapter pipeline: outline → parallel ACP research → MD synthesis → DOCX generation → source verification → screenshot verification (mandatory)

README (SKILL.md)

docx-chapter

Name: Docx Chapter
Author: ccmxigua

大纲 → 并行搜索 → 内容合成 → DOCX 生成 → 来源验证 → 交付

输入格式

title: "章节标题"
description: "章节定位与目标"
sections:
  - heading: "3.1 第一级标题"
    topics:
      - "子主题 1"
      - "子主题 2"
  - heading: "3.2 第一级标题"
    topics:
      - "..."
reference_docx: "/path/to/reference.docx"  # 可选，DOCX 样式参考
output_dir: "/path/to/output"              # 可选，默认 ~/.openclaw/media/outbound/

工作流程

阶段一：大纲解析 → 并行研究计划

解析大纲 YAML，将每个 section 拆为独立研究主题
为每个主题生成 ACP agent 任务：/unified_search \x3C关键词> + Tavily Research 深度搜索 → 来源采集
输出研究计划清单：主题 → 搜索关键词 → 预期来源数

阶段二：并行 ACP 研究（双搜索源：unified_search + Tavily Research）

必须使用 sessions_spawn(runtime="acp", agentId="claude", mode="run")，禁止默认 runtime="subagent"（继承 Telegram 路由会失败）。

ACP 参数固定规则：以上参数为固定模板，禁止添加 lightContext（仅 subagent 有效，ACP 不认）、禁止添加 runTimeoutSeconds 等额外参数。AC 使用各自后端默认超时。

⚠️ #50597 警告：ACP Claude CLI 有 ~30% 概率返回空结果（只含 thinking block，0 输出 token）。见 GitHub #50597。 降级策略：子代理返回空、超时、或 Internal error 时，触发自动重试（最多 2 次）。若 3 次全空/全失败，降级走 /unified_search（三引擎搜索），禁止直接使用 llm-task。

为什么禁止 llm-task 降级：llm-task 让模型直接搜索时，模型会编造/猜测 URL （已验证：BrowserStack /guide/beta-testing、/guide/alpha-vs-beta-testing、 Testim /blog/beta-testing/、SoftwareTestingHelp /alpha-testing-vs-beta-testing/ 均为 llm-task 幻觉 URL，返回 404）。/unified_search 返回真实可访问的 URL。

每个 ACP 子代理执行两路搜索后合并：

① unified_search（快速结构化搜索）

/unified_search \x3C英文关键词>

记录每个来源的 URL、标题、日期、权威度、原文摘录。

② Tavily 深度研究（补充覆盖面与深度）

node \x3Ctavily-research>/scripts/research.mjs \
  "\x3C英语研究问题>" --model pro --out /tmp/tavily_\x3Ctopic>.md

读取 /tmp/tavily_\x3Ctopic>.md，从报告中提取：

所有被引用的 URL、标题
与主题最相关的 2-3 句原文
补充 unified_search 未覆盖的来源

Tavily Research 耗时 20-60s，等待期间可先整理 unified_search 的结果。

③ 合并去重

按 URL 去重，同一 URL 出现两次时保留内容更丰富的版本
unified_search 优先用于 URL 准确性，tavily 用于深度补充
标注每个来源的搜索渠道：source_channel

返回结构化 JSON：

{
  "topic": "...",
  "sources": [
    {
      "url": "...",
      "title": "...",
      "date": "...",
      "authority": "gov|academic|news|commercial",
      "source_channel": "unified_search|tavily_research|both",
      "quotes": ["原文句1", "原文句2"],
      "key_data": {"metric": "value"}
    }
  ],
  "suggested_body": "基于来源的正文草稿"
}

阶段三：内容合成

更新 research_results.json：回填研究数据（仅更新，不覆盖已存在的 sources）：
- name ← 从 topics.json 取对应的 heading（如 "Alpha 测试"）
- summary ← 从 ACP agent 返回的 suggested_body 提取摘要（2-4 句）
- key_findings ← 从 suggested_body/sources 提取 3-5 条关键发现
- status ← "research_completed"
合并所有子代理返回结果
去重 + 冲突检测：同一 URL 被多个子代理返回时保留权威度最高的版本；对同一事实的不同表述标记为 ⚠️ 冲突
生成 MD 正文：每处数据/观点标注脚注引用 [^N]
脚注格式：[^N]: 作者/机构。《标题》。发布日期。URL
脚注从 0 起连续编号
同步输出 verification_keywords.json：每写入一处 [^N] 时，顺带记录：
- footnote_id：N
- source_url：对应的来源 URL
- keywords：该处引用的核心术语/数据（2-5 个英文词/短语，如 ["defect reduction", "85%", "alpha testing"]）
- 格式：{ "footnotes": [ { "id": "0", "url": "...", "keywords": [...] }, ... ] }
- 原则：写脚注的人最清楚自己在引用什么，关键词当场定下来。
输出草稿 → 人工审阅（必须，不可跳过）

阶段 3.5：正文脚注完整性检查（必须）

此阶段不可跳过。 在进入 DOCX 生成前，对 chapter.md 正文扫描行内引用残留——即未纳入 [^N] 脚注系统的括号引用。

扫描正则（Python）：

INLINE_CITE_RE = re.compile(
    r'(?:'
    r'[（(][A-Z][A-Za-z]+(?:\s+et\s+al\.?|\s+等)?[，,\s]+\d{4}[a-z]?[)）]'  # （MDPI, 2022） / (Author, 2023)
    r'|[A-Z][A-Za-z]+[（(]\d{4}[)）]'                                         # Nature（2025）
    r')'
)

检测流程：

python3 -c "
import re, sys
with open(sys.argv[1], 'r') as f:
    text = f.read()
INLINE_CITE_RE = re.compile(
    r'(?:'
    r'[\（(][A-Z][A-Za-z]+(?:\s+et\s+al\.?|\s+等)?[，,\s]+\d{4}[a-z]?[)）]'
    r'|[A-Z][A-Za-z]+[\（(]\d{4}[)）]'
    r')'
)
for i, line in enumerate(text.split('\
'), 1):
    if line.lstrip().startswith('[^'):  # skip footnote definitions
        continue
    for m in INLINE_CITE_RE.finditer(line):
        print(f'L{i}: \"{m.group()}\" — 未纳入 [^N] 脚注系统')
" chapter.md

处理规则：

每处匹配 → ERROR 级别。必须逐条回答：
- 该引用对应哪个已验证的来源（URL）？
- 若可对应 → 替换为 [^N] 脚注引用
- 若无法对应任何已验证来源 → 删除该句或降级措辞（移除引用）
检查通过条件：扫描零输出
此检查必须在 pandoc 生成 DOCX 之前完成；若任一 ERROR 未解决，阻断进入阶段四

常见坑：

全大写缩写（如 MDPI、OECD）也会被捕获
中文/英文括号混合情况均覆盖
脚注定义行（[^N]:）自动跳过，不误报

阶段四：DOCX 生成

{{- if .args.reference_docx }}
# 先验证 reference.docx 存在，不存在则去掉 --reference-doc（否则 Pandoc exit 99）
pandoc chapter.md -o chapter.docx \
  --reference-doc="{{ .args.reference_docx }}" \
  --from markdown+footnotes+tex_math_dollars --to docx
{{- else }}
pandoc chapter.md -o chapter.docx \
  --from markdown+footnotes+tex_math_dollars --to docx
{{- end }}

--reference-doc 可选：不指定时 Pandoc 使用内置默认样式。

⚠️ Pandoc 3.7 兼容性：tex_math_single_dollar 扩展在 Pandoc 3.7.0.2 中不再支持，仅使用 tex_math_dollars。若 --reference-doc 指向不存在的文件，Pandoc 会以 exit code 99 失败；执行前必须 test -f 验证文件存在。

Heading 层级映射： Pandoc 将 Markdown heading 映射为 Word 内置样式：

# → Heading1
## → Heading2
### → Heading3
#### → Heading4 章节编号需手动写入标题文字，Pandoc 不自动编号。

脚注 URL 超链接化（强制）： Pandoc 默认将脚注中的 URL 输出为纯文本，在 Word 中不可点击。生成 DOCX 后必须执行后处理脚本 add_hyperlink_footnotes.py：

python3 add_hyperlink_footnotes.py chapter.docx chapter.docx

⚠️ Pandoc 会重编号脚注：markdown 中的 [^42] 在 DOCX 中可能被重编号为 ID 37、38 等（Pandoc 按出现顺序分配连续 ID，且重复引用会被复制到不同 ID）。排查时必须按 URL 内容定位，不可假设 markdown 编号 = DOCX 编号。

⚠️ add_hyperlink_footnotes.py 依赖 research_results.json：脚本从 research_results.json 读取 URL→脚注映射来构建超链接。若在修改 chapter.md / research_results.json 后未重跑此脚本， DOCX 中的脚注 URL 可能仍为旧值。变更三文件后必须重新执行： pandoc → add_hyperlink_footnotes.py → verify_docx.py。

后处理原理：

解压 DOCX，解析 word/footnotes.xml
找到每个脚注正文中的 URL（正确正则含 （））
将 URL 文本拆出独立的 \x3Cw:hyperlink> 元素（不含 （日期访问） 等中文后缀）
为超链接内的 \x3Cw:r> 添加 \x3Cw:rStyle w:val="Hyperlink"/> 以显示蓝色下划线
在 word/_rels/footnotes.xml.rels 中注册对应的 relationship
重新打包 DOCX

阶段五：来源验证（自动）

对每个脚注执行以下检查：

检查项	方法	规避的坑
URL 可达	`curl -sIL --max-time 10 \x3CURL>`	—
URL 必须有 hyperlink	`urls_in_body` 非空且不在 hyperlink 内则报错（自动排除 hyperlink 内的 `\x3Cw:t>`）	补漏
URL 重复（纯文本 + hyperlink 各一份）	`urls_in_body` ∩ `urls_in_hyperlink` 非空则报错（同一 URL 写了两次）	坑 #4
footnoteRef 存在	`\x3Cns0:footnote>` 内必须含 `\x3Cns0:footnoteRef />`	坑 #6
无显式字体/字号	不应出现 `ns0:ascii`、`仿宋`、`ns0:sz`	坑 #7
URL 提取正则正确	使用含中文标点的正确正则（见下方）	坑 #5

URL 提取正则（已修正）：

# ❌ 错误：中文标点不在排除集，会合并 URL
re.findall(r'https?://[^\s\x3C>"]+', text)

# ✅ 正确：包含全角括号等中文标点
re.findall(r'https?://[^\s\x3C>"」。，；：\!\?」』（）、《》【】…—～]+', text)

阶段六：截图验证（必须）

此阶段不可跳过。 对所有脚注来源执行截图验证。

必须使用 scripts/purple_highlight_screenshot.py，禁止自行编写普通截图脚本。

python3 scripts/purple_highlight_screenshot.py \x3CURL> \x3Coutput.png> \x3Ckeyword1> [keyword2] ...

关键词从 verification_keywords.json 中按脚注 ID 查找。若某脚注缺失关键词，回退从 chapter.md 上下文提取。

特殊来源处理：

PDF 来源：跳过截图，只做 keyword 匹配（坑 #9）
图片骨架网站：curl -s \x3CURL> 检查 HTML \x3C 2000 chars 基本无正文（坑 #10）

正文措辞审核规则

以下措辞类型必须触发人工确认：

措辞类型	示例	需要确认
排他性声明	"列为重点""核心行动""首次提出"	来源原文必须逐字匹配
框架性声明	"纳入…框架""建立…机制""达成…共识"	需查看官方文件原文
量化断言	"占比 X%""排名第 N"	需来源直接引用，不得二次推算
时间关联	"自 X 年起""Y 年 Z 月至"	日期需与 HTTP Last-Modified 一致

原则：如果源文件中找不到逐字匹配的原句，降级措辞。宁可保守，不可夸大。

DOCX 修改规范

修改已有 DOCX 正文前：

先 dump 所有 run：提取每个 \x3Cns0:r> 的文本和脚注引用
确认分隔边界：脚注引用可能独占空 run（坑 #11）
修改后连读验证：确保 run 间逗号/句号完整
保存前 git diff：对比原始版本确认改动范围

搜索注意事项

中文主题：用英文关键词搜索，避免 tokenizer 拆散整词（坑 #3）
双搜索源：unified_search 提供结构化快速结果，Tavily Research 提供深度研究补充
Tavily Research 失败（超时/API 错误/网络问题）→ 静默降级，仅使用 unified_search 结果，不阻断 agent 执行
来源筛选：每个主题 3-5 个高质量来源，优先官方/学术
对不可达网站：尝试 web.archive.org 缓存版本

错误处理矩阵

严重度	定义	处理策略	示例
FATAL	管线无法继续	中断执行，报告用户	大纲解析失败、Pandoc 未安装
ERROR	某阶段失败但管线可降级	降级走 `/unified_search`（非 llm-task），跳过后继不阻塞	ACP 返回空/失败、URL 不可达、Tavily Research 超时/失败
WARN	非阻塞异常	记录到报告，不中断	某个脚注缺 footnoteRef、字体异常
INFO	信息性提示	仅日志	使用了默认 reference.docx

Roxy 截图排坑（2026-05-28）

当截图验证（阶段六）遇到普通 Playwright 打不开的 URL（如 ACM DL、BMC、Splunk、ScyllaDB、CAP FAQ、AWS、HBase），需要走 Roxy 严格链路：

正确链路：

@roxybrowser/openapi MCP 开浏览器 → 从 open_browsers 响应提取 CDP WS
  → @roxybrowser/playwright-mcp --cdp-endpoint \x3Cws_url>
  → browser_connect_roxy({ cdpEndpoint: ws_url })
  → browser_navigate → browser_take_screenshot

关键约束：

不要调 get_connection_info 拿 WS URL — 它返回所有已打开浏览器，regex 可能选错 → playwright-mcp 启动崩溃
MCP stdio 交互用 Node.js 实现 — Python stdout 有缓冲，bufsize=0 不生效
openapi MCP 进程保持存活 — 关闭会导致浏览器被清理，CDP WS 失效
优先复用已有浏览器 — 窗口额度有限，用完截图 close + delete
截图输出路径：browser_take_screenshot 的 filename 为完整绝对路径时，文件保存到该路径（~/.openclaw/media/outbound/xxx/screenshots_roxy/ 需提前 mkdir -p）

参考实现：/tmp/roxy_full.cjs（Node.js，2026-05-28 验证通过，6/7 URL 截图成功）

规则优先级

规则优先级：所有 DOCX 结构检查规则以 verify_docx.py 代码实现为准，本文档描述为辅助说明。若本文档与代码不一致，以代码为准。

脚注格式规范

多段脚注：后续段落必须缩进 4 空格。未缩进时 Pandoc 将后续段落视为新脚注定义，导致正文截断
脚注内代码块：禁止使用围栏代码块（```），改用 8 空格缩进
脚注内表格：需额外 4 空格缩进（共 8 空格）；脚注内表格极易出错（行解析错误、跨列错位），建议尽量避免
章节标题编号：需手动编写，Pandoc 不自动编号

已知陷阱与规避

坑	描述	规避方法
#50597	ACP 空返回（~30%）	自动重试 2 次 → 降级 `/unified_search`
llm-task 幻觉	llm-task 直接搜索会编造 URL（404）	禁止 llm-task 降级，必须走 `/unified_search`
URL 重复	同一 URL 在 `\x3Cw:t>` 和 `\x3Cw:hyperlink>` 各一份	`verify_docx.py` 自动检测
footnoteRef 缺失	ACP 修改后丢失 `\x3Cw:footnoteRef>`	`verify_docx.py` 检查
字体污染	ACP 引入显式字体声明	`verify_docx.py` 检查 BANNED_FONTS
URL 不可点击	Pandoc 输出纯文本 URL	`add_hyperlink_footnotes.py` 后处理
footnote_gap	Pandoc 重复引用同一脚注时，后续出现复制到不同连续 ID，中间 ID 被跳过（如 16→18、28→30）	非阻塞 warning；由 `verify_docx.py` 报告，不影响脚注功能。原因：Markdown 中同一 `[^N]` 被多次引用时 Pandoc 会占用新 ID
超链接无样式	URL 可点击但无蓝色下划线（缺 `rStyle="Hyperlink"`）	`add_hyperlink_footnotes.py` 已自动添加；`verify_docx.py` 检查 `hyperlink_no_style`
Roxy CDP WS 取错	`get_connection_info` 返回所有浏览器，regex 可能匹配错误的 WS URL → playwright-mcp init timeout + EPIPE	从 `open_browsers` 响应中按 dirId 精确提取 CDP WebSocket URL
Roxy 窗口额度不足	创建新浏览器报「窗口额度不足」	优先复用已有浏览器；截图后 close + delete 释放额度
MCP stdio Python 缓冲	Python `subprocess.Popen` stdout 在 `text=True` 下有缓冲 → MCP 响应延迟 → 脚本卡死	用 Node.js `child_process.spawn` + `stdout.on('data')` 逐行解析 JSON-RPC

pandoc（MD → DOCX 转换）
ACP Claude CLI agent（并行研究 + DOCX 审查）
curl（URL 可达性检查）
Playwright（截图验证）
python3 + lxml（DOCX XML 检查）

Checkpoint 持久化

大型章节管线执行时间较长（10+ 分钟），为避免重启丢失进度，每阶段完成后应将中间产物保存到 output_dir：

阶段	产物	保存路径
阶段一	主题拆分 JSON	`output_dir/topics.json`
阶段二	研究结果 JSON	`output_dir/research_results.json`
阶段三	正文草稿 MD	`output_dir/chapter.md`
阶段三	验证关键词 JSON	`output_dir/verification_keywords.json`

恢复时从最近的 checkpoint 继续，跳过已完成的阶段。在 lobster pipeline 中可通过 {{ .args.checkpoint_dir }} 指定 checkpoint 目录（默认同 output_dir）。

输出

chapter.md — 正文 + 脚注
chapter.docx — 样式化的 DOCX（含 reference.docx 样式）
chapter_verification.json — 验证结果（Phase 6）
verification_keywords.json — 脚注 → 关键词映射（Phase 3 同步产出）
screenshots/ — 来源截图（Playwright 紫色高亮）

Usage Guidance

Review before installing in sensitive environments. Use it only in a workspace where generated drafts, research excerpts, screenshots, and verification files may be retained, and avoid confidential or internal URLs unless you are comfortable with browser loading and openclaw infer receiving page and chapter text. Prefer running with network controls or an allowlist for footnote domains.

Capability Tags

financial-authoritycan-sign-transactionsrequires-sensitive-credentials

Capability Assessment

⚠ Purpose & Capability

The DOCX research, source verification, and screenshot workflow is coherent, but the scripts add an under-disclosed LLM data flow for keyword matching and enrichment beyond ordinary screenshot capture.

⚠ Instruction Scope

The skill instructs agents to spawn ACP research agents, perform web searches, run Tavily research, curl footnote URLs, and load arbitrary footnote URLs in Playwright; these are purpose-aligned but broad and not clearly bounded by domain allowlists or sensitive-research controls.

✓ Install Mechanism

Artifacts are plain skill files and Python/YAML scripts with no post-install hook, hidden installer, dependency confusion signal, or automatic persistence mechanism beyond normal execution outputs.

⚠ Credentials

Network access and browser automation are expected for source verification, but mandatory loading of generated footnote URLs and LLM processing of page text/research excerpts is broader than users would reasonably infer from the main workflow description.

ℹ Persistence & Privilege

The skill writes chapter drafts, DOCX output, research results, verification JSON, and screenshots to an output directory, defaulting to ~/.openclaw/media/outbound; this is mostly disclosed and proportionate, but users should treat the retained research artifacts as potentially sensitive.

Version History

v0.1.1

Remove hardcoded user paths

v0.1.0

Initial release

Metadata

Slug docx-chapter

Version 0.1.1

License MIT-0

All-time Installs 0

Active Installs 0

Total Versions 2

Frequently Asked Questions

What is Docx Chapter?

Research-backed chapter pipeline: outline → parallel ACP research → MD synthesis → DOCX generation → source verification → screenshot verification (mandatory). It is an AI Agent Skill for Claude Code / OpenClaw, with 42 downloads so far.

How do I install Docx Chapter?

Run "/install docx-chapter" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.

Is Docx Chapter free?

Yes, Docx Chapter is completely free, licensed under MIT-0. You can download, install and use it at no cost.

Which platforms does Docx Chapter support?

Docx Chapter is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).

Who created Docx Chapter?

It is built and maintained by ccmxigua (@ccmxigua); the current version is v0.1.1.

More Skills

Docx Chapter