← Back to Skills Marketplace
uwvwko-zzz

pdf-ppt-docx-xlsx-tools

by uwvwko · GitHub ↗ · v1.0.0 · MIT-0
cross-platform ✓ Security Clean
111
Downloads
0
Stars
0
Active Installs
1
Versions
Install in OpenClaw
/install uwvwko-pdf-ppt-docx-xlsx-tools
Description
文档格式转换工具集,支持 PDF、PPTX、DOCX、XLSX 四种格式之间的互转及衍生操作(转图片、合并、拆分、提取文本/表格、加水印等)。当用户需要转换文档格式、处理 PDF、操作 Office 文件时使用此技能。
README (SKILL.md)

\r \r

PDF / PPTX / DOCX / XLSX 文档转换工具集\r

\r 四大办公文档格式的转换、提取、合并、拆分工具。基于 Python 生态,所有操作均可通过 execute_command 执行 Python 一行命令或短脚本完成。\r \r

依赖安装\r

\r

pip install PyMuPDF pdf2docx python-docx python-pptx openpyxl pandas Pillow pdfplumber\r
```\r
\r
部分操作(DOCX→PDF、PPTX→PDF、XLSX→PDF)需要系统安装 **LibreOffice**:\r
\r
```bash\r
# Windows (winget)\r
winget install LibreOffice.LibreOffice\r
\r
# macOS\r
brew install --cask libreoffice\r
\r
# Ubuntu/Debian\r
sudo apt install libreoffice\r
```\r
\r
PDF 转图片如需高质量渲染,可选装 poppler(`PyMuPDF` 内置渲染已足够,poppler 仅作为备选)。\r
\r
---\r
\r
## 快速参考:支持的全部转换\r
\r
| 源格式 | 目标格式 | 推荐库 | 备注 |\r
|--------|----------|--------|------|\r
| PDF | 图片 (PNG/JPG) | PyMuPDF | 逐页渲染,支持 DPI 控制 |\r
| PDF | DOCX | pdf2docx | 保留布局、表格、图片 |\r
| PDF | PPTX | PyMuPDF + python-pptx | 每页一张幻灯片 |\r
| PDF | XLSX | pdfplumber + openpyxl | 提取表格数据 |\r
| PDF | 文本 (TXT) | PyMuPDF | 提取纯文本 |\r
| DOCX | PDF | LibreOffice (CLI) | 最佳保真度 |\r
| DOCX | PPTX | python-docx + python-pptx | 段落→幻灯片 |\r
| DOCX | HTML | python-docx / mammoth | 保留基本格式 |\r
| DOCX | 纯文本 | python-docx | 提取所有段落文本 |\r
| PPTX | PDF | LibreOffice (CLI) | 最佳保真度 |\r
| PPTX | 图片 (PNG) | PyMuPDF | 每页导出为图片 |\r
| PPTX | DOCX | python-pptx + python-docx | 提取所有文本 |\r
| PPTX | 纯文本 | python-pptx | 提取幻灯片文本 |\r
| XLSX | PDF | LibreOffice (CLI) | 最佳保真度 |\r
| XLSX | CSV | pandas | 可指定 sheet |\r
| XLSX | DOCX | openpyxl + python-docx | 表格写入 Word |\r
| XLSX | JSON | pandas | 结构化数据导出 |\r
| 图片 | PDF | Pillow + reportlab | 多图合并为 PDF |\r
\r
---\r
\r
## 转换命令详解\r
\r
### 1. PDF → 图片\r
\r
```bash\r
python -c "\r
import fitz, sys, os\r
pdf_path, out_dir = sys.argv[1], sys.argv[2] if len(sys.argv)>2 else '.'\r
os.makedirs(out_dir, exist_ok=True)\r
doc = fitz.open(pdf_path)\r
dpi = int(sys.argv[3]) if len(sys.argv)>3 else 200\r
fmt = sys.argv[4] if len(sys.argv)>4 else 'png'\r
for i, page in enumerate(doc):\r
    pix = page.get_pixmap(dpi=dpi)\r
    out = os.path.join(out_dir, f'page_{i+1:04d}.{fmt}')\r
    pix.save(out)\r
    print(f'Saved: {out}')\r
doc.close()\r
print(f'Done: {len(doc)} pages -> {out_dir}')\r
" "input.pdf" "./output_images" 200 png\r
```\r
\r
**参数说明:**\r
- `arg1` — PDF 文件路径\r
- `arg2` — 输出目录(默认当前目录)\r
- `arg3` — DPI 分辨率(默认 200,推荐 150~300)\r
- `arg4` — 图片格式:`png`(默认)或 `jpg`\r
\r
### 2. PDF → DOCX\r
\r
```bash\r
python -c "\r
from pdf2docx import Converter\r
import sys\r
cv = Converter(sys.argv[1])\r
cv.convert(sys.argv[2] if len(sys.argv)>2 else 'output.docx')\r
cv.close()\r
print('Done')\r
" "input.pdf" "output.docx"\r
```\r
\r
**可选参数(通过修改脚本):**\r
- `start=0, end=None` — 指定页码范围\r
- `multi_processing=True` — 多进程加速大文件\r
\r
### 3. PDF → PPTX(每页一张幻灯片)\r
\r
```bash\r
python -c "\r
import fitz, sys\r
from pptx import Presentation\r
from pptx.util import Inches\r
from pptx.dml.color import RGBColor\r
import io\r
\r
pdf_path = sys.argv[1]\r
out_path = sys.argv[2] if len(sys.argv)>2 else 'output.pptx'\r
doc = fitz.open(pdf_path)\r
prs = Presentation()\r
prs.slide_width = Inches(13.333)\r
prs.slide_height = Inches(7.5)\r
\r
blank_layout = prs.slide_layouts[6]  # blank\r
for i, page in enumerate(doc):\r
    pix = page.get_pixmap(dpi=200)\r
    img_data = pix.tobytes('png')\r
    slide = prs.slides.add_slide(blank_layout)\r
    slide.shapes.add_picture(io.BytesIO(img_data), Inches(0), Inches(0),\r
                             width=prs.slide_width, height=prs.slide_height)\r
    print(f'Page {i+1}/{len(doc)} added')\r
\r
prs.save(out_path)\r
doc.close()\r
print(f'Done: {out_path}')\r
" "input.pdf" "output.pptx"\r
```\r
\r
### 4. PDF → XLSX(提取表格)\r
\r
```bash\r
python -c "\r
import pdfplumber, openpyxl, sys\r
\r
pdf_path = sys.argv[1]\r
out_path = sys.argv[2] if len(sys.argv)>2 else 'output.xlsx'\r
wb = openpyxl.Workbook()\r
ws_total = wb.active\r
ws_total.title = 'All Tables'\r
row_offset = 0\r
\r
with pdfplumber.open(pdf_path) as pdf:\r
    for page_num, page in enumerate(pdf.pages):\r
        tables = page.extract_tables()\r
        for t_idx, table in enumerate(tables):\r
            if row_offset == 0 and t_idx == 0:\r
                ws = ws_total\r
            else:\r
                ws = wb.create_sheet(title=f'p{page_num+1}_t{t_idx+1}')\r
            for row in table:\r
                ws.append(row)\r
            row_offset += len(table)\r
            print(f'Page {page_num+1}, Table {t_idx+1}: {len(table)} rows')\r
\r
if ws_total.max_row == 1 and ws_total.max_column == 1:\r
    wb.remove(ws_total)\r
\r
wb.save(out_path)\r
print(f'Done: {out_path}')\r
" "input.pdf" "output.xlsx"\r
```\r
\r
### 5. PDF → 纯文本\r
\r
```bash\r
python -c "\r
import fitz, sys\r
doc = fitz.open(sys.argv[1])\r
out_path = sys.argv[2] if len(sys.argv)>2 else 'output.txt'\r
with open(out_path, 'w', encoding='utf-8') as f:\r
    for i, page in enumerate(doc):\r
        text = page.get_text()\r
        f.write(f'--- Page {i+1} ---\
{text}\
\
')\r
print(f'Done: {len(doc)} pages extracted')\r
doc.close()\r
" "input.pdf" "output.txt"\r
```\r
\r
### 6. DOCX → PDF\r
\r
```bash\r
# Windows\r
python -c "\r
import subprocess, sys, os\r
docx_path = os.path.abspath(sys.argv[1])\r
out_dir = os.path.dirname(docx_path)\r
subprocess.run([\r
    r'C:\Program Files\LibreOffice\program\soffice.exe',\r
    '--headless', '--convert-to', 'pdf',\r
    '--outdir', out_dir, docx_path\r
], check=True)\r
print(f'Done')\r
" "input.docx"\r
\r
# macOS / Linux\r
soffice --headless --convert-to pdf --outdir ./ "input.docx"\r
```\r
\r
### 7. DOCX → PPTX\r
\r
```bash\r
python -c "\r
from docx import Document\r
from pptx import Presentation\r
from pptx.util import Pt, Inches\r
import sys\r
\r
docx_path = sys.argv[1]\r
out_path = sys.argv[2] if len(sys.argv)>2 else 'output.pptx'\r
doc = Document(docx_path)\r
prs = Presentation()\r
prs.slide_width = Inches(13.333)\r
prs.slide_height = Inches(7.5)\r
blank_layout = prs.slide_layouts[6]\r
\r
slide = None\r
bullet_count = 0\r
max_bullets = 8\r
\r
for para in doc.paragraphs:\r
    text = para.text.strip()\r
    if not text:\r
        continue\r
    style = para.style.name.lower()\r
    is_heading = 'heading' in style or 'title' in style\r
\r
    if is_heading or bullet_count >= max_bullets:\r
        slide = prs.slides.add_slide(blank_layout)\r
        bullet_count = 0\r
        txBox = slide.shapes.add_textbox(Inches(0.5), Inches(0.5),\r
                                          Inches(12.333), Inches(6.5))\r
        tf = txBox.text_frame\r
        tf.word_wrap = True\r
        if is_heading:\r
            p = tf.paragraphs[0]\r
            p.text = text\r
            p.font.size = Pt(32)\r
            p.font.bold = True\r
            bullet_count = 0\r
            continue\r
\r
    if slide is None:\r
        slide = prs.slides.add_slide(blank_layout)\r
        txBox = slide.shapes.add_textbox(Inches(0.5), Inches(0.5),\r
                                          Inches(12.333), Inches(6.5))\r
        tf = txBox.text_frame\r
        tf.word_wrap = True\r
\r
    p = tf.add_paragraph()\r
    p.text = text\r
    p.font.size = Pt(20)\r
    p.level = min(para.style.name.count('Heading') if 'Heading' in para.style.name else 0, 2)\r
    bullet_count += 1\r
\r
prs.save(out_path)\r
print(f'Done: {len(prs.slides)} slides -> {out_path}')\r
" "input.docx" "output.pptx"\r
```\r
\r
### 8. DOCX → HTML\r
\r
```bash\r
python -c "\r
from docx import Document\r
from docx.oxml.ns import qn\r
import sys, re\r
\r
doc = Document(sys.argv[1])\r
out_path = sys.argv[2] if len(sys.argv)>2 else 'output.html'\r
\r
def paragraph_to_html(para):\r
    text = para.text\r
    style = para.style.name.lower()\r
    if 'heading 1' in style or 'title' in style:\r
        return f'\x3Ch1>{text}\x3C/h1>'\r
    elif 'heading 2' in style:\r
        return f'\x3Ch2>{text}\x3C/h2>'\r
    elif 'heading 3' in style:\r
        return f'\x3Ch3>{text}\x3C/h3>'\r
    elif 'list' in style:\r
        return f'\x3Cli>{text}\x3C/li>'\r
    else:\r
        return f'\x3Cp>{text}\x3C/p>'\r
\r
html_parts = ['\x3C!DOCTYPE html>\x3Chtml>\x3Chead>\x3Cmeta charset=\"utf-8\">\x3Cstyle>body{font-family:sans-serif;max-width:800px;margin:2em auto;padding:0 1em;}h1{color:#333;}h2{color:#555;border-bottom:1px solid #eee;}\x3C/style>\x3C/head>\x3Cbody>']\r
for para in doc.paragraphs:\r
    if para.text.strip():\r
        html_parts.append(paragraph_to_html(para))\r
html_parts.append('\x3C/body>\x3C/html>')\r
\r
with open(out_path, 'w', encoding='utf-8') as f:\r
    f.write('\
'.join(html_parts))\r
print(f'Done: {out_path}')\r
" "input.docx" "output.html"\r
```\r
\r
### 9. PPTX → PDF\r
\r
```bash\r
# Windows\r
python -c "\r
import subprocess, sys, os\r
pptx_path = os.path.abspath(sys.argv[1])\r
out_dir = os.path.dirname(pptx_path)\r
subprocess.run([\r
    r'C:\Program Files\LibreOffice\program\soffice.exe',\r
    '--headless', '--convert-to', 'pdf',\r
    '--outdir', out_dir, pptx_path\r
], check=True)\r
print('Done')\r
" "input.pptx"\r
\r
# macOS / Linux\r
soffice --headless --convert-to pdf --outdir ./ "input.pptx"\r
```\r
\r
### 10. PPTX → 图片\r
\r
```bash\r
python -c "\r
import fitz, sys, os\r
pptx_path = sys.argv[1]\r
out_dir = sys.argv[2] if len(sys.argv)>2 else '.'\r
os.makedirs(out_dir, exist_ok=True)\r
\r
# 先用 LibreOffice 转为 PDF,再用 PyMuPDF 渲染\r
import subprocess, tempfile\r
tmp = tempfile.NamedTemporaryFile(suffix='.pdf', delete=False)\r
tmp.close()\r
subprocess.run([\r
    r'C:\Program Files\LibreOffice\program\soffice.exe',\r
    '--headless', '--convert-to', 'pdf', '--outdir', os.path.dirname(tmp.name),\r
    pptx_path\r
], check=True, capture_output=True)\r
# LibreOffice 输出文件名基于输入文件名\r
pdf_tmp = os.path.join(os.path.dirname(tmp.name),\r
                        os.path.splitext(os.path.basename(pptx_path))[0] + '.pdf')\r
\r
doc = fitz.open(pdf_tmp)\r
for i, page in enumerate(doc):\r
    pix = page.get_pixmap(dpi=200)\r
    out = os.path.join(out_dir, f'slide_{i+1:04d}.png')\r
    pix.save(out)\r
    print(f'Saved: {out}')\r
doc.close()\r
os.unlink(pdf_tmp)\r
print(f'Done: {len(doc)} slides')\r
" "input.pptx" "./slides_output"\r
```\r
\r
> **注意**:macOS/Linux 下将 `soffice.exe` 路径替换为 `soffice`。\r
\r
### 11. PPTX → DOCX(提取文本)\r
\r
```bash\r
python -c "\r
from pptx import Presentation\r
from docx import Document\r
from docx.shared import Pt\r
import sys\r
\r
pptx_path = sys.argv[1]\r
out_path = sys.argv[2] if len(sys.argv)>2 else 'output.docx'\r
prs = Presentation(pptx_path)\r
doc = Document()\r
\r
for i, slide in enumerate(prs.slides):\r
    doc.add_heading(f'Slide {i+1}', level=1)\r
    for shape in slide.shapes:\r
        if shape.has_text_frame:\r
            for para in shape.text_frame.paragraphs:\r
                text = para.text.strip()\r
                if text:\r
                    p = doc.add_paragraph(text)\r
                    p.style.font.size = Pt(11)\r
        if shape.has_table:\r
            table = shape.table\r
            rows = []\r
            for row in table.rows:\r
                rows.append([cell.text for cell in row.cells])\r
            t = doc.add_table(rows=len(rows), cols=len(rows[0]) if rows else 0)\r
            for r_idx, row_data in enumerate(rows):\r
                for c_idx, cell_text in enumerate(row_data):\r
                    t.cell(r_idx, c_idx).text = cell_text\r
    doc.add_page_break()\r
\r
doc.save(out_path)\r
print(f'Done: {len(prs.slides)} slides -> {out_path}')\r
" "input.pptx" "output.docx"\r
```\r
\r
### 12. XLSX → PDF\r
\r
```bash\r
# Windows\r
python -c "\r
import subprocess, sys, os\r
xlsx_path = os.path.abspath(sys.argv[1])\r
out_dir = os.path.dirname(xlsx_path)\r
subprocess.run([\r
    r'C:\Program Files\LibreOffice\program\soffice.exe',\r
    '--headless', '--convert-to', 'pdf',\r
    '--outdir', out_dir, xlsx_path\r
], check=True)\r
print('Done')\r
" "input.xlsx"\r
\r
# macOS / Linux\r
soffice --headless --convert-to pdf --outdir ./ "input.xlsx"\r
```\r
\r
### 13. XLSX → CSV\r
\r
```bash\r
python -c "\r
import pandas as pd, sys, os\r
xlsx_path = sys.argv[1]\r
out_dir = sys.argv[2] if len(sys.argv)>2 else '.'\r
os.makedirs(out_dir, exist_ok=True)\r
xls = pd.ExcelFile(xlsx_path)\r
for sheet in xls.sheet_names:\r
    df = pd.read_excel(xls, sheet_name=sheet)\r
    out = os.path.join(out_dir, f'{sheet}.csv')\r
    df.to_csv(out, index=False, encoding='utf-8-sig')\r
    print(f'Saved: {out} ({len(df)} rows)')\r
print('Done')\r
" "input.xlsx" "./csv_output"\r
```\r
\r
### 14. XLSX → JSON\r
\r
```bash\r
python -c "\r
import pandas as pd, json, sys, os\r
xlsx_path = sys.argv[1]\r
out_path = sys.argv[2] if len(sys.argv)>2 else 'output.json'\r
xls = pd.ExcelFile(xlsx_path)\r
result = {}\r
for sheet in xls.sheet_names:\r
    df = pd.read_excel(xls, sheet_name=sheet)\r
    result[sheet] = df.to_dict(orient='records')\r
    print(f'Sheet \"{sheet}\": {len(df)} rows')\r
with open(out_path, 'w', encoding='utf-8') as f:\r
    json.dump(result, f, ensure_ascii=False, indent=2, default=str)\r
print(f'Done: {out_path}')\r
" "input.xlsx" "output.json"\r
```\r
\r
### 15. XLSX → DOCX(表格写入 Word)\r
\r
```bash\r
python -c "\r
import openpyxl, sys\r
from docx import Document\r
from docx.shared import Inches, Pt\r
\r
xlsx_path = sys.argv[1]\r
out_path = sys.argv[2] if len(sys.argv)>2 else 'output.docx'\r
wb = openpyxl.load_workbook(xlsx_path)\r
doc = Document()\r
\r
for sheet_name in wb.sheetnames:\r
    ws = wb[sheet_name]\r
    doc.add_heading(sheet_name, level=1)\r
    rows = list(ws.iter_rows(values_only=True))\r
    if not rows:\r
        continue\r
    t = doc.add_table(rows=len(rows), cols=len(rows[0]))\r
    for r_idx, row in enumerate(rows):\r
        for c_idx, val in enumerate(row):\r
            cell = t.cell(r_idx, c_idx)\r
            cell.text = str(val) if val is not None else ''\r
            if r_idx == 0:\r
                for run in cell.paragraphs[0].runs:\r
                    run.font.bold = True\r
                    run.font.size = Pt(10)\r
    doc.add_page_break()\r
\r
doc.save(out_path)\r
print(f'Done: {len(wb.sheetnames)} sheets -> {out_path}')\r
" "input.xlsx" "output.docx"\r
```\r
\r
### 16. 图片 → PDF\r
\r
```bash\r
python -c "\r
from PIL import Image\r
from reportlab.lib.pagesizes import A4\r
from reportlab.pdfgen import canvas\r
import sys, os, io\r
\r
img_dir = sys.argv[1]\r
out_path = sys.argv[2] if len(sys.argv)>2 else 'output.pdf'\r
files = sorted([f for f in os.listdir(img_dir)\r
                if f.lower().endswith(('.png','.jpg','.jpeg','.webp','.bmp'))])\r
\r
c = canvas.Canvas(out_path, pagesize=A4)\r
w, h = A4\r
\r
for f in files:\r
    img = Image.open(os.path.join(img_dir, f))\r
    iw, ih = img.size\r
    scale = min(w/iw, h/ih, 1)\r
    dw, dh = iw*scale, ih*scale\r
    x, y = (w-dw)/2, (h-dh)/2\r
    c.drawImage(os.path.join(img_dir, f), x, y, dw, dh)\r
    c.showPage()\r
    print(f'Added: {f}')\r
\r
c.save()\r
print(f'Done: {len(files)} images -> {out_path}')\r
" "./images" "output.pdf"\r
```\r
\r
---\r
\r
## PDF 高级操作\r
\r
### 合并 PDF\r
\r
```bash\r
python -c "\r
import fitz, sys, glob\r
files = sorted(glob.glob(sys.argv[1]))\r
out_path = sys.argv[2] if len(sys.argv)>2 else 'merged.pdf'\r
merged = fitz.open()\r
for f in files:\r
    doc = fitz.open(f)\r
    merged.insert_pdf(doc)\r
    doc.close()\r
    print(f'Merged: {f}')\r
merged.save(out_path)\r
merged.close()\r
print(f'Done: {len(files)} files -> {out_path}')\r
" "./*.pdf" "merged.pdf"\r
```\r
\r
### 拆分 PDF\r
\r
```bash\r
python -c "\r
import fitz, sys, os\r
pdf_path = sys.argv[1]\r
out_dir = sys.argv[2] if len(sys.argv)>2 else './split'\r
os.makedirs(out_dir, exist_ok=True)\r
doc = fitz.open(pdf_path)\r
for i, page in enumerate(doc):\r
    new_doc = fitz.open()\r
    new_doc.insert_pdf(doc, from_page=i, to_page=i)\r
    out = os.path.join(out_dir, f'page_{i+1:04d}.pdf')\r
    new_doc.save(out)\r
    new_doc.close()\r
    print(f'Saved: {out}')\r
doc.close()\r
print(f'Done: {len(doc)} pages split')\r
" "input.pdf" "./split"\r
```\r
\r
### 按页码范围拆分\r
\r
```bash\r
python -c "\r
import fitz, sys\r
pdf_path = sys.argv[1]\r
out_path = sys.argv[2] if len(sys.argv)>2 else 'extracted.pdf'\r
ranges = sys.argv[3]  # e.g. '1-3,5,7-10'\r
\r
def parse_ranges(s):\r
    result = []\r
    for part in s.split(','):\r
        part = part.strip()\r
        if '-' in part:\r
            a, b = part.split('-')\r
            result.extend(range(int(a)-1, int(b)))\r
        else:\r
            result.append(int(part)-1)\r
    return result\r
\r
pages = parse_ranges(ranges)\r
doc = fitz.open(pdf_path)\r
new_doc = fitz.open()\r
new_doc.insert_pdf(doc, from_page=min(pages), to_page=max(pages))\r
# 如果不是连续页码,逐页插入更精确\r
if pages != list(range(min(pages), max(pages)+1)):\r
    new_doc = fitz.open()\r
    for p in sorted(set(pages)):\r
        new_doc.insert_pdf(doc, from_page=p, to_page=p)\r
new_doc.save(out_path)\r
new_doc.close()\r
doc.close()\r
print(f'Done: pages {ranges} -> {out_path}')\r
" "input.pdf" "extracted.pdf" "1-3,5,7-10"\r
```\r
\r
### PDF 加水印\r
\r
```bash\r
python -c "\r
import fitz, sys\r
\r
pdf_path = sys.argv[1]\r
out_path = sys.argv[2] if len(sys.argv)>2 else 'watermarked.pdf'\r
text = sys.argv[3] if len(sys.argv)>3 else 'CONFIDENTIAL'\r
opacity = float(sys.argv[4]) if len(sys.argv)>4 else 0.3\r
\r
doc = fitz.open(pdf_path)\r
for page in doc:\r
    # 计算页面中心\r
    rect = page.rect\r
    x, y = rect.width / 2, rect.height / 2\r
    # 旋转 -45 度\r
    rc = fitz.Rect(0, 0, rect.width, rect.height)\r
    page.insert_textbox(\r
        fitz.Rect(x - 200, y - 30, x + 200, y + 30),\r
        text, fontsize=50, color=(0.5, 0.5, 0.5),\r
        rotate=-45, opacity=opacity,\r
        fontname='helv', align=1\r
    )\r
    print(f'Watermarked page {page.number + 1}')\r
\r
doc.save(out_path)\r
doc.close()\r
print(f'Done: {out_path}')\r
" "input.pdf" "watermarked.pdf" "机密文件" 0.2\r
```\r
\r
### PDF 提取图片\r
\r
```bash\r
python -c "\r
import fitz, sys, os\r
pdf_path = sys.argv[1]\r
out_dir = sys.argv[2] if len(sys.argv)>2 else './extracted_images'\r
os.makedirs(out_dir, exist_ok=True)\r
doc = fitz.open(pdf_path)\r
count = 0\r
for page_num, page in enumerate(doc):\r
    images = page.get_images(full=True)\r
    for img_idx, img in enumerate(images):\r
        xref = img[0]\r
        base = doc.extract_image(xref)\r
        ext = base['ext']\r
        out = os.path.join(out_dir, f'p{page_num+1}_img{img_idx+1}.{ext}')\r
        with open(out, 'wb') as f:\r
            f.write(base['image'])\r
        count += 1\r
        print(f'Extracted: {out}')\r
doc.close()\r
print(f'Done: {count} images extracted')\r
" "input.pdf" "./extracted_images"\r
```\r
\r
---\r
\r
## DOCX 高级操作\r
\r
### 合并 DOCX\r
\r
```bash\r
python -c "\r
from docx import Document\r
from docx.oxml.ns import qn\r
import sys, glob, os\r
\r
files = sorted(glob.glob(sys.argv[1]))\r
out_path = sys.argv[2] if len(sys.argv)>2 else 'merged.docx'\r
\r
if os.path.exists(out_path):\r
    merged = Document(out_path)\r
else:\r
    merged = Document()\r
\r
for i, f in enumerate(files):\r
    if i == 0 and not os.path.exists(out_path):\r
        continue\r
    doc = Document(f)\r
    for element in doc.element.body:\r
        merged.element.body.append(element)\r
    print(f'Merged: {f}')\r
\r
merged.save(out_path)\r
print(f'Done: {len(files)} files -> {out_path}')\r
" "./*.docx" "merged.docx"\r
```\r
\r
### DOCX 提取图片\r
\r
```bash\r
python -c "\r
from docx import Document\r
import sys, os, zipfile\r
\r
docx_path = sys.argv[1]\r
out_dir = sys.argv[2] if len(sys.argv)>2 else './docx_images'\r
os.makedirs(out_dir, exist_ok=True)\r
\r
with zipfile.ZipFile(docx_path, 'r') as z:\r
    for name in z.namelist():\r
        if name.startswith('word/media/'):\r
            out = os.path.join(out_dir, os.path.basename(name))\r
            with open(out, 'wb') as f:\r
                f.write(z.read(name))\r
            print(f'Extracted: {out}')\r
\r
print('Done')\r
" "input.docx" "./docx_images"\r
```\r
\r
---\r
\r
## XLSX 高级操作\r
\r
### 合并 XLSX(按 sheet 名合并)\r
\r
```bash\r
python -c "\r
import pandas as pd, sys, glob\r
\r
files = sorted(glob.glob(sys.argv[1]))\r
out_path = sys.argv[2] if len(sys.argv)>2 else 'merged.xlsx'\r
\r
with pd.ExcelWriter(out_path, engine='openpyxl') as writer:\r
    for f in files:\r
        xls = pd.ExcelFile(f)\r
        for sheet in xls.sheet_names:\r
            df = pd.read_excel(xls, sheet_name=sheet)\r
            sheet_name = sheet if sheet not in writer.sheets else f'{sheet}_{os.path.basename(f)}'\r
            df.to_excel(writer, sheet_name=sheet_name, index=False)\r
            print(f'Added sheet \"{sheet_name}\" from {f}')\r
\r
print(f'Done: {out_path}')\r
" "./*.xlsx" "merged.xlsx"\r
```\r
\r
### XLSX 筛选导出\r
\r
```bash\r
python -c "\r
import pandas as pd, sys\r
\r
xlsx_path = sys.argv[1]\r
out_path = sys.argv[2] if len(sys.argv)>2 else 'filtered.xlsx'\r
sheet = sys.argv[3] if len(sys.argv)>3 else 0\r
filter_expr = sys.argv[4]  # e.g. '金额 > 1000'\r
\r
df = pd.read_excel(xlsx_path, sheet_name=sheet)\r
filtered = df.query(filter_expr)\r
filtered.to_excel(out_path, index=False)\r
print(f'Done: {len(df)} -> {len(filtered)} rows')\r
" "input.xlsx" "filtered.xlsx" 0 "金额 > 1000"\r
```\r
\r
---\r
\r
## 错误处理指南\r
\r
| 错误 | 原因 | 解决方案 |\r
|------|------|----------|\r
| `ModuleNotFoundError` | 缺少 Python 依赖 | 运行 `pip install \x3C包名>` |\r
| `FileNotFoundError: soffice` | 未安装 LibreOffice | 安装 LibreOffice 并确认路径 |\r
| `pdf2docx` 转换布局错乱 | 复杂排版 PDF | 改用 PDF→图片→PPTX 方案 |\r
| 中文乱码 | 编码问题 | CSV 使用 `utf-8-sig`;确保系统有中文字体 |\r
| `poppler not found` | pdf2image 需要 poppler | 改用 PyMuPDF(`fitz`),不依赖 poppler |\r
| 大文件内存不足 | 文件过大 | 使用逐页处理,避免一次性加载 |\r
\r
## LibreOffice 路径参考\r
\r
| 系统 | 默认路径 |\r
|------|----------|\r
| Windows | `C:\Program Files\LibreOffice\program\soffice.exe` |\r
| macOS | `/Applications/LibreOffice.app/Contents/MacOS/soffice` |\r
| Linux | `/usr/bin/soffice` 或 `/usr/bin/libreoffice` |\r
\r
> **提示**:如果 LibreOffice 不在默认路径,可通过 `where soffice`(Windows)或 `which soffice`(Linux/macOS)查找实际路径,替换脚本中的路径即可。\r
\r
## 脚本方式(推荐)\r
\r
除了上面的内联 `python -c` 命令,所有转换操作也提供了独立脚本,位于 `scripts/` 目录。脚本方式更易读、更易调试,适合复杂参数场景。\r
\r
**脚本根目录:** `workspace/skills/pdf-ppt-docx-xlsx/scripts/`\r
\r
### 脚本清单\r
\r
| 脚本 | 功能 | 用法示例 |\r
|------|------|----------|\r
| `pdf_to_images.py` | PDF → 图片 | `python scripts/pdf_to_images.py input.pdf ./out 200 png` |\r
| `pdf_to_docx.py` | PDF → DOCX | `python scripts/pdf_to_docx.py input.pdf output.docx --start 0 --end 5` |\r
| `pdf_to_pptx.py` | PDF → PPTX | `python scripts/pdf_to_pptx.py input.pdf output.pptx 200` |\r
| `pdf_to_xlsx.py` | PDF → XLSX (提取表格) | `python scripts/pdf_to_xlsx.py input.pdf output.xlsx` |\r
| `pdf_to_txt.py` | PDF → 纯文本 | `python scripts/pdf_to_txt.py input.pdf output.txt` |\r
| `pdf_advanced.py` | PDF 合并/拆分/水印/提取/信息 | `python scripts/pdf_advanced.py merge a.pdf b.pdf out.pdf` |\r
| `docx_to_pdf.py` | DOCX → PDF | `python scripts/docx_to_pdf.py input.docx` |\r
| `docx_to_pptx.py` | DOCX → PPTX | `python scripts/docx_to_pptx.py input.docx output.pptx` |\r
| `docx_to_html.py` | DOCX → HTML | `python scripts/docx_to_html.py input.docx output.html` |\r
| `docx_advanced.py` | DOCX 合并/提取文本/信息 | `python scripts/docx_advanced.py merge a.docx b.docx out.docx` |\r
| `pptx_to_pdf.py` | PPTX → PDF | `python scripts/pptx_to_pdf.py input.pptx` |\r
| `pptx_to_images.py` | PPTX → 图片 | `python scripts/pptx_to_images.py input.pptx ./out 200` |\r
| `pptx_to_docx.py` | PPTX → DOCX | `python scripts/pptx_to_docx.py input.pptx output.docx` |\r
| `xlsx_to_pdf.py` | XLSX → PDF | `python scripts/xlsx_to_pdf.py input.xlsx` |\r
| `xlsx_to_csv.py` | XLSX → CSV | `python scripts/xlsx_to_csv.py input.xlsx ./csv_out` |\r
| `xlsx_advanced.py` | XLSX 合并/信息/摘要 | `python scripts/xlsx_advanced.py info input.xlsx` |\r
| `images_to_pdf.py` | 图片 → PDF | `python scripts/images_to_pdf.py a.png b.jpg out.pdf` |\r
\r
### `pdf_advanced.py` 子命令\r
\r
```bash\r
# 合并\r
python scripts/pdf_advanced.py merge file1.pdf file2.pdf merged.pdf\r
\r
# 拆分(全部页)\r
python scripts/pdf_advanced.py split input.pdf ./split_output\r
\r
# 拆分(指定范围)\r
python scripts/pdf_advanced.py split input.pdf ./split_output --range 0-5,8-10\r
\r
# 提取指定页\r
python scripts/pdf_advanced.py extract input.pdf output.pdf --pages 0,2,5-8\r
\r
# 加水印\r
python scripts/pdf_advanced.py watermark input.pdf "机密" watermarked.pdf --opacity 0.2 --size 60 --color #ff0000\r
\r
# 查看信息\r
python scripts/pdf_advanced.py info input.pdf\r
```\r
\r
### `docx_advanced.py` 子命令\r
\r
```bash\r
# 合并\r
python scripts/docx_advanced.py merge a.docx b.docx merged.docx\r
\r
# 提取文本\r
python scripts/docx_advanced.py extract_text input.docx output.txt\r
\r
# 查看信息\r
python scripts/docx_advanced.py info input.docx\r
```\r
\r
### `xlsx_advanced.py` 子命令\r
\r
```bash\r
# 合并\r
python scripts/xlsx_advanced.py merge a.xlsx b.xlsx merged.xlsx\r
\r
# 查看信息\r
python scripts/xlsx_advanced.py info input.xlsx\r
\r
# 生成摘要(前5行预览)\r
python scripts/xlsx_advanced.py summary input.xlsx summary.txt\r
```\r
\r
---\r
\r
## 使用原则\r
\r
1. **优先使用 Python 库方案**(PyMuPDF、pdf2docx 等),不依赖 LibreOffice 的转换更快更可控\r
2. **Office 格式转 PDF 时使用 LibreOffice**,保真度远优于纯 Python 方案\r
3. **大文件分页处理**,避免内存溢出\r
4. **输出路径使用绝对路径或明确的相对路径**,避免歧义\r
5. **转换前检查文件是否存在**,转换后验证输出文件大小非零\r
6. **简单场景用内联命令,复杂场景用 scripts/ 脚本**——脚本支持更多参数且更易调试\r
\x3C/task_progress>
Usage Guidance
This appears safe for normal local document conversion. Before installing, use a virtual environment if possible, install LibreOffice and Python packages from trusted sources, keep backups of important documents, and avoid opening generated HTML from untrusted files unless the converter is updated to escape HTML content.
Capability Analysis
Type: OpenClaw Skill Name: uwvwko-pdf-ppt-docx-xlsx-tools Version: 1.0.0 The skill bundle provides a comprehensive set of document processing tools for PDF, DOCX, PPTX, and XLSX formats using standard Python libraries like PyMuPDF, python-docx, and pdf2docx. The implementation includes scripts for conversion, merging, splitting, and watermarking, and utilizes LibreOffice for high-fidelity Office-to-PDF conversions via subprocess calls. No evidence of data exfiltration, malicious persistence, or prompt injection was found; the code logic is entirely consistent with the stated purpose of document manipulation.
Capability Assessment
Purpose & Capability
The scripts match the stated purpose: converting, merging, splitting, extracting, and summarizing PDF/PPTX/DOCX/XLSX/image files using local Python libraries and LibreOffice.
Instruction Scope
The skill intentionally relies on local command execution for conversions; this is disclosed and purpose-aligned, but users should use explicit input and output paths. The supplied SKILL.md excerpt is truncated, so this review relies mainly on the visible instructions and full script contents.
Install Mechanism
The skill asks for multiple third-party Python packages and optionally LibreOffice via system package managers; these installs are expected for document conversion but should come from trusted sources.
Credentials
The skill reads user-selected documents and writes converted outputs locally. No network, credential, or background behavior is evident; generated HTML should be handled carefully because document text is inserted without HTML escaping.
Persistence & Privilege
No persistent background agent, credential storage, account access, or self-starting behavior is shown. Optional system installation commands are user-directed setup steps.
How to Use
  1. Make sure OpenClaw is installed (local or Docker)
  2. Run the install command in chat: /install uwvwko-pdf-ppt-docx-xlsx-tools
  3. After installation, invoke the skill by name or use /uwvwko-pdf-ppt-docx-xlsx-tools
  4. Provide required inputs per the skill's parameter spec and get structured output
Version History
v1.0.0
Initial publish
Metadata
Slug uwvwko-pdf-ppt-docx-xlsx-tools
Version 1.0.0
License MIT-0
All-time Installs 0
Active Installs 0
Total Versions 1
Frequently Asked Questions

What is pdf-ppt-docx-xlsx-tools?

文档格式转换工具集,支持 PDF、PPTX、DOCX、XLSX 四种格式之间的互转及衍生操作(转图片、合并、拆分、提取文本/表格、加水印等)。当用户需要转换文档格式、处理 PDF、操作 Office 文件时使用此技能。 It is an AI Agent Skill for Claude Code / OpenClaw, with 111 downloads so far.

How do I install pdf-ppt-docx-xlsx-tools?

Run "/install uwvwko-pdf-ppt-docx-xlsx-tools" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.

Is pdf-ppt-docx-xlsx-tools free?

Yes, pdf-ppt-docx-xlsx-tools is completely free, licensed under MIT-0. You can download, install and use it at no cost.

Which platforms does pdf-ppt-docx-xlsx-tools support?

pdf-ppt-docx-xlsx-tools is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).

Who created pdf-ppt-docx-xlsx-tools?

It is built and maintained by uwvwko (@uwvwko-zzz); the current version is v1.0.0.

💬 Comments