← 返回 Skills 市场
samledger67-dotcom

Document Ingestion

作者 samledger67-dotcom · GitHub ↗ · v1.0.2 · MIT-0
cross-platform ⚠ suspicious
250
总下载
0
收藏
1
当前安装
3
版本数
在 OpenClaw 中安装
/install document-ingestion
功能描述
Process raw accounting source documents (PDFs, CSVs, bank statements, invoices, receipts) into standardized transaction records for QBO import. Use when batc...
使用说明 (SKILL.md)

Document Ingestion Engine — SKILL.md

When to Use This Skill

Use when a user needs to process raw accounting source documents into standardized transaction records for QBO import. Triggers on:

  • "Process these documents / invoices / receipts / bank statements"
  • "Ingest docs for [client]"
  • "I have PDFs/CSVs to categorize"
  • "Batch import these transactions to QBO"
  • "Extract data from 1099s / payroll reports"
  • Document drop + categorization requests during month-end close

When NOT to Use

  • Not for running bank reconciliation (use bank-reconciliation skill)
  • Not for P&L variance analysis (use pl-quick-compare skill)
  • Not for single manual journal entries (just post directly in QBO)
  • Not for AR collections or aging (use ar-collections-agent skill)

What It Does

Processes 6 document types → standardized records → Excel workbook + QBO import CSV.

Input Type Formats Extracts
Bank Statements CSV, OFX/QFX, PDF Date, vendor, amount
Credit Card Stmts CSV, PDF Date, merchant, amount, category
Invoices PDF Vendor, total, date, due date, invoice #, line items
Receipts PDF, JPG/PNG* Merchant, date, amount
1099 / Tax Forms PDF Payer, TIN, form type, box amounts
Payroll Reports CSV, PDF Employee, gross, taxes, net per employee

*Image OCR requires tesseract installed.

Processing Steps

  1. File type detection — magic bytes + extension fallback
  2. Document classification — bank/CC/invoice/receipt/1099/payroll
  3. Content extraction — CSV parsing, OFX parsing, PDF text extraction
  4. Format normalization — dates (multi-format), amounts (Decimal), vendor names (strip noise)
  5. QBO COA pull — fetches live Chart of Accounts from QBO for categorization
  6. Duplicate detection — same amount + vendor within ±3 days → flagged
  7. Auto-categorization — vendor map → COA keywords → doc-class default
  8. Confidence scoring — HIGH (exact match) / MEDIUM (fuzzy) / LOW (needs review)
  9. Exception flagging — missing dates, zero amounts, unknown vendors, LOW confidence
  10. QBO import CSV — ready for batch import (excludes dups + failed extractions)
  11. Excel workbook — 6 tabs (see below)
  12. CDC tracking — delta since last run cached in .cache/document-ingestion/{slug}.json

Excel Output Tabs

Tab Contents
Processed Transactions All records with category, confidence, dup flag, exception
⚠ Exceptions Records needing manual review before import
Duplicates Flagged potential duplicates with "Dup Of" reference
Category Mapping Unique vendor → QBO account map with confidence
Import Ready QBO-format rows (Date, Description, Amount, Account, Memo)
CDC Log Delta metrics vs. prior run + this-run stats summary

Script Location

scripts/pipelines/document-ingestion.py

Usage

# Process a directory of mixed documents
python3 scripts/pipelines/document-ingestion.py \
    --slug sb-paulson \
    --input-dir ~/Downloads/month-end-docs

# Single file
python3 scripts/pipelines/document-ingestion.py \
    --slug sb-paulson \
    --file ~/Downloads/invoice_march.pdf

# Multiple files + custom output dir
python3 scripts/pipelines/document-ingestion.py \
    --slug glowlabs \
    --file ~/Downloads/stmt.csv \
    --file ~/Downloads/payroll.csv \
    --out ~/Desktop/ingested

# Offline mode (no QBO auth needed)
python3 scripts/pipelines/document-ingestion.py \
    --slug sb-paulson \
    --input-dir ./docs \
    --no-qbo-coa

# QBO sandbox
python3 scripts/pipelines/document-ingestion.py \
    --slug sb-paulson \
    --input-dir ./docs \
    --sandbox

All CLI Flags

Flag Default Description
--slug required Company slug (QBO + client vendor map)
--input-dir Directory of docs to process
--file Single file (repeatable)
--out ~/Desktop Output directory
--no-qbo-coa false Use built-in COA only (offline)
--sandbox false QBO sandbox mode

Dependencies

Required (pip)

pip install openpyxl

Optional (better extraction quality)

pip install pdfminer.six   # Better PDF text extraction
pip install ofxparse       # Better OFX/QFX parsing
brew install tesseract     # Image receipt OCR (JPG/PNG)

Node.js QBO Client

Node.js QBO client   # Auth token must be configured

Categorization Logic

Priority Chain

  1. Vendor Map exact matchHIGH confidence
  2. Vendor Map substring matchHIGH confidence
  3. COA keyword index (built from COA account names + keywords) → MEDIUM confidence
  4. Doc-class defaultLOW confidence

Built-in Vendor Map

50+ known vendors pre-mapped:

  • Stripe/Square/PayPal → Sales Revenue
  • Gusto/ADP/Deel/Paychex → Payroll - Salaries & Wages
  • Google/Microsoft/Slack/GitHub/Zoom → Software & Subscriptions
  • Delta/United/Marriott/Uber → Travel
  • FedEx/UPS/USPS → Postage & Delivery
  • Chase/BofA service charges → Bank & Merchant Fees
  • etc. (see VENDOR_MAP in script)

Client-Specific Overrides

Auto-loaded by --slug:

  • glowlabs → Loads GlowLabs vendor map (Deel, Toptal, Brex, Huellas Labs, etc.)
  • sb-paulson / willo → Loads Willo Salons vendor map
  • Other clients → Reads clients/{slug}/categorization-map*.md markdown tables

Duplicate Detection Rules

  • Window: ±3 days (configurable via DUP_WINDOW_DAYS constant)
  • Match criteria: Same amount (exact Decimal) + same vendor key (first 3 meaningful words)
  • Action: Flagged as is_duplicate=True, excluded from import file
  • Always confirm before deleting — duplicates tab shows "Dup Of Row #" reference

Exception Rules (auto-flagged)

Condition Flag
Missing transaction date "Missing transaction date"
Zero amount (non-1099) "Zero amount — verify or skip"
Empty/unknown vendor "Vendor name missing or unknown"
LOW confidence category "Low categorization confidence — manual review"
PDF extraction failed "PDF text extraction failed — manual review required"
Image without tesseract "Image OCR not available — manual entry required"

QBO Import CSV Format

Ready-to-import columns:

Date | Description | Amount | Vendor/Customer | Account | Class | Memo | Doc Number
  • Amount sign: positive = expense (debit), negative = credit/income
  • Memo includes source file + doc type for audit trail
  • Excludes: duplicates, failed extractions

CDC Cache

Location: .cache/document-ingestion/{slug}.json

Tracks between runs:

  • docs_processed, records_extracted, duplicates_caught
  • exceptions_flagged, import_ready
  • high_confidence, medium_confidence, low_confidence

Output File Naming

DocIngestion_{slug}_{YYYYMMDD}.xlsx
DocIngestion_{slug}_{YYYYMMDD}_QBO_Import.csv

Agent Instructions

Standard Run

  1. Collect input files from user (directory path or individual files)
  2. Get client slug (sb-paulson, glowlabs, etc.)
  3. Run pipeline. If QBO auth not set, use --no-qbo-coa
  4. Deliver summary:
    • Records extracted, dups caught, exceptions
    • HIGH/MED/LOW confidence split
    • Path to Excel + import CSV
  5. Walk user through Exceptions tab — those need action before import

Month-End Close Integration

  • Run AFTER bank statement download, BEFORE bank reconciliation
  • Use --input-dir pointing to client's document drop folder
  • Import CSV goes into QBO → then run bank-reconciliation.py

Exception Handling

  • PDFs with no extractable text → LOW confidence + exception flag → send to client for re-scan
  • Image receipts with no tesseract → exception flag → use nano-pdf skill or manual entry
  • Unknown vendors → update VENDOR_MAP in script or add to clients/{slug}/categorization-map.md

Adding New Client Vendor Maps

Edit load_client_vendor_map() in the script:

if slug_lower in ("new-client", "nc"):
    client_map.update({
        "vendor name": "QBO Account Name",
    })

Or create clients/{slug}/categorization-map.md with markdown table:

| Vendor / Memo Keyword | Primary Account | Notes |
|---|---|---|
| Amazon | Office Supplies | |
| Comcast | Utilities | |

Financial Math

All amounts use Python Decimal with ROUND_HALF_UP to 2 decimal places. No float arithmetic.

安全使用建议
Red flags to consider before installing/using this skill: - The SKILL.md expects you to run a local script at scripts/pipelines/document-ingestion.py and to have client mapping files (clients/{slug}/...) but the published bundle contains no code — ask the publisher for the script and full source before trusting it. - The skill clearly needs a QBO auth token (and mentions a Node.js QBO client) but the registry lists no required environment variables or credential names; ask which exact credentials are needed and how to scope/restrict them (use a sandbox token with minimal scope for testing). - The instructions will read your local folders and write outputs and caches (~/Desktop, .cache/document-ingestion). If you plan to run anything from an unreviewed source, do so in an isolated environment (VM/container) and inspect the code first. - The skill recommends installing third-party tools (tesseract, pdfminer.six, ofxparse). Install these only from official sources and be cautious about permissions. - If you want to proceed: obtain the actual script source, verify where the QBO token is read (which env var or config file), review vendor maps and any pre-mapped vendor list for privacy issues, and test with non-sensitive sample documents in QBO sandbox mode. If the publisher cannot provide the missing script and explicit credential/config instructions, treat this skill as incomplete and avoid running it on real financial data.
功能分析
Type: OpenClaw Skill Name: document-ingestion Version: 1.0.2 The document-ingestion skill bundle is a legitimate tool designed to automate the extraction and categorization of financial data from various accounting documents (PDFs, CSVs, bank statements) for QuickBooks Online import. The SKILL.md file provides comprehensive instructions for an AI agent to perform file classification, data normalization, and duplicate detection. No indicators of malicious intent, data exfiltration, or prompt injection were found; all described behaviors, including local caching and QBO API interaction, are strictly aligned with the stated purpose of financial document processing.
能力评估
Purpose & Capability
The skill's stated purpose (convert accounting docs into QBO import CSVs) is reasonable, but the SKILL.md expects live QBO Chart-of-Accounts access and a Node.js QBO client with an auth token while the registry metadata declares no required environment variables or primary credential. The skill also references a local Python script path (scripts/pipelines/document-ingestion.py) and client-specific local files (clients/{slug}/...) that are not provided in the bundle.
Instruction Scope
Runtime instructions direct the agent to run a local Python script, read directories and client mapping files, write outputs to ~/Desktop and .cache/document-ingestion/{slug}.json, and optionally contact QBO (including sandbox). The SKILL.md also refers to 'Auth token must be configured' but does not specify how or which env var. These instructions access local filesystem paths and external APIs beyond what the registry declares.
Install Mechanism
This is an instruction-only skill (no install spec, no code files). The SKILL.md lists pip packages and Homebrew (tesseract) as required/optional installs — that's a manual installation expectation but the registry provides no automated install. Because no script files are included, following the instructions would fail unless the user separately obtains the referenced scripts.
Credentials
The runtime behavior implies need for QBO credentials (auth token) and possibly other secrets for a Node.js QBO client, but requires.env is empty and no primary credential is declared. It also reads/writes local files (client maps, caches, Desktop outputs). Requesting QBO access is proportionate to purpose only if the skill declares which credentials it needs and why; here that mapping is missing.
Persistence & Privilege
The skill is not always-enabled and does not request elevated platform privileges. However, it will create/modify local cache files (.cache/document-ingestion/...), read client config files, and write Excel/CSV outputs to user directories (default ~/Desktop). These are normal for such a tool but should be confirmed before running.
如何使用
  1. 确保已安装 OpenClaw(本地或 Docker 部署)
  2. 在对话框中输入安装命令:/install document-ingestion
  3. 安装完成后,直接呼叫该 Skill 的名称或使用 /document-ingestion 触发
  4. 根据 Skill 的参数说明提供必要输入,即可获得结构化输出
版本历史
v1.0.2
Updated SKILL.md
v1.0.1
Security cleanup: removed internal references, genericized examples
v1.0.0
Initial release: Raw accounting document processing pipeline for QBO import
元数据
Slug document-ingestion
版本 1.0.2
许可证 MIT-0
累计安装 1
当前安装数 1
历史版本数 3
常见问题

Document Ingestion 是什么?

Process raw accounting source documents (PDFs, CSVs, bank statements, invoices, receipts) into standardized transaction records for QBO import. Use when batc... 它是一个面向 Claude Code / OpenClaw 的 AI Agent Skill 插件,目前累计下载 250 次。

如何安装 Document Ingestion?

在 OpenClaw 或 Claude Code 对话框中运行命令「/install document-ingestion」即可一键安装,无需额外配置。

Document Ingestion 是免费的吗?

是的,Document Ingestion 完全免费,采用 MIT-0 许可证,可自由下载、安装和使用。

Document Ingestion 支持哪些平台?

Document Ingestion 跨平台运行,可在任意部署了 OpenClaw / Claude Code 的环境中使用(cross-platform)。

谁开发了 Document Ingestion?

由 samledger67-dotcom(@samledger67-dotcom)开发并维护,当前版本 v1.0.2。

💬 留言讨论