← 返回 Skills 市场
alex02131926

Civil Judgment Taiwan Vectorstore

作者 alex02131926 · GitHub ↗ · v1.0.0 · MIT-0
cross-platform ✓ 安全检测通过
121
总下载
0
收藏
0
当前安装
1
版本数
在 OpenClaw 中安装
/install civil-judgment-taiwan-vectorstore
功能描述
Ingest Taiwan civil court judgments (HTML or PDF) — exclusively covering Taiwan civil cases — into Qdrant with Ollama embeddings, preserving traceability, de...
使用说明 (SKILL.md)

Taiwan Civil Judgment → Vector DB (Qdrant) Ingestion

Scope: Taiwan civil court judgments only (民事判決). This skill ingests Taiwan civil cases (HTML or PDF files) into Qdrant. All parsing, chunking, and embedding logic lives in scripts/ingest.py — your job is to run the script, not to reimplement the pipeline.


Quick Start (follow these steps in order)

Step 1 — Activate venv

source {baseDir}/.venv/bin/activate

Step 2 — Identify the run folder

The user will provide an absolute path to a run folder.

Example: /path/to/output/judicialyuan/20260305_142030

Verify it exists and has HTML or PDF files:

ls \x3CRUN_FOLDER>/archive/ | grep -E '\.(html|pdf)$' | head -5

If no archive/*.html or archive/*.pdf files → stop and tell the user the folder has no ingestible data.

Step 3 — Run ingestion

Use absolute paths throughout — no cd needed:

python3 {baseDir}/scripts/ingest.py \
  --run-folder \x3CRUN_FOLDER>

The script handles everything: pre-flight checks, collection auto-creation (creates civil_case_doc / civil_case_chunk if they don't exist), canonicalization, chunking, embedding, Qdrant upsert, manifest + report writing.

Re-running the same command on the same folder is always safe — deterministic IDs mean upsert = overwrite. No special --resume flag needed; just run the same command again.

Step 4 — Check the result

Successful output looks like:

OK files=42 processed=42 skipped=0 errored=0 doc_points=42 chunk_points=187
manifest=\x3CRUN_FOLDER>/ingest_manifest.jsonl
report=\x3CRUN_FOLDER>/ingest_report.md

Read the report (human-readable stats summary):

cat \x3CRUN_FOLDER>/ingest_report.md

If there are errors, check the manifest (machine-readable, one JSON line per file) for per-file diagnosis:

grep -E '"status":"(skipped|error|partial)"' \x3CRUN_FOLDER>/ingest_manifest.jsonl

Step 5 — Report to user

Tell the user:

  • How many docs were ingested (doc_points)
  • How many chunks were created (chunk_points)
  • Whether any were skipped or errored
  • Where the report file is

Done. Do not proceed to additional steps unless the user asks.


DO NOT rules (critical)

  • DO NOT write your own HTML parsing, chunking, or embedding code. ingest.py handles all of this.
  • DO NOT modify parsing/chunking logic casually. Only change heading detection or chunk fallback when the user explicitly asks to improve PDF/OCR robustness, and validate on a small sample before re-running a large batch.
  • DO NOT call Qdrant or Ollama APIs directly. The script does this.
  • DO NOT use verify=False or skip SSL verification for any HTTP request.
  • DO NOT modify or delete files under archive/. Raw HTML is immutable source of truth.
  • DO NOT change chunking defaults (--max-chars, --overlap-chars) unless the user explicitly asks.

Hard constraints

  • Raw HTML/PDF is source of truth; never overwrite it.
  • Deterministic: same input → same canonical text → same SHA-256 → same Qdrant point IDs. Safe to re-run.
  • Traceability: every Qdrant point carries doc_url + local_path.
  • Batched upserts (≤ 64 points/batch) to avoid Qdrant 32MB payload limit.
  • parser_version in every point's metadata. Current: v3.5-sentence-boundary.

Troubleshooting

PREFLIGHT_FAILED: Qdrant not reachable

Qdrant is down or unreachable at the default/configured URL.

# Check if Qdrant is running
curl -s http://localhost:6333/collections | head -1

# If not running, start it (or ask the user)

PREFLIGHT_FAILED: Ollama not reachable

# Check Ollama
curl -s http://localhost:11434/api/tags | head -5

PREFLIGHT_FAILED: Ollama model missing: bge-m3:latest

ollama pull bge-m3:latest

Then re-run Step 3.

PREFLIGHT_FAILED: No archive/*.html or archive/*.pdf found

The run folder exists but has no archived detail pages. Check:

  • Is this the correct run folder?

Output shows skipped > 0 or errored > 0

Check ingest_manifest.jsonl for per-file details:

grep -E '"status":"(skipped|error|partial)"' "\x3CRUN_FOLDER>/ingest_manifest.jsonl"
Manifest status Meaning Action
ok Doc + all chunks ingested None
partial Doc upserted, but some section chunks failed embedding Check Ollama stability; can re-run safely
skipped Doc-level embedding failed — nothing upserted for this doc Check Ollama; re-run safely
error HTML read/parse failed Check if the HTML file is corrupted

Re-running is always safe — use the exact same command. No special flags needed; deterministic IDs → upsert/overwrite.

Override service endpoints

# Via environment variables
OLLAMA_URL=http://localhost:11434 QDRANT_URL=http://localhost:6333 \
  python3 scripts/ingest.py --run-folder "..."

# Via CLI flags (take precedence over env vars)
python3 scripts/ingest.py --run-folder "..." \
  --ollama http://localhost:11434 --qdrant http://localhost:6333

Default endpoints:

Service Default Env override
Ollama http://localhost:11434 $OLLAMA_URL
Qdrant http://localhost:6333 $QDRANT_URL

Test with a small batch first

python3 scripts/ingest.py --run-folder "..." --limit 5

Input folder structure (expected)

\x3Crun_folder>/
  archive/
    fjud_detail_001.html               ← HTML input
    fjud_detail_002.html
    fjud_detail_003.pdf                ← PDF input (also supported)
    fint_detail_001.html               (if system=both)
  results_fjud.jsonl                   (optional)
  results_fint.jsonl                   (optional)

The script discovers all archive/*.html and archive/*.pdf files automatically (sorted by filename). HTML and PDF files can coexist in the same run folder.

v1 limitation: The system metadata field is currently hardcoded to FJUD. If a run folder contains both FJUD and FINT files, FINT files will be ingested but mislabeled as FJUD. This does not affect chunking or embeddings — only the system metadata field on the resulting Qdrant points.


CLI reference

python3 scripts/ingest.py --run-folder \x3CPATH> [options]
Flag Default Description
--run-folder (required) Path to an input folder
--ollama $OLLAMA_URL or http://localhost:11434 Ollama endpoint
--qdrant $QDRANT_URL or http://localhost:6333 Qdrant endpoint
--embed-model bge-m3:latest Ollama embedding model
--vector-size 1024 Vector dimension
--max-chars 900 Max chars per chunk (500–1000)
--overlap-chars 150 Overlap between chunks (10–20% of max-chars)
--limit 0 (no limit) Process only first N files sorted by filename (lexicographic order); for testing

Outputs

  • Qdrant collections: civil_case_doc (1 point/doc), civil_case_chunk (many points/doc). Auto-created if they don't exist.
  • ingest_report.md: human-readable summary (doc/chunk counts, error counts). Read this first after ingestion.
  • ingest_manifest.jsonl: machine-readable, one JSON line per doc with status (ok / partial / skipped / error). Read this to diagnose specific file failures (grep for non-ok statuses). Both files overlap on aggregate counts; the manifest adds per-file detail.

Roadmap

  • v1 (current): doc + section-aware chunks
  • v2: candidate issue extraction (爭點抽取)
  • v3: issue-level index (civil_case_issue collection)

Internal details

For metadata schema, canonicalization rules, section-splitting patterns, and chunking implementation, see references/internals.md.


Lessons learned / operational gotchas

  • Qdrant rejects non-UUID/non-integer point IDs (400 Bad Request). The script uses deterministic UUIDs — do not change the ID generation logic.
  • Qdrant rejects payloads > 32MB. The script batches at 64 points — do not increase batch size.
  • Re-running on the same folder is safe: deterministic IDs mean upsert = overwrite.
  • 台灣判決書 section headings 格式不統一(e.g.「理 由」with fullwidth space、兼容字如「⽂」)。目前 parser 已先做 heading normalization;若仍切不出 section,會 fallback 對 full 做 chunking,避免只留下 doc-level points。
安全使用建议
This skill appears coherent with its purpose. Before installing or running it, confirm the following: (1) Ollama and Qdrant endpoints you provide are trusted — the script will send document text to the configured Ollama service for embeddings, so do not point it at an untrusted remote endpoint if documents are sensitive; (2) run the pipeline in an isolated environment (local machine or trusted server) and test first with a small sample (use --limit or --dry-run) to verify behaviour and resource usage; (3) ensure you want the script to create/modify Qdrant collections in the provided instance; (4) the skill preserves raw HTML/PDF as immutable source of truth but may leave manifests/reports in the run folder — inspect those files if you need auditability; (5) if you require additional assurance, review the two included Python scripts (ingest.py and build_reasoning_collection.py) locally — they are the only code that runs and they contain the network calls. If any of the above are unacceptable (e.g., embedding on a remote, untrusted Ollama instance), do not run the skill or reconfigure endpoints to local/trusted services.
功能分析
Type: OpenClaw Skill Name: civil-judgment-taiwan-vectorstore Version: 1.0.0 The skill bundle is a legitimate tool for ingesting Taiwan civil court judgments into a Qdrant vector database using Ollama embeddings. The core logic in `scripts/ingest.py` and `scripts/build_reasoning_collection.py` uses standard libraries (BeautifulSoup, pypdf, requests) to parse documents and interact with local/configured service endpoints. The instructions in `SKILL.md` are well-structured, providing clear operational boundaries for the AI agent without any signs of prompt injection or malicious intent. No evidence of data exfiltration, unauthorized execution, or obfuscation was found.
能力评估
Purpose & Capability
Name/description match the implementation: scripts parse HTML/PDF, canonicalize, chunk, embed via Ollama, and upsert into Qdrant. The required resources (Ollama, Qdrant, local run-folder) are appropriate and expected for the stated goal.
Instruction Scope
SKILL.md limits the agent's work to activating a venv, pointing to an absolute run-folder, and running the provided ingest script. It explicitly forbids ad-hoc parsing, direct calls to the services (the script does those), and overwriting archive files. The instructions do not ask the agent to read unrelated files or exfiltrate data.
Install Mechanism
No install spec — instruction-only skill with two Python scripts. README documents standard Python deps (requests, beautifulsoup4, pypdf, qdrant-client). No downloads from untrusted URLs or arbitrary installer steps are present.
Credentials
The skill does not declare required environment secrets. It accepts optional endpoint overrides via OLLAMA_URL and QDRANT_URL (documented). Access to local run-folder files and to the configured Ollama/Qdrant endpoints is necessary and proportional to the task. No unrelated credentials or config paths are requested.
Persistence & Privilege
The skill is not always-enabled and does not request system-wide privileges. It will create/modify Qdrant collections (expected) but does not alter other skills or global agent settings.
如何使用
  1. 确保已安装 OpenClaw(本地或 Docker 部署)
  2. 在对话框中输入安装命令:/install civil-judgment-taiwan-vectorstore
  3. 安装完成后,直接呼叫该 Skill 的名称或使用 /civil-judgment-taiwan-vectorstore 触发
  4. 根据 Skill 的参数说明提供必要输入,即可获得结构化输出
版本历史
v1.0.0
Initial release for Taiwan civil court judgment ingestion to Qdrant - Supports ingestion of Taiwan civil court judgments (HTML and PDF) into Qdrant with Ollama embeddings. - Ensures traceability, deduplication, and safe incremental updates. - Enforces strict process: raw files never overwritten, deterministic IDs, and robust pre-flight checks. - Provides user-friendly reporting on processed, skipped, and errored files. - Explicit DO NOT rules to avoid accidental misuse or logic modifications. - Troubleshooting and testing instructions included for fast onboarding.
元数据
Slug civil-judgment-taiwan-vectorstore
版本 1.0.0
许可证 MIT-0
累计安装 0
当前安装数 0
历史版本数 1
常见问题

Civil Judgment Taiwan Vectorstore 是什么?

Ingest Taiwan civil court judgments (HTML or PDF) — exclusively covering Taiwan civil cases — into Qdrant with Ollama embeddings, preserving traceability, de... 它是一个面向 Claude Code / OpenClaw 的 AI Agent Skill 插件,目前累计下载 121 次。

如何安装 Civil Judgment Taiwan Vectorstore?

在 OpenClaw 或 Claude Code 对话框中运行命令「/install civil-judgment-taiwan-vectorstore」即可一键安装,无需额外配置。

Civil Judgment Taiwan Vectorstore 是免费的吗?

是的,Civil Judgment Taiwan Vectorstore 完全免费,采用 MIT-0 许可证,可自由下载、安装和使用。

Civil Judgment Taiwan Vectorstore 支持哪些平台?

Civil Judgment Taiwan Vectorstore 跨平台运行,可在任意部署了 OpenClaw / Claude Code 的环境中使用(cross-platform)。

谁开发了 Civil Judgment Taiwan Vectorstore?

由 alex02131926(@alex02131926)开发并维护,当前版本 v1.0.0。

💬 留言讨论