← Back to Skills Marketplace
dbmoradi60

habib-pdf-to-json

by dbmoradi60 · GitHub ↗ · v1.0.0
cross-platform ✓ Security Clean
1133
Downloads
0
Stars
1
Active Installs
1
Versions
Install in OpenClaw
/install habib-pdf-to-json
Description
Extract structured data from construction PDFs. Convert specifications, BOMs, schedules, and reports from PDF to Excel/CSV/JSON. Use OCR for scanned documents and pdfplumber for native PDFs.
README (SKILL.md)

\r \r

PDF to Structured Data Conversion\r

\r

Overview\r

\r Based on DDC methodology (Chapter 2.4), this skill transforms unstructured PDF documents into structured formats suitable for analysis and integration. Construction projects generate vast amounts of PDF documentation - specifications, BOMs, schedules, and reports - that need to be extracted and processed.\r \r Book Reference: "Преобразование данных в структурированную форму" / "Data Transformation to Structured Form"\r \r

"Преобразование данных из неструктурированной в структурированную форму — это и искусство, и наука. Этот процесс часто занимает значительную часть работы инженера по обработке данных."\r — DDC Book, Chapter 2.4\r \r

ETL Process Overview\r

\r The conversion follows the ETL pattern:\r

  1. Extract: Load the PDF document\r
  2. Transform: Parse and structure the content\r
  3. Load: Save to CSV, Excel, or JSON\r \r

Quick Start\r

\r

import pdfplumber\r
import pandas as pd\r
\r
# Extract table from PDF\r
with pdfplumber.open("construction_spec.pdf") as pdf:\r
    page = pdf.pages[0]\r
    table = page.extract_table()\r
    df = pd.DataFrame(table[1:], columns=table[0])\r
    df.to_excel("extracted_data.xlsx", index=False)\r
```\r
\r
## Installation\r
\r
```bash\r
# Core libraries\r
pip install pdfplumber pandas openpyxl\r
\r
# For scanned PDFs (OCR)\r
pip install pytesseract pdf2image\r
# Also install Tesseract OCR: https://github.com/tesseract-ocr/tesseract\r
\r
# For advanced PDF operations\r
pip install pypdf\r
```\r
\r
## Native PDF Extraction (pdfplumber)\r
\r
### Extract All Tables from PDF\r
\r
```python\r
import pdfplumber\r
import pandas as pd\r
\r
def extract_tables_from_pdf(pdf_path):\r
    """Extract all tables from a PDF file"""\r
    all_tables = []\r
\r
    with pdfplumber.open(pdf_path) as pdf:\r
        for page_num, page in enumerate(pdf.pages):\r
            tables = page.extract_tables()\r
            for table_num, table in enumerate(tables):\r
                if table and len(table) > 1:\r
                    # First row as header\r
                    df = pd.DataFrame(table[1:], columns=table[0])\r
                    df['_page'] = page_num + 1\r
                    df['_table'] = table_num + 1\r
                    all_tables.append(df)\r
\r
    if all_tables:\r
        return pd.concat(all_tables, ignore_index=True)\r
    return pd.DataFrame()\r
\r
# Usage\r
df = extract_tables_from_pdf("material_specification.pdf")\r
df.to_excel("materials.xlsx", index=False)\r
```\r
\r
### Extract Text with Layout\r
\r
```python\r
import pdfplumber\r
\r
def extract_text_with_layout(pdf_path):\r
    """Extract text preserving layout structure"""\r
    full_text = []\r
\r
    with pdfplumber.open(pdf_path) as pdf:\r
        for page in pdf.pages:\r
            text = page.extract_text()\r
            if text:\r
                full_text.append(text)\r
\r
    return "\
\
--- Page Break ---\
\
".join(full_text)\r
\r
# Usage\r
text = extract_text_with_layout("project_report.pdf")\r
with open("report_text.txt", "w", encoding="utf-8") as f:\r
    f.write(text)\r
```\r
\r
### Extract Specific Table by Position\r
\r
```python\r
import pdfplumber\r
import pandas as pd\r
\r
def extract_table_from_area(pdf_path, page_num, bbox):\r
    """\r
    Extract table from specific area on page\r
\r
    Args:\r
        pdf_path: Path to PDF file\r
        page_num: Page number (0-indexed)\r
        bbox: Bounding box (x0, top, x1, bottom) in points\r
    """\r
    with pdfplumber.open(pdf_path) as pdf:\r
        page = pdf.pages[page_num]\r
        cropped = page.within_bbox(bbox)\r
        table = cropped.extract_table()\r
\r
        if table:\r
            return pd.DataFrame(table[1:], columns=table[0])\r
    return pd.DataFrame()\r
\r
# Usage - extract table from specific area\r
# bbox format: (left, top, right, bottom) in points (1 inch = 72 points)\r
df = extract_table_from_area("drawing.pdf", 0, (50, 100, 550, 400))\r
```\r
\r
## Scanned PDF Processing (OCR)\r
\r
### Extract Text from Scanned PDF\r
\r
```python\r
import pytesseract\r
from pdf2image import convert_from_path\r
import pandas as pd\r
\r
def ocr_scanned_pdf(pdf_path, language='eng'):\r
    """\r
    Extract text from scanned PDF using OCR\r
\r
    Args:\r
        pdf_path: Path to scanned PDF\r
        language: Tesseract language code (eng, deu, rus, etc.)\r
    """\r
    # Convert PDF pages to images\r
    images = convert_from_path(pdf_path, dpi=300)\r
\r
    extracted_text = []\r
    for i, image in enumerate(images):\r
        text = pytesseract.image_to_string(image, lang=language)\r
        extracted_text.append({\r
            'page': i + 1,\r
            'text': text\r
        })\r
\r
    return pd.DataFrame(extracted_text)\r
\r
# Usage\r
df = ocr_scanned_pdf("scanned_specification.pdf", language='eng')\r
df.to_csv("ocr_results.csv", index=False)\r
```\r
\r
### OCR Table Extraction\r
\r
```python\r
import pytesseract\r
from pdf2image import convert_from_path\r
import pandas as pd\r
import cv2\r
import numpy as np\r
\r
def ocr_table_from_scanned_pdf(pdf_path, page_num=0):\r
    """Extract table from scanned PDF using OCR with table detection"""\r
    # Convert specific page to image\r
    images = convert_from_path(pdf_path, first_page=page_num+1,\r
                                last_page=page_num+1, dpi=300)\r
    image = np.array(images[0])\r
\r
    # Convert to grayscale\r
    gray = cv2.cvtColor(image, cv2.COLOR_RGB2GRAY)\r
\r
    # Apply thresholding\r
    _, binary = cv2.threshold(gray, 150, 255, cv2.THRESH_BINARY_INV)\r
\r
    # Extract text with table structure\r
    custom_config = r'--oem 3 --psm 6'\r
    text = pytesseract.image_to_string(gray, config=custom_config)\r
\r
    # Parse text into table structure\r
    lines = text.strip().split('\
')\r
    data = [line.split() for line in lines if line.strip()]\r
\r
    if data:\r
        # Assume first row is header\r
        df = pd.DataFrame(data[1:], columns=data[0] if len(data[0]) > 0 else None)\r
        return df\r
    return pd.DataFrame()\r
\r
# Usage\r
df = ocr_table_from_scanned_pdf("scanned_bom.pdf")\r
print(df)\r
```\r
\r
## Construction-Specific Extractions\r
\r
### Bill of Materials (BOM) Extraction\r
\r
```python\r
import pdfplumber\r
import pandas as pd\r
import re\r
\r
def extract_bom_from_pdf(pdf_path):\r
    """Extract Bill of Materials from construction PDF"""\r
    all_items = []\r
\r
    with pdfplumber.open(pdf_path) as pdf:\r
        for page in pdf.pages:\r
            tables = page.extract_tables()\r
            for table in tables:\r
                if not table or len(table) \x3C 2:\r
                    continue\r
\r
                # Find header row (look for common BOM headers)\r
                header_keywords = ['item', 'description', 'quantity', 'unit', 'material']\r
                for i, row in enumerate(table):\r
                    if row and any(keyword in str(row).lower() for keyword in header_keywords):\r
                        # Found header, process remaining rows\r
                        headers = [str(h).strip() for h in row]\r
                        for data_row in table[i+1:]:\r
                            if data_row and any(cell for cell in data_row if cell):\r
                                item = dict(zip(headers, data_row))\r
                                all_items.append(item)\r
                        break\r
\r
    return pd.DataFrame(all_items)\r
\r
# Usage\r
bom = extract_bom_from_pdf("project_bom.pdf")\r
bom.to_excel("bom_extracted.xlsx", index=False)\r
```\r
\r
### Project Schedule Extraction\r
\r
```python\r
import pdfplumber\r
import pandas as pd\r
from datetime import datetime\r
\r
def extract_schedule_from_pdf(pdf_path):\r
    """Extract project schedule/gantt data from PDF"""\r
    with pdfplumber.open(pdf_path) as pdf:\r
        all_tasks = []\r
\r
        for page in pdf.pages:\r
            tables = page.extract_tables()\r
            for table in tables:\r
                if not table:\r
                    continue\r
\r
                # Look for schedule-like table\r
                headers = table[0] if table else []\r
\r
                # Check if it looks like a schedule\r
                schedule_keywords = ['task', 'activity', 'start', 'end', 'duration']\r
                if any(kw in str(headers).lower() for kw in schedule_keywords):\r
                    for row in table[1:]:\r
                        if row and any(cell for cell in row if cell):\r
                            task = dict(zip(headers, row))\r
                            all_tasks.append(task)\r
\r
    df = pd.DataFrame(all_tasks)\r
\r
    # Try to parse dates\r
    date_columns = ['Start', 'End', 'Start Date', 'End Date', 'Finish']\r
    for col in date_columns:\r
        if col in df.columns:\r
            df[col] = pd.to_datetime(df[col], errors='coerce')\r
\r
    return df\r
\r
# Usage\r
schedule = extract_schedule_from_pdf("project_schedule.pdf")\r
print(schedule)\r
```\r
\r
### Specification Parsing\r
\r
```python\r
import pdfplumber\r
import pandas as pd\r
import re\r
\r
def parse_specification_pdf(pdf_path):\r
    """Parse construction specification document"""\r
    specs = []\r
\r
    with pdfplumber.open(pdf_path) as pdf:\r
        full_text = ""\r
        for page in pdf.pages:\r
            text = page.extract_text()\r
            if text:\r
                full_text += text + "\
"\r
\r
    # Parse sections (common spec format)\r
    section_pattern = r'(\d+\.\d+(?:\.\d+)?)\s+([A-Z][^\
]+)'\r
    sections = re.findall(section_pattern, full_text)\r
\r
    for num, title in sections:\r
        specs.append({\r
            'section_number': num,\r
            'title': title.strip(),\r
            'level': len(num.split('.'))\r
        })\r
\r
    return pd.DataFrame(specs)\r
\r
# Usage\r
specs = parse_specification_pdf("technical_spec.pdf")\r
print(specs)\r
```\r
\r
## Batch Processing\r
\r
### Process Multiple PDFs\r
\r
```python\r
import pdfplumber\r
import pandas as pd\r
from pathlib import Path\r
\r
def batch_extract_tables(folder_path, output_folder):\r
    """Process all PDFs in folder and extract tables"""\r
    pdf_files = Path(folder_path).glob("*.pdf")\r
    results = []\r
\r
    for pdf_path in pdf_files:\r
        print(f"Processing: {pdf_path.name}")\r
        try:\r
            with pdfplumber.open(pdf_path) as pdf:\r
                for page_num, page in enumerate(pdf.pages):\r
                    tables = page.extract_tables()\r
                    for table_num, table in enumerate(tables):\r
                        if table and len(table) > 1:\r
                            df = pd.DataFrame(table[1:], columns=table[0])\r
                            df['_source_file'] = pdf_path.name\r
                            df['_page'] = page_num + 1\r
\r
                            # Save individual table\r
                            output_name = f"{pdf_path.stem}_p{page_num+1}_t{table_num+1}.xlsx"\r
                            df.to_excel(Path(output_folder) / output_name, index=False)\r
                            results.append(df)\r
        except Exception as e:\r
            print(f"Error processing {pdf_path.name}: {e}")\r
\r
    # Combined output\r
    if results:\r
        combined = pd.concat(results, ignore_index=True)\r
        combined.to_excel(Path(output_folder) / "all_tables.xlsx", index=False)\r
\r
    return len(results)\r
\r
# Usage\r
count = batch_extract_tables("./pdf_documents/", "./extracted/")\r
print(f"Extracted {count} tables")\r
```\r
\r
## Data Cleaning After Extraction\r
\r
```python\r
import pandas as pd\r
\r
def clean_extracted_data(df):\r
    """Clean common issues in PDF-extracted data"""\r
    # Remove completely empty rows\r
    df = df.dropna(how='all')\r
\r
    # Strip whitespace from string columns\r
    for col in df.select_dtypes(include=['object']).columns:\r
        df[col] = df[col].str.strip()\r
\r
    # Remove rows where all cells are empty strings\r
    df = df[df.apply(lambda row: any(cell != '' for cell in row), axis=1)]\r
\r
    # Convert numeric columns\r
    for col in df.columns:\r
        # Try to convert to numeric\r
        numeric_series = pd.to_numeric(df[col], errors='coerce')\r
        if numeric_series.notna().sum() > len(df) * 0.5:  # More than 50% numeric\r
            df[col] = numeric_series\r
\r
    return df\r
\r
# Usage\r
df = extract_tables_from_pdf("document.pdf")\r
df_clean = clean_extracted_data(df)\r
df_clean.to_excel("clean_data.xlsx", index=False)\r
```\r
\r
## Export Options\r
\r
```python\r
import pandas as pd\r
import json\r
\r
def export_to_multiple_formats(df, base_name):\r
    """Export DataFrame to multiple formats"""\r
    # Excel\r
    df.to_excel(f"{base_name}.xlsx", index=False)\r
\r
    # CSV\r
    df.to_csv(f"{base_name}.csv", index=False, encoding='utf-8-sig')\r
\r
    # JSON\r
    df.to_json(f"{base_name}.json", orient='records', indent=2)\r
\r
    # JSON Lines (for large datasets)\r
    df.to_json(f"{base_name}.jsonl", orient='records', lines=True)\r
\r
# Usage\r
df = extract_tables_from_pdf("document.pdf")\r
export_to_multiple_formats(df, "extracted_data")\r
```\r
\r
## Quick Reference\r
\r
| Task | Tool | Code |\r
|------|------|------|\r
| Extract table | pdfplumber | `page.extract_table()` |\r
| Extract text | pdfplumber | `page.extract_text()` |\r
| OCR scanned | pytesseract | `pytesseract.image_to_string(image)` |\r
| Merge PDFs | pypdf | `writer.add_page(page)` |\r
| Convert to image | pdf2image | `convert_from_path(pdf)` |\r
\r
## Troubleshooting\r
\r
| Issue | Solution |\r
|-------|----------|\r
| Table not detected | Try adjusting table settings: `page.extract_table(table_settings={})` |\r
| Wrong column alignment | Use visual debugging: `page.to_image().draw_rects()` |\r
| OCR quality poor | Increase DPI, preprocess image, use correct language |\r
| Memory issues | Process pages one at a time, close PDF after processing |\r
\r
## Resources\r
\r
- **Book**: "Data-Driven Construction" by Artem Boiko, Chapter 2.4\r
- **Website**: https://datadrivenconstruction.io\r
- **pdfplumber Docs**: https://github.com/jsvine/pdfplumber\r
- **Tesseract OCR**: https://github.com/tesseract-ocr/tesseract\r
\r
## Next Steps\r
\r
- See `image-to-data` for image processing\r
- See `cad-to-data` for CAD/BIM data extraction\r
- See `etl-pipeline` for automated processing workflows\r
- See `data-quality-check` for validating extracted data\r
Usage Guidance
This skill appears to do what it says: extract tables/text from PDFs and run OCR for scanned pages. Before installing or running it: 1) Note you must install the Tesseract system binary (and any OS-level libs for OpenCV/pdf2image) — the SKILL.md mentions this but the skill metadata does not list required binaries. 2) Install Python packages from trusted sources (pip) and preferably in a virtualenv. 3) Be cautious when processing sensitive documents — the skill will extract and produce structured outputs that could contain confidential data; ensure outputs are stored or transmitted securely. 4) The skill has no homepage and the registry ownerId in _meta.json differs from the published ownerId; if provenance matters to you, seek a known source or ask the publisher for verification. 5) If you plan to run this automatically, run it in an isolated environment (container/VM) and review any outputs before sharing externally.
Capability Analysis
Type: OpenClaw Skill Name: habib-pdf-to-json Version: 1.0.0 The skill bundle is benign. All code examples in SKILL.md demonstrate local PDF processing, text extraction, and data conversion to local files (Excel, CSV, JSON). There are no network calls, no attempts to access sensitive system resources or environment variables, no obfuscation, and no evidence of prompt injection against the AI agent. The dependencies are standard for PDF and OCR operations, and the instructions are purely descriptive and functional.
Capability Assessment
Purpose & Capability
Name/description match the SKILL.md: the skill focuses on extracting tables and text from PDFs using pdfplumber for native PDFs and pytesseract/pdf2image/OpenCV for scanned PDFs. No unrelated credentials, binaries, or external services are requested. Note: registry metadata lists no homepage/source and _meta.json ownerId differs from the published ownerId, which is a metadata inconsistency to be aware of but does not change the functional coherence.
Instruction Scope
SKILL.md contains concrete, narrow instructions and example Python code to extract tables/text and run OCR; it does not instruct reading unrelated files, accessing credentials, or sending data to external endpoints. One minor scope issue: the instructions reference installing the Tesseract OCR system binary but the skill's declared requirements do not list required binaries — the agent or operator must install Tesseract separately for OCR to work.
Install Mechanism
Instruction-only skill (no install spec, no code files) — lowest install risk. It recommends pip packages (pdfplumber, pandas, pytesseract, pdf2image, pypdf, openpyxl, OpenCV, numpy) and a system Tesseract binary. These are standard for the stated task; there is no remote download/install automation in the skill that would fetch arbitrary code.
Credentials
No environment variables, credentials, or config paths are requested. The required packages and binaries are proportional to PDF extraction and OCR work. Because the skill processes documents, consider that outputs may contain sensitive data — but the skill itself does not request secrets or remote endpoints.
Persistence & Privilege
always:false and no install hooks or config changes are present. The skill is user-invocable and can be invoked autonomously per platform default; there is no indication it requests persistent or elevated privileges or modifies other skills.
How to Use
  1. Make sure OpenClaw is installed (local or Docker)
  2. Run the install command in chat: /install habib-pdf-to-json
  3. After installation, invoke the skill by name or use /habib-pdf-to-json
  4. Provide required inputs per the skill's parameter spec and get structured output
Version History
v1.0.0
PDF To Structured v1.0.0 – Initial release - Extracts structured data (tables, text) from construction-related PDFs using pdfplumber. - Supports export to Excel, CSV, or JSON. - OCR processing available for scanned PDFs with Tesseract and pdf2image. - Provides usage examples for native and scanned PDF extraction. - Includes functions for extracting tables, text, BOMs, and schedules from construction documents.
Metadata
Slug habib-pdf-to-json
Version 1.0.0
License
All-time Installs 1
Active Installs 1
Total Versions 1
Frequently Asked Questions

What is habib-pdf-to-json?

Extract structured data from construction PDFs. Convert specifications, BOMs, schedules, and reports from PDF to Excel/CSV/JSON. Use OCR for scanned documents and pdfplumber for native PDFs. It is an AI Agent Skill for Claude Code / OpenClaw, with 1133 downloads so far.

How do I install habib-pdf-to-json?

Run "/install habib-pdf-to-json" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.

Is habib-pdf-to-json free?

Yes, habib-pdf-to-json is completely free (open-source). You can download, install and use it at no cost.

Which platforms does habib-pdf-to-json support?

habib-pdf-to-json is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).

Who created habib-pdf-to-json?

It is built and maintained by dbmoradi60 (@dbmoradi60); the current version is v1.0.0.

💬 Comments