Description

Extract structured data from unstructured sources. Parse JSON, CSV, logs, and mixed formats into clean, usable data. Handle malformed data, nested structures...

README (SKILL.md)

Data Extractor

Name: Data Extractor
Author: engsathiago

Extract structured, clean data from unstructured or messy sources. Turn chaos into usable data.

Supported Formats

JSON

Nested objects
Arrays of objects
Mixed types
Malformed JSON
Large files (streaming)

CSV

Headers or headerless
Various delimiters
Quoted fields
Multiline values
Large files (streaming)

Logs

Application logs
Server logs
Error logs
Access logs
Custom formats

Text

Key-value pairs
Tables
Lists
Mixed content

Common Patterns

Extract JSON from Text

Input: "The response was {'status': 'ok', 'data': [1, 2, 3]} and then..."
Output: {"status": "ok", "data": [1, 2, 3]}

import re
import json

def extract_json(text):
    # Find JSON-like structures
    pattern = r'\{[^{}]*(?:\{[^{}]*\}[^{}]*)*\}'
    matches = re.findall(pattern, text)
    
    for match in matches:
        try:
            return json.loads(match)
        except json.JSONDecodeError:
            continue
    return None

Parse CSV with Issues

# Handle: missing values, inconsistent quotes, mixed delimiters

import csv
from io import StringIO

def parse_messy_csv(text):
    lines = text.strip().split('\
')
    
    # Detect delimiter
    delimiters = [',', ';', '	', '|']
    delimiter = ','
    for d in delimiters:
        if lines[0].count(d) > lines[0].count(delimiter):
            delimiter = d
    
    # Parse with error handling
    reader = csv.reader(StringIO(text), delimiter=delimiter)
    rows = []
    for row in reader:
        # Clean each field
        cleaned = [field.strip().strip('"').strip("'") for field in row]
        rows.append(cleaned)
    
    return rows

Extract Key-Value Pairs

Input: "name: John, age: 30, city: New York"
Output: {"name": "John", "age": "30", "city": "New York"}

import re

def extract_key_value(text):
    patterns = [
        r'(\w+)\s*:\s*([^,\
]+)',      # key: value
        r'(\w+)\s*=\s*([^,\
]+)',      # key=value
        r'"?(\w+)"?\s*[:=]\s*"?([^,"\
]+)"?',  # quoted variants
    ]
    
    result = {}
    for pattern in patterns:
        matches = re.findall(pattern, text)
        for key, value in matches:
            result[key.strip()] = value.strip()
    
    return result

Parse Logs

# Common log formats

import re
from datetime import datetime

def parse_log_line(line):
    # Try common patterns
    
    # Apache/Nginx access log
    pattern = r'(\S+) \S+ \S+ \[([^\]]+)\] "(\S+) ([^"]+) HTTP/\d\.\d" (\d+) (\d+)'
    match = re.match(pattern, line)
    if match:
        return {
            "ip": match.group(1),
            "timestamp": match.group(2),
            "method": match.group(3),
            "path": match.group(4),
            "status": int(match.group(5)),
            "size": int(match.group(6))
        }
    
    # JSON log
    if line.startswith('{'):
        try:
            return json.loads(line)
        except:
            pass
    
    # Key-value log
    if '=' in line:
        return extract_key_value(line)
    
    return {"raw": line}

Handling Edge Cases

Malformed JSON

def fix_json(text):
    # Common fixes
    
    # Single quotes to double quotes
    text = re.sub(r"'([^']*)'", r'"\1"', text)
    
    # Unquoted keys
    text = re.sub(r'(\w+):', r'"\1":', text)
    
    # Trailing commas
    text = re.sub(r',\s*([}\]])', r'\1', text)
    
    # Missing quotes around values
    text = re.sub(r':\s*([a-zA-Z_]\w*)(?=[,}\]])', r': "\1"', text)
    
    return text

Large Files

def stream_jsonl(file_path):
    """Stream JSON Lines (JSONL) format"""
    with open(file_path, 'r') as f:
        for line in f:
            try:
                yield json.loads(line)
            except json.JSONDecodeError:
                continue

def stream_csv(file_path, chunk_size=1000):
    """Stream CSV in chunks"""
    with open(file_path, 'r') as f:
        reader = csv.reader(f)
        headers = next(reader)
        
        chunk = []
        for row in reader:
            chunk.append(dict(zip(headers, row)))
            if len(chunk) >= chunk_size:
                yield chunk
                chunk = []
        
        if chunk:
            yield chunk

Mixed Formats

def detect_and_parse(content):
    """Auto-detect format and parse"""
    
    content = content.strip()
    
    # JSON
    if content.startswith('{') or content.startswith('['):
        try:
            return json.loads(content)
        except:
            pass
    
    # JSONL
    if '\
{' in content:
        try:
            return [json.loads(line) for line in content.split('\
') if line.strip()]
        except:
            pass
    
    # CSV
    if ',' in content and '\
' in content:
        lines = content.split('\
')
        if len(lines) > 1:
            return parse_messy_csv(content)
    
    # Key-value
    if '=' in content or ':' in content:
        return extract_key_value(content)
    
    # Lines
    return content.split('\
')

Data Cleaning

Remove Duplicates

def deduplicate(data, key=None):
    if isinstance(data, list):
        if key:
            seen = set()
            result = []
            for item in data:
                val = item.get(key) if isinstance(item, dict) else item
                if val not in seen:
                    seen.add(val)
                    result.append(item)
            return result
        return list(set(data))
    return data

Normalize Values

def normalize(data):
    if isinstance(data, dict):
        return {k: normalize(v) for k, v in data.items()}
    elif isinstance(data, list):
        return [normalize(item) for item in data]
    elif isinstance(data, str):
        # Lowercase, trim, standardize whitespace
        data = data.lower().strip()
        data = re.sub(r'\s+', ' ', data)
        
        # Convert common values
        if data in ('true', 'yes', 'on'):
            return True
        if data in ('false', 'no', 'off'):
            return False
        if data in ('null', 'none', 'n/a', ''):
            return None
        
        # Try numeric
        try:
            return int(data)
        except:
            try:
                return float(data)
            except:
                pass
        
        return data
    return data

Validate Schema

def validate(data, schema):
    errors = []
    
    # Required fields
    for field in schema.get('required', []):
        if field not in data:
            errors.append(f"Missing required field: {field}")
    
    # Type checking
    for field, expected_type in schema.get('types', {}).items():
        if field in data and not isinstance(data[field], expected_type):
            errors.append(f"Field {field} should be {expected_type.__name__}")
    
    # Value ranges
    for field, (min_val, max_val) in schema.get('ranges', {}).items():
        if field in data:
            if not (min_val \x3C= data[field] \x3C= max_val):
                errors.append(f"Field {field} out of range: {data[field]}")
    
    return len(errors) == 0, errors

Output Formats

To JSON

import json

def to_json(data, pretty=True):
    if pretty:
        return json.dumps(data, indent=2, ensure_ascii=False)
    return json.dumps(data, ensure_ascii=False)

To CSV

import csv
from io import StringIO

def to_csv(data, headers=None):
    if not data:
        return ""
    
    output = StringIO()
    
    if isinstance(data[0], dict):
        headers = headers or list(data[0].keys())
        writer = csv.DictWriter(output, fieldnames=headers)
        writer.writeheader()
        writer.writerows(data)
    else:
        writer = csv.writer(output)
        if headers:
            writer.writerow(headers)
        writer.writerows(data)
    
    return output.getvalue()

To Markdown Table

def to_markdown_table(data):
    if not data:
        return ""
    
    if isinstance(data[0], dict):
        headers = list(data[0].keys())
        rows = [[str(row.get(h, '')) for h in headers] for row in data]
    else:
        headers = [f"Col {i+1}" for i in range(len(data[0]))]
        rows = data
    
    # Build table
    result = []
    result.append('| ' + ' | '.join(headers) + ' |')
    result.append('| ' + ' | '.join(['---'] * len(headers)) + ' |')
    
    for row in rows:
        result.append('| ' + ' | '.join(str(cell) for cell in row) + ' |')
    
    return '\
'.join(result)

Usage Examples

Example 1: Extract JSON from API Response

Input (messy):
"API returned: {status: 'success', data: {users: [{id: 1, name: 'John'}, {id: 2, name: 'Jane'}]}, timestamp: '2026-03-16'}"

Output (clean):
{
  "status": "success",
  "data": {
    "users": [
      {"id": 1, "name": "John"},
      {"id": 2, "name": "Jane"}
    ]
  },
  "timestamp": "2026-03-16"
}

Example 2: Parse Mixed Log

Input:
192.168.1.1 - - [16/Mar/2026:12:00:00 +0000] "GET /api HTTP/1.1" 200 1234
{"level": "INFO", "message": "User logged in", "user_id": 123}
name=John action=login time=12:00

Output:
[
  {"ip": "192.168.1.1", "timestamp": "16/Mar/2026:12:00:00 +0000", "method": "GET", "path": "/api", "status": 200, "size": 1234},
  {"level": "INFO", "message": "User logged in", "user_id": 123},
  {"name": "John", "action": "login", "time": "12:00"}
]

Example 3: Clean CSV

Input (messy):
name,age,city
"John", 30, "New York"
'Jane',,Los Angeles
"Bob","forty","Chicago"

Output (clean):
[
  {"name": "John", "age": 30, "city": "New York"},
  {"name": "Jane", "age": null, "city": "Los Angeles"},
  {"name": "Bob", "age": "forty", "city": "Chicago"}
]

Best Practices

Always validate input - Check format before parsing
Handle errors gracefully - Log and continue or fail cleanly
Stream large files - Don't load everything into memory
Normalize consistently - Same rules for all data
Document transformations - What changed and why
Preserve originals - Keep raw data until confirmed clean
Test edge cases - Empty, null, malformed, very large
Use appropriate types - Numbers as numbers, dates as dates

Performance Tips

Use streaming for files > 10MB
Batch processing for database inserts
Parallel parsing for independent chunks
Lazy evaluation with generators
Cache parsed results if reused frequently

Usage Guidance

This skill appears to do what it says: example Python code for extracting and cleaning messy data. Before using it, keep in mind: 1) the snippets read files by path—only give it files you intend to expose (avoid system or sensitive files); 2) the regex-based fixes are heuristic and can corrupt or misinterpret input (test on non-sensitive samples first); 3) some regex patterns can be slow or brittle on adversarial inputs—limit file size and runtime or use streaming libraries (ijson, csv with proper dialect detection) for large/hostile inputs; 4) because it's instruction-only, the agent will try to run code in whatever runtime is available—confirm the execution environment and sandboxing and avoid granting elevated filesystem access; and 5) if you need production-grade parsing, consider using well-maintained parsing libraries rather than raw regex fixes. If you want extra assurance, request a version that includes robust error handling, explicit file-path validation, and limits on input size/timeouts.

Capability Analysis

Type: OpenClaw Skill Name: data-extractor Version: 1.0.0 The 'data-extractor' skill bundle contains standard Python utility functions and instructions for parsing and cleaning structured data from various formats like JSON, CSV, and logs. The code uses safe libraries (re, json, csv) and follows best practices for data processing without any indicators of malicious intent, data exfiltration, or prompt injection attacks.

Capability Assessment

✓ Purpose & Capability

Name and description match the SKILL.md: it provides Python snippets and patterns for extracting JSON, CSV, logs, key-value pairs, streaming large files, and cleaning data. No unrelated binaries, env vars, or config paths are requested.

ℹ Instruction Scope

The instructions are narrowly focused on parsing/cleaning data and include functions that read files (stream_jsonl, stream_csv) and perform regex-based fixes. This is expected for a data-extraction skill, but the snippets assume access to arbitrary file paths provided at runtime and use brittle regex substitutions that can mis-parse or mutate input. There are no instructions to access system-wide config, credentials, or unexpected external endpoints.

✓ Install Mechanism

Instruction-only skill with no install spec and no bundled code—nothing is written to disk by the skill itself. This is the lowest-risk install posture.

✓ Credentials

The skill requests no environment variables or credentials. All operations use local I/O and standard Python libraries in the examples; the required privileges are proportionate to the stated purpose.

✓ Persistence & Privilege

always is false and model invocation is allowed (the platform default). The skill does not request persistent presence, nor does it instruct modification of other skills or system-wide agent settings.

Version History

v1.0.0

Initial release. Extract structured data from unstructured sources. Parse JSON, CSV, logs, and mixed formats. Handle malformed data, nested structures, and large files efficiently.

Metadata

Slug data-extractor

Version 1.0.0

License MIT-0

All-time Installs 4

Active Installs 3

Total Versions 1

Frequently Asked Questions

What is Data Extractor?

Extract structured data from unstructured sources. Parse JSON, CSV, logs, and mixed formats into clean, usable data. Handle malformed data, nested structures... It is an AI Agent Skill for Claude Code / OpenClaw, with 330 downloads so far.

How do I install Data Extractor?

Run "/install data-extractor" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.

Is Data Extractor free?

Yes, Data Extractor is completely free, licensed under MIT-0. You can download, install and use it at no cost.

Which platforms does Data Extractor support?

Data Extractor is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).

Who created Data Extractor?

It is built and maintained by engsathiago (@engsathiago); the current version is v1.0.0.

More Skills

Data Extractor

Data Extractor

Supported Formats

JSON

CSV

Logs

Text

Common Patterns

Extract JSON from Text

Parse CSV with Issues

Extract Key-Value Pairs

Parse Logs

Handling Edge Cases

Malformed JSON

Large Files

Mixed Formats

Data Cleaning

Remove Duplicates

Normalize Values

Validate Schema

Output Formats

To JSON

To CSV

To Markdown Table

Usage Examples

Example 1: Extract JSON from API Response

Example 2: Parse Mixed Log

Example 3: Clean CSV

Best Practices

Performance Tips

What is Data Extractor?

How do I install Data Extractor?

Is Data Extractor free?

Which platforms does Data Extractor support?

Who created Data Extractor?

💬 Comments