← 返回 Skills 市场
Data Extractor
作者
engsathiago
· GitHub ↗
· v1.0.0
· MIT-0
330
总下载
0
收藏
3
当前安装
1
版本数
在 OpenClaw 中安装
/install data-extractor
功能描述
Extract structured data from unstructured sources. Parse JSON, CSV, logs, and mixed formats into clean, usable data. Handle malformed data, nested structures...
使用说明 (SKILL.md)
Data Extractor
Extract structured, clean data from unstructured or messy sources. Turn chaos into usable data.
Supported Formats
JSON
- Nested objects
- Arrays of objects
- Mixed types
- Malformed JSON
- Large files (streaming)
CSV
- Headers or headerless
- Various delimiters
- Quoted fields
- Multiline values
- Large files (streaming)
Logs
- Application logs
- Server logs
- Error logs
- Access logs
- Custom formats
Text
- Key-value pairs
- Tables
- Lists
- Mixed content
Common Patterns
Extract JSON from Text
Input: "The response was {'status': 'ok', 'data': [1, 2, 3]} and then..."
Output: {"status": "ok", "data": [1, 2, 3]}
import re
import json
def extract_json(text):
# Find JSON-like structures
pattern = r'\{[^{}]*(?:\{[^{}]*\}[^{}]*)*\}'
matches = re.findall(pattern, text)
for match in matches:
try:
return json.loads(match)
except json.JSONDecodeError:
continue
return None
Parse CSV with Issues
# Handle: missing values, inconsistent quotes, mixed delimiters
import csv
from io import StringIO
def parse_messy_csv(text):
lines = text.strip().split('\
')
# Detect delimiter
delimiters = [',', ';', ' ', '|']
delimiter = ','
for d in delimiters:
if lines[0].count(d) > lines[0].count(delimiter):
delimiter = d
# Parse with error handling
reader = csv.reader(StringIO(text), delimiter=delimiter)
rows = []
for row in reader:
# Clean each field
cleaned = [field.strip().strip('"').strip("'") for field in row]
rows.append(cleaned)
return rows
Extract Key-Value Pairs
Input: "name: John, age: 30, city: New York"
Output: {"name": "John", "age": "30", "city": "New York"}
import re
def extract_key_value(text):
patterns = [
r'(\w+)\s*:\s*([^,\
]+)', # key: value
r'(\w+)\s*=\s*([^,\
]+)', # key=value
r'"?(\w+)"?\s*[:=]\s*"?([^,"\
]+)"?', # quoted variants
]
result = {}
for pattern in patterns:
matches = re.findall(pattern, text)
for key, value in matches:
result[key.strip()] = value.strip()
return result
Parse Logs
# Common log formats
import re
from datetime import datetime
def parse_log_line(line):
# Try common patterns
# Apache/Nginx access log
pattern = r'(\S+) \S+ \S+ \[([^\]]+)\] "(\S+) ([^"]+) HTTP/\d\.\d" (\d+) (\d+)'
match = re.match(pattern, line)
if match:
return {
"ip": match.group(1),
"timestamp": match.group(2),
"method": match.group(3),
"path": match.group(4),
"status": int(match.group(5)),
"size": int(match.group(6))
}
# JSON log
if line.startswith('{'):
try:
return json.loads(line)
except:
pass
# Key-value log
if '=' in line:
return extract_key_value(line)
return {"raw": line}
Handling Edge Cases
Malformed JSON
def fix_json(text):
# Common fixes
# Single quotes to double quotes
text = re.sub(r"'([^']*)'", r'"\1"', text)
# Unquoted keys
text = re.sub(r'(\w+):', r'"\1":', text)
# Trailing commas
text = re.sub(r',\s*([}\]])', r'\1', text)
# Missing quotes around values
text = re.sub(r':\s*([a-zA-Z_]\w*)(?=[,}\]])', r': "\1"', text)
return text
Large Files
def stream_jsonl(file_path):
"""Stream JSON Lines (JSONL) format"""
with open(file_path, 'r') as f:
for line in f:
try:
yield json.loads(line)
except json.JSONDecodeError:
continue
def stream_csv(file_path, chunk_size=1000):
"""Stream CSV in chunks"""
with open(file_path, 'r') as f:
reader = csv.reader(f)
headers = next(reader)
chunk = []
for row in reader:
chunk.append(dict(zip(headers, row)))
if len(chunk) >= chunk_size:
yield chunk
chunk = []
if chunk:
yield chunk
Mixed Formats
def detect_and_parse(content):
"""Auto-detect format and parse"""
content = content.strip()
# JSON
if content.startswith('{') or content.startswith('['):
try:
return json.loads(content)
except:
pass
# JSONL
if '\
{' in content:
try:
return [json.loads(line) for line in content.split('\
') if line.strip()]
except:
pass
# CSV
if ',' in content and '\
' in content:
lines = content.split('\
')
if len(lines) > 1:
return parse_messy_csv(content)
# Key-value
if '=' in content or ':' in content:
return extract_key_value(content)
# Lines
return content.split('\
')
Data Cleaning
Remove Duplicates
def deduplicate(data, key=None):
if isinstance(data, list):
if key:
seen = set()
result = []
for item in data:
val = item.get(key) if isinstance(item, dict) else item
if val not in seen:
seen.add(val)
result.append(item)
return result
return list(set(data))
return data
Normalize Values
def normalize(data):
if isinstance(data, dict):
return {k: normalize(v) for k, v in data.items()}
elif isinstance(data, list):
return [normalize(item) for item in data]
elif isinstance(data, str):
# Lowercase, trim, standardize whitespace
data = data.lower().strip()
data = re.sub(r'\s+', ' ', data)
# Convert common values
if data in ('true', 'yes', 'on'):
return True
if data in ('false', 'no', 'off'):
return False
if data in ('null', 'none', 'n/a', ''):
return None
# Try numeric
try:
return int(data)
except:
try:
return float(data)
except:
pass
return data
return data
Validate Schema
def validate(data, schema):
errors = []
# Required fields
for field in schema.get('required', []):
if field not in data:
errors.append(f"Missing required field: {field}")
# Type checking
for field, expected_type in schema.get('types', {}).items():
if field in data and not isinstance(data[field], expected_type):
errors.append(f"Field {field} should be {expected_type.__name__}")
# Value ranges
for field, (min_val, max_val) in schema.get('ranges', {}).items():
if field in data:
if not (min_val \x3C= data[field] \x3C= max_val):
errors.append(f"Field {field} out of range: {data[field]}")
return len(errors) == 0, errors
Output Formats
To JSON
import json
def to_json(data, pretty=True):
if pretty:
return json.dumps(data, indent=2, ensure_ascii=False)
return json.dumps(data, ensure_ascii=False)
To CSV
import csv
from io import StringIO
def to_csv(data, headers=None):
if not data:
return ""
output = StringIO()
if isinstance(data[0], dict):
headers = headers or list(data[0].keys())
writer = csv.DictWriter(output, fieldnames=headers)
writer.writeheader()
writer.writerows(data)
else:
writer = csv.writer(output)
if headers:
writer.writerow(headers)
writer.writerows(data)
return output.getvalue()
To Markdown Table
def to_markdown_table(data):
if not data:
return ""
if isinstance(data[0], dict):
headers = list(data[0].keys())
rows = [[str(row.get(h, '')) for h in headers] for row in data]
else:
headers = [f"Col {i+1}" for i in range(len(data[0]))]
rows = data
# Build table
result = []
result.append('| ' + ' | '.join(headers) + ' |')
result.append('| ' + ' | '.join(['---'] * len(headers)) + ' |')
for row in rows:
result.append('| ' + ' | '.join(str(cell) for cell in row) + ' |')
return '\
'.join(result)
Usage Examples
Example 1: Extract JSON from API Response
Input (messy):
"API returned: {status: 'success', data: {users: [{id: 1, name: 'John'}, {id: 2, name: 'Jane'}]}, timestamp: '2026-03-16'}"
Output (clean):
{
"status": "success",
"data": {
"users": [
{"id": 1, "name": "John"},
{"id": 2, "name": "Jane"}
]
},
"timestamp": "2026-03-16"
}
Example 2: Parse Mixed Log
Input:
192.168.1.1 - - [16/Mar/2026:12:00:00 +0000] "GET /api HTTP/1.1" 200 1234
{"level": "INFO", "message": "User logged in", "user_id": 123}
name=John action=login time=12:00
Output:
[
{"ip": "192.168.1.1", "timestamp": "16/Mar/2026:12:00:00 +0000", "method": "GET", "path": "/api", "status": 200, "size": 1234},
{"level": "INFO", "message": "User logged in", "user_id": 123},
{"name": "John", "action": "login", "time": "12:00"}
]
Example 3: Clean CSV
Input (messy):
name,age,city
"John", 30, "New York"
'Jane',,Los Angeles
"Bob","forty","Chicago"
Output (clean):
[
{"name": "John", "age": 30, "city": "New York"},
{"name": "Jane", "age": null, "city": "Los Angeles"},
{"name": "Bob", "age": "forty", "city": "Chicago"}
]
Best Practices
- Always validate input - Check format before parsing
- Handle errors gracefully - Log and continue or fail cleanly
- Stream large files - Don't load everything into memory
- Normalize consistently - Same rules for all data
- Document transformations - What changed and why
- Preserve originals - Keep raw data until confirmed clean
- Test edge cases - Empty, null, malformed, very large
- Use appropriate types - Numbers as numbers, dates as dates
Performance Tips
- Use streaming for files > 10MB
- Batch processing for database inserts
- Parallel parsing for independent chunks
- Lazy evaluation with generators
- Cache parsed results if reused frequently
安全使用建议
This skill appears to do what it says: example Python code for extracting and cleaning messy data. Before using it, keep in mind: 1) the snippets read files by path—only give it files you intend to expose (avoid system or sensitive files); 2) the regex-based fixes are heuristic and can corrupt or misinterpret input (test on non-sensitive samples first); 3) some regex patterns can be slow or brittle on adversarial inputs—limit file size and runtime or use streaming libraries (ijson, csv with proper dialect detection) for large/hostile inputs; 4) because it's instruction-only, the agent will try to run code in whatever runtime is available—confirm the execution environment and sandboxing and avoid granting elevated filesystem access; and 5) if you need production-grade parsing, consider using well-maintained parsing libraries rather than raw regex fixes. If you want extra assurance, request a version that includes robust error handling, explicit file-path validation, and limits on input size/timeouts.
功能分析
Type: OpenClaw Skill
Name: data-extractor
Version: 1.0.0
The 'data-extractor' skill bundle contains standard Python utility functions and instructions for parsing and cleaning structured data from various formats like JSON, CSV, and logs. The code uses safe libraries (re, json, csv) and follows best practices for data processing without any indicators of malicious intent, data exfiltration, or prompt injection attacks.
能力评估
Purpose & Capability
Name and description match the SKILL.md: it provides Python snippets and patterns for extracting JSON, CSV, logs, key-value pairs, streaming large files, and cleaning data. No unrelated binaries, env vars, or config paths are requested.
Instruction Scope
The instructions are narrowly focused on parsing/cleaning data and include functions that read files (stream_jsonl, stream_csv) and perform regex-based fixes. This is expected for a data-extraction skill, but the snippets assume access to arbitrary file paths provided at runtime and use brittle regex substitutions that can mis-parse or mutate input. There are no instructions to access system-wide config, credentials, or unexpected external endpoints.
Install Mechanism
Instruction-only skill with no install spec and no bundled code—nothing is written to disk by the skill itself. This is the lowest-risk install posture.
Credentials
The skill requests no environment variables or credentials. All operations use local I/O and standard Python libraries in the examples; the required privileges are proportionate to the stated purpose.
Persistence & Privilege
always is false and model invocation is allowed (the platform default). The skill does not request persistent presence, nor does it instruct modification of other skills or system-wide agent settings.
如何使用
- 确保已安装 OpenClaw(本地或 Docker 部署)
- 在对话框中输入安装命令:
/install data-extractor - 安装完成后,直接呼叫该 Skill 的名称或使用
/data-extractor触发 - 根据 Skill 的参数说明提供必要输入,即可获得结构化输出
版本历史
v1.0.0
Initial release. Extract structured data from unstructured sources. Parse JSON, CSV, logs, and mixed formats. Handle malformed data, nested structures, and large files efficiently.
元数据
常见问题
Data Extractor 是什么?
Extract structured data from unstructured sources. Parse JSON, CSV, logs, and mixed formats into clean, usable data. Handle malformed data, nested structures... 它是一个面向 Claude Code / OpenClaw 的 AI Agent Skill 插件,目前累计下载 330 次。
如何安装 Data Extractor?
在 OpenClaw 或 Claude Code 对话框中运行命令「/install data-extractor」即可一键安装,无需额外配置。
Data Extractor 是免费的吗?
是的,Data Extractor 完全免费,采用 MIT-0 许可证,可自由下载、安装和使用。
Data Extractor 支持哪些平台?
Data Extractor 跨平台运行,可在任意部署了 OpenClaw / Claude Code 的环境中使用(cross-platform)。
谁开发了 Data Extractor?
由 engsathiago(@engsathiago)开发并维护,当前版本 v1.0.0。
推荐 Skills