← 返回 Skills 市场
119
总下载
0
收藏
0
当前安装
1
版本数
在 OpenClaw 中安装
/install data-parser-toolkit
功能描述
智能解析CSV、JSON、XLSX、Parquet与SQL文件,自动检测编码并修复常见格式与内容问题,提取结构化数据。
使用说明 (SKILL.md)
数据文件解析技能 (Data File Parser)
技能描述
智能解析各种数据文件格式(CSV/JSON/XLSX/Parquet/SQL),自动检测编码、修复常见问题、提取结构化数据。
支持格式
1. CSV (逗号分隔值)
常见问题及修复:
- 编码问题: 自动尝试 UTF-8 → GBK → GB2312 → Latin1
- 标题行数: 自动检测 (常见: 1行、2行、混合单元格表头)
- 数字格式: 处理逗号千分位 (如 "1,234.56")、中文数字 (如 "3")
- 空值: 处理 "-", "—", "null", "None", 空字符串
- 换行符: 处理 CSV 内嵌换行 (需引号包裹)
自动检测:
# 检测标题行数
def detect_header_lines(content):
lines = content.split('\
')[:10]
for i, line in enumerate(lines):
if '合约代码' in line or '交易代码' in line or 'symbol' in line.lower():
return i
return 1 # 默认1行
2. JSON (JavaScript Object Notation)
常见问题及修复:
- BOM头: 移除
\ufeff - 尾部逗号:
{"a": 1,}→{"a": 1} - 单引号:
{'a': 1}→{"a": 1} - Python注释: 移除
#注释 - 数值精度: 处理科学计数法
修复函数:
def fix_json(text):
# 移除BOM
text = text.replace('\ufeff', '')
# 修复尾部逗号
text = re.sub(r',(\s*[}\]])', r'\1', text)
# 单引号转双引号
text = re.sub(r"'([^']*)'", r'"\1"', text)
# 移除注释
text = re.sub(r'//.*$', '', text, flags=re.MULTILINE)
text = re.sub(r'#.*$', '', text, flags=re.MULTILINE)
return text
3. XLSX (Excel)
常见问题及修复:
- 损坏文件: "File is not a zip file" → XLSX本质是ZIP,需重新保存
- 合并单元格: 读取时需处理
merged_cells范围 - 空行: 跳过全为 None 的行
- 日期格式: 转换为标准 ISO 格式
- 公式: 使用
data_only=True读取计算值
检测XLSX是否损坏:
import zipfile
import openpyxl
def is_valid_xlsx(path):
try:
# 方法1: 检查ZIP有效性
with zipfile.ZipFile(path, 'r'):
pass
# 方法2: 尝试用openpyxl打开
wb = openpyxl.load_workbook(path, data_only=True)
wb.close()
return True
except:
return False
4. Parquet (列式存储)
特点: 高压缩率、适合大数据分析
import pyarrow.parquet as pq
def read_parquet(path):
table = pq.read_table(path)
return table.to_pandas()
5. SQL脚本
常见问题:
- 字符集声明:
CHARSET=utf8mb4 - 批量插入: 处理
INSERT INTO ... VALUES (...), (...), ... - 转义字符: 处理
\'→'或''
核心工具函数
自动编码检测
import chardet
def detect_encoding(path):
with open(path, 'rb') as f:
raw = f.read(10000) # 读取前10KB
result = chardet.detect(raw)
return result['encoding'] or 'utf-8'
智能读取CSV
import pandas as pd
import chardet
def smart_read_csv(path, **kwargs):
# 1. 检测编码
enc = detect_encoding(path)
# 2. 尝试读取
try:
df = pd.read_csv(path, encoding=enc, **kwargs)
except:
# 备用编码
for alt_enc in ['gbk', 'gb2312', 'utf-8-sig', 'latin1']:
try:
df = pd.read_csv(path, encoding=alt_enc, **kwargs)
break
except:
continue
return df
智能读取XLSX
def smart_read_xlsx(path):
"""带自动修复的XLSX读取"""
# 检查文件是否有效
if not is_valid_xlsx(path):
print(f"警告: {path} 可能损坏")
return None
wb = openpyxl.load_workbook(path, data_only=True)
ws = wb.active
# 读取为列表
data = []
for row in ws.iter_rows(values_only=True):
# 跳过全空行
if not any(row):
continue
data.append(list(row))
wb.close()
return data
使用示例
解析任何数据文件
from data_parser import parse_file
# 自动识别格式并解析
data = parse_file("data.csv") # 返回 DataFrame/List
data = parse_file("data.json") # 返回 dict/List
data = parse_file("data.xlsx") # 返回 List[List]
data = parse_file("data.parquet") # 返回 DataFrame
批量转换
from data_parser import convert_folder
# 将文件夹内所有XLSX转为CSV
convert_folder(
input_dir="D:/data/xlsx",
output_dir="D:/data/csv",
output_format="csv"
)
依赖安装
pip install pandas openpyxl chardet pyarrow
注意事项
- XLSX文件如果显示"File is not a zip file",说明文件损坏,需重新从源头获取
- CSV编码问题最常见,优先检测编码
- 大文件用 Parquet 格式更高效
- 读取XLSX时用
data_only=True获取计算值,否则得到公式
安全使用建议
This skill appears to be a normal data-file parser and its declared dependencies match the code. Before installing or giving it broad access, consider: 1) Review the full source (the SKILL.md and README mention functions such as parse_from_url and many helper utilities — verify they exist and inspect any network code). 2) If you will run it in an environment with sensitive files, restrict the skill's file access (run in a sandbox or grant it only specific directories), because it can read arbitrary local files and archive members. 3) Confirm whether it performs any network I/O (parse_from_url or pandas.read_csv on a URL) and whether that behavior is acceptable; if you need to prevent exfiltration, run offline or in a restricted network environment. 4) Because some documentation features look out-of-sync with the code, you may want to run the tests/examples in an isolated environment to validate behavior before trusting it with production data.
功能分析
Type: OpenClaw Skill
Name: data-parser-toolkit
Version: 1.0.0
The 'data-parser-toolkit' is a comprehensive utility for parsing and cleaning various data formats (CSV, JSON, XLSX, Parquet, SQL, etc.). The code in data_parser.py and data_parser_enhanced.py implements standard data processing logic, including encoding detection, data validation, and privacy-focused features like 'mask_sensitive' for de-identifying PII. While it includes network capabilities via 'read_from_url', this is aligned with its stated purpose of fetching data files for analysis, and no evidence of malicious intent, exfiltration, or prompt injection was found.
能力评估
Purpose & Capability
Name/description (data parsing for CSV/JSON/XLSX/Parquet/SQL) match the included Python modules and README. Declared dependencies (pandas, openpyxl, chardet, pyarrow, xlrd) are appropriate for the stated functionality. The README and SKILL.md advertise many helper functions (convert_folder, clean_pipeline, parse_from_url, detect_corruption, etc.); the provided code implements large portions of parsing functionality but some advertised utilities are not obviously present or are truncated in the provided files. This is likely sloppy packaging or documentation drift rather than malicious mismatch.
Instruction Scope
SKILL.md instructs the agent to parse files, detect encoding, and install typical Python deps. The runtime instructions and code focus on reading local files and archive members and performing in-memory transformations. There are no explicit instructions to read unrelated system config files or to transmit data to external endpoints. However, the README references parse_from_url and URL-reading examples; URL reads (pandas.read_csv supports URLs) permit network fetches — the exact implementation of parse_from_url is not visible, so network behavior is possible but not proven.
Install Mechanism
No install spec provided in the skill registry; SKILL.md gives a pip install line for standard Python packages (pandas, openpyxl, chardet, pyarrow, xlrd). That is an expected, low-risk method for this type of library. No arbitrary downloads, obscure URLs, or extract operations were seen in the install metadata.
Credentials
The skill requests no environment variables, no credentials, and no config paths. This is proportionate for a local file-parsing toolkit. No hidden env access was detected in SKILL.md or in the visible source.
Persistence & Privilege
always is false and default agent invocation is allowed (platform default). The package does not request persistent elevated privileges or permanent system-wide changes in the provided files. It exposes functions to read arbitrary files (expected for a parser) but does not attempt to modify other skills' configs or set global agent settings.
如何使用
- 确保已安装 OpenClaw(本地或 Docker 部署)
- 在对话框中输入安装命令:
/install data-parser-toolkit - 安装完成后,直接呼叫该 Skill 的名称或使用
/data-parser-toolkit触发 - 根据 Skill 的参数说明提供必要输入,即可获得结构化输出
版本历史
v1.0.0
- Initial release of the data file parsing toolkit.
- Supports intelligent parsing for CSV, JSON, XLSX, Parquet, and SQL file formats.
- Automatically detects file encoding and repairs common file issues.
- Extracts structured data and handles various data cleaning scenarios (e.g. header row detection, scientific notation, merged cells).
- Provides sample usage for single-file parsing and batch conversion of files.
元数据
常见问题
data-parser-toolkit 是什么?
智能解析CSV、JSON、XLSX、Parquet与SQL文件,自动检测编码并修复常见格式与内容问题,提取结构化数据。 它是一个面向 Claude Code / OpenClaw 的 AI Agent Skill 插件,目前累计下载 119 次。
如何安装 data-parser-toolkit?
在 OpenClaw 或 Claude Code 对话框中运行命令「/install data-parser-toolkit」即可一键安装,无需额外配置。
data-parser-toolkit 是免费的吗?
是的,data-parser-toolkit 完全免费,采用 MIT-0 许可证,可自由下载、安装和使用。
data-parser-toolkit 支持哪些平台?
data-parser-toolkit 跨平台运行,可在任意部署了 OpenClaw / Claude Code 的环境中使用(cross-platform)。
谁开发了 data-parser-toolkit?
由 XiLi-aXi(@qiuwenxi416488212-ship-it)开发并维护,当前版本 v1.0.0。
推荐 Skills