← 返回 Skills 市场
deyashmukh

Data Cleaning & Annotation Workflow

作者 Yash Deshmukh · GitHub ↗ · v1.0.0
cross-platform ⚠ suspicious
854
总下载
0
收藏
1
当前安装
1
版本数
在 OpenClaw 中安装
/install data-cleaning-annotation-workflow
功能描述
Complete workflow for time series datasets (Energy, Manufacturing, Climate) on Kaggle to Data Annotation platform (data.smlcrm.com). Includes downloading, cl...
使用说明 (SKILL.md)

Simulacrum Data Annotation Workflow

Complete end-to-end workflow for time series dataset preparation and annotation on the Data Annotation platform (data.smlcrm.com).

What This Skill Does

This skill captures the precise workflow for processing time series datasets (Energy, Manufacturing, Climate) from discovery to CLEAN status:

  1. Find Dataset: Search Kaggle for Energy/Manufacturing/Climate time series data
  2. Download: Get CSV files via browser or Kaggle CLI
  3. Clean: Run Python/pandas script to handle missing values, duplicates, formatting
  4. Upload RAW: Upload original CSV with metadata (name, domain, source URL, description)
  5. Configure Headers: Set column types (Time, Target, Covariate, Group) and units
  6. Assign Groups: Select ALL variables (target + covariates), apply ALL group tags
  7. Upload Cleaned: Final upload → CLEAN status

Supported Domains

  • Energy: Power consumption, utilities, renewable energy, grid data
  • Manufacturing: Industrial processes, steel production, emissions, equipment data
  • Climate: CO2 emissions, environmental monitoring, weather correlation data

Quick Start

For the full pipeline from Kaggle to annotated dataset:

1. Find dataset on Kaggle
2. Download (browser or kaggle CLI)
3. Clean with scripts/clean_dataset.py
4. Upload RAW dataset to data.smlcrm.com (with metadata)
5. Click "Clean" and upload cleaned file
6. Configure column metadata (types, units)
7. Assign groups to variables
8. Upload cleaned dataset → CLEAN status

Workflow Steps

Step 1: Find and Download Dataset

From Kaggle (Browser Method):

  1. Navigate to kaggle.com/datasets
  2. Search for relevant dataset (e.g., "steel industry energy consumption", "manufacturing emissions", "climate CO2")
  3. Review data description, file list, and preview
  4. Click "Download" button
  5. Extract CSV file from downloaded zip

Alternative: Kaggle CLI

# Install if needed: pip install kaggle
# Configure: kaggle competitions list

scripts/download_kaggle.sh \x3Cdataset-name> [output-dir]
# Example: scripts/download_kaggle.sh csafrit2/steel-industry-energy-consumption

Step 2: Clean the Dataset

Always run the cleaning script before upload:

python3 scripts/clean_dataset.py \x3Cinput.csv> [-o \x3Coutput.csv>]

What the script does:

  • Strips whitespace from column names
  • Removes duplicate rows
  • Fills missing numeric values with median
  • Fills missing categorical values with mode or 'Unknown'
  • Converts timestamp columns to datetime format
  • Outputs column summary for metadata configuration

Output:

  • Cleaned CSV file ready for upload
  • Column summary printed to console (save this for metadata config)

Step 3: Upload Raw Dataset to Platform

  1. Navigate to data.smlcrm.com/dashboard
  2. Click "Upload Dataset" button
  3. Fill in metadata for the RAW dataset:
    • Name: Descriptive dataset name
    • Domain: Category (Energy, Manufacturing, Climate, etc.)
    • Source URL: Kaggle or original source URL
    • Description: Brief summary of the dataset
  4. Upload the original/raw CSV file (not cleaned yet)
  5. Click Upload

Result: Dataset appears in list with RAW status

Step 4: Upload Cleaned File & Configure Metadata

  1. Find the RAW dataset in the list
  2. Click "Clean" button
  3. Upload the cleaned CSV file (from Step 2)
  4. Configure headers for each column:
Setting Description
Name Column name (editable)
Units Measurement units (kWh, °C, %, ratio, tCO2, etc.)
Type Time / Target / Covariate / Group

Column Type Guide:

  • Time: Timestamp/datetime columns (usually required)
  • Target: Variable to predict (at least one required)
  • Covariate: Input features/independent variables
  • Group: Categorical segment variables (WeekStatus, Day_of_week, Load_Type, etc.)

Bulk Configuration:

  • Select multiple rows via checkboxes
  • Use "Apply" dropdown to set type for selected columns
  • Set units individually or in bulk

Common Unit Patterns:

  • Energy: kWh, MWh, MW
  • Power: kVarh, kW
  • Emissions: tCO2, kgCO2
  • Ratios: ratio, %
  • Time: seconds, minutes, hours

Step 5: Assign Groups to Variables

Purpose: Group variables define how data is segmented for analysis.

Exact Workflow:

  1. Select ALL variables by checking their checkboxes:

    • Target variable(s)
    • ALL covariate variables
  2. Apply ALL group tags to selected variables:

    • Click first group tag (e.g., WeekStatus) → all selected get this group
    • Click second group tag (e.g., Day_of_week) → all selected get this group
    • Click third group tag (e.g., Load_Type) → all selected get this group
    • Continue for all available group tags
  3. Result: All variables have all groups assigned (e.g., "WeekStatus × Day_of_week × Load_Type")

Important: Assign groups to BOTH target variables AND all covariates.

Step 6: Final Upload

  1. Click "Upload Cleaned Dataset" button
  2. Wait for processing
  3. Dataset status changes from RAWCLEAN
  4. Verify data points count is correct

Example: Steel Industry Energy Dataset

Source: https://www.kaggle.com/datasets/csafrit2/steel-industry-energy-consumption

Metadata:

  • Name: Steel Industry Energy Consumption (South Korea)
  • Domain: Energy
  • Data Points: 350,400

Column Configuration:

Column Type Units
Timestamps Time -
Usage_kWh Target kWh
Lagging_Current_Reactive.Power_kVarh Covariate kVarh
Leading_Current_Reactive_Power_kVarh Covariate kVarh
CO2(tCO2) Covariate tCO2
Lagging_Current_Power_Factor Covariate ratio
Leading_Current_Power_Factor Covariate ratio
NSM Covariate seconds
WeekStatus Group -
Day_of_week Group -
Load_Type Group -

Group Assignment:

  1. Select: Usage_kWh, Lagging_Current_Reactive.Power_kVarh, Leading_Current_Reactive_Power_kVarh, CO2(tCO2), Lagging_Current_Power_Factor, Leading_Current_Power_Factor, NSM
  2. Click: WeekStatus → all selected get WeekStatus
  3. Click: Day_of_week → all selected get Day_of_week
  4. Click: Load_Type → all selected get Load_Type
  5. Final: All variables show "WeekStatus × Day_of_week × Load_Type"

Reference Materials

For detailed platform configuration guidance, see references/platform_guide.md.

Troubleshooting

"Next" button disabled:

  • Check at least one Time column is set
  • Check at least one Target column is set
  • Verify all columns have types assigned

Groups not appearing:

  • Columns must be marked as "Group" type first
  • Proceed to next step after setting Group types

Upload fails:

  • Re-run cleaning script
  • Check CSV format (comma-delimited)
  • Verify no empty column names

Scripts

Script Purpose
scripts/clean_dataset.py Clean and prepare CSV for upload
scripts/download_kaggle.sh Download datasets via Kaggle CLI

Platform URL

Data Annotation Platform: https://data.smlcrm.com

安全使用建议
This skill appears to do what it says: local downloading (via your Kaggle CLI), pandas-based cleaning, and manual upload to data.smlcrm.com. Before installing or running: 1) Verify the data.smlcrm.com site is the intended/legitimate platform; 2) Review scripts locally (they are short and readable) and run them in a safe environment; 3) Ensure your Kaggle CLI is configured securely (kaggle.json in ~/.kaggle) — the skill won’t manage that token; 4) Be cautious about the 'assign ALL group tags to ALL variables' step — it may produce incorrect metadata or labels for ML models; and 5) Check dataset licenses and privacy constraints before downloading or uploading any datasets.
功能分析
Type: OpenClaw Skill Name: data-cleaning-annotation-workflow Version: 1.0.0 The skill's stated purpose of data cleaning and annotation is benign. The `scripts/clean_dataset.py` file performs standard data processing operations without suspicious behavior. However, the `scripts/download_kaggle.sh` script directly uses user-provided arguments (`$DATASET_NAME`, `$OUTPUT_DIR`) in shell commands (`kaggle datasets download -d "$DATASET_NAME"`, `mkdir -p "$OUTPUT_DIR"`, `cd "$OUTPUT_DIR"`) without explicit sanitization, creating a shell injection vulnerability. While not intentionally malicious, this vulnerability makes the skill 'suspicious' as it could be exploited by a malicious agent or crafted input.
能力评估
Purpose & Capability
The skill’s name/description match the provided files: a downloader helper and a pandas cleaning script plus detailed upload instructions to data.smlcrm.com. One small expectation gap: the SKILL.md suggests using the Kaggle CLI, which implicitly requires the user's Kaggle API token/config (~/.kaggle/kaggle.json); the skill does not declare or attempt to manage that token (this is normal but worth noting).
Instruction Scope
Runtime instructions are limited to finding/downloading data, running the local clean_dataset.py script, and manually uploading/configuring datasets on data.smlcrm.com. That stays within the stated purpose. One procedural oddity: it instructs assigning 'ALL group tags' to ALL variables including target variables — this is unexpected from an ML-quality standpoint (could harm labeling/analysis) but is not a security/privacy exfiltration action.
Install Mechanism
No install specification: instruction-only with two small helper scripts. Nothing is downloaded from arbitrary URLs or written/executed by an installer; therefore low install risk.
Credentials
The skill does not request environment variables, credentials, or config paths. The only implicit requirement is that the user may need the Kaggle CLI configured locally (user-managed kaggle.json) to run download_kaggle.sh — this is proportional to the stated task.
Persistence & Privilege
always is false and the skill does not request persistent/system-wide changes or access to other skills' configs. It does not perform autonomous network exfiltration or modify agent settings.
如何使用
  1. 确保已安装 OpenClaw(本地或 Docker 部署)
  2. 在对话框中输入安装命令:/install data-cleaning-annotation-workflow
  3. 安装完成后,直接呼叫该 Skill 的名称或使用 /data-cleaning-annotation-workflow 触发
  4. 根据 Skill 的参数说明提供必要输入,即可获得结构化输出
版本历史
v1.0.0
- Initial release of complete end-to-end workflow for preparing, cleaning, and annotating time series datasets (Energy, Manufacturing, Climate) using the Data Annotation platform. - Step-by-step instructions for finding datasets on Kaggle, downloading, cleaning via pandas scripts, and uploading both raw and cleaned files with full metadata. - Detailed guidance on configuring column types (Time, Target, Covariate, Group), setting measurement units, and bulk-assigning group tags to all relevant variables. - Workflow explicitly covers group assignment for both targets and covariates, emphasizing all-variables-to-all-groups mapping. - Troubleshooting section and script usage notes included for common platform and data preparation issues.
元数据
Slug data-cleaning-annotation-workflow
版本 1.0.0
许可证
累计安装 1
当前安装数 1
历史版本数 1
常见问题

Data Cleaning & Annotation Workflow 是什么?

Complete workflow for time series datasets (Energy, Manufacturing, Climate) on Kaggle to Data Annotation platform (data.smlcrm.com). Includes downloading, cl... 它是一个面向 Claude Code / OpenClaw 的 AI Agent Skill 插件,目前累计下载 854 次。

如何安装 Data Cleaning & Annotation Workflow?

在 OpenClaw 或 Claude Code 对话框中运行命令「/install data-cleaning-annotation-workflow」即可一键安装,无需额外配置。

Data Cleaning & Annotation Workflow 是免费的吗?

是的,Data Cleaning & Annotation Workflow 完全免费(开源免费),可自由下载、安装和使用。

Data Cleaning & Annotation Workflow 支持哪些平台?

Data Cleaning & Annotation Workflow 跨平台运行,可在任意部署了 OpenClaw / Claude Code 的环境中使用(cross-platform)。

谁开发了 Data Cleaning & Annotation Workflow?

由 Yash Deshmukh(@deyashmukh)开发并维护,当前版本 v1.0.0。

💬 留言讨论