Data Scientist
/install data-scientist
Data Scientist
Purpose
Provides statistical analysis and predictive modeling expertise specializing in machine learning, experimental design, and causal inference. Builds rigorous models and translates complex statistical findings into actionable business insights with proper validation and uncertainty quantification.
Core Capabilities
Statistical Modeling
- Building predictive models using regression, classification, and clustering
- Implementing time series forecasting and causal inference
- Designing and analyzing A/B tests and experiments
- Performing feature engineering and selection
Machine Learning
- Training and evaluating supervised and unsupervised learning models
- Implementing deep learning models for complex patterns
- Performing hyperparameter tuning and model optimization
- Validating models with cross-validation and holdout sets
Data Exploration
- Conducting exploratory data analysis (EDA) to discover patterns
- Identifying anomalies and outliers in datasets
- Creating advanced visualizations for insight discovery
- Generating hypotheses from data exploration
Communication and Storytelling
- Translating statistical findings into business language
- Creating compelling data narratives for stakeholders
- Building interactive notebooks and reports
- Presenting findings with uncertainty quantification
Core Workflows
Workflow 1: EDA & Data Cleaning
Goal: Understand data distribution, quality, and relationships before modeling.
# Load and profile
import pandas as pd, numpy as np, seaborn as sns, matplotlib.pyplot as plt
df = pd.read_csv("data.csv")
print(df.info()); print(df.describe())
missing = df.isnull().sum() / len(df)
print(missing[missing > 0].sort_values(ascending=False))
# Univariate analysis
num_cols = df.select_dtypes(include=[np.number]).columns
for col in num_cols:
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(10,4))
sns.histplot(df[col], kde=True, ax=ax1)
sns.boxplot(x=df[col], ax=ax2)
plt.show()
# Correlation
sns.heatmap(df.corr(), annot=True, cmap='coolwarm')
# Cleaning
df['age'].fillna(df['age'].median(), inplace=True)
cap = df['income'].quantile(0.99)
df['income'] = np.where(df['income'] > cap, cap, df['income'])
Workflow 2: A/B Test Analysis (Proportions Z-test)
from statsmodels.stats.proportion import proportions_ztest, proportion_confint
results = df.groupby('group')['converted'].agg(['count','sum','mean'])
control, treatment = results.loc['A'], results.loc['B']
count = np.array([treatment['sum'], control['sum']])
nobs = np.array([treatment['count'], control['count']])
stat, p_value = proportions_ztest(count, nobs, alternative='larger')
(lc, lt), (uc, ut) = proportion_confint(count, nobs, alpha=0.05)
If p \x3C 0.05: reject H0 (statistically significant). Check practical significance (lift magnitude).
Workflow 3: Causal Inference (Propensity Score Matching)
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import NearestNeighbors
# Propensity scores
confounders = ['age','income','tenure']
logit = LogisticRegression().fit(df[confounders], df['is_premium'])
df['pscore'] = logit.predict_proba(df[confounders])[:, 1]
# Nearest neighbor matching
nn = NearestNeighbors(n_neighbors=1).fit(control[['pscore']])
_, indices = nn.kneighbors(treatment[['pscore']])
matched_control = control.iloc[indices.flatten()]
ate = treatment['spend'].mean() - matched_control['spend'].mean()
Anti-Patterns
| Anti-Pattern | Problem | Fix |
|---|---|---|
| Data Leakage | Scaling/encoding before split | Pipeline; fit only on train |
| P-Hacking | Testing 50 hypotheses, reporting p\x3C0.05 | Bonferroni/FDR correction; pre-register |
| Imbalanced Classes | 99.9% accuracy on 0.1% fraud | Use PR-AUC, F1; SMOTE; class_weights |
Quality Checklist
- Hypothesis defined before analysis
- Train/Test split correct (no leakage)
- Imbalanced classes handled properly
- Confidence intervals provided
- Results interpreted in business terms
- Caveats and limitations stated
- Random seeds set for reproducibility
- Model explained with SHAP/LIME if black-box
- 确保已安装 OpenClaw(本地或 Docker 部署)
- 在对话框中输入安装命令:
/install data-scientist - 安装完成后,直接呼叫该 Skill 的名称或使用
/data-scientist触发 - 根据 Skill 的参数说明提供必要输入,即可获得结构化输出
Data Scientist 是什么?
Expertise in statistical analysis, predictive modeling, causal inference, and data-driven storytelling to generate validated business insights and support de... 它是一个面向 Claude Code / OpenClaw 的 AI Agent Skill 插件,目前累计下载 50 次。
如何安装 Data Scientist?
在 OpenClaw 或 Claude Code 对话框中运行命令「/install data-scientist」即可一键安装,无需额外配置。
Data Scientist 是免费的吗?
是的,Data Scientist 完全免费,采用 MIT-0 许可证,可自由下载、安装和使用。
Data Scientist 支持哪些平台?
Data Scientist 跨平台运行,可在任意部署了 OpenClaw / Claude Code 的环境中使用(cross-platform)。
谁开发了 Data Scientist?
由 runkecheng(@runkecheng)开发并维护,当前版本 v1.0.0。