← 返回 Skills 市场

Ingeniero de datos

Name: Ingeniero de datos
Author: felix-antonio-sl

作者 felix-antonio-sl · GitHub ↗ · v1.0.0 · MIT-0

cross-platform ✓ 安全检测通过

总下载

当前安装

版本数

在 OpenClaw 中安装

/install kv-senior-data-engineering

功能描述

Design and build scalable data pipelines, ETL/ELT systems, and data infrastructure. Use when designing data architectures, choosing between batch and streami...

使用说明 (SKILL.md)

Senior Data Engineer

Production-grade data engineering: pipelines, modeling, quality, and DataOps.

Activation

Use this skill when the user asks to:

design a data pipeline (batch, streaming, or hybrid)
choose between Lambda and Kappa architecture, or batch vs streaming
build ETL/ELT with Airflow, Prefect, Dagster, dbt, or Spark
implement data quality checks or data contracts
model data (star schema, snowflake, SCD, Data Vault)
optimize a slow Spark job, DAG, or warehouse query
set up data observability, lineage, or incident response

Workflow

Classify the request: pipeline | model | quality | optimize | architecture.
Load the relevant reference:
- batch/streaming patterns, Lambda vs Kappa, CDC → {baseDir}/references/data_pipeline_architecture.md
- dimensional modeling, SCD, dbt, Data Vault → {baseDir}/references/data_modeling_patterns.md
- data testing, contracts, CI/CD, observability → {baseDir}/references/dataops_best_practices.md
- end-to-end workflow walkthroughs → {baseDir}/references/workflows.md
- slow queries, DAG failures, Spark tuning → {baseDir}/references/troubleshooting.md

Run the appropriate script when artifacts are provided:

# Generate pipeline orchestration config (airflow | prefect | dagster)
python {baseDir}/scripts/pipeline_orchestrator.py generate \
  --type airflow --source postgres --destination snowflake --schedule "0 5 * * *"

# Validate data quality (freshness, completeness, uniqueness, schema)
python {baseDir}/scripts/data_quality_validator.py validate \
  --input data/file.parquet --schema schemas/file.json \
  --checks freshness,completeness,uniqueness

# Analyze and optimize ETL performance
python {baseDir}/scripts/etl_performance_optimizer.py analyze \
  --query queries/aggregation.sql --engine spark --recommend

Emit the artifact: pipeline config, dbt model, schema DDL, quality rules, or architecture diagram.

Output Contract

Open with the pipeline classification and dominant bottleneck or design decision.
Emit one primary artifact per response (DAG, dbt model, schema, quality config).
For architecture decisions: state the trade-offs of each option before recommending.
Declare data loss risk explicitly when a pipeline design cannot guarantee exactly-once semantics.
Close with observability recommendation (what to monitor and at what threshold).

Key Rules

Default to batch unless sub-minute latency is a stated requirement.
Default to dbt + warehouse compute for \x3C1TB daily; recommend Spark only when justified by volume or complexity.
Every pipeline must declare: idempotency strategy, error handling, and dead-letter queue approach.
Data quality checks are non-optional — include them in every pipeline design.

Guardrails

Do not generate application-layer code (APIs, web services) — stay within data pipeline scope.
Do not recommend streaming when batch satisfies the latency requirement; streaming adds operational cost.
Flag missing idempotency as a HIGH issue; flag missing data quality checks as MEDIUM.
For cross-engine migration refer to migration-architect.

Self Check

Before emitting any artifact, verify:

idempotency strategy is stated;
error handling and retry logic are addressed;
data quality checks are included or explicitly deferred with a reason;
the chosen architecture (batch vs stream) matches the stated latency requirement.

安全使用建议

This bundle appears coherent for designing and generating data pipelines, but review generated artifacts before running them in production. Specific points to consider: 1) Generated DAGs and tasks will reference connection IDs (e.g., postgres_conn_id, snowflake_conn_id) — you must configure those connections securely in your Airflow/secret manager rather than embedding secrets in generated code. 2) The generator can embed arbitrary bash commands from task parameters; inspect any generated BashOperator commands for accidental injection of secrets or destructive commands. 3) The validators and patterns include detectors for PII (emails, credit-card-like patterns) — avoid feeding sensitive production data into the skill without appropriate controls. 4) Test generated code in an isolated or staging environment first, and supply credentials via your normal secret-store mechanism. If you want to reduce risk, disable autonomous invocation for this skill or review its outputs manually before execution.

功能分析

Type: OpenClaw Skill Name: kv-senior-data-engineering Version: 1.0.0 The skill bundle provides a comprehensive and legitimate set of tools for senior data engineering tasks, including pipeline orchestration, data quality validation, and performance optimization. The Python scripts (pipeline_orchestrator.py, data_quality_validator.py, and etl_performance_optimizer.py) are well-documented and implement their stated features using standard libraries without any evidence of malicious intent, data exfiltration, or unauthorized execution. While pipeline_orchestrator.py uses the compile() function for syntax validation of generated or provided code, it does not execute the code, and the overall behavior is strictly aligned with the professional data engineering purpose described in SKILL.md.

能力评估

✓ Purpose & Capability

Name and description (designing pipelines, ETL/ELT, quality, Airflow/dbt/Spark/Kafka guidance) match the provided reference docs and three scripts (pipeline generator, quality validator, ETL optimizer). There are no unrelated binaries, credentials, or config paths declared that would be inconsistent with the stated purpose.

✓ Instruction Scope

SKILL.md instructs the agent to load local reference files under {baseDir}/references and to run the packaged scripts under {baseDir}/scripts. The instructions do not ask the agent to read arbitrary system files, environment variables, or remote endpoints beyond what is normal for data-engineering artifacts. It does generate DAGs and commands that will later require environment-specific connections (Airflow conn IDs, Snowflake/Postgres connection names), which is expected for this purpose.

✓ Install Mechanism

No install spec or external downloads are present; this is an instruction-only skill with bundled Python scripts and docs. Nothing is fetched from external URLs or installed at runtime by the skill itself.

ℹ Credentials

The skill declares no required environment variables or credentials (none in requires.env). However, generated artifacts (Airflow DAGs, SnowflakeOperator/PostgresOperator usage, Kafka examples) will expect platform-specific connection identifiers and secrets to be present in the target environment (Airflow connections, cloud credentials) — this is normal but users should not assume this skill will auto-supply or manage those credentials.

✓ Persistence & Privilege

always is false and model invocation is allowed (default). The skill does not request permanent system presence or attempt to modify other skills or system-wide agent settings.

如何使用

确保已安装 OpenClaw（本地或 Docker 部署）
在对话框中输入安装命令：/install kv-senior-data-engineering
安装完成后，直接呼叫该 Skill 的名称或使用 /kv-senior-data-engineering 触发
根据 Skill 的参数说明提供必要输入，即可获得结构化输出

版本历史

v1.0.0

Initial release of Senior Data Engineer skill. - Provides guidance on designing data pipelines, data modeling, quality frameworks, and DataOps best practices. - Includes workflow for classifying requests and referencing detailed guides for architecture, modeling, testing, workflow, and troubleshooting. - Supports generating artifacts such as pipeline configs, dbt models, schema DDL, and quality rules. - Enforces key engineering guardrails and self-checks for idempotency, error handling, and data quality coverage. - Recommends technologies and architectures based on data volume, latency, and complexity requirements.

元数据

Slug kv-senior-data-engineering

版本 1.0.0

许可证 MIT-0

累计安装 0

当前安装数 0

历史版本数 1

常见问题