← 返回 Skills 市场

Data Quality Check

Name: Data Quality Check
Author: datadrivenconstruction

作者 datadrivenconstruction · GitHub ↗ · v2.1.0

win32 ✓ 安全检测通过

1251

总下载

当前安装

版本数

在 OpenClaw 中安装

/install data-quality-check

功能描述

Assess construction data quality using completeness, accuracy, consistency, timeliness, and validity metrics. Automated validation with regex patterns, thres...

使用说明 (SKILL.md)

Data Quality Check for Construction\r

Overview\r

\r Based on DDC methodology (Chapter 2.6), this skill provides comprehensive data quality assessment for construction projects. Poor data quality leads to poor decisions - validate early, validate often.\r \r Book Reference: "Требования к качеству данных и его обеспечение" / "Data Quality Requirements"\r \r

"Качество данных определяется пятью ключевыми метриками: полнота, точность, согласованность, своевременность и достоверность."\r — DDC Book, Chapter 2.6\r \r

Quick Start\r

import pandas as pd\r
\r
# Load construction data\r
df = pd.read_excel("bim_export.xlsx")\r
\r
# Quick quality check\r
quality_score = {\r
    'completeness': (1 - df.isnull().sum().sum() / df.size) * 100,\r
    'unique_ids': df['ElementId'].nunique() == len(df),\r
    'valid_volumes': (df['Volume_m3'] >= 0).all()\r
}\r
\r
print(f"Completeness: {quality_score['completeness']:.1f}%")\r
print(f"Unique IDs: {quality_score['unique_ids']}")\r
print(f"Valid volumes: {quality_score['valid_volumes']}")\r
```\r
\r
## Data Quality Dimensions\r
\r
### The 5 Quality Metrics\r
\r
```python\r
import pandas as pd\r
import numpy as np\r
import re\r
from datetime import datetime, timedelta\r
\r
class DataQualityChecker:\r
    """Comprehensive data quality assessment for construction data"""\r
\r
    def __init__(self, df):\r
        self.df = df.copy()\r
        self.results = {}\r
        self.issues = []\r
\r
    def check_completeness(self, required_columns=None):\r
        """Check for missing values (Полнота)"""\r
        if required_columns is None:\r
            required_columns = self.df.columns.tolist()\r
\r
        completeness = {}\r
        for col in required_columns:\r
            if col in self.df.columns:\r
                non_null = self.df[col].notna().sum()\r
                total = len(self.df)\r
                completeness[col] = (non_null / total) * 100\r
            else:\r
                completeness[col] = 0\r
                self.issues.append(f"Missing required column: {col}")\r
\r
        overall = np.mean(list(completeness.values()))\r
\r
        self.results['completeness'] = {\r
            'by_column': completeness,\r
            'overall': overall,\r
            'threshold': 95,\r
            'passed': overall >= 95\r
        }\r
\r
        return self.results['completeness']\r
\r
    def check_accuracy(self, rules=None):\r
        """Check data accuracy against rules (Точность)"""\r
        if rules is None:\r
            # Default construction data rules\r
            rules = {\r
                'Volume_m3': {'min': 0, 'max': 10000},\r
                'Area_m2': {'min': 0, 'max': 100000},\r
                'Weight_kg': {'min': 0, 'max': 1000000},\r
                'Cost': {'min': 0, 'max': 100000000}\r
            }\r
\r
        accuracy = {}\r
        for col, bounds in rules.items():\r
            if col in self.df.columns:\r
                valid = self.df[col].between(\r
                    bounds.get('min', -np.inf),\r
                    bounds.get('max', np.inf)\r
                ).sum()\r
                total = self.df[col].notna().sum()\r
                accuracy[col] = (valid / total * 100) if total > 0 else 100\r
\r
                # Log invalid values\r
                invalid_count = total - valid\r
                if invalid_count > 0:\r
                    self.issues.append(\r
                        f"{col}: {invalid_count} values outside range [{bounds.get('min')}, {bounds.get('max')}]"\r
                    )\r
\r
        overall = np.mean(list(accuracy.values())) if accuracy else 100\r
\r
        self.results['accuracy'] = {\r
            'by_column': accuracy,\r
            'overall': overall,\r
            'threshold': 98,\r
            'passed': overall >= 98\r
        }\r
\r
        return self.results['accuracy']\r
\r
    def check_consistency(self, unique_cols=None, relationship_rules=None):\r
        """Check data consistency (Согласованность)"""\r
        consistency = {}\r
\r
        # Check unique columns\r
        if unique_cols is None:\r
            unique_cols = ['ElementId']\r
\r
        for col in unique_cols:\r
            if col in self.df.columns:\r
                is_unique = self.df[col].nunique() == len(self.df)\r
                consistency[f'{col}_unique'] = 100 if is_unique else \\r
                    (self.df[col].nunique() / len(self.df) * 100)\r
\r
                if not is_unique:\r
                    duplicates = self.df[self.df[col].duplicated()][col].unique()\r
                    self.issues.append(f"Duplicate {col}: {len(duplicates)} duplicates found")\r
\r
        # Check cross-field relationships\r
        if relationship_rules is None:\r
            relationship_rules = [\r
                ('End_Date', '>=', 'Start_Date'),\r
                ('Gross_Volume', '>=', 'Net_Volume')\r
            ]\r
\r
        for col1, op, col2 in relationship_rules:\r
            if col1 in self.df.columns and col2 in self.df.columns:\r
                if op == '>=':\r
                    valid = (self.df[col1] >= self.df[col2]).sum()\r
                elif op == '>':\r
                    valid = (self.df[col1] > self.df[col2]).sum()\r
                elif op == '==':\r
                    valid = (self.df[col1] == self.df[col2]).sum()\r
\r
                total = self.df[[col1, col2]].notna().all(axis=1).sum()\r
                consistency[f'{col1}_{op}_{col2}'] = (valid / total * 100) if total > 0 else 100\r
\r
        overall = np.mean(list(consistency.values())) if consistency else 100\r
\r
        self.results['consistency'] = {\r
            'checks': consistency,\r
            'overall': overall,\r
            'threshold': 99,\r
            'passed': overall >= 99\r
        }\r
\r
        return self.results['consistency']\r
\r
    def check_timeliness(self, date_col='Modified_Date', max_age_days=30):\r
        """Check data timeliness (Своевременность)"""\r
        if date_col not in self.df.columns:\r
            self.results['timeliness'] = {\r
                'overall': None,\r
                'message': f'Column {date_col} not found'\r
            }\r
            return self.results['timeliness']\r
\r
        dates = pd.to_datetime(self.df[date_col], errors='coerce')\r
        cutoff = datetime.now() - timedelta(days=max_age_days)\r
\r
        recent = (dates >= cutoff).sum()\r
        total = dates.notna().sum()\r
        timeliness_pct = (recent / total * 100) if total > 0 else 0\r
\r
        oldest = dates.min()\r
        newest = dates.max()\r
        avg_age = (datetime.now() - dates.mean()).days if dates.notna().any() else None\r
\r
        self.results['timeliness'] = {\r
            'recent_percentage': timeliness_pct,\r
            'oldest_record': oldest,\r
            'newest_record': newest,\r
            'average_age_days': avg_age,\r
            'threshold': 80,\r
            'passed': timeliness_pct >= 80\r
        }\r
\r
        return self.results['timeliness']\r
\r
    def check_validity(self, patterns=None):\r
        """Check data validity with regex patterns (Достоверность)"""\r
        if patterns is None:\r
            patterns = {\r
                'ElementId': r'^[A-Z]{1,3}\d{3,6}$',  # e.g., W001, FL12345\r
                'Level': r'^Level\s*\d+$|^L\d+$|^Уровень\s*\d+$',\r
                'Email': r'^[\w\.-]+@[\w\.-]+\.\w+$',\r
                'Phone': r'^\+?\d{10,15}$'\r
            }\r
\r
        validity = {}\r
        for col, pattern in patterns.items():\r
            if col in self.df.columns:\r
                non_null = self.df[col].dropna()\r
                if len(non_null) > 0:\r
                    matches = non_null.astype(str).str.match(pattern).sum()\r
                    validity[col] = (matches / len(non_null) * 100)\r
\r
                    invalid = len(non_null) - matches\r
                    if invalid > 0:\r
                        self.issues.append(f"{col}: {invalid} values don't match pattern")\r
                else:\r
                    validity[col] = 100\r
\r
        overall = np.mean(list(validity.values())) if validity else 100\r
\r
        self.results['validity'] = {\r
            'by_column': validity,\r
            'overall': overall,\r
            'threshold': 95,\r
            'passed': overall >= 95\r
        }\r
\r
        return self.results['validity']\r
\r
    def run_full_check(self):\r
        """Run all quality checks"""\r
        self.check_completeness()\r
        self.check_accuracy()\r
        self.check_consistency()\r
        self.check_timeliness()\r
        self.check_validity()\r
\r
        # Calculate overall score\r
        scores = []\r
        for metric in ['completeness', 'accuracy', 'consistency', 'validity']:\r
            if metric in self.results and self.results[metric].get('overall'):\r
                scores.append(self.results[metric]['overall'])\r
\r
        self.results['overall_score'] = np.mean(scores) if scores else 0\r
        self.results['grade'] = self._calculate_grade(self.results['overall_score'])\r
        self.results['issues'] = self.issues\r
\r
        return self.results\r
\r
    def _calculate_grade(self, score):\r
        """Calculate quality grade"""\r
        if score >= 98:\r
            return 'A+'\r
        elif score >= 95:\r
            return 'A'\r
        elif score >= 90:\r
            return 'B'\r
        elif score >= 80:\r
            return 'C'\r
        elif score >= 70:\r
            return 'D'\r
        else:\r
            return 'F'\r
\r
    def generate_report(self):\r
        """Generate quality report"""\r
        if not self.results:\r
            self.run_full_check()\r
\r
        report = []\r
        report.append("=" * 60)\r
        report.append("DATA QUALITY REPORT")\r
        report.append("=" * 60)\r
        report.append(f"Records analyzed: {len(self.df)}")\r
        report.append(f"Columns: {len(self.df.columns)}")\r
        report.append("")\r
        report.append(f"OVERALL SCORE: {self.results['overall_score']:.1f}% (Grade: {self.results['grade']})")\r
        report.append("")\r
        report.append("-" * 60)\r
\r
        # Detail by dimension\r
        for metric in ['completeness', 'accuracy', 'consistency', 'validity', 'timeliness']:\r
            if metric in self.results:\r
                r = self.results[metric]\r
                passed = '✓' if r.get('passed', False) else '✗'\r
                overall = r.get('overall', r.get('recent_percentage', 'N/A'))\r
                if isinstance(overall, (int, float)):\r
                    report.append(f"{metric.upper():15s}: {overall:>6.1f}% {passed}")\r
                else:\r
                    report.append(f"{metric.upper():15s}: {overall}")\r
\r
        report.append("-" * 60)\r
\r
        if self.issues:\r
            report.append("")\r
            report.append("ISSUES FOUND:")\r
            for issue in self.issues[:10]:  # Show first 10\r
                report.append(f"  • {issue}")\r
            if len(self.issues) > 10:\r
                report.append(f"  ... and {len(self.issues) - 10} more issues")\r
\r
        report.append("")\r
        report.append("=" * 60)\r
\r
        return "\
".join(report)\r
```\r
\r
## Validation Rules Builder\r
\r
### Custom Validation Rules\r
\r
```python\r
class ValidationRulesBuilder:\r
    """Build custom validation rules for construction data"""\r
\r
    def __init__(self):\r
        self.rules = []\r
\r
    def add_not_null(self, column):\r
        """Column must not have null values"""\r
        self.rules.append({\r
            'type': 'not_null',\r
            'column': column,\r
            'check': lambda df, col=column: df[col].notna().all()\r
        })\r
        return self\r
\r
    def add_unique(self, column):\r
        """Column must have unique values"""\r
        self.rules.append({\r
            'type': 'unique',\r
            'column': column,\r
            'check': lambda df, col=column: df[col].nunique() == len(df)\r
        })\r
        return self\r
\r
    def add_range(self, column, min_val=None, max_val=None):\r
        """Column values must be within range"""\r
        self.rules.append({\r
            'type': 'range',\r
            'column': column,\r
            'min': min_val,\r
            'max': max_val,\r
            'check': lambda df, col=column, mn=min_val, mx=max_val:\r
                df[col].between(mn or -np.inf, mx or np.inf).all()\r
        })\r
        return self\r
\r
    def add_regex(self, column, pattern):\r
        """Column values must match regex pattern"""\r
        self.rules.append({\r
            'type': 'regex',\r
            'column': column,\r
            'pattern': pattern,\r
            'check': lambda df, col=column, p=pattern:\r
                df[col].astype(str).str.match(p).all()\r
        })\r
        return self\r
\r
    def add_in_list(self, column, valid_values):\r
        """Column values must be in list"""\r
        self.rules.append({\r
            'type': 'in_list',\r
            'column': column,\r
            'valid_values': valid_values,\r
            'check': lambda df, col=column, vals=valid_values:\r
                df[col].isin(vals).all()\r
        })\r
        return self\r
\r
    def add_custom(self, name, check_func):\r
        """Add custom validation function"""\r
        self.rules.append({\r
            'type': 'custom',\r
            'name': name,\r
            'check': check_func\r
        })\r
        return self\r
\r
    def validate(self, df):\r
        """Run all validation rules"""\r
        results = []\r
\r
        for rule in self.rules:\r
            try:\r
                passed = rule['check'](df)\r
                results.append({\r
                    'rule': rule.get('name', f"{rule['type']}:{rule.get('column', 'custom')}"),\r
                    'passed': passed,\r
                    'type': rule['type']\r
                })\r
            except Exception as e:\r
                results.append({\r
                    'rule': rule.get('name', f"{rule['type']}:{rule.get('column', 'custom')}"),\r
                    'passed': False,\r
                    'error': str(e)\r
                })\r
\r
        return results\r
\r
# Usage example\r
rules = (ValidationRulesBuilder()\r
    .add_not_null('ElementId')\r
    .add_unique('ElementId')\r
    .add_range('Volume_m3', min_val=0)\r
    .add_range('Cost', min_val=0)\r
    .add_in_list('Category', ['Wall', 'Floor', 'Column', 'Beam', 'Slab'])\r
    .add_regex('Level', r'^Level\s*\d+$')\r
)\r
\r
results = rules.validate(df)\r
for r in results:\r
    status = '✓' if r['passed'] else '✗'\r
    print(f"{status} {r['rule']}")\r
```\r
\r
## Automated Quality Pipeline\r
\r
```python\r
class DataQualityPipeline:\r
    """Automated data quality pipeline"""\r
\r
    def __init__(self, config=None):\r
        self.config = config or self._default_config()\r
        self.history = []\r
\r
    def _default_config(self):\r
        return {\r
            'required_columns': ['ElementId', 'Category', 'Volume_m3'],\r
            'unique_columns': ['ElementId'],\r
            'numeric_ranges': {\r
                'Volume_m3': (0, 10000),\r
                'Area_m2': (0, 100000),\r
                'Cost': (0, 100000000)\r
            },\r
            'valid_categories': ['Wall', 'Floor', 'Column', 'Beam', 'Slab',\r
                                 'Foundation', 'Roof', 'Stair', 'Door', 'Window'],\r
            'min_quality_score': 90\r
        }\r
\r
    def run(self, df, source_name='unknown'):\r
        """Run quality pipeline"""\r
        checker = DataQualityChecker(df)\r
\r
        # Configure checks based on config\r
        checker.check_completeness(self.config['required_columns'])\r
        checker.check_accuracy({\r
            col: {'min': r[0], 'max': r[1]}\r
            for col, r in self.config['numeric_ranges'].items()\r
        })\r
        checker.check_consistency(self.config['unique_columns'])\r
        checker.check_validity()\r
\r
        results = checker.run_full_check()\r
\r
        # Store in history\r
        self.history.append({\r
            'timestamp': datetime.now(),\r
            'source': source_name,\r
            'records': len(df),\r
            'score': results['overall_score'],\r
            'grade': results['grade'],\r
            'issues_count': len(results['issues'])\r
        })\r
\r
        # Check threshold\r
        passed = results['overall_score'] >= self.config['min_quality_score']\r
\r
        return {\r
            'passed': passed,\r
            'score': results['overall_score'],\r
            'grade': results['grade'],\r
            'details': results,\r
            'report': checker.generate_report()\r
        }\r
\r
    def get_history_summary(self):\r
        """Get quality history summary"""\r
        if not self.history:\r
            return "No quality checks performed yet."\r
\r
        df_history = pd.DataFrame(self.history)\r
        return {\r
            'total_checks': len(self.history),\r
            'avg_score': df_history['score'].mean(),\r
            'min_score': df_history['score'].min(),\r
            'max_score': df_history['score'].max(),\r
            'latest': self.history[-1]\r
        }\r
```\r
\r
## Quality Reporting\r
\r
### Export Quality Report\r
\r
```python\r
def export_quality_report(df, output_path, include_details=True):\r
    """Export comprehensive quality report to Excel"""\r
    checker = DataQualityChecker(df)\r
    results = checker.run_full_check()\r
\r
    with pd.ExcelWriter(output_path, engine='openpyxl') as writer:\r
        # Summary sheet\r
        summary = pd.DataFrame({\r
            'Metric': ['Overall Score', 'Grade', 'Records', 'Columns', 'Issues'],\r
            'Value': [\r
                f"{results['overall_score']:.1f}%",\r
                results['grade'],\r
                len(df),\r
                len(df.columns),\r
                len(results['issues'])\r
            ]\r
        })\r
        summary.to_excel(writer, sheet_name='Summary', index=False)\r
\r
        # Completeness details\r
        if 'completeness' in results:\r
            comp_df = pd.DataFrame.from_dict(\r
                results['completeness']['by_column'],\r
                orient='index',\r
                columns=['Completeness_%']\r
            )\r
            comp_df.to_excel(writer, sheet_name='Completeness')\r
\r
        # Issues list\r
        if results['issues']:\r
            issues_df = pd.DataFrame({'Issue': results['issues']})\r
            issues_df.to_excel(writer, sheet_name='Issues', index=False)\r
\r
        # Missing values analysis\r
        if include_details:\r
            missing = df.isnull().sum()\r
            missing_df = pd.DataFrame({\r
                'Column': missing.index,\r
                'Missing_Count': missing.values,\r
                'Missing_%': (missing.values / len(df) * 100).round(2)\r
            })\r
            missing_df.to_excel(writer, sheet_name='Missing_Values', index=False)\r
\r
    return output_path\r
```\r
\r
## Quick Reference\r
\r
| Metric | Description | Threshold |\r
|--------|-------------|-----------|\r
| Completeness | % non-null values | ≥ 95% |\r
| Accuracy | Values within valid range | ≥ 98% |\r
| Consistency | Unique IDs, valid relationships | ≥ 99% |\r
| Validity | Match expected patterns | ≥ 95% |\r
| Timeliness | Records updated recently | ≥ 80% |\r
\r
## Common Validation Patterns\r
\r
```python\r
# Construction-specific regex patterns\r
PATTERNS = {\r
    'element_id': r'^[A-Z]{1,3}\d{3,8}$',\r
    'revit_id': r'^\d{5,8}$',\r
    'ifc_guid': r'^[A-Za-z0-9_$]{22}$',\r
    'level': r'^(Level|L|Уровень)\s*[-]?\d+$',\r
    'grid': r'^[A-Z]{1,2}[-/]?\d{0,3}$',\r
    'date_iso': r'^\d{4}-\d{2}-\d{2}$',\r
    'cost_code': r'^\d{2,3}[.-]\d{2,4}[.-]?\d{0,4}$'\r
}\r
```\r
\r
## Resources\r
\r
- **Book**: "Data-Driven Construction" by Artem Boiko, Chapter 2.6\r
- **Website**: https://datadrivenconstruction.io\r
- **Great Expectations**: https://greatexpectations.io\r
\r
## Next Steps\r
\r
- See `bim-validation-pipeline` for BIM-specific validation\r
- See `etl-pipeline` for data processing pipelines\r
- See `data-visualization` for quality dashboards\r

安全使用建议

This skill appears coherent for running local data-quality checks in Python. Before installing or invoking it: 1) Confirm you are comfortable granting filesystem access (the skill needs to read user-supplied files). 2) Run it only on non-sensitive/test data first to validate behavior. 3) Ensure a Python environment with required packages (pandas, numpy) is available on your Windows host. 4) Note minor metadata oddities: the package manifest (claw.json) claims filesystem permission and a slightly different version than the registry entry; the publisher/source are not fully documented — consider checking the homepage (https://datadrivenconstruction.io) and verifying the publisher before trusting sensitive data. If you need networked or automated processing of sensitive files, require additional review; for offline, user-supplied file checks this skill looks appropriate.

功能分析

Type: OpenClaw Skill Name: data-quality-check Version: 2.1.0 The skill bundle is designed for construction data quality assessment, utilizing `pandas` and `numpy` for various checks (completeness, accuracy, consistency, timeliness, validity) and report generation. The `claw.json` explicitly requests `filesystem` permission, which is appropriately used by the Python code in `SKILL.md` for reading input Excel files (`pd.read_excel`) and exporting reports to Excel (`pd.ExcelWriter`). There is no evidence of prompt injection in `SKILL.md` or `instructions.md`, nor any malicious code attempting data exfiltration, unauthorized command execution, persistence, or obfuscation. The instructions for the agent (`instructions.md`) also guide it towards safe data handling.

能力评估

✓ Purpose & Capability

The name/description (data quality checks for construction data) aligns with the content: SKILL.md contains Python code and examples that read Excel/CSV/JSON and run completeness/accuracy/consistency checks. Requiring python3 is appropriate for the provided Python snippets.

✓ Instruction Scope

Instructions and code operate on user-provided data files (CSV/Excel/JSON) and include only local processing (pandas, numpy, regex). The instructions explicitly constrain the skill to 'Only use data provided by the user or referenced in the skill.' No network endpoints, external exfiltration, or requests for unrelated system data are present in the provided SKILL.md/instructions.md excerpts.

✓ Install Mechanism

This is an instruction-only skill with no install spec and no downloaded code. That minimizes disk-write/install risk. The only runtime requirement is python3 (no packages are auto-installed by the skill manifest itself).

✓ Credentials

The skill declares no required environment variables, no credentials, and no config paths. That is appropriate: the code only needs access to local data files and standard Python libraries (pandas/numpy/etc. are referenced but not enforced by the manifest).

ℹ Persistence & Privilege

claw.json lists 'filesystem' in permissions, which is consistent with reading user data files but does grant the skill the ability to access the agent's filesystem. The skill is not always-enabled and does not request other elevated privileges. Users should be aware that filesystem permission allows reading local files the agent can access.

如何使用

确保已安装 OpenClaw（本地或 Docker 部署）
在对话框中输入安装命令：/install data-quality-check
安装完成后，直接呼叫该 Skill 的名称或使用 /data-quality-check 触发
根据 Skill 的参数说明提供必要输入，即可获得结构化输出

版本历史

v2.1.0

- Added comprehensive data quality assessment covering completeness, accuracy, consistency, timeliness, and validity for construction datasets. - Introduced automated validation using regex patterns, rule-based thresholds, and column checks. - Included programmatic usage examples and quick start guide in Python. - Enhanced reporting with per-metric results, threshold checks, and detailed issue logging. - Documentation now references the DDC methodology and main quality standards for construction data.

v1.0.0

Initial release: Comprehensive data quality assessment for construction data. - Assesses completeness, accuracy, consistency, timeliness, and validity metrics. - Provides automated validation with customizable regex patterns, thresholds, and reporting. - Includes sample Python code and a flexible DataQualityChecker class. - Supports detection of missing, duplicate, out-of-range, outdated, or incorrectly formatted entries.

元数据

Slug data-quality-check

版本 2.1.0

许可证 —

累计安装 8

当前安装数 8

历史版本数 2

常见问题