← Back to Skills Marketplace

Data Type Classifier

Name: Data Type Classifier
Author: datadrivenconstruction

by datadrivenconstruction · GitHub ↗ · v2.1.0

win32 ✓ Security Clean

1067

Downloads

Stars

Active Installs

Versions

Install in OpenClaw

/install data-type-classifier

Description

Classify construction data by type (structured, unstructured, semi-structured). Analyze data sources and recommend appropriate storage/processing methods

README (SKILL.md)

Data Type Classifier\r

Overview\r

\r Based on DDC methodology (Chapter 2.1), this skill classifies construction data by type, analyzes data sources, and recommends appropriate storage, processing, and integration methods.\r \r Book Reference: "Типы данных в строительстве" / "Data Types in Construction"\r \r

Quick Start\r

from dataclasses import dataclass, field\r
from enum import Enum\r
from typing import List, Dict, Optional, Any, Tuple\r
from datetime import datetime\r
import json\r
import re\r
import mimetypes\r
\r
class DataStructure(Enum):\r
    """Data structure classification"""\r
    STRUCTURED = "structured"           # Tables, databases, spreadsheets\r
    SEMI_STRUCTURED = "semi_structured" # JSON, XML, IFC\r
    UNSTRUCTURED = "unstructured"       # Documents, images, videos\r
    GEOMETRIC = "geometric"             # CAD, BIM geometry\r
    TEMPORAL = "temporal"               # Time-series, schedules\r
    SPATIAL = "spatial"                 # GIS, coordinates\r
\r
class DataFormat(Enum):\r
    """Common construction data formats"""\r
    # Structured\r
    CSV = "csv"\r
    EXCEL = "excel"\r
    SQL = "sql"\r
    PARQUET = "parquet"\r
\r
    # Semi-structured\r
    JSON = "json"\r
    XML = "xml"\r
    IFC = "ifc"\r
    BCF = "bcf"\r
\r
    # Unstructured\r
    PDF = "pdf"\r
    DOCX = "docx"\r
    IMAGE = "image"\r
    VIDEO = "video"\r
\r
    # Geometric\r
    DWG = "dwg"\r
    DXF = "dxf"\r
    RVT = "rvt"\r
    NWD = "nwd"\r
    OBJ = "obj"\r
    STL = "stl"\r
\r
    # Schedule\r
    MPP = "mpp"\r
    P6 = "p6"\r
    XER = "xer"\r
\r
class StorageRecommendation(Enum):\r
    """Storage system recommendations"""\r
    RELATIONAL_DB = "relational_database"\r
    DOCUMENT_DB = "document_database"\r
    OBJECT_STORAGE = "object_storage"\r
    GRAPH_DB = "graph_database"\r
    TIME_SERIES_DB = "time_series_database"\r
    VECTOR_DB = "vector_database"\r
    FILE_SYSTEM = "file_system"\r
    DATA_LAKE = "data_lake"\r
\r
@dataclass\r
class DataCharacteristics:\r
    """Characteristics of a data source"""\r
    has_schema: bool\r
    has_relationships: bool\r
    is_queryable: bool\r
    is_binary: bool\r
    has_geometry: bool\r
    has_temporal: bool\r
    has_text_content: bool\r
    avg_record_size: Optional[int] = None  # bytes\r
    estimated_volume: Optional[str] = None  # small/medium/large/huge\r
    update_frequency: Optional[str] = None\r
\r
@dataclass\r
class DataClassification:\r
    """Classification result for a data source"""\r
    source_name: str\r
    source_type: str\r
    detected_format: DataFormat\r
    structure: DataStructure\r
    characteristics: DataCharacteristics\r
    storage_recommendation: StorageRecommendation\r
    processing_tools: List[str]\r
    integration_options: List[str]\r
    quality_considerations: List[str]\r
    confidence: float\r
\r
@dataclass\r
class ClassificationReport:\r
    """Complete classification report"""\r
    total_sources: int\r
    classifications: List[DataClassification]\r
    summary_by_structure: Dict[str, int]\r
    summary_by_format: Dict[str, int]\r
    storage_recommendations: Dict[str, List[str]]\r
    integration_strategy: Dict[str, str]\r
\r
\r
class DataTypeClassifier:\r
    """\r
    Classify construction data by type and recommend processing methods.\r
    Based on DDC methodology Chapter 2.1.\r
    """\r
\r
    def __init__(self):\r
        self.format_signatures = self._define_format_signatures()\r
        self.structure_mapping = self._define_structure_mapping()\r
        self.storage_mapping = self._define_storage_mapping()\r
        self.processing_tools = self._define_processing_tools()\r
\r
    def _define_format_signatures(self) -> Dict[str, Dict]:\r
        """Define format detection signatures"""\r
        return {\r
            # File extensions\r
            ".csv": {"format": DataFormat.CSV, "structure": DataStructure.STRUCTURED},\r
            ".xlsx": {"format": DataFormat.EXCEL, "structure": DataStructure.STRUCTURED},\r
            ".xls": {"format": DataFormat.EXCEL, "structure": DataStructure.STRUCTURED},\r
            ".json": {"format": DataFormat.JSON, "structure": DataStructure.SEMI_STRUCTURED},\r
            ".xml": {"format": DataFormat.XML, "structure": DataStructure.SEMI_STRUCTURED},\r
            ".ifc": {"format": DataFormat.IFC, "structure": DataStructure.SEMI_STRUCTURED},\r
            ".bcf": {"format": DataFormat.BCF, "structure": DataStructure.SEMI_STRUCTURED},\r
            ".pdf": {"format": DataFormat.PDF, "structure": DataStructure.UNSTRUCTURED},\r
            ".docx": {"format": DataFormat.DOCX, "structure": DataStructure.UNSTRUCTURED},\r
            ".dwg": {"format": DataFormat.DWG, "structure": DataStructure.GEOMETRIC},\r
            ".dxf": {"format": DataFormat.DXF, "structure": DataStructure.GEOMETRIC},\r
            ".rvt": {"format": DataFormat.RVT, "structure": DataStructure.GEOMETRIC},\r
            ".nwd": {"format": DataFormat.NWD, "structure": DataStructure.GEOMETRIC},\r
            ".mpp": {"format": DataFormat.MPP, "structure": DataStructure.TEMPORAL},\r
            ".xer": {"format": DataFormat.XER, "structure": DataStructure.TEMPORAL},\r
            ".parquet": {"format": DataFormat.PARQUET, "structure": DataStructure.STRUCTURED},\r
            ".jpg": {"format": DataFormat.IMAGE, "structure": DataStructure.UNSTRUCTURED},\r
            ".png": {"format": DataFormat.IMAGE, "structure": DataStructure.UNSTRUCTURED},\r
            ".mp4": {"format": DataFormat.VIDEO, "structure": DataStructure.UNSTRUCTURED}\r
        }\r
\r
    def _define_structure_mapping(self) -> Dict[DataStructure, Dict]:\r
        """Define characteristics for each structure type"""\r
        return {\r
            DataStructure.STRUCTURED: {\r
                "description": "Tabular data with fixed schema",\r
                "examples": ["Cost databases", "Material lists", "Vendor records"],\r
                "query_support": True,\r
                "schema_required": True\r
            },\r
            DataStructure.SEMI_STRUCTURED: {\r
                "description": "Hierarchical data with flexible schema",\r
                "examples": ["BIM models (IFC)", "API responses", "Configuration files"],\r
                "query_support": True,\r
                "schema_required": False\r
            },\r
            DataStructure.UNSTRUCTURED: {\r
                "description": "No predefined schema or format",\r
                "examples": ["Contracts", "Photos", "Emails", "Meeting notes"],\r
                "query_support": False,\r
                "schema_required": False\r
            },\r
            DataStructure.GEOMETRIC: {\r
                "description": "3D/2D geometric and spatial data",\r
                "examples": ["CAD drawings", "BIM geometry", "Point clouds"],\r
                "query_support": True,\r
                "schema_required": True\r
            },\r
            DataStructure.TEMPORAL: {\r
                "description": "Time-based sequential data",\r
                "examples": ["Schedules", "Progress data", "Sensor readings"],\r
                "query_support": True,\r
                "schema_required": True\r
            },\r
            DataStructure.SPATIAL: {\r
                "description": "Geographic and location data",\r
                "examples": ["Site maps", "GPS tracks", "GIS layers"],\r
                "query_support": True,\r
                "schema_required": True\r
            }\r
        }\r
\r
    def _define_storage_mapping(self) -> Dict[DataStructure, StorageRecommendation]:\r
        """Map data structures to storage recommendations"""\r
        return {\r
            DataStructure.STRUCTURED: StorageRecommendation.RELATIONAL_DB,\r
            DataStructure.SEMI_STRUCTURED: StorageRecommendation.DOCUMENT_DB,\r
            DataStructure.UNSTRUCTURED: StorageRecommendation.OBJECT_STORAGE,\r
            DataStructure.GEOMETRIC: StorageRecommendation.FILE_SYSTEM,\r
            DataStructure.TEMPORAL: StorageRecommendation.TIME_SERIES_DB,\r
            DataStructure.SPATIAL: StorageRecommendation.RELATIONAL_DB\r
        }\r
\r
    def _define_processing_tools(self) -> Dict[DataFormat, List[str]]:\r
        """Define processing tools for each format"""\r
        return {\r
            DataFormat.CSV: ["pandas", "polars", "duckdb"],\r
            DataFormat.EXCEL: ["pandas", "openpyxl", "xlrd"],\r
            DataFormat.JSON: ["json", "pandas", "jq"],\r
            DataFormat.XML: ["lxml", "ElementTree", "BeautifulSoup"],\r
            DataFormat.IFC: ["ifcopenshell", "IfcOpenShell", "xBIM"],\r
            DataFormat.BCF: ["bcfpython", "ifcopenshell"],\r
            DataFormat.PDF: ["pdfplumber", "PyPDF2", "pdf2image"],\r
            DataFormat.DOCX: ["python-docx", "mammoth"],\r
            DataFormat.DWG: ["ezdxf", "Teigha", "ODA SDK"],\r
            DataFormat.DXF: ["ezdxf", "dxfgrabber"],\r
            DataFormat.RVT: ["Revit API", "pyRevit", "Dynamo"],\r
            DataFormat.NWD: ["Navisworks API", "NW API"],\r
            DataFormat.MPP: ["mpxj", "Project API"],\r
            DataFormat.XER: ["xerparser", "P6 API"],\r
            DataFormat.PARQUET: ["pandas", "pyarrow", "polars"],\r
            DataFormat.IMAGE: ["PIL", "opencv", "scikit-image"],\r
            DataFormat.VIDEO: ["opencv", "ffmpeg", "moviepy"]\r
        }\r
\r
    def classify_source(\r
        self,\r
        source_name: str,\r
        source_type: str,\r
        file_extension: Optional[str] = None,\r
        sample_data: Optional[Any] = None,\r
        metadata: Optional[Dict] = None\r
    ) -> DataClassification:\r
        """\r
        Classify a single data source.\r
\r
        Args:\r
            source_name: Name of the data source\r
            source_type: Type (file, database, api, etc.)\r
            file_extension: File extension if applicable\r
            sample_data: Sample of the data for analysis\r
            metadata: Additional metadata\r
\r
        Returns:\r
            Classification result\r
        """\r
        # Detect format\r
        detected_format, structure = self._detect_format(\r
            file_extension, source_type, sample_data\r
        )\r
\r
        # Analyze characteristics\r
        characteristics = self._analyze_characteristics(\r
            detected_format, structure, sample_data, metadata\r
        )\r
\r
        # Determine storage recommendation\r
        storage = self._recommend_storage(structure, characteristics)\r
\r
        # Get processing tools\r
        tools = self.processing_tools.get(detected_format, [])\r
\r
        # Determine integration options\r
        integration = self._get_integration_options(detected_format, structure)\r
\r
        # Quality considerations\r
        quality = self._get_quality_considerations(detected_format, structure)\r
\r
        # Calculate confidence\r
        confidence = self._calculate_confidence(\r
            file_extension, sample_data, metadata\r
        )\r
\r
        return DataClassification(\r
            source_name=source_name,\r
            source_type=source_type,\r
            detected_format=detected_format,\r
            structure=structure,\r
            characteristics=characteristics,\r
            storage_recommendation=storage,\r
            processing_tools=tools,\r
            integration_options=integration,\r
            quality_considerations=quality,\r
            confidence=confidence\r
        )\r
\r
    def _detect_format(\r
        self,\r
        extension: Optional[str],\r
        source_type: str,\r
        sample: Optional[Any]\r
    ) -> Tuple[DataFormat, DataStructure]:\r
        """Detect data format and structure"""\r
        # Check file extension\r
        if extension:\r
            ext = extension.lower() if extension.startswith('.') else f".{extension.lower()}"\r
            if ext in self.format_signatures:\r
                sig = self.format_signatures[ext]\r
                return sig["format"], sig["structure"]\r
\r
        # Check source type\r
        if source_type == "database":\r
            return DataFormat.SQL, DataStructure.STRUCTURED\r
        elif source_type == "api":\r
            return DataFormat.JSON, DataStructure.SEMI_STRUCTURED\r
\r
        # Analyze sample data\r
        if sample:\r
            if isinstance(sample, dict):\r
                return DataFormat.JSON, DataStructure.SEMI_STRUCTURED\r
            elif isinstance(sample, list) and all(isinstance(x, dict) for x in sample):\r
                return DataFormat.JSON, DataStructure.STRUCTURED\r
            elif isinstance(sample, str):\r
                if sample.strip().startswith('\x3C'):\r
                    return DataFormat.XML, DataStructure.SEMI_STRUCTURED\r
                elif sample.strip().startswith('{'):\r
                    return DataFormat.JSON, DataStructure.SEMI_STRUCTURED\r
\r
        # Default\r
        return DataFormat.JSON, DataStructure.SEMI_STRUCTURED\r
\r
    def _analyze_characteristics(\r
        self,\r
        format: DataFormat,\r
        structure: DataStructure,\r
        sample: Optional[Any],\r
        metadata: Optional[Dict]\r
    ) -> DataCharacteristics:\r
        """Analyze data characteristics"""\r
        return DataCharacteristics(\r
            has_schema=structure in [DataStructure.STRUCTURED, DataStructure.TEMPORAL],\r
            has_relationships=format in [DataFormat.IFC, DataFormat.SQL],\r
            is_queryable=structure != DataStructure.UNSTRUCTURED,\r
            is_binary=format in [\r
                DataFormat.DWG, DataFormat.RVT, DataFormat.NWD,\r
                DataFormat.IMAGE, DataFormat.VIDEO, DataFormat.PDF\r
            ],\r
            has_geometry=structure == DataStructure.GEOMETRIC or format == DataFormat.IFC,\r
            has_temporal=structure == DataStructure.TEMPORAL,\r
            has_text_content=format in [\r
                DataFormat.PDF, DataFormat.DOCX, DataFormat.CSV\r
            ],\r
            estimated_volume=metadata.get("volume") if metadata else None,\r
            update_frequency=metadata.get("update_frequency") if metadata else None\r
        )\r
\r
    def _recommend_storage(\r
        self,\r
        structure: DataStructure,\r
        characteristics: DataCharacteristics\r
    ) -> StorageRecommendation:\r
        """Recommend storage solution"""\r
        # Special cases\r
        if characteristics.has_text_content and not characteristics.has_schema:\r
            return StorageRecommendation.VECTOR_DB\r
\r
        if characteristics.is_binary and characteristics.estimated_volume == "huge":\r
            return StorageRecommendation.OBJECT_STORAGE\r
\r
        if characteristics.has_relationships:\r
            return StorageRecommendation.GRAPH_DB\r
\r
        # Default mapping\r
        return self.storage_mapping.get(structure, StorageRecommendation.FILE_SYSTEM)\r
\r
    def _get_integration_options(\r
        self,\r
        format: DataFormat,\r
        structure: DataStructure\r
    ) -> List[str]:\r
        """Get integration options for the data"""\r
        options = []\r
\r
        if structure == DataStructure.STRUCTURED:\r
            options.extend(["Direct SQL queries", "ETL pipelines", "API export"])\r
        elif structure == DataStructure.SEMI_STRUCTURED:\r
            options.extend(["JSON/XML parsing", "Schema validation", "API integration"])\r
        elif structure == DataStructure.UNSTRUCTURED:\r
            options.extend(["OCR extraction", "NLP processing", "ML classification"])\r
        elif structure == DataStructure.GEOMETRIC:\r
            options.extend(["IFC export", "Geometry extraction", "Clash detection"])\r
\r
        # Format-specific options\r
        if format == DataFormat.IFC:\r
            options.append("IFC import/export via IfcOpenShell")\r
        elif format == DataFormat.EXCEL:\r
            options.append("Pandas DataFrame conversion")\r
        elif format == DataFormat.PDF:\r
            options.append("PDF text/table extraction")\r
\r
        return options\r
\r
    def _get_quality_considerations(\r
        self,\r
        format: DataFormat,\r
        structure: DataStructure\r
    ) -> List[str]:\r
        """Get quality considerations"""\r
        considerations = []\r
\r
        if structure == DataStructure.STRUCTURED:\r
            considerations.extend([\r
                "Validate schema consistency",\r
                "Check for null/missing values",\r
                "Verify data types"\r
            ])\r
        elif structure == DataStructure.UNSTRUCTURED:\r
            considerations.extend([\r
                "OCR accuracy verification",\r
                "Text encoding issues",\r
                "Content extraction completeness"\r
            ])\r
        elif structure == DataStructure.GEOMETRIC:\r
            considerations.extend([\r
                "Model validity (closed solids)",\r
                "Coordinate system consistency",\r
                "Unit verification"\r
            ])\r
\r
        # Format-specific\r
        if format == DataFormat.IFC:\r
            considerations.append("IFC schema version compatibility")\r
        elif format == DataFormat.EXCEL:\r
            considerations.append("Formula vs value extraction")\r
\r
        return considerations\r
\r
    def _calculate_confidence(\r
        self,\r
        extension: Optional[str],\r
        sample: Optional[Any],\r
        metadata: Optional[Dict]\r
    ) -> float:\r
        """Calculate classification confidence"""\r
        confidence = 0.5  # Base confidence\r
\r
        if extension:\r
            confidence += 0.3  # Extension provides good hint\r
        if sample:\r
            confidence += 0.15  # Sample data helps\r
        if metadata:\r
            confidence += 0.05  # Metadata adds context\r
\r
        return min(1.0, confidence)\r
\r
    def classify_multiple(\r
        self,\r
        sources: List[Dict]\r
    ) -> ClassificationReport:\r
        """\r
        Classify multiple data sources.\r
\r
        Args:\r
            sources: List of source definitions\r
\r
        Returns:\r
            Complete classification report\r
        """\r
        classifications = []\r
\r
        for source in sources:\r
            classification = self.classify_source(\r
                source_name=source["name"],\r
                source_type=source.get("type", "file"),\r
                file_extension=source.get("extension"),\r
                sample_data=source.get("sample"),\r
                metadata=source.get("metadata")\r
            )\r
            classifications.append(classification)\r
\r
        # Generate summaries\r
        summary_structure = {}\r
        summary_format = {}\r
        storage_recs = {}\r
\r
        for c in classifications:\r
            # Structure summary\r
            struct = c.structure.value\r
            summary_structure[struct] = summary_structure.get(struct, 0) + 1\r
\r
            # Format summary\r
            fmt = c.detected_format.value\r
            summary_format[fmt] = summary_format.get(fmt, 0) + 1\r
\r
            # Storage recommendations\r
            storage = c.storage_recommendation.value\r
            if storage not in storage_recs:\r
                storage_recs[storage] = []\r
            storage_recs[storage].append(c.source_name)\r
\r
        # Integration strategy\r
        strategy = self._generate_integration_strategy(classifications)\r
\r
        return ClassificationReport(\r
            total_sources=len(sources),\r
            classifications=classifications,\r
            summary_by_structure=summary_structure,\r
            summary_by_format=summary_format,\r
            storage_recommendations=storage_recs,\r
            integration_strategy=strategy\r
        )\r
\r
    def _generate_integration_strategy(\r
        self,\r
        classifications: List[DataClassification]\r
    ) -> Dict[str, str]:\r
        """Generate integration strategy"""\r
        strategy = {}\r
\r
        # Group by structure\r
        structured = [c for c in classifications if c.structure == DataStructure.STRUCTURED]\r
        semi = [c for c in classifications if c.structure == DataStructure.SEMI_STRUCTURED]\r
        unstructured = [c for c in classifications if c.structure == DataStructure.UNSTRUCTURED]\r
        geometric = [c for c in classifications if c.structure == DataStructure.GEOMETRIC]\r
\r
        if structured:\r
            strategy["structured_data"] = (\r
                "Use ETL pipeline to consolidate into central data warehouse. "\r
                "Implement SQL-based querying and reporting."\r
            )\r
\r
        if semi:\r
            strategy["semi_structured_data"] = (\r
                "Use document database for flexible storage. "\r
                "Implement schema validation at ingestion."\r
            )\r
\r
        if unstructured:\r
            strategy["unstructured_data"] = (\r
                "Extract text content using OCR/NLP. "\r
                "Store in vector database for semantic search."\r
            )\r
\r
        if geometric:\r
            strategy["geometric_data"] = (\r
                "Standardize on IFC format for exchange. "\r
                "Maintain native formats for editing."\r
            )\r
\r
        return strategy\r
\r
    def generate_report(self, report: ClassificationReport) -> str:\r
        """Generate classification report"""\r
        output = f"""\r
# Data Classification Report\r
\r
**Total Sources Analyzed:** {report.total_sources}\r
\r
## Summary by Structure\r
\r
"""\r
        for struct, count in report.summary_by_structure.items():\r
            output += f"- **{struct.title()}**: {count} sources\
"\r
\r
        output += "\
## Summary by Format\
\
"\r
        for fmt, count in report.summary_by_format.items():\r
            output += f"- **{fmt.upper()}**: {count} sources\
"\r
\r
        output += "\
## Storage Recommendations\
\
"\r
        for storage, sources in report.storage_recommendations.items():\r
            output += f"### {storage.replace('_', ' ').title()}\
"\r
            for src in sources:\r
                output += f"- {src}\
"\r
            output += "\
"\r
\r
        output += "## Integration Strategy\
\
"\r
        for category, strategy in report.integration_strategy.items():\r
            output += f"### {category.replace('_', ' ').title()}\
{strategy}\
\
"\r
\r
        output += "## Detailed Classifications\
\
"\r
        for c in report.classifications[:10]:\r
            output += f"""\r
### {c.source_name}\r
- **Format:** {c.detected_format.value}\r
- **Structure:** {c.structure.value}\r
- **Storage:** {c.storage_recommendation.value}\r
- **Tools:** {', '.join(c.processing_tools[:3])}\r
- **Confidence:** {c.confidence:.0%}\r
"""\r
\r
        return output\r
```\r
\r
## Common Use Cases\r
\r
### Classify Single Data Source\r
\r
```python\r
classifier = DataTypeClassifier()\r
\r
# Classify a BIM model\r
classification = classifier.classify_source(\r
    source_name="Building Model",\r
    source_type="file",\r
    file_extension=".ifc",\r
    metadata={"volume": "large"}\r
)\r
\r
print(f"Format: {classification.detected_format.value}")\r
print(f"Structure: {classification.structure.value}")\r
print(f"Storage: {classification.storage_recommendation.value}")\r
print(f"Tools: {classification.processing_tools}")\r
```\r
\r
### Classify Multiple Sources\r
\r
```python\r
sources = [\r
    {"name": "Cost Database", "type": "database", "extension": ".sql"},\r
    {"name": "Building Model", "type": "file", "extension": ".ifc"},\r
    {"name": "Contract PDFs", "type": "file", "extension": ".pdf"},\r
    {"name": "Site Photos", "type": "file", "extension": ".jpg"},\r
    {"name": "Schedule", "type": "file", "extension": ".mpp"}\r
]\r
\r
report = classifier.classify_multiple(sources)\r
\r
print(f"Total: {report.total_sources}")\r
print(f"By structure: {report.summary_by_structure}")\r
```\r
\r
### Generate Classification Report\r
\r
```python\r
report_text = classifier.generate_report(report)\r
print(report_text)\r
\r
# Save to file\r
with open("classification_report.md", "w") as f:\r
    f.write(report_text)\r
```\r
\r
## Quick Reference\r
\r
| Component | Purpose |\r
|-----------|---------|\r
| `DataTypeClassifier` | Main classification engine |\r
| `DataStructure` | Structure types (structured, semi, unstructured) |\r
| `DataFormat` | File format detection |\r
| `StorageRecommendation` | Storage system recommendations |\r
| `DataClassification` | Classification result |\r
| `ClassificationReport` | Multi-source report |\r
\r
## Resources\r
\r
- **Book**: "Data-Driven Construction" by Artem Boiko, Chapter 2.1\r
- **Website**: https://datadrivenconstruction.io\r
\r
## Next Steps\r
\r
- Use [sql-query-builder](../sql-query-builder/SKILL.md) for structured data queries\r
- Use [pdf-to-structured](../../Chapter-2.4/pdf-to-structured/SKILL.md) for unstructured data\r
- Use [data-model-designer](../../Chapter-2.5/data-model-designer/SKILL.md) for schema design\r

Usage Guidance

This skill appears coherent and focused on classifying construction data. It is instruction-only (no bundled code to install) and will read files you provide, so: only supply the files you intend it to analyze, avoid giving system or credential files, and install/enable optional tools (tesseract, ifcopenshell) only if you need OCR or IFC parsing. Note the publisher is not clearly identified in the package metadata — if provenance matters, verify the homepage or source before using on sensitive projects.

Capability Analysis

Type: OpenClaw Skill Name: data-type-classifier Version: 2.1.0 The OpenClaw skill 'data-type-classifier' is designed to classify construction data and recommend storage/processing methods. The Python code in SKILL.md implements this functionality without any signs of malicious behavior, such as data exfiltration, arbitrary code execution, or persistence mechanisms. The `filesystem` permission declared in `claw.json` is used benignly for writing a hardcoded `classification_report.md` file. Neither SKILL.md nor instructions.md contain any prompt injection attempts or instructions for the agent to perform actions outside the stated purpose.

Capability Assessment

✓ Purpose & Capability

Name/description (data classification for construction) match the declared requirements: python3 is reasonable for the provided Python implementation; tesseract (OCR) and ifcopenshell (IFC parsing) are appropriate optional dependencies for images/PDFs and IFC BIM files respectively. Filesystem permission is expected to let the skill read user-supplied files.

ℹ Instruction Scope

SKILL.md and instructions.md contain a large Python-based implementation and explicit guidance to process user-provided files/paths. The instructions constrain the agent to only use user-provided data, but they do instruct reading files from disk — this is consistent with the skill's purpose but means users should avoid giving sensitive credentials or unrelated system files as input.

✓ Install Mechanism

No install spec (instruction-only) — lowest-risk pattern. The skill expects system binaries be present rather than downloading code. This is proportionate for a documentation/sample-code skill.

✓ Credentials

No environment variables, no credentials, and no config paths are requested. The lack of external secrets or unrelated env-vars is appropriate for this functionality.

✓ Persistence & Privilege

always:false and normal autonomous invocation. The skill requests filesystem permission (present in claw.json) to read user files, which is reasonable and scoped to its purpose. It does not request persistent elevated privileges or alter other skills' configs.

How to Use

Make sure OpenClaw is installed (local or Docker)
Run the install command in chat: /install data-type-classifier
After installation, invoke the skill by name or use /data-type-classifier
Provide required inputs per the skill's parameter spec and get structured output

Version History

v2.1.0

- Added detailed data structure and format classifications for construction data, including new enums for geometric, temporal, and spatial types. - Expanded storage and processing recommendations based on specific data characteristics. - Introduced book reference and DDC methodology overview for better context. - Enhanced examples and descriptions for each data type to aid understanding. - Updated metadata to clarify platform requirements and external dependencies.

v1.0.0

Data Type Classifier v1.0.0 - Initial release of the Data Type Classifier skill. - Classifies construction data as structured, unstructured, or semi-structured. - Analyzes data sources and recommends suitable storage and processing methods based on DDC methodology. - Supports identification of common construction data formats and structures. - Provides detailed classification reports, including storage, processing, and integration recommendations.

Metadata

Slug data-type-classifier

Version 2.1.0

License —

All-time Installs 0

Active Installs 0

Total Versions 2

Frequently Asked Questions

What is Data Type Classifier?

Classify construction data by type (structured, unstructured, semi-structured). Analyze data sources and recommend appropriate storage/processing methods. It is an AI Agent Skill for Claude Code / OpenClaw, with 1067 downloads so far.

How do I install Data Type Classifier?

Run "/install data-type-classifier" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.

Is Data Type Classifier free?

Yes, Data Type Classifier is completely free (open-source). You can download, install and use it at no cost.

Which platforms does Data Type Classifier support?

Data Type Classifier is cross-platform and runs anywhere OpenClaw / Claude Code is available (win32).

Who created Data Type Classifier?

It is built and maintained by datadrivenconstruction (@datadrivenconstruction); the current version is v2.1.0.

More Skills