Description

Extract, clean, and organize legacy construction data from archives. Migrate historical project data, cost records, and schedules into modern formats.

README (SKILL.md)

\r

Historical Data Manager for Construction\r

Name: Historical Data Manager
Author: datadrivenconstruction

\r

Overview\r

\r Manage legacy construction data from archives, old systems, and historical records. Extract, clean, normalize, and migrate data into modern formats for analysis and benchmarking.\r \r

Business Case\r

\r Construction companies accumulate decades of project data in various formats:\r

Paper records scanned to PDF\r
Legacy database exports (Access, dBase, FoxPro)\r
Old spreadsheet formats (Lotus 1-2-3, early Excel)\r
Proprietary software exports\r
Project closeout documentation\r \r This skill helps extract value from historical data for:\r
Cost benchmarking and trending\r
Productivity analysis over time\r
Risk pattern identification\r
Estimating improvement\r \r

Technical Implementation\r

\r

Historical Data Extractor\r

\r

from dataclasses import dataclass, field\r
from typing import List, Dict, Any, Optional\r
from datetime import datetime\r
from pathlib import Path\r
import pandas as pd\r
import re\r
import json\r
\r
@dataclass\r
class HistoricalRecord:\r
    project_id: str\r
    project_name: str\r
    year: int\r
    data_type: str  # cost, schedule, labor, material\r
    original_format: str\r
    extracted_data: Dict[str, Any]\r
    quality_score: float\r
    notes: List[str] = field(default_factory=list)\r
\r
class HistoricalDataManager:\r
    """Manage extraction and normalization of historical construction data."""\r
\r
    def __init__(self, archive_path: str):\r
        self.archive_path = Path(archive_path)\r
        self.records: List[HistoricalRecord] = []\r
        self.normalization_rules = self._load_normalization_rules()\r
\r
    def scan_archive(self) -> Dict[str, int]:\r
        """Scan archive and categorize files by type."""\r
        file_types = {}\r
\r
        for file_path in self.archive_path.rglob('*'):\r
            if file_path.is_file():\r
                ext = file_path.suffix.lower()\r
                file_types[ext] = file_types.get(ext, 0) + 1\r
\r
        return file_types\r
\r
    def extract_from_legacy_excel(self, file_path: str, year: int) -> List[HistoricalRecord]:\r
        """Extract data from legacy Excel files."""\r
        records = []\r
\r
        try:\r
            # Try different engines for old formats\r
            try:\r
                df = pd.read_excel(file_path, engine='openpyxl')\r
            except:\r
                df = pd.read_excel(file_path, engine='xlrd')\r
\r
            # Detect data type from content\r
            data_type = self._detect_data_type(df)\r
\r
            # Normalize column names\r
            df = self._normalize_columns(df)\r
\r
            # Extract project info\r
            project_info = self._extract_project_info(df, file_path)\r
\r
            record = HistoricalRecord(\r
                project_id=project_info.get('id', f'LEGACY-{year}-{hash(file_path) % 10000}'),\r
                project_name=project_info.get('name', Path(file_path).stem),\r
                year=year,\r
                data_type=data_type,\r
                original_format='excel',\r
                extracted_data=df.to_dict('records'),\r
                quality_score=self._assess_quality(df)\r
            )\r
            records.append(record)\r
\r
        except Exception as e:\r
            print(f"Error extracting {file_path}: {e}")\r
\r
        return records\r
\r
    def extract_from_csv(self, file_path: str, year: int) -> HistoricalRecord:\r
        """Extract data from CSV files with encoding detection."""\r
        # Try different encodings\r
        encodings = ['utf-8', 'latin-1', 'cp1252', 'iso-8859-1']\r
\r
        for encoding in encodings:\r
            try:\r
                df = pd.read_csv(file_path, encoding=encoding)\r
                break\r
            except:\r
                continue\r
\r
        df = self._normalize_columns(df)\r
        data_type = self._detect_data_type(df)\r
\r
        return HistoricalRecord(\r
            project_id=f'CSV-{year}-{hash(file_path) % 10000}',\r
            project_name=Path(file_path).stem,\r
            year=year,\r
            data_type=data_type,\r
            original_format='csv',\r
            extracted_data=df.to_dict('records'),\r
            quality_score=self._assess_quality(df)\r
        )\r
\r
    def extract_from_database_export(self, file_path: str, db_type: str) -> List[HistoricalRecord]:\r
        """Extract data from legacy database exports."""\r
        records = []\r
\r
        if db_type == 'access':\r
            # Read Access MDB/ACCDB files\r
            import pyodbc\r
            conn_str = f'DRIVER={{Microsoft Access Driver (*.mdb, *.accdb)}};DBQ={file_path}'\r
            conn = pyodbc.connect(conn_str)\r
\r
            # Get all tables\r
            cursor = conn.cursor()\r
            tables = [row.table_name for row in cursor.tables(tableType='TABLE')]\r
\r
            for table in tables:\r
                df = pd.read_sql(f'SELECT * FROM [{table}]', conn)\r
                # Process each table...\r
\r
            conn.close()\r
\r
        return records\r
\r
    def normalize_cost_data(self, records: List[HistoricalRecord], base_year: int = 2026) -> pd.DataFrame:\r
        """Normalize historical cost data to current dollars."""\r
        # RSMeans historical cost indices (example values)\r
        cost_indices = {\r
            2015: 0.82, 2016: 0.84, 2017: 0.87, 2018: 0.90,\r
            2019: 0.93, 2020: 0.95, 2021: 0.98, 2022: 1.02,\r
            2023: 1.06, 2024: 1.10, 2025: 1.14, 2026: 1.18\r
        }\r
\r
        normalized_data = []\r
\r
        for record in records:\r
            if record.data_type == 'cost':\r
                year_index = cost_indices.get(record.year, 1.0)\r
                base_index = cost_indices.get(base_year, 1.18)\r
                escalation_factor = base_index / year_index\r
\r
                for item in record.extracted_data:\r
                    if 'amount' in item or 'cost' in item:\r
                        original_cost = item.get('amount') or item.get('cost', 0)\r
                        normalized_item = item.copy()\r
                        normalized_item['original_cost'] = original_cost\r
                        normalized_item['normalized_cost'] = original_cost * escalation_factor\r
                        normalized_item['escalation_factor'] = escalation_factor\r
                        normalized_item['original_year'] = record.year\r
                        normalized_item['project_id'] = record.project_id\r
                        normalized_data.append(normalized_item)\r
\r
        return pd.DataFrame(normalized_data)\r
\r
    def _detect_data_type(self, df: pd.DataFrame) -> str:\r
        """Detect type of data from column names and content."""\r
        columns_lower = [c.lower() for c in df.columns]\r
\r
        if any(c in columns_lower for c in ['cost', 'amount', 'price', 'total', 'budget']):\r
            return 'cost'\r
        elif any(c in columns_lower for c in ['start', 'finish', 'duration', 'task', 'activity']):\r
            return 'schedule'\r
        elif any(c in columns_lower for c in ['hours', 'labor', 'worker', 'crew']):\r
            return 'labor'\r
        elif any(c in columns_lower for c in ['material', 'quantity', 'unit', 'supplier']):\r
            return 'material'\r
        else:\r
            return 'unknown'\r
\r
    def _normalize_columns(self, df: pd.DataFrame) -> pd.DataFrame:\r
        """Normalize column names to standard format."""\r
        column_mapping = {\r
            r'proj.*id': 'project_id',\r
            r'proj.*name': 'project_name',\r
            r'desc.*': 'description',\r
            r'qty|quantity': 'quantity',\r
            r'unit.*cost|unit.*price': 'unit_cost',\r
            r'total|amount': 'amount',\r
            r'start.*date': 'start_date',\r
            r'end.*date|finish.*date': 'end_date',\r
            r'dur.*': 'duration',\r
        }\r
\r
        new_columns = {}\r
        for col in df.columns:\r
            col_lower = col.lower().strip()\r
            for pattern, new_name in column_mapping.items():\r
                if re.match(pattern, col_lower):\r
                    new_columns[col] = new_name\r
                    break\r
\r
        return df.rename(columns=new_columns)\r
\r
    def _assess_quality(self, df: pd.DataFrame) -> float:\r
        """Assess data quality score (0-1)."""\r
        if df.empty:\r
            return 0.0\r
\r
        scores = []\r
\r
        # Completeness: % of non-null values\r
        completeness = 1 - (df.isnull().sum().sum() / df.size)\r
        scores.append(completeness)\r
\r
        # Column quality: has meaningful column names\r
        meaningful_cols = sum(1 for c in df.columns if len(c) > 2 and not c.startswith('Unnamed'))\r
        col_quality = meaningful_cols / len(df.columns)\r
        scores.append(col_quality)\r
\r
        # Row count: more data is better (capped at 1.0)\r
        row_score = min(len(df) / 100, 1.0)\r
        scores.append(row_score)\r
\r
        return sum(scores) / len(scores)\r
\r
    def _extract_project_info(self, df: pd.DataFrame, file_path: str) -> Dict[str, str]:\r
        """Extract project info from data or filename."""\r
        info = {}\r
\r
        # Try to find project info in data\r
        for col in df.columns:\r
            if 'project' in col.lower() and 'id' in col.lower():\r
                info['id'] = str(df[col].iloc[0]) if not df[col].empty else None\r
            if 'project' in col.lower() and 'name' in col.lower():\r
                info['name'] = str(df[col].iloc[0]) if not df[col].empty else None\r
\r
        # Fallback to filename\r
        if 'name' not in info:\r
            info['name'] = Path(file_path).stem\r
\r
        return info\r
\r
    def _load_normalization_rules(self) -> Dict:\r
        """Load rules for normalizing legacy data."""\r
        return {\r
            'unit_conversions': {\r
                'M': 1000,  # Thousand\r
                'C': 100,   # Hundred\r
                'LF': 1,    # Linear Foot\r
                'SF': 1,    # Square Foot\r
                'CY': 1,    # Cubic Yard\r
            },\r
            'date_formats': [\r
                '%m/%d/%Y', '%m/%d/%y', '%Y-%m-%d',\r
                '%d-%b-%Y', '%B %d, %Y'\r
            ]\r
        }\r
\r
    def generate_migration_report(self) -> str:\r
        """Generate report on migrated data."""\r
        report = ["# Historical Data Migration Report", ""]\r
\r
        # Summary\r
        report.append("## Summary")\r
        report.append(f"- Total Records: {len(self.records)}")\r
\r
        by_type = {}\r
        by_year = {}\r
        for r in self.records:\r
            by_type[r.data_type] = by_type.get(r.data_type, 0) + 1\r
            by_year[r.year] = by_year.get(r.year, 0) + 1\r
\r
        report.append("\
### By Data Type")\r
        for dt, count in sorted(by_type.items()):\r
            report.append(f"- {dt}: {count}")\r
\r
        report.append("\
### By Year")\r
        for year, count in sorted(by_year.items()):\r
            report.append(f"- {year}: {count}")\r
\r
        # Quality Assessment\r
        report.append("\
## Data Quality")\r
        avg_quality = sum(r.quality_score for r in self.records) / len(self.records) if self.records else 0\r
        report.append(f"- Average Quality Score: {avg_quality:.2%}")\r
\r
        low_quality = [r for r in self.records if r.quality_score \x3C 0.5]\r
        if low_quality:\r
            report.append(f"\
### Low Quality Records ({len(low_quality)})")\r
            for r in low_quality[:10]:\r
                report.append(f"- {r.project_name} ({r.year}): {r.quality_score:.2%}")\r
\r
        return "\
".join(report)\r
```\r
\r
### Legacy System Connectors\r
\r
```python\r
class LegacySystemConnector:\r
    """Connect to various legacy construction systems."""\r
\r
    @staticmethod\r
    def read_timberline_export(file_path: str) -> pd.DataFrame:\r
        """Read Sage Timberline (now Sage 300) export files."""\r
        # Timberline exports typically have specific format\r
        df = pd.read_csv(file_path, encoding='cp1252')\r
\r
        # Map Timberline columns to standard\r
        column_map = {\r
            'JOB': 'project_id',\r
            'PHASE': 'phase_code',\r
            'CATEGORY': 'cost_code',\r
            'DESCRIPTION': 'description',\r
            'ESTIMATE': 'estimated_cost',\r
            'ACTUAL': 'actual_cost',\r
            'COMMITTED': 'committed_cost'\r
        }\r
\r
        return df.rename(columns=column_map)\r
\r
    @staticmethod\r
    def read_primavera_xer(file_path: str) -> Dict[str, pd.DataFrame]:\r
        """Read Primavera P6 XER export files."""\r
        tables = {}\r
        current_table = None\r
        current_data = []\r
        columns = []\r
\r
        with open(file_path, 'r', encoding='utf-8') as f:\r
            for line in f:\r
                line = line.strip()\r
                if line.startswith('%T'):\r
                    # Save previous table\r
                    if current_table and current_data:\r
                        tables[current_table] = pd.DataFrame(current_data, columns=columns)\r
                    # Start new table\r
                    current_table = line.split('	')[1] if '	' in line else None\r
                    current_data = []\r
                    columns = []\r
                elif line.startswith('%F'):\r
                    # Field definitions\r
                    columns = line.split('	')[1:]\r
                elif line.startswith('%R'):\r
                    # Data row\r
                    current_data.append(line.split('	')[1:])\r
\r
        # Save last table\r
        if current_table and current_data:\r
            tables[current_table] = pd.DataFrame(current_data, columns=columns)\r
\r
        return tables\r
\r
    @staticmethod\r
    def read_mc2_ice(file_path: str) -> pd.DataFrame:\r
        """Read MC2 ICE estimating export."""\r
        # MC2 ICE format handling\r
        pass\r
```\r
\r
## Quick Start\r
\r
```python\r
# Initialize manager\r
manager = HistoricalDataManager('/archive/projects')\r
\r
# Scan archive\r
file_types = manager.scan_archive()\r
print(f"Found: {file_types}")\r
\r
# Extract from legacy Excel files\r
for year in range(2015, 2024):\r
    year_path = f'/archive/projects/{year}'\r
    for file in Path(year_path).glob('*.xls*'):\r
        records = manager.extract_from_legacy_excel(str(file), year)\r
        manager.records.extend(records)\r
\r
# Normalize cost data to 2026 dollars\r
cost_records = [r for r in manager.records if r.data_type == 'cost']\r
normalized_costs = manager.normalize_cost_data(cost_records, base_year=2026)\r
\r
# Generate migration report\r
report = manager.generate_migration_report()\r
print(report)\r
\r
# Export for analysis\r
normalized_costs.to_excel('historical_costs_normalized.xlsx', index=False)\r
```\r
\r
## Common Use Cases\r
\r
1. **Cost Benchmarking**: Normalize historical costs for comparison\r
2. **Productivity Analysis**: Track labor productivity over time\r
3. **Risk Identification**: Find patterns in historical project issues\r
4. **Estimating Calibration**: Improve estimates with historical data\r
\r
## Dependencies\r
\r
```bash\r
pip install pandas openpyxl xlrd pyodbc\r
```\r
\r
## Resources\r
\r
- **RSMeans Historical Cost Index**: For cost escalation\r
- **ENR Construction Cost Index**: Alternative escalation source\r
- **Legacy Format Documentation**: Vendor-specific export formats\r

Usage Guidance

This skill appears to do what it says, but before installing or running it: 1) Only point the skill at intended archive directories (avoid giving it root or broad system paths) because it recursively reads files. 2) Be prepared to install Python packages (pandas, openpyxl/xlrd, pyodbc) and any native ODBC drivers needed to read Access files. 3) Run initial tests on non-sensitive sample data or inside an isolated environment/VM. 4) Verify the skill source (homepage is provided but 'Source: unknown' in the registry) if you require provenance. 5) If you need network isolation, confirm the agent runtime will not permit outbound network calls — SKILL.md does not show exfiltration, but the full file should be reviewed for any code that performs HTTP requests before use.

Capability Analysis

Type: OpenClaw Skill Name: historical-data-manager Version: 2.1.0 The skill's core functionality involves processing historical data, which includes reading various file types and connecting to legacy databases. The `extract_from_database_export` function in `SKILL.md` uses `pyodbc.connect` with a connection string that directly embeds the `file_path` parameter (`DBQ={file_path}`). If the `file_path` input is not rigorously validated by the agent or the underlying code, this could lead to arbitrary file access or other injection vulnerabilities via the ODBC driver. While `instructions.md` advises the agent to 'Validate inputs before processing,' the Python code itself does not explicitly implement this for the `pyodbc` connection string, creating a vulnerability that could be exploited by a malicious user or prompt injection.

Capability Assessment

✓ Purpose & Capability

Name, description, and declared requirements align: the skill is an instruction-only Python-based data extractor/normalizer that needs filesystem access and python3 to operate. The claw.json permission 'filesystem' and the SKILL.md methods (reading archives, CSV/Excel, Access exports) are consistent with the stated purpose.

ℹ Instruction Scope

SKILL.md contains concrete Python code to recursively scan archives, read CSV/Excel files, and access legacy DB exports (pyodbc for Access). That scope matches the purpose, but the code will traverse user-provided archive paths and process arbitrary local files — ensure you only point it at intended data. The instructions claim to 'only use data provided by the user', which is appropriate, but the recursive scanning behavior could read more files than intended if given a high-level path.

ℹ Install Mechanism

There is no install spec (instruction-only), which is low-risk for unexpected downloads. However, the Python code imports several third-party libraries (pandas, pyodbc, openpyxl/xlrd) that are not declared; those dependencies must be installed in the runtime environment and may require native drivers (e.g., Access ODBC). The lack of an explicit dependency list or environment setup is an operational gap to be aware of.

✓ Credentials

The skill requests no environment variables or credentials and the primary capability is file I/O. This is proportionate to migrating local historical data. There is no declaration of network endpoints or secrets access. Users should note the filesystem permission is necessary for the task and will let the skill read files the agent is permitted to access.

✓ Persistence & Privilege

always is false and model invocation is not disabled; the skill does not request persistent global privileges or modify other skills. It is instruction-only and does not install background services or force inclusion.

Version History

v2.1.0

- Added detailed documentation describing extraction, cleaning, and organization of legacy construction data. - Expanded capabilities for handling multiple historical data sources, including legacy Excel, CSV, and database exports. - Introduced normalization of cost data using cost indices to adjust historical values to current dollars. - Improved detection of data types (cost, schedule, labor, material) based on column analysis. - Enhanced support for different file encodings for robust CSV extraction. - Specified technical requirements, usage scenarios, and business value more clearly.

v1.0.0

Initial release of Historical Data Manager for Construction. - Extracts, cleans, and organizes legacy construction data from archives, including spreadsheets, PDFs, and database exports. - Supports extraction from formats such as old Excel, CSV, and Microsoft Access. - Detects and normalizes key data types (cost, schedule, labor, material) from diverse historical sources. - Provides tools for column normalization and historical cost escalation to modern values. - Enables migration of historical projects, costs, and schedules into analysis-ready, modern formats.

Metadata

Slug historical-data-manager

Version 2.1.0

License —

All-time Installs 4

Active Installs 4

Total Versions 2

Frequently Asked Questions

What is Historical Data Manager?

Extract, clean, and organize legacy construction data from archives. Migrate historical project data, cost records, and schedules into modern formats. It is an AI Agent Skill for Claude Code / OpenClaw, with 1449 downloads so far.

How do I install Historical Data Manager?

Run "/install historical-data-manager" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.

Is Historical Data Manager free?

Yes, Historical Data Manager is completely free (open-source). You can download, install and use it at no cost.

Which platforms does Historical Data Manager support?

Historical Data Manager is cross-platform and runs anywhere OpenClaw / Claude Code is available (darwin, linux, win32).

Who created Historical Data Manager?

It is built and maintained by datadrivenconstruction (@datadrivenconstruction); the current version is v2.1.0.

More Skills

Historical Data Manager