Cost Prediction
/install cost-prediction
\r \r
Construction Cost Prediction with Machine Learning\r
\r
Overview\r
\r Based on DDC methodology (Chapter 4.5), this skill enables predicting construction project costs using historical data and machine learning algorithms. The approach transforms traditional expert-based estimation into data-driven prediction.\r \r Book Reference: "Будущее: прогнозы и машинное обучение" / "Future: Predictions and Machine Learning"\r \r
"Предсказания и прогнозы на основе исторических данных позволяют компаниям принимать более точные решения о стоимости и сроках проектов."\r — DDC Book, Chapter 4.5\r \r
Core Concepts\r
\r
Historical Data → Feature Engineering → ML Model → Cost Prediction\r
│ │ │ │\r
▼ ▼ ▼ ▼\r
Past projects Prepare data Train model New project\r
with costs for ML on history cost forecast\r
```\r
\r
## Quick Start\r
\r
```python\r
import pandas as pd\r
from sklearn.model_selection import train_test_split\r
from sklearn.linear_model import LinearRegression\r
from sklearn.metrics import mean_absolute_error, r2_score\r
\r
# Load historical project data\r
df = pd.read_csv("historical_projects.csv")\r
\r
# Features and target\r
X = df[['area_m2', 'floors', 'complexity_score']]\r
y = df['total_cost']\r
\r
# Split data\r
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)\r
\r
# Train model\r
model = LinearRegression()\r
model.fit(X_train, y_train)\r
\r
# Predict\r
predictions = model.predict(X_test)\r
print(f"R² Score: {r2_score(y_test, predictions):.2f}")\r
print(f"MAE: ${mean_absolute_error(y_test, predictions):,.0f}")\r
\r
# Predict new project\r
new_project = [[5000, 10, 3]] # area, floors, complexity\r
cost = model.predict(new_project)\r
print(f"Predicted cost: ${cost[0]:,.0f}")\r
```\r
\r
## Data Preparation\r
\r
### Prepare Historical Dataset\r
\r
```python\r
import pandas as pd\r
import numpy as np\r
\r
def prepare_cost_dataset(df):\r
"""Prepare historical project data for ML"""\r
# Select relevant features\r
features = [\r
'area_m2',\r
'floors',\r
'building_type',\r
'location',\r
'year_completed',\r
'complexity_score',\r
'material_quality',\r
'total_cost'\r
]\r
\r
df = df[features].copy()\r
\r
# Handle missing values\r
df = df.dropna(subset=['total_cost'])\r
df['complexity_score'] = df['complexity_score'].fillna(df['complexity_score'].median())\r
\r
# Encode categorical variables\r
df = pd.get_dummies(df, columns=['building_type', 'location'])\r
\r
# Calculate derived features\r
df['cost_per_m2'] = df['total_cost'] / df['area_m2']\r
df['cost_per_floor'] = df['total_cost'] / df['floors']\r
\r
# Adjust for inflation (to current year prices)\r
current_year = 2024\r
inflation_rate = 0.03 # 3% annual\r
df['years_ago'] = current_year - df['year_completed']\r
df['adjusted_cost'] = df['total_cost'] * (1 + inflation_rate) ** df['years_ago']\r
\r
return df\r
\r
# Usage\r
df = pd.read_csv("projects_history.csv")\r
df_prepared = prepare_cost_dataset(df)\r
```\r
\r
### Feature Engineering\r
\r
```python\r
def engineer_features(df):\r
"""Create additional features for better predictions"""\r
# Interaction features\r
df['area_x_floors'] = df['area_m2'] * df['floors']\r
df['area_x_complexity'] = df['area_m2'] * df['complexity_score']\r
\r
# Polynomial features\r
df['area_squared'] = df['area_m2'] ** 2\r
\r
# Log transforms (for skewed features)\r
df['log_area'] = np.log1p(df['area_m2'])\r
\r
# Binned features\r
df['size_category'] = pd.cut(\r
df['area_m2'],\r
bins=[0, 1000, 5000, 10000, float('inf')],\r
labels=['small', 'medium', 'large', 'xlarge']\r
)\r
\r
return df\r
```\r
\r
## Machine Learning Models\r
\r
### Linear Regression\r
\r
```python\r
from sklearn.linear_model import LinearRegression\r
from sklearn.preprocessing import StandardScaler\r
from sklearn.pipeline import Pipeline\r
\r
def train_linear_model(X_train, y_train):\r
"""Train Linear Regression model with scaling"""\r
pipeline = Pipeline([\r
('scaler', StandardScaler()),\r
('regressor', LinearRegression())\r
])\r
\r
pipeline.fit(X_train, y_train)\r
\r
# Feature importance (coefficients)\r
coefficients = pd.DataFrame({\r
'feature': X_train.columns,\r
'coefficient': pipeline.named_steps['regressor'].coef_\r
}).sort_values('coefficient', key=abs, ascending=False)\r
\r
return pipeline, coefficients\r
\r
# Usage\r
model, importance = train_linear_model(X_train, y_train)\r
print("Feature Importance:")\r
print(importance)\r
```\r
\r
### K-Nearest Neighbors (KNN)\r
\r
```python\r
from sklearn.neighbors import KNeighborsRegressor\r
from sklearn.preprocessing import StandardScaler\r
from sklearn.model_selection import GridSearchCV\r
\r
def train_knn_model(X_train, y_train):\r
"""Train KNN model with optimal k"""\r
# Scale features\r
scaler = StandardScaler()\r
X_scaled = scaler.fit_transform(X_train)\r
\r
# Find optimal k using cross-validation\r
param_grid = {'n_neighbors': range(3, 20)}\r
knn = KNeighborsRegressor()\r
grid_search = GridSearchCV(knn, param_grid, cv=5, scoring='neg_mean_absolute_error')\r
grid_search.fit(X_scaled, y_train)\r
\r
print(f"Best k: {grid_search.best_params_['n_neighbors']}")\r
print(f"Best MAE: ${-grid_search.best_score_:,.0f}")\r
\r
return grid_search.best_estimator_, scaler\r
\r
# Usage\r
knn_model, scaler = train_knn_model(X_train, y_train)\r
```\r
\r
### Random Forest\r
\r
```python\r
from sklearn.ensemble import RandomForestRegressor\r
\r
def train_random_forest(X_train, y_train):\r
"""Train Random Forest model"""\r
rf = RandomForestRegressor(\r
n_estimators=100,\r
max_depth=10,\r
min_samples_split=5,\r
random_state=42\r
)\r
\r
rf.fit(X_train, y_train)\r
\r
# Feature importance\r
importance = pd.DataFrame({\r
'feature': X_train.columns,\r
'importance': rf.feature_importances_\r
}).sort_values('importance', ascending=False)\r
\r
return rf, importance\r
\r
# Usage\r
rf_model, importance = train_random_forest(X_train, y_train)\r
print("Feature Importance:")\r
print(importance.head(10))\r
```\r
\r
### Gradient Boosting\r
\r
```python\r
from sklearn.ensemble import GradientBoostingRegressor\r
\r
def train_gradient_boosting(X_train, y_train):\r
"""Train Gradient Boosting model"""\r
gb = GradientBoostingRegressor(\r
n_estimators=200,\r
learning_rate=0.1,\r
max_depth=5,\r
random_state=42\r
)\r
\r
gb.fit(X_train, y_train)\r
return gb\r
\r
# Usage\r
gb_model = train_gradient_boosting(X_train, y_train)\r
```\r
\r
## Model Evaluation\r
\r
### Comprehensive Evaluation\r
\r
```python\r
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score\r
import numpy as np\r
\r
def evaluate_model(model, X_test, y_test, model_name="Model"):\r
"""Comprehensive model evaluation"""\r
predictions = model.predict(X_test)\r
\r
metrics = {\r
'MAE': mean_absolute_error(y_test, predictions),\r
'RMSE': np.sqrt(mean_squared_error(y_test, predictions)),\r
'R²': r2_score(y_test, predictions),\r
'MAPE': np.mean(np.abs((y_test - predictions) / y_test)) * 100\r
}\r
\r
print(f"\
{model_name} Evaluation:")\r
print(f" MAE: ${metrics['MAE']:,.0f}")\r
print(f" RMSE: ${metrics['RMSE']:,.0f}")\r
print(f" R²: {metrics['R²']:.3f}")\r
print(f" MAPE: {metrics['MAPE']:.1f}%")\r
\r
return metrics, predictions\r
\r
# Usage\r
metrics, predictions = evaluate_model(model, X_test, y_test, "Linear Regression")\r
```\r
\r
### Compare Multiple Models\r
\r
```python\r
def compare_models(models, X_test, y_test):\r
"""Compare multiple models"""\r
results = []\r
\r
for name, model in models.items():\r
metrics, _ = evaluate_model(model, X_test, y_test, name)\r
metrics['Model'] = name\r
results.append(metrics)\r
\r
comparison = pd.DataFrame(results)\r
comparison = comparison.set_index('Model')\r
\r
print("\
Model Comparison:")\r
print(comparison.round(2))\r
\r
return comparison\r
\r
# Usage\r
models = {\r
'Linear Regression': linear_model,\r
'KNN': knn_model,\r
'Random Forest': rf_model,\r
'Gradient Boosting': gb_model\r
}\r
comparison = compare_models(models, X_test, y_test)\r
```\r
\r
### Cross-Validation\r
\r
```python\r
from sklearn.model_selection import cross_val_score\r
\r
def cross_validate_model(model, X, y, cv=5):\r
"""Perform cross-validation"""\r
scores = cross_val_score(model, X, y, cv=cv, scoring='neg_mean_absolute_error')\r
mae_scores = -scores\r
\r
print(f"Cross-Validation MAE: ${mae_scores.mean():,.0f} (+/- ${mae_scores.std():,.0f})")\r
return mae_scores\r
\r
# Usage\r
cv_scores = cross_validate_model(rf_model, X, y)\r
```\r
\r
## Prediction Pipeline\r
\r
### Complete Prediction Function\r
\r
```python\r
import joblib\r
\r
def create_prediction_pipeline(model, feature_names, scaler=None):\r
"""Create a reusable prediction pipeline"""\r
\r
def predict_cost(project_data):\r
"""\r
Predict cost for new project\r
\r
Args:\r
project_data: dict with project features\r
\r
Returns:\r
Predicted cost and confidence interval\r
"""\r
# Create DataFrame from input\r
df = pd.DataFrame([project_data])\r
\r
# Ensure all required features\r
for col in feature_names:\r
if col not in df.columns:\r
df[col] = 0\r
\r
df = df[feature_names]\r
\r
# Scale if necessary\r
if scaler:\r
df = scaler.transform(df)\r
\r
# Predict\r
prediction = model.predict(df)[0]\r
\r
# Confidence interval (simple estimation)\r
confidence = 0.15 # 15% margin\r
lower = prediction * (1 - confidence)\r
upper = prediction * (1 + confidence)\r
\r
return {\r
'predicted_cost': prediction,\r
'lower_bound': lower,\r
'upper_bound': upper,\r
'confidence_level': f"{(1-confidence)*100:.0f}%"\r
}\r
\r
return predict_cost\r
\r
# Usage\r
predictor = create_prediction_pipeline(rf_model, X.columns.tolist())\r
\r
# Predict new project\r
new_project = {\r
'area_m2': 5000,\r
'floors': 8,\r
'complexity_score': 3,\r
'material_quality': 2\r
}\r
\r
result = predictor(new_project)\r
print(f"Predicted Cost: ${result['predicted_cost']:,.0f}")\r
print(f"Range: ${result['lower_bound']:,.0f} - ${result['upper_bound']:,.0f}")\r
```\r
\r
### Save and Load Model\r
\r
```python\r
import joblib\r
\r
# Save model\r
def save_model(model, filepath):\r
"""Save trained model to file"""\r
joblib.dump(model, filepath)\r
print(f"Model saved to {filepath}")\r
\r
# Load model\r
def load_model(filepath):\r
"""Load model from file"""\r
model = joblib.load(filepath)\r
print(f"Model loaded from {filepath}")\r
return model\r
\r
# Usage\r
save_model(rf_model, "cost_prediction_model.pkl")\r
loaded_model = load_model("cost_prediction_model.pkl")\r
```\r
\r
## Using with ChatGPT\r
\r
```python\r
# Prompt for ChatGPT to help with cost prediction\r
\r
prompt = """\r
I have historical construction project data with these columns:\r
- area_m2: Building area in square meters\r
- floors: Number of floors\r
- building_type: residential, commercial, industrial\r
- total_cost: Total project cost in USD\r
\r
Write Python code using scikit-learn to:\r
1. Prepare the data for machine learning\r
2. Train a Random Forest model\r
3. Evaluate the model\r
4. Predict cost for a new 3000 m² commercial building with 5 floors\r
"""\r
```\r
\r
## Quick Reference\r
\r
| Task | Code |\r
|------|------|\r
| Split data | `train_test_split(X, y, test_size=0.2)` |\r
| Linear Regression | `LinearRegression().fit(X, y)` |\r
| KNN | `KNeighborsRegressor(n_neighbors=5)` |\r
| Random Forest | `RandomForestRegressor(n_estimators=100)` |\r
| Predict | `model.predict(X_new)` |\r
| MAE | `mean_absolute_error(y_true, y_pred)` |\r
| R² Score | `r2_score(y_true, y_pred)` |\r
| Cross-validate | `cross_val_score(model, X, y, cv=5)` |\r
| Save model | `joblib.dump(model, 'file.pkl')` |\r
\r
## Best Practices\r
\r
1. **Data Quality**: More historical data = better predictions\r
2. **Feature Selection**: Include relevant project characteristics\r
3. **Inflation Adjustment**: Normalize costs to current prices\r
4. **Regular Retraining**: Update model with new completed projects\r
5. **Ensemble Methods**: Combine multiple models for robustness\r
6. **Confidence Intervals**: Always provide prediction ranges\r
\r
## Resources\r
\r
- **Book**: "Data-Driven Construction" by Artem Boiko, Chapter 4.5\r
- **Website**: https://datadrivenconstruction.io\r
- **scikit-learn**: https://scikit-learn.org\r
\r
## Next Steps\r
\r
- See `duration-prediction` for project duration forecasting\r
- See `ml-model-builder` for custom ML workflows\r
- See `kpi-dashboard` for visualization\r
- See `big-data-analysis` for large dataset processing\r
- Make sure OpenClaw is installed (local or Docker)
- Run the install command in chat:
/install cost-prediction - After installation, invoke the skill by name or use
/cost-prediction - Provide required inputs per the skill's parameter spec and get structured output
What is Cost Prediction?
Predict construction project costs using Machine Learning. Use Linear Regression, K-Nearest Neighbors, and Random Forest models on historical project data. Train, evaluate, and deploy cost prediction models. It is an AI Agent Skill for Claude Code / OpenClaw, with 1352 downloads so far.
How do I install Cost Prediction?
Run "/install cost-prediction" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.
Is Cost Prediction free?
Yes, Cost Prediction is completely free (open-source). You can download, install and use it at no cost.
Which platforms does Cost Prediction support?
Cost Prediction is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).
Who created Cost Prediction?
It is built and maintained by datadrivenconstruction (@datadrivenconstruction); the current version is v2.0.0.