什么是线性回归

线性回归是统计学和机器学习中最基础、最重要的算法之一。它的核心思想非常简单：找到一条"最佳拟合线"来描述自变量(X)和因变量(y)之间的线性关系，从而对连续数值进行预测。

直觉理解

想象你有一组"房屋面积 vs 房价"数据点散布在二维平面上。线性回归就是找到一条直线，使得所有数据点到这条线的"距离"总和最小。有了这条线，给定任意面积就能预测对应的房价——这就是线性回归的本质。

适用场景

线性回归用于预测连续数值（而非分类标签）。典型应用包括：房价预测、销售额预测、温度预测、股票趋势分析、广告投入与收入关系建模等。

尽管"线性"听起来简单，但线性回归是许多高级模型的基石。理解它的数学原理，对学习逻辑回归、神经网络、支持向量机等都大有帮助。掌握线性回归 = 掌握机器学习的第一把钥匙。

数学基础

简单线性回归 (Simple Linear Regression)

只有一个自变量 x 和一个因变量 y，模型表示为：

y = wx + b

其中 w（权重/斜率）表示 x 每增加 1 个单位，y 的变化量；b（偏置/截距）表示 x = 0 时 y 的值。

多元线性回归 (Multiple Linear Regression)

当有多个自变量时，模型扩展为：

y = w₁x₁ + w₂x₂ + w₃x₃ + ... + wₙxₙ + b

用矩阵形式可以简洁地表示为：

Y = Xβ + ε 其中: Y = [y₁, y₂, ..., yₙ]ᵀ (n×1 目标向量) X = [[1, x₁₁, x₁₂, ..., x₁ₚ], (n×(p+1) 设计矩阵, 第一列全为1) [1, x₂₁, x₂₂, ..., x₂ₚ], ...] β = [b, w₁, w₂, ..., wₚ]ᵀ (参数向量) ε = [ε₁, ε₂, ..., εₙ]ᵀ (误差向量)

损失函数：均方误差 (MSE)

我们需要一个标准来衡量"拟合得有多好"。最常用的是均方误差（MSE）：

MSE = (1/n) Σ(yᵢ - ŷᵢ)² = (1/n) Σ(yᵢ - (wxᵢ + b))² 其中 yᵢ 是真实值, ŷᵢ 是预测值, n 是样本数

MSE 越小，模型拟合越好。训练的目标就是找到使 MSE 最小化的参数 w 和 b。

正规方程 (Normal Equation) — 解析解

对 MSE 求导并令其为 0，可以直接求出最优参数的解析解：

β = (XᵀX)⁻¹ Xᵀy

正规方程可以一步算出最优解，但需要对矩阵求逆。当特征数量很大时（如 10,000+），矩阵求逆的计算复杂度为 O(p³)，此时梯度下降法更高效。

梯度下降 (Gradient Descent) — 迭代优化

梯度下降通过反复调整参数来逐步最小化损失函数。每次迭代的更新规则为：

w = w - α * (∂MSE/∂w) b = b - α * (∂MSE/∂b) 其中: ∂MSE/∂w = (-2/n) Σ xᵢ(yᵢ - ŷᵢ) ∂MSE/∂b = (-2/n) Σ (yᵢ - ŷᵢ) α = 学习率 (learning rate), 控制每步更新的幅度

线性回归的四大假设

线性回归要得到可靠结果，需要满足以下四个核心假设。违反假设不意味着模型不能用，但会影响参数估计的可靠性和置信区间的准确性。

1. 线性关系 (Linearity)

自变量和因变量之间存在线性关系。可通过散点图初步判断。如果关系是非线性的（如二次曲线），可以添加多项式特征来处理。

2. 独立性 (Independence)

观测值之间相互独立，残差之间不存在自相关。在时间序列数据中容易违反此假设。可通过 Durbin-Watson 检验来检测（值接近 2 表示无自相关）。

3. 同方差性 (Homoscedasticity)

残差的方差在所有预测值水平上保持恒定。如果残差呈"喇叭形"分布（方差随预测值增大），说明存在异方差性。可通过对 y 取对数或使用加权最小二乘法来缓解。

4. 误差正态性 (Normality of Errors)

残差应近似服从正态分布（均值为 0）。可通过 Q-Q 图或 Shapiro-Wilk 检验来验证。大样本时（n > 30），由中心极限定理保证，此假设的影响较小。

Python 从零实现线性回归

使用 NumPy 实现梯度下降法训练线性回归模型，不依赖任何机器学习库：

import numpy as np

class LinearRegressionGD:
    """从零实现的线性回归（梯度下降法）"""

    def __init__(self, learning_rate=0.01, n_iterations=1000):
        self.lr = learning_rate
        self.n_iter = n_iterations
        self.weights = None
        self.bias = None
        self.cost_history = []  # 记录每次迭代的损失值

    def fit(self, X, y):
        n_samples, n_features = X.shape

        # 初始化参数为 0
        self.weights = np.zeros(n_features)
        self.bias = 0

        for i in range(self.n_iter):
            # 前向传播：计算预测值
            y_pred = np.dot(X, self.weights) + self.bias

            # 计算损失 (MSE)
            cost = np.mean((y - y_pred) ** 2)
            self.cost_history.append(cost)

            # 计算梯度
            dw = (-2 / n_samples) * np.dot(X.T, (y - y_pred))
            db = (-2 / n_samples) * np.sum(y - y_pred)

            # 更新参数
            self.weights -= self.lr * dw
            self.bias -= self.lr * db

        return self

    def predict(self, X):
        return np.dot(X, self.weights) + self.bias

    def score(self, X, y):
        """计算 R² 决定系数"""
        y_pred = self.predict(X)
        ss_res = np.sum((y - y_pred) ** 2)
        ss_tot = np.sum((y - np.mean(y)) ** 2)
        return 1 - (ss_res / ss_tot)


# ===== 使用示例 =====
# 生成模拟数据
np.random.seed(42)
X = 2 * np.random.rand(100, 1)          # 100 个样本, 1 个特征
y = 4 + 3 * X.squeeze() + np.random.randn(100) * 0.5  # y = 3x + 4 + noise

# 特征标准化（加速收敛）
X_mean, X_std = X.mean(axis=0), X.std(axis=0)
X_scaled = (X - X_mean) / X_std

# 训练模型
model = LinearRegressionGD(learning_rate=0.1, n_iterations=500)
model.fit(X_scaled, y)

print(f"R² = {model.score(X_scaled, y):.4f}")
print(f"最终 MSE: {model.cost_history[-1]:.4f}")

正规方程实现

def normal_equation(X, y):
    """正规方程解析解: beta = (X^T X)^(-1) X^T y"""
    # 添加偏置列 (全1列)
    X_b = np.c_[np.ones((X.shape[0], 1)), X]
    # 计算最优参数
    beta = np.linalg.inv(X_b.T @ X_b) @ X_b.T @ y
    return beta  # beta[0] = intercept, beta[1:] = weights

beta = normal_equation(X, y)
print(f"Intercept: {beta[0]:.4f}, Weight: {beta[1]:.4f}")
# 输出接近 4.0 和 3.0（我们设定的真实值）

Sklearn 实现线性回归

实际工作中，推荐使用 scikit-learn 的 LinearRegression。它内部使用正规方程（通过 SVD 分解），数值稳定性更好：

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
import numpy as np

# 准备数据
np.random.seed(42)
X = np.random.rand(200, 3)  # 200 样本, 3 个特征
y = 5 + 2*X[:, 0] - 3*X[:, 1] + 1.5*X[:, 2] + np.random.randn(200) * 0.3

# 划分训练集/测试集 (80/20)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# 训练模型
model = LinearRegression()
model.fit(X_train, y_train)

# 查看学到的参数
print(f"Intercept (b): {model.intercept_:.4f}")
print(f"Coefficients (w): {model.coef_}")
# 应接近 [2.0, -3.0, 1.5]

# 在测试集上预测
y_pred = model.predict(X_test)

# 评估模型
print(f"\nR²  = {r2_score(y_test, y_pred):.4f}")
print(f"MSE = {mean_squared_error(y_test, y_pred):.4f}")
print(f"MAE = {mean_absolute_error(y_test, y_pred):.4f}")

Sklearn LinearRegression 要点

- 默认 fit_intercept=True，自动处理截距
- 不需要手动做特征标准化（内部使用 SVD，不是梯度下降）
- 没有超参数需要调，无正则化（如需正则化请用 Ridge/Lasso）
- 适用于特征数量适中的场景（特征 < 10,000）

正则化：Ridge / Lasso / ElasticNet

正则化通过在损失函数中添加惩罚项来防止过拟合。当特征数量多或特征之间存在多重共线性时尤为重要。

Ridge 回归 (L2 正则化)

Cost = MSE + α * Σ(wᵢ²) α 控制正则化强度。L2惩罚使权重趋向于小值但不会为0。

from sklearn.linear_model import Ridge
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

# Ridge 需要特征标准化
ridge_pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('ridge', Ridge(alpha=1.0))  # alpha = 正则化强度
])

ridge_pipe.fit(X_train, y_train)
print(f"Ridge R²: {ridge_pipe.score(X_test, y_test):.4f}")

# 用交叉验证选择最优 alpha
from sklearn.linear_model import RidgeCV
ridge_cv = RidgeCV(alphas=[0.01, 0.1, 1.0, 10.0, 100.0], cv=5)
ridge_cv.fit(X_train, y_train)
print(f"Best alpha: {ridge_cv.alpha_}")

Lasso 回归 (L1 正则化)

Cost = MSE + α * Σ|wᵢ| L1惩罚可以将某些权重压缩到恰好为0，实现特征选择。

from sklearn.linear_model import Lasso, LassoCV

# Lasso 回归
lasso = Pipeline([
    ('scaler', StandardScaler()),
    ('lasso', Lasso(alpha=0.1))
])
lasso.fit(X_train, y_train)
print(f"Lasso R²: {lasso.score(X_test, y_test):.4f}")

# 查看哪些特征被选中（权重不为0）
lasso_model = lasso.named_steps['lasso']
for i, coef in enumerate(lasso_model.coef_):
    status = "保留" if abs(coef) > 1e-6 else "丢弃"
    print(f"  Feature {i}: w={coef:.4f} ({status})")

ElasticNet (L1 + L2)

Cost = MSE + α * [ρ * Σ|wᵢ| + (1-ρ) * Σ(wᵢ²)] ρ (l1_ratio) 控制 L1 和 L2 的比例。ρ=1 等价于 Lasso，ρ=0 等价于 Ridge。

from sklearn.linear_model import ElasticNet, ElasticNetCV

# 用交叉验证同时搜索 alpha 和 l1_ratio
enet_cv = ElasticNetCV(
    l1_ratio=[0.1, 0.5, 0.7, 0.9, 0.95, 1.0],
    alphas=[0.001, 0.01, 0.1, 1.0],
    cv=5
)
enet_cv.fit(X_train, y_train)
print(f"Best alpha: {enet_cv.alpha_:.4f}")
print(f"Best l1_ratio: {enet_cv.l1_ratio_:.2f}")
print(f"ElasticNet R²: {enet_cv.score(X_test, y_test):.4f}")

如何选择正则化方法

方法	何时使用	特征选择
Ridge (L2)	所有特征可能都有用，只想缩小权重	否
Lasso (L1)	怀疑很多特征无用，想自动选择特征	是
ElasticNet	特征很多且存在高度相关的特征组	是（按组）

特征工程

特征工程是提升线性回归模型性能的关键步骤。好的特征比复杂的模型更重要。

多项式特征 (Polynomial Features)

当数据呈非线性关系时，可以创建多项式特征，让线性模型拟合曲线：

from sklearn.preprocessing import PolynomialFeatures

# 原始特征: [x₁, x₂]
# degree=2 后: [1, x₁, x₂, x₁², x₁x₂, x₂²]
poly = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly.fit_transform(X_train)

print(f"原始特征数: {X_train.shape[1]}")
print(f"多项式特征数: {X_poly.shape[1]}")

# 用多项式特征训练线性回归
from sklearn.pipeline import make_pipeline
poly_model = make_pipeline(
    PolynomialFeatures(degree=2, include_bias=False),
    StandardScaler(),
    Ridge(alpha=1.0)  # 多项式特征容易过拟合，建议加正则化
)
poly_model.fit(X_train, y_train)
print(f"Polynomial R²: {poly_model.score(X_test, y_test):.4f}")

特征缩放 (Feature Scaling)

from sklearn.preprocessing import StandardScaler, MinMaxScaler

# 标准化: 均值=0, 标准差=1 (推荐用于线性回归)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_train)

# 归一化: 缩放到 [0, 1] 区间
minmax = MinMaxScaler()
X_minmax = minmax.fit_transform(X_train)

# 重要: 测试集使用训练集的 scaler（不能重新fit）
X_test_scaled = scaler.transform(X_test)  # transform only, no fit!

独热编码 (One-Hot Encoding)

线性回归只能处理数值特征。分类特征需要编码为数值：

import pandas as pd
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

# 示例数据
df = pd.DataFrame({
    'area': [80, 120, 60, 200],
    'rooms': [2, 3, 1, 4],
    'city': ['Beijing', 'Shanghai', 'Beijing', 'Shenzhen'],
    'price': [300, 500, 200, 800]
})

# 自动处理数值+分类特征
preprocessor = ColumnTransformer(transformers=[
    ('num', StandardScaler(), ['area', 'rooms']),
    ('cat', OneHotEncoder(drop='first'), ['city'])
    # drop='first' 避免虚拟变量陷阱
])

X = preprocessor.fit_transform(df[['area', 'rooms', 'city']])
print(f"编码后特征维度: {X.shape}")

模型评估指标

指标	公式	解读
R²	1 - SS_res / SS_tot	模型解释了多少方差。1 = 完美，0 = 和均值预测一样，负数 = 比均值还差
MSE	(1/n) Σ(yᵢ - ŷᵢ)²	平均平方误差，对大误差惩罚更重
RMSE	√MSE	和 y 同单位，更直观。如 RMSE=5 表示平均误差约 5 个单位
MAE	(1/n) Σ\|yᵢ - ŷᵢ\|	平均绝对误差，对异常值不敏感
Adjusted R²	1 - (1-R²)(n-1)/(n-p-1)	考虑特征数量的 R²，防止添加无用特征虚增 R²

from sklearn.metrics import (
    r2_score, mean_squared_error, mean_absolute_error
)
import numpy as np

def evaluate_regression(y_true, y_pred, n_features=None):
    """打印所有回归评估指标"""
    r2 = r2_score(y_true, y_pred)
    mse = mean_squared_error(y_true, y_pred)
    rmse = np.sqrt(mse)
    mae = mean_absolute_error(y_true, y_pred)

    print(f"R²   = {r2:.4f}")
    print(f"MSE  = {mse:.4f}")
    print(f"RMSE = {rmse:.4f}")
    print(f"MAE  = {mae:.4f}")

    if n_features is not None:
        n = len(y_true)
        adj_r2 = 1 - (1 - r2) * (n - 1) / (n - n_features - 1)
        print(f"Adjusted R² = {adj_r2:.4f}")

    return {"r2": r2, "mse": mse, "rmse": rmse, "mae": mae}

# 使用示例
metrics = evaluate_regression(y_test, y_pred, n_features=X_train.shape[1])

如何判断模型好不好？

- R² > 0.9：优秀（物理/工程场景常见）
- R² 0.7-0.9：良好（商业/社科场景常见）
- R² 0.5-0.7：一般，可能需要更多特征或非线性模型
- R² < 0.5：较差，线性假设可能不成立
- 关键：比较训练集和测试集的指标差异。差异过大说明过拟合。

数据可视化

散点图 + 回归线

import matplotlib.pyplot as plt
import numpy as np
from sklearn.linear_model import LinearRegression

# 生成示例数据
np.random.seed(42)
X = 2 * np.random.rand(80, 1)
y = 4 + 3 * X.squeeze() + np.random.randn(80) * 0.8

# 训练模型
model = LinearRegression().fit(X, y)

# 绘图
fig, ax = plt.subplots(figsize=(8, 5))
ax.scatter(X, y, alpha=0.6, color='#6c63ff', label='Data points')

# 绘制回归线
X_line = np.linspace(0, 2, 100).reshape(-1, 1)
y_line = model.predict(X_line)
ax.plot(X_line, y_line, color='#ff6b6b', linewidth=2,
        label=f'y = {model.coef_[0]:.2f}x + {model.intercept_:.2f}')

ax.set_xlabel('X')
ax.set_ylabel('y')
ax.set_title('Linear Regression Fit')
ax.legend()
plt.tight_layout()
plt.savefig('regression_fit.png', dpi=150)
plt.show()

残差图（检查模型假设）

fig, axes = plt.subplots(1, 2, figsize=(12, 4))

y_pred = model.predict(X)
residuals = y - y_pred

# 残差 vs 预测值
axes[0].scatter(y_pred, residuals, alpha=0.6, color='#6c63ff')
axes[0].axhline(y=0, color='#ff6b6b', linestyle='--')
axes[0].set_xlabel('Predicted')
axes[0].set_ylabel('Residuals')
axes[0].set_title('Residuals vs Predicted')

# 残差直方图（检查正态性）
axes[1].hist(residuals, bins=20, color='#6c63ff', alpha=0.7, edgecolor='white')
axes[1].set_xlabel('Residual')
axes[1].set_ylabel('Frequency')
axes[1].set_title('Residual Distribution')

plt.tight_layout()
plt.savefig('residual_analysis.png', dpi=150)
plt.show()

# 理想情况：残差图无明显模式，直方图近似正态分布

学习曲线（检查过拟合/欠拟合）

from sklearn.model_selection import learning_curve

train_sizes, train_scores, val_scores = learning_curve(
    LinearRegression(), X, y, cv=5,
    train_sizes=np.linspace(0.1, 1.0, 10),
    scoring='r2'
)

fig, ax = plt.subplots(figsize=(8, 5))
ax.plot(train_sizes, train_scores.mean(axis=1), 'o-', label='Training R²')
ax.plot(train_sizes, val_scores.mean(axis=1), 'o-', label='Validation R²')
ax.fill_between(train_sizes,
    train_scores.mean(axis=1) - train_scores.std(axis=1),
    train_scores.mean(axis=1) + train_scores.std(axis=1), alpha=0.1)
ax.fill_between(train_sizes,
    val_scores.mean(axis=1) - val_scores.std(axis=1),
    val_scores.mean(axis=1) + val_scores.std(axis=1), alpha=0.1)
ax.set_xlabel('Training Set Size')
ax.set_ylabel('R² Score')
ax.set_title('Learning Curve')
ax.legend()
plt.tight_layout()
plt.show()

常见陷阱与解决方案

陷阱	症状	解决方案
多重共线性	权重大且不稳定，符号可能与直觉相反	计算 VIF（方差膨胀因子），VIF > 10 的特征考虑移除或合并；使用 Ridge 正则化
过拟合	训练 R² 很高但测试 R² 低；特征数 > 样本数	减少特征数量；添加正则化(Ridge/Lasso)；增加训练数据；使用交叉验证
欠拟合	训练和测试 R² 都低	添加多项式特征或交互特征；考虑非线性模型；进行更多特征工程
异常值影响	个别极端值严重扭曲回归线	使用 IQR 或 Z-score 检测并处理异常值；使用 RANSAC 或 Huber 鲁棒回归
数据泄露	测试集 R² 异常高（>0.99），上线后效果骤降	确保测试集数据在训练时不可见；scaler 只在训练集上 fit
忽略特征缩放	梯度下降不收敛或收敛极慢；正则化效果异常	使用 StandardScaler；Ridge/Lasso/ElasticNet 必须标准化

多重共线性检测代码

from statsmodels.stats.outliers_influence import variance_inflation_factor
import pandas as pd

def check_vif(X, feature_names):
    """计算各特征的 VIF 值"""
    vif_data = pd.DataFrame()
    vif_data['Feature'] = feature_names
    vif_data['VIF'] = [
        variance_inflation_factor(X, i) for i in range(X.shape[1])
    ]
    vif_data = vif_data.sort_values('VIF', ascending=False)
    print(vif_data.to_string(index=False))
    # VIF > 10: 严重共线性, 考虑移除
    # VIF 5-10: 中度共线性, 需关注
    # VIF < 5: 可接受
    return vif_data

鲁棒回归处理异常值

from sklearn.linear_model import RANSACRegressor, HuberRegressor

# RANSAC: 随机采样一致性，忽略异常值
ransac = RANSACRegressor(random_state=42)
ransac.fit(X_train, y_train)
inlier_mask = ransac.inlier_mask_
print(f"Inliers: {inlier_mask.sum()}/{len(inlier_mask)}")

# Huber: 对小误差用平方损失，大误差用绝对损失
huber = HuberRegressor(epsilon=1.35)
huber.fit(X_train, y_train)
print(f"Huber R²: {huber.score(X_test, y_test):.4f}")

什么时候用 / 不用线性回归

适合使用线性回归

数据量小到中等需要可解释性特征与目标线性相关作为基准模型需要快速训练特征工程后效果好

不适合使用线性回归

目标是分类问题强非线性关系大量异常值特征间复杂交互图像/文本/序列数据特征数远多于样本数

与其他回归算法对比

算法	优势	劣势	复杂度
线性回归	简单、快速、可解释	只能拟合线性关系	O(np²)
决策树	处理非线性、无需缩放	容易过拟合、不稳定	O(np log n)
随机森林	鲁棒、处理非线性好	较慢、不可解释	O(k * np log n)
XGBoost / LightGBM	精度高、处理复杂模式	需要调参、容易过拟合	O(k * np log n)
神经网络	任意复杂模式、大数据表现好	需要大量数据和算力	视架构而定

实用建议：任何回归任务都应先用线性回归作为基准模型（baseline）。如果 R² 已经够用（>0.8），可能不需要更复杂的模型。复杂模型不一定比简单模型好 — 始终从简单开始。

常见问题 (FAQ)

Q1: 线性回归和逻辑回归有什么区别？

线性回归预测连续数值（如房价），输出范围无限；逻辑回归预测分类概率（如是否购买），输出经过 Sigmoid 函数映射到 [0, 1]。尽管名字带"回归"，逻辑回归本质上是分类算法。两者的损失函数也不同：线性回归用 MSE，逻辑回归用交叉熵（Cross-Entropy）。

Q2: 为什么我的 R² 是负数？

R² 为负说明你的模型比"用均值预测"还差。常见原因：(1) 模型未正确训练（如梯度下降未收敛）；(2) 训练集和测试集分布差异大；(3) 数据存在严重非线性关系而你只用了线性模型；(4) 特征缩放有误。检查方法：确保训练集 R² 为正，然后排查测试集问题。

Q3: 线性回归需要对数据做标准化吗？

取决于方法。如果使用正规方程或 sklearn 的 LinearRegression（内部用 SVD），不需要标准化，结果完全一样。但以下情况必须标准化：(1) 使用梯度下降时，不同特征尺度差异大会导致收敛慢或不收敛；(2) 使用 Ridge/Lasso/ElasticNet 正则化时，正则化项对未标准化的特征惩罚不均匀；(3) 需要比较各特征权重大小时（标准化后权重大小才有可比性）。

Q4: 线性回归能用于时间序列预测吗？

可以，但需注意独立性假设。时间序列数据的残差通常存在自相关，违反了线性回归的独立性假设。解决方案：(1) 使用滞后特征（lag features），将前 N 步作为输入特征；(2) 添加时间相关特征（月份、星期几、是否节假日等）；(3) 考虑使用专门的时间序列模型如 ARIMA、Prophet 或 LSTM。线性回归可以作为快速基准，但专用模型通常效果更好。

Q5: Ridge 和 Lasso 的 alpha 值怎么选？

使用交叉验证自动选择。sklearn 提供了 RidgeCV 和 LassoCV，内置交叉验证来搜索最优 alpha。推荐做法：(1) 设置一个对数均匀分布的候选值列表，如 [0.001, 0.01, 0.1, 1, 10, 100]；(2) 使用 5 折或 10 折交叉验证；(3) 查看交叉验证分数曲线，选择拐点处的 alpha。alpha 太小 = 几乎没有正则化（可能过拟合），alpha 太大 = 正则化过强（可能欠拟合）。

线性回归详解