梯度下降详解

What is Gradient Descent

Gradient Descent is the most fundamental optimization algorithm in machine learning and deep learning. The intuition is beautifully simple: imagine you are standing on a mountain, surrounded by thick fog so you cannot see the landscape. However, you can feel the slope of the ground beneath your feet. To reach the valley (the minimum of the loss function), you take a small step in the direction of the steepest downhill slope at your current position, then repeat. This is the essence of gradient descent -- iteratively computing the gradient (directional derivative) of the loss function with respect to the parameters, and updating the parameters in the opposite direction to gradually minimize the loss.

Virtually all neural network training relies on gradient descent and its variants. Understanding the math behind gradients, the differences between optimizer variants, and learning rate scheduling strategies is essential for any deep learning practitioner. This guide covers everything from mathematical foundations to PyTorch implementation.

Mathematical Foundation

The Gradient

The gradient is a vector of partial derivatives of a multivariate function with respect to each variable. It points in the direction of the steepest increase of the function. For a function f(x1, x2, ..., xn), the gradient is defined as:

∇f = (∂f/∂x₁, ∂f/∂x₂, ..., ∂f/∂xₙ)

The gradient points in the direction of steepest ascent. The negative gradient points in the direction of steepest descent -- and that is exactly the direction gradient descent follows to minimize the loss function.

The Update Rule

The core update formula of gradient descent is elegantly simple:

θ = θ - α · ∇J(θ)

Here, θ represents the model parameters, α is the learning rate, J(θ) is the loss function, and ∇J(θ) is the gradient of the loss with respect to the parameters. Each iteration updates the parameters by subtracting the gradient scaled by the learning rate.

Impact of Learning Rate α

The learning rate is the single most critical hyperparameter in gradient descent:

Learning rate too small: Each update is tiny, convergence is extremely slow, training takes forever, and the optimizer may get stuck near a poor local minimum.

Learning rate too large: Update steps overshoot the optimum, loss oscillates wildly or diverges to infinity (loss becomes NaN).

Good learning rate: Loss decreases steadily without oscillation. In practice, start with 0.001 and adjust based on the training curve.

Types of Gradient Descent

Batch Gradient Descent

Uses the entire training dataset to compute the gradient at each step. The gradient estimate is accurate and updates are stable, but the computational cost per step is very high for large datasets. It cannot perform online learning and may get stuck at saddle points.

for epoch in range(n_epochs):
    grad = compute_gradient(X_all, y_all, theta)  # Use all data
    theta = theta - lr * grad

Stochastic Gradient Descent (SGD)

Uses a single sample to compute the gradient at each step. Updates are extremely frequent, which helps escape local minima and supports online learning. However, gradient estimates are noisy, the loss curve oscillates heavily, and convergence is erratic.

for epoch in range(n_epochs):
    np.random.shuffle(data)
    for xi, yi in zip(X, y):
        grad = compute_gradient(xi, yi, theta)  # Single sample
        theta = theta - lr * grad

Mini-batch Gradient Descent

The standard approach in practice. Uses a small batch (typically 32-256 samples) to compute the gradient. It combines the stability of batch methods with the efficiency of stochastic methods and takes full advantage of GPU parallelism. This is the default training mode in PyTorch and TensorFlow.

for epoch in range(n_epochs):
    for X_batch, y_batch in dataloader:  # batch_size=64
        grad = compute_gradient(X_batch, y_batch, theta)
        theta = theta - lr * grad

Comparison Table

Method	Speed	Stability	Memory	Best For
Batch GD	Slow (high per-step cost)	High (accurate gradient)	High (all data in memory)	Small datasets, convex problems
Stochastic GD	Fast (low per-step cost)	Low (noisy)	Low (single sample)	Online learning, streaming data
Mini-batch GD	Optimal (GPU parallel)	Medium (balanced)	Medium (one batch)	Standard deep learning

Optimizers Explained

Vanilla SGD suffers from slow convergence and oscillation issues. Researchers have developed improved optimizers that introduce momentum, adaptive learning rates, and other mechanisms to accelerate and stabilize training. Here are the most widely used optimizers.

SGD with Momentum

Momentum borrows from physics: the update depends not only on the current gradient but also accumulates previous update directions. This accelerates movement in consistent gradient directions and dampens oscillation, similar to a ball rolling downhill gaining speed. The momentum coefficient β is typically set to 0.9.

v = β · v + α · ∇J(θ) θ = θ - v

import torch.optim as optim

optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.9)

# Training loop
for batch in dataloader:
    optimizer.zero_grad()
    loss = criterion(model(batch_x), batch_y)
    loss.backward()
    optimizer.step()

RMSprop (Root Mean Square Propagation)

RMSprop maintains an exponential moving average of squared gradients for each parameter, adaptively scaling the learning rate. Parameters with large gradients get a smaller effective learning rate; parameters with small gradients get a larger one. This solves the monotonically decreasing learning rate problem of Adagrad.

cache = β · cache + (1 - β) · (∇J)² θ = θ - α · ∇J / √(cache + ε)

optimizer = optim.RMSprop(model.parameters(), lr=0.001, alpha=0.99, eps=1e-8)

Adam (Adaptive Moment Estimation)

Adam is the most popular optimizer today, combining the benefits of Momentum (first moment estimate) and RMSprop (second moment estimate). It maintains both the mean (momentum) and variance (adaptive rate) of gradients for each parameter, with bias correction to eliminate initialization bias. The defaults (β₁=0.9, β₂=0.999, ε=1e-8) work well in most cases.

m = β₁ · m + (1 - β₁) · ∇J (1st moment / momentum) v = β₂ · v + (1 - β₂) · (∇J)² (2nd moment / adaptive rate) m̂ = m / (1 - β₁ᵗ) (bias correction) v̂ = v / (1 - β₂ᵗ) (bias correction) θ = θ - α · m̂ / (√v̂ + ε)

optimizer = optim.Adam(model.parameters(), lr=0.001, betas=(0.9, 0.999), eps=1e-8)

AdamW (Decoupled Weight Decay)

AdamW fixes Adam's incorrect implementation of L2 regularization. In standard Adam, weight decay is added to the gradient before adaptive scaling, weakening the regularization effect. AdamW decouples weight decay from the gradient update, applying it directly to the parameters for better regularization. AdamW has become the standard choice for training Transformers and large language models.

Adam steps as above, but weight decay applied independently: θ = θ - α · m̂ / (√v̂ + ε) - α · λ · θ (where λ is the weight decay coefficient)

optimizer = optim.AdamW(model.parameters(), lr=0.001, weight_decay=0.01)

Optimizer Comparison

Optimizer	Adaptive LR	Momentum	Weight Decay	Typical Use
SGD	No	No	L2	Convex optimization baseline
SGD + Momentum	No	Yes	L2	CNN training (ResNet, etc.)
RMSprop	Yes	No	L2	RNN / non-stationary objectives
Adam	Yes	Yes	L2 (coupled)	General-purpose default
AdamW	Yes	Yes	Decoupled	Transformer / LLM training

Learning Rate Strategies

A fixed learning rate is rarely optimal. Early in training, a larger learning rate enables fast exploration; later, a smaller learning rate allows fine-tuning. Here are the most common learning rate scheduling strategies.

Fixed Learning Rate (Constant)

The simplest strategy: use the same learning rate throughout training. Suitable for small models and simple tasks, but rarely optimal for complex problems.

optimizer = optim.Adam(model.parameters(), lr=0.001)
# No scheduler needed, lr stays at 0.001

Step Decay

Multiply the learning rate by a decay factor (e.g., 0.1) every fixed number of epochs. Simple and intuitive, widely used in CNN training (e.g., ResNet decays lr at epochs 30, 60, 90).

scheduler = optim.lr_scheduler.StepLR(optimizer, step_size=30, gamma=0.1)

for epoch in range(100):
    train_one_epoch()
    scheduler.step()  # lr *= 0.1 at epochs 30, 60, 90

Cosine Annealing

The learning rate follows a cosine curve from the initial value smoothly down to a minimum (near 0). The deceleration is gradual in the later stages of training. This is one of the most popular scheduling strategies today.

scheduler = optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=100, eta_min=1e-6)

for epoch in range(100):
    train_one_epoch()
    scheduler.step()

Warmup

Start with a very small learning rate and linearly increase it to the target learning rate over the first few epochs, then begin decaying. Warmup prevents unstable gradient updates from randomly initialized parameters early in training. Transformer training almost always uses warmup -- skipping it often causes training to completely fail.

scheduler = optim.lr_scheduler.LinearLR(optimizer, start_factor=0.01, total_iters=5)
# First 5 epochs: lr grows linearly from 0.01*base_lr to base_lr

# Combined: warmup + cosine
scheduler1 = optim.lr_scheduler.LinearLR(optimizer, start_factor=0.01, total_iters=5)
scheduler2 = optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=95)
scheduler = optim.lr_scheduler.SequentialLR(optimizer, [scheduler1, scheduler2], milestones=[5])

OneCycleLR

A super-convergence strategy: the learning rate rises from a small value to a maximum, then decays back to a very small value, all within one training cycle. Proposed by Leslie Smith, it allows using learning rates up to 10x larger than conventional methods, significantly speeding up convergence.

scheduler = optim.lr_scheduler.OneCycleLR(
    optimizer, max_lr=0.01, total_steps=len(dataloader) * n_epochs,
    pct_start=0.3, anneal_strategy='cos'
)

for epoch in range(n_epochs):
    for batch in dataloader:
        train_step(batch)
        scheduler.step()  # Note: OneCycleLR is called every batch

Gradient Descent from Scratch (Linear Regression)

A complete gradient descent implementation for linear regression using pure NumPy, to help you understand the underlying mechanics:

import numpy as np

# Generate data: y = 3x + 7 + noise
np.random.seed(42)
X = 2 * np.random.rand(100, 1)
y = 3 * X + 7 + np.random.randn(100, 1) * 0.5

# Add bias term x0=1
X_b = np.c_[np.ones((100, 1)), X]  # shape: (100, 2)

# Hyperparameters
lr = 0.1
n_epochs = 1000
m = len(X_b)

# Random initialization [bias, weight]
theta = np.random.randn(2, 1)

# Gradient descent
for epoch in range(n_epochs):
    predictions = X_b @ theta
    errors = predictions - y
    gradients = (2 / m) * X_b.T @ errors
    theta = theta - lr * gradients

print(f"bias = {theta[0, 0]:.4f}, weight = {theta[1, 0]:.4f}")
# Output approx: bias = 7.0, weight = 3.0

Common Problems and Pitfalls

Vanishing Gradients

In deep networks, gradients are multiplied layer by layer during backpropagation via the chain rule. If gradients at each layer are less than 1, they shrink exponentially over dozens of layers, approaching zero. Early layers barely update, preventing the network from learning deep features. Common with sigmoid/tanh activations in deep networks. Solutions include: ReLU activation, BatchNorm, residual connections (ResNet), and proper weight initialization (He/Xavier).

Exploding Gradients

The opposite of vanishing gradients: if gradients at each layer are greater than 1, they grow exponentially during backpropagation, causing enormous parameter updates and NaN loss values. Common in RNNs processing long sequences. Solutions include: gradient clipping, using LSTM/GRU instead of vanilla RNN, proper weight initialization, and reducing the learning rate.

# PyTorch gradient clipping
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

Saddle Points

Saddle points have zero gradient but are neither minima nor maxima -- they are minima in some directions and maxima in others. In high-dimensional spaces, saddle points vastly outnumber local minima. The stochasticity of SGD and momentum mechanisms help escape saddle points, which is one reason stochastic methods outperform batch methods in practice.

Local Minima

Non-convex loss functions may have multiple local minima, and gradient descent may converge to a suboptimal one. However, recent research shows that in high-dimensional deep learning, most local minima have loss values very close to the global minimum, making local minima less of a practical concern than previously thought. Saddle points and flat regions are typically the bigger challenge.

Learning Rate Too High / Too Low

Too high: Loss oscillates wildly from the start or immediately shoots up to NaN. The parameter updates overshoot the optimum and may leave the reasonable region of the loss landscape entirely. When this happens, reduce the learning rate by 10x immediately.

Too low: Loss decreases extremely slowly, remaining high even after hundreds of epochs. The model crawls through the search space and may need thousands of epochs to converge. Increase the learning rate by 3-10x, or use a warmup strategy.

Practical Tuning Tips

1. Start with Adam: When in doubt, use Adam (lr=1e-3) or AdamW (lr=1e-3, weight_decay=0.01). They work well on most tasks out of the box.

2. Learning Rate Finder: Exponentially increase lr from very small to very large while recording the loss at each lr. Set max_lr to 1/10 of the lr where loss decreases fastest.

3. Batch Size and Learning Rate: When doubling batch size, double the learning rate (linear scaling rule). For example, batch_size 32 to 64 means lr should also double. This rule may break down for batch sizes above 8192.

4. Gradient Clipping is Insurance: Setting max_norm=1.0 gradient clipping has virtually no downside even without exploding gradients, yet effectively prevents occasional abnormally large gradients from derailing training.

5. Cosine Annealing + Warmup is Universal: For most tasks, 5-10 epochs of linear warmup followed by cosine annealing to a minimum lr is a robust scheduling recipe.

6. Monitor Gradient Norms: Log the gradient norm at each step during training. A sudden spike indicates instability; persistent zeros indicate vanishing gradients.

7. SGD + Momentum May Win: While Adam converges faster, SGD + Momentum + lr scheduling often achieves higher final accuracy on CNNs (ResNet, EfficientNet), though it requires more careful tuning.

8. Do Not Forget Weight Decay: Nearly all training should use weight decay, typically 1e-4 to 0.1. AdamW's decoupled weight decay outperforms Adam's L2 regularization.

Related Guides

FAQ

Q: Should I use Adam or SGD?

For fast convergence and easy tuning, choose Adam/AdamW. For maximum accuracy with patience for tuning, choose SGD + Momentum + lr scheduling. In NLP/Transformer tasks, AdamW is essentially the only choice. In CV/CNN tasks, SGD + Momentum remains the go-to for competition-winning solutions.

Q: How important is learning rate warmup?

Very important for large models and large batch sizes. Randomly initialized parameters produce unstable gradients early in training -- using a large learning rate immediately can cause the model to diverge. Warmup lets the model "warm up" with a small lr until parameters reach a reasonable range. For Transformers, skipping warmup frequently causes complete training failure.

Q: Is a larger batch size always better?

Not necessarily. Larger batch sizes better utilize GPU parallelism and process more data per unit time, but excessively large batches may hurt generalization (the sharp minima problem). Common range is 32-512; very large batch training requires special lr strategies (e.g., LARS, LAMB). When limited by GPU memory, use gradient accumulation to simulate larger batches.

Q: How do I debug when loss is not decreasing?

Debug in this order: 1) Check if the learning rate is appropriate (try 1e-3 first); 2) Verify data and labels are correct (overfit on a tiny dataset as a sanity check); 3) Ensure the loss function matches the task (CrossEntropy for classification, MSE for regression); 4) Check that optimizer.zero_grad() is being called; 5) Verify gradient norms are normal (non-zero, non-NaN); 6) Simplify the model, confirm the basic training pipeline works, then add complexity.

Q: Does gradient descent only work with differentiable functions?

Standard gradient descent requires the loss to be differentiable with respect to the parameters. For non-differentiable operations (argmax, discrete sampling), alternatives exist: Straight-Through Estimator, Gumbel-Softmax reparameterization, REINFORCE policy gradient, etc. In practice, ReLU is not differentiable at 0, but PyTorch defaults its gradient to 0 there, which works fine in training.