Chapter 26

GPU: Why It's Perfect for AI

Have you ever wondered how a "graphics processor" became the most important computing engine of the AI era? It wasn't an accident. There's a deeply logical reason behind it.

Think of it this way: training a deep neural network is like needing to grade ten thousand exam papers simultaneously, where every paper follows the exact same grading rubric. If you only have one superstar teacher (a CPU), no matter how fast she works, the papers still have to go through her one at a time. But if you have ten thousand ordinary teachers (GPU cores), each grading one paper at the same time—even if each individual teacher works more slowly—the total throughput is tens of times faster.

That's the core philosophy of a GPU: trade a small number of fast cores for a massive number of parallel ones.

Core Concepts

CPU vs GPU: Two Radically Different Design Choices

Both CPUs and GPUs are chips, but they were designed from birth to solve completely different problems.

CPU (e.g., Intel Core i9):
┌────────────────────────────────────────────┐
│  Core 1  │  Core 2  │  Core 3  │  Core 4  │
│ (Powerful)│(Powerful)│(Powerful)│(Powerful)│
│  Branch   │  Out-of- │  Large   │         │
│ Prediction│  Order   │  Cache   │  ...    │
└────────────────────────────────────────────┘
Goal: Make a single task run as fast as possible (low latency)
Core count: typically 8–32

GPU (e.g., NVIDIA RTX 4090):
┌──────────────────────────────────────────────────────────┐
│ SM  │ SM  │ SM  │ SM  │ SM  │ SM  │ ... │ SM  │ SM  │ SM │
│ (Streaming Multiprocessors, each with 128 CUDA Cores)    │
│  128 SMs × 128 CUDA Cores = 16,384 CUDA Cores total     │
└──────────────────────────────────────────────────────────┘
Goal: Run a massive number of tasks simultaneously (high throughput)
CUDA Core count: thousands to tens of thousands

Each CPU core is "smart": it has complex branch prediction, out-of-order execution, and deep pipelines—all designed to make a single thread run as fast as possible. Each GPU core is "simple": essentially just "fetch data, compute, write result" with minimal control logic. The GPU's advantage is sheer numbers.

SIMD and SIMT: Two Models of Parallelism

SIMD (Single Instruction, Multiple Data) is how CPUs handle parallelism. One instruction operates on a vector of data simultaneously—for example, AVX-512 can process 512 bits at once, meaning 16 float32 numbers in parallel.

SIMT (Single Instruction, Multiple Threads) is NVIDIA's GPU execution model. A group of 32 threads called a Warp executes the same instruction at the same moment, but each thread operates on its own private data. Think of it as 32 soldiers marching in lockstep, each carrying a different piece of cargo but making the exact same moves.

SIMT Execution (one Warp, 32 threads):

Instruction: C[i] = A[i] * B[i]

Thread  0: C[0]  = A[0]  * B[0]   ─┐
Thread  1: C[1]  = A[1]  * B[1]    │
Thread  2: C[2]  = A[2]  * B[2]    │  Same moment,
...                                 │  same instruction,
Thread 31: C[31] = A[31] * B[31]  ─┘  different data

For matrix multiplication, this is a match made in heaven.

GPU Memory Hierarchy: Speed vs. Capacity

GPUs have their own memory system, from fastest to slowest:

Speed ↑                              Capacity ↑
│  Registers                  Private to each thread; fastest; ~KB
│  Shared Memory              Shared within a Block; ~48 KB; very fast
│  L1/L2 Cache                Automatic management; similar to CPU cache
│  Global Memory (VRAM)       GPU DRAM (e.g., 24 GB); slower
│  Host Memory (System RAM)   CPU-side RAM; transferred via PCIe; slowest
↓

The most important optimization in GPU programming is keeping data in shared memory and minimizing trips to global memory. This mirrors the CPU philosophy of exploiting the L1 cache—except GPU programmers must manually manage shared memory.

Why Matrix Multiplication Is a Natural Fit for GPUs

What is the core operation in deep learning? Matrix multiplication. Fully-connected layers, convolutional layers, attention mechanisms—they all reduce to enormous amounts of matrix multiplication.

For a matrix product C = A × B, where A is M×K and B is K×N:

C[i][j] = Σ A[i][k] * B[k][j]   (summing over k)

Each element of C can be computed independently!
C[0][0], C[0][1], C[1][0], ... C[M-1][N-1]
These M×N results have zero dependencies between them.

GPU: Fine. I have 16,384 cores. I'll compute 16,384 elements at once. Let's go.

This is precisely why deep learning and GPUs are such a perfect pairing: neural networks feed the GPU a continuous stream of embarrassingly parallel matrix multiplications.

The CUDA Programming Model: Grid / Block / Thread

NVIDIA's CUDA is the dominant framework for GPU programming. It organizes computation into a three-level hierarchy:

Grid
└── Block × many
    └── Thread × many

Example: element-wise addition of two 4096-element vectors

Grid
┌──────────────────────────────────────────┐
│  Block(0)       Block(1)      Block(2)...│
│  ┌──────────┐   ┌──────────┐            │
│  │ T0 T1 T2 │   │ T0 T1 T2 │  ...      │
│  │ T3 T4 T5 │   │ T3 T4 T5 │           │
│  │ ...      │   │ ...      │           │
│  └──────────┘   └──────────┘           │
└──────────────────────────────────────────┘

Each thread knows its own identity:
  global_index = blockIdx.x * blockDim.x + threadIdx.x
  It uses this index to decide which data element it handles.

Tensor Cores: Hardware Built for AI

Starting with the Volta architecture in 2017, NVIDIA added Tensor Cores to its GPUs. An ordinary CUDA Core performs one floating-point multiply-add per clock cycle. A Tensor Core performs a full 4×4 matrix multiply-add per clock cycle—64 operations at once—optimized specifically for the multiply-accumulate patterns at the heart of deep learning.

The RTX 4090 has 512 Tensor Cores. In mixed-precision (FP16) mode, it delivers 330 TFLOPS of throughput, compared to roughly 82 TFLOPS for standard FP32 CUDA Cores. Tensor Cores have multiplied GPU training speed several times over for modern AI workloads.

Hands-On

Use Python to compare NumPy (CPU) and CuPy (GPU) on a large matrix multiplication:

import numpy as np
import time

# --- CPU version ---
size = 4096
A_cpu = np.random.rand(size, size).astype(np.float32)
B_cpu = np.random.rand(size, size).astype(np.float32)

start = time.time()
C_cpu = np.matmul(A_cpu, B_cpu)
elapsed_cpu = time.time() - start
print(f"CPU (NumPy): {elapsed_cpu:.3f} seconds")

# --- GPU version (requires CuPy and CUDA) ---
try:
    import cupy as cp

    A_gpu = cp.asarray(A_cpu)   # transfer data from CPU RAM to GPU VRAM
    B_gpu = cp.asarray(B_cpu)

    # Warm-up pass (first call incurs JIT compilation overhead)
    _ = cp.matmul(A_gpu, B_gpu)
    cp.cuda.Stream.null.synchronize()

    start = time.time()
    C_gpu = cp.matmul(A_gpu, B_gpu)
    cp.cuda.Stream.null.synchronize()   # wait for GPU to actually finish
    elapsed_gpu = time.time() - start

    print(f"GPU (CuPy):  {elapsed_gpu:.4f} seconds")
    print(f"Speedup: {elapsed_cpu / elapsed_gpu:.1f}x")

except ImportError:
    print("CuPy not installed. Run: pip install cupy-cuda12x")

Typical output on a machine with an RTX 3080:

CPU (NumPy): 2.847 seconds
GPU (CuPy):  0.019 seconds
Speedup: 149.8x

Even accounting for data transfer overhead, the GPU is more than 100× faster for large matrix operations.

No GPU on hand? Run this on Google Colab's free T4 GPU runtime—CuPy comes pre-installed. Just switch the runtime type to GPU under Runtime → Change runtime type.

🔬 Going Deeper

Memory Bandwidth Is Often the Real Bottleneck

Many people assume that slow GPU training means "not enough compute." In practice, modern AI training is frequently memory-bound rather than compute-bound. The RTX 4090 delivers 82 TFLOPS of compute but only 1,008 GB/s of memory bandwidth. A large model's parameter matrices may occupy tens of gigabytes; every training iteration must read and write those parameters repeatedly, and the data movement can't keep up with the arithmetic units. The compute cores end up waiting for data. This is why high-bandwidth memory (HBM) was introduced in datacenter GPUs—the A100 uses HBM2e with 2 TB/s of bandwidth—and why optimization techniques like Flash Attention (Dao et al., 2022) focus explicitly on reducing memory traffic rather than raw computation.

What GPUs Are Bad At

GPUs are not universal accelerators. Algorithms with dense branching and strong data dependencies actually perform poorly on GPUs. Because SIMT requires all threads in a Warp to execute the same instruction, when threads diverge due to if/else conditions—different threads taking different branches—the GPU must serialize the execution of both paths, effectively cutting throughput in half or worse. This is called Warp Divergence. Decision trees, graph algorithms, and programs with complex control flow tend to be better suited to CPUs.

Recommended Resources

NVIDIA CUDA C Programming Guide (official, free online)—the definitive reference for GPU architecture and the CUDA programming model, straight from the source
Programming Massively Parallel Processors (Kirk & Hwu)—the most widely used textbook on GPU parallel computing, with hands-on CUDA examples throughout
FlashAttention paper (Dao et al., 2022)—"FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness"—a landmark paper that applies the memory hierarchy insight to Transformer attention, a must-read for understanding modern LLM training efficiency

Rate this chapter

4.6 / 5 (4 ratings)

GPU: Why It's Perfect for AI

GPU: Why It's Perfect for AI

Core Concepts

CPU vs GPU: Two Radically Different Design Choices

SIMD and SIMT: Two Models of Parallelism

GPU Memory Hierarchy: Speed vs. Capacity

Why Matrix Multiplication Is a Natural Fit for GPUs

The CUDA Programming Model: Grid / Block / Thread

Tensor Cores: Hardware Built for AI

Hands-On

🔬 Going Deeper

💬 Comments