What Comes Next for Computers
What Comes Next for Computers
We started with a switch. AND, OR, NOT—a handful of logic gates, then an adder, a flip-flop, memory, a CPU, an operating system, a network, a compiler, caches, virtualization, GPUs. Twenty-eight chapters later, you've assembled a complete conceptual model of a computer from the ground up: from a single transistor flipping on and off all the way to the software and cloud services you use every day.
Now let's look at where this machine is headed. Not science fiction—engineering that is already happening.
Core Concepts
The End of Moore's Law
In 1965, Intel co-founder Gordon Moore observed that the number of transistors on an integrated circuit roughly doubled every two years, while the cost stayed flat. This became Moore's Law, and for fifty years it ran like a punctual train, propelling the entire computer industry forward at breakneck speed.
That train is slowing down. In some ways it has already stopped.
The underlying problem is the breakdown of Dennard Scaling. Dennard Scaling held that as transistors shrank, power density stayed constant—smaller transistors meant proportionally lower voltage and current, so total chip power consumption stayed manageable. This rule began to fail around 2005. Transistors had shrunk to just a few nanometers, and quantum tunneling meant leakage current could no longer scale down proportionally. Power density—watts per square millimeter—began to rise relentlessly.
The result: chips run hot. You can't let every part of the chip run at full speed simultaneously because the heat would be unmanageable. This is why Intel stopped pushing clock frequencies aggressively in the mid-2000s and pivoted to multiple cores—not because higher clocks were physically impossible, but because the power cost was prohibitive. Engineers call this the Power Wall.
Transistor count (Moore's Law still roughly holds):
2003: Pentium 4, 0.13μm, 55 million transistors
2012: Ivy Bridge, 22nm, 1.4 billion transistors
2023: Apple M2 Ultra, 3nm, 13.4 billion transistors
But single-core performance growth has slowed:
2000s: clock speed grew ~30% per year
2010s onward: single-core performance grows ~5–10% per year
The bottleneck shifted from "how do we fit more transistors?"
to "we have the transistors—how do we keep them usefully running?"
Three Paths Through the Post-Moore Era
The industry didn't accept this as a dead end. It found three ways around the physical limits:
Path 1: 3D Stacking
If you can't fit more transistors on a flat surface, stack them vertically. The memory industry led the way: HBM (High Bandwidth Memory) stacks eight DRAM dies vertically and connects them with through-silicon vias (TSVs), achieving more than 10× the bandwidth of conventional DRAM. Logic chip makers are following: Intel's Foveros technology stacks logic and memory dies face-to-face, like building a hamburger out of silicon.
Path 2: Chiplets
Instead of fabricating one massive die on a large piece of wafer (low yield, high cost), break the chip into smaller tiles, manufacture each separately, and connect them with advanced packaging—AMD's Infinity Fabric, Intel's EMIB.
Traditional monolithic die: Chiplet approach:
┌────────────────────────┐ ┌────┬────┬────┐
│ │ │ CPU│ CPU│ CPU│
│ One large die │ │ die│ die│ die│
│ (low yield, big area) │ ├────┴────┴────┤
│ │ │ IO die │
└────────────────────────┘ └───────────────┘
Each die manufactured independently
CPU dies on cutting-edge process;
IO die on mature process
Better cost and yield for both
AMD's EPYC server processors and Ryzen desktop chips use exactly this chiplet design, letting AMD beat competitors on both cost and performance.
Path 3: Domain-Specific Architectures
A general-purpose CPU must support every possible program, so it carries a lot of overhead for generality. A specialized chip does exactly one thing and dedicates every transistor to doing it well—achieving efficiency that can be 10× to 100× better than a general CPU for that workload. Google's TPU (Tensor Processing Unit) is built for matrix multiplication. Apple's Neural Engine is built for machine learning inference. John Hennessy and David Patterson articulated this clearly in their 2019 Turing Award lecture, "A New Golden Age for Computer Architecture": domain-specific architectures are the most important source of performance gains in the post-Moore era.
RISC-V: Why an Open Instruction Set Matters
For decades, the instruction set architecture (ISA) landscape was controlled by two gatekeepers: Intel's x86 and ARM. If you wanted to design a chip, you either paid substantial licensing fees or built an incompatible proprietary architecture.
RISC-V is an open ISA designed at UC Berkeley starting in 2010. It is completely open-source and royalty-free—anyone can design a RISC-V chip without paying licensing fees. Its significance to the chip industry is analogous to Linux's significance to the operating system world: it lowers the barrier to entry, lets more countries and companies independently develop processors, and gives academic researchers and specialized chip designers enormous freedom. RISC-V chips are now shipping in embedded devices, data centers, laptops, and even spacecraft.
Processing In Memory: Tearing Down the Memory Wall
You learned earlier in this book that CPUs are fast, memory is slow, and caches exist to bridge the gap. As data volumes have exploded, this memory wall has grown harder and harder to climb. A large language model's parameters can weigh tens of gigabytes. Every inference pass requires hauling enormous amounts of data from memory to the CPU or GPU. The data movement itself can consume more time and energy than the actual arithmetic.
PIM (Processing In Memory) and NDP (Near Data Processing) flip the script: move the compute units next to the memory—or inside it—and process data in place, minimizing data movement.
Traditional architecture:
Memory ──(PCIe/bus, slow)──→ CPU/GPU ──→ compute
Processing in Memory:
Memory ←→ [compute units embedded in memory] ──→ result
Data stays still. Computation travels to the data.
Samsung's HBM-PIM and Micron's CXL memory expansion are commercial efforts in this direction. The idea isn't new—but the urgency created by AI workloads has made it a real engineering priority.
Quantum Computing: What Problems Does It Actually Solve?
A quantum computer uses qubits instead of classical bits. A classical bit is either 0 or 1. A qubit can exist in a superposition of both—roughly speaking, it simultaneously "explores" multiple possibilities. Qubits can also be entangled, so the state of one is instantly correlated with another.
Classical bit: Qubit:
0 or 1 α|0⟩ + β|1⟩
Simultaneous superposition of 0 and 1
Measurement causes "collapse" to a definite 0 or 1
In 2019, Google's quantum processor Sycamore (53 qubits) completed a calculation in 200 seconds that Google estimated would take the world's fastest supercomputer at the time approximately 10,000 years. This was the Quantum Supremacy experiment. The task itself had no practical value—it was a carefully constructed problem designed to be hard for classical computers—but it demonstrated that quantum hardware can, in principle, outperform classical hardware on certain problems.
Quantum computers are not general-purpose accelerators. They are good at:
- Shor's algorithm: Factoring large numbers (which would break current RSA encryption)
- Grover's algorithm: Searching unstructured databases (square-root speedup)
- Quantum simulation: Simulating quantum chemistry, molecular drug design, new materials
Quantum computers are not going to replace classical computers for everyday tasks: sending email, running websites, training neural networks—at least not in the foreseeable future. They are a specialized tool for a specific class of problems.
Neuromorphic Computing: Learning from the Brain
The human brain has 86 billion neurons and runs on roughly 20 watts of power—yet it handles tasks that computers still can't match (recognizing a face even when the photo is blurry, the angle is weird, and you only see half of it). Neuromorphic computing attempts to encode the brain's operating principles—spike-based signals, sparse activation, local learning rules—directly into chip circuitry.
Intel's Loihi chip and IBM's TrueNorth are the leading examples. Loihi 2 contains one million artificial "neurons" and 100 million "synapses." For processing sparse event streams (such as the output of a dynamic vision sensor), its energy consumption is orders of magnitude lower than a GPU. Neuromorphic computing is still at an early research stage, though, and has yet to find a killer commercial application at scale.
The Invariant Core of Computer Science
Technologies change. But some things don't.
Looking back across these 28 chapters, a few fundamental tensions appear in every era—and will likely persist in every future era:
Speed vs. capacity: Fast storage is always small and expensive (registers, L1 cache). Large storage is always slow and cheap (disk, tape). This tension produced caches, virtual memory, hierarchical storage systems—decades of engineering ingenuity.
General-purpose vs. specialized: General-purpose designs are flexible but inefficient; specialized designs are efficient but rigid. From the CPU to the GPU to the TPU and NPU, every step toward specialization trades flexibility for efficiency on a specific problem.
Speed vs. correctness: Pipelining, out-of-order execution, prefetching, speculative execution—every acceleration technique carries a cost. Mispredict a branch and you flush the pipeline. The Spectre and Meltdown vulnerabilities exist precisely because speculative execution violated security boundaries.
Abstraction vs. control: High-level languages let you express ideas quickly, at the cost of losing control over what the hardware actually does. Assembly and kernel code give you maximum control, but every detail must be managed by hand.
Every major breakthrough in computer science history has, at its core, found a new equilibrium point among these tensions. Every future breakthrough will too.
Hands-On
You may not have access to a quantum computer or a neuromorphic chip, but you can experience the quantum programming model on an ordinary laptop:
# Simulate a simple quantum superposition with Qiskit
# Install: pip install qiskit qiskit-aer
from qiskit import QuantumCircuit
from qiskit_aer import AerSimulator
# Create a single-qubit quantum circuit
qc = QuantumCircuit(1, 1)
# Apply a Hadamard gate: transforms |0⟩ into (|0⟩ + |1⟩) / √2
qc.h(0)
# Measure the qubit (superposition "collapses" to 0 or 1)
qc.measure(0, 0)
# Run on a simulator, 1000 shots
simulator = AerSimulator()
job = simulator.run(qc, shots=1000)
result = job.result()
counts = result.get_counts()
print(counts)
# Output, approximately: {'0': 498, '1': 502}
# Roughly half 0s and half 1s—this is quantum superposition in action
You can also sign up for a free account on IBM Quantum (quantum.ibm.com) and run this on real quantum hardware—a 127-qubit Eagle processor—not a simulation. Actual quantum hardware, accessible from a browser.
🔬 Going Deeper
Hennessy & Patterson's Turing Lecture: Where We Stand
In 2019, John Hennessy (former Stanford president, co-inventor of the MIPS architecture) and David Patterson (inventor of RISC, co-author of Computer Organization and Design) jointly received the Turing Award. Their lecture, "A New Golden Age for Computer Architecture," is the single best piece of writing for understanding the post-Moore landscape: general-purpose CPU performance scaling is slowing, and domain-specific architectures (DSAs), open instruction sets (RISC-V), and domain-specific languages (DSLs) will together open a new golden age for computer architecture. The lecture is freely available on the ACM Digital Library. Read it.
The Memory Wall and Its Consequences
In 1994, William Wulf and Sally McKee published "Hitting the Memory Wall: Implications of the Obvious," predicting that as the gap between CPU and memory speeds widened, memory access would become the dominant performance bottleneck. Thirty years later, the wall hasn't come down—if anything, the rise of large AI models has made it more consequential than ever. Nearly every architectural discussion about AI chips today revolves around how to address this gap. This is why HBM, PIM, and CXL are becoming increasingly important, and why memory bandwidth now appears in benchmark comparisons as prominently as FLOPS.
A Final Note
You now understand computers—not as a user, but as someone who could, in principle, build one. You know what happens inside a CPU when an if statement runs. You know how an operating system makes every program believe it owns everything. You know how data travels across a network. You know why GPUs made AI possible.
Will this knowledge go out of date? The specific technical details will. But the ability to see how one layer of abstraction sits on top of another—to understand why every engineering design involves a trade-off rather than a free lunch—that way of thinking doesn't go out of date. Computers are still evolving, but now you have the framework to understand how.
Go build something.
Recommended Resources
- "A New Golden Age for Computer Architecture" (Hennessy & Patterson, Communications of the ACM, 2019)—the most authoritative survey of post-Moore computer architecture directions; freely available on ACM Digital Library
- Quantum Computing: An Applied Approach (Jack Hidary, Springer)—the clearest introductory book on quantum computing for programmers, with Qiskit code throughout
- Computer Organization and Design: RISC-V Edition (Patterson & Hennessy)—the textbook that ties together everything in this book at a deeper engineering level; the RISC-V edition is the most current
- RISC-V International (riscv.org)—download the RISC-V specification for free; watch how an open instruction set moves from academic project to industrial standard in real time