Chapter 25

Where Is the I/O Bottleneck

Where Is the I/O Bottleneck?

Imagine a restaurant with two staffing models. Model A: one waiter per table, standing beside the guest from the moment they're seated until the food is eaten, not moving to another table while waiting. 100 tables need 100 waiters, most of whom are idle at any given moment. Model B: one waiter handles 100 tables. Whenever a table signals (guest waves, food is ready, bill requested), the waiter responds immediately. Otherwise, they're busy with something else.

Model A is synchronous blocking I/O. Model B is event-driven non-blocking I/O. Nginx uses Model B โ€” which is why a single-threaded Nginx worker can manage tens of thousands of concurrent connections, while Apache's old prefork model (one process per request) collapses under high concurrency.

Core Concepts

The Problem with Synchronous Blocking I/O

The simplest possible network server:

int fd = accept(listen_fd, ...);  // wait for a client connection
read(fd, buf, 1024);              // block until client sends data
process(buf);                     // handle the request
write(fd, response, len);         // block until write completes
close(fd);

When read() is called and no data has arrived yet, the OS suspends the thread entirely โ€” it cannot do anything else. A slow client on a poor mobile connection can hold a thread hostage for seconds.

The naive fix is threads: one thread per connection. But threads are expensive. Default stack size is 8 MB; 10,000 connections need 80 GB of memory just for stacks. Context switching requires saving and restoring dozens of registers โ€” each switch costs microseconds. At 10,000 concurrent connections, the thread model falls apart.

select/poll: Better, But Still O(n)

select and poll let one thread watch many file descriptors simultaneously:

fd_set readfds;
FD_SET(fd1, &readfds);
FD_SET(fd2, &readfds);
select(max_fd + 1, &readfds, NULL, NULL, NULL);   // blocks
// On return, loop through ALL fds to find which ones are ready

The problem: every select call copies the entire fd set from user space to kernel space, then scans every monitored fd to check readiness โ€” O(n) work per wakeup. With 10,000 connections, every I/O event triggers a scan of 10,000 descriptors. poll changes the data structure but doesn't fix the fundamental O(n) scan.

epoll: O(1) Event Notification

Linux 2.6 introduced epoll, which solves the problem architecturally:

// Register once โ€” kernel maintains an internal red-black tree
int epfd = epoll_create1(0);
struct epoll_event ev = { .events = EPOLLIN, .data.fd = fd };
epoll_ctl(epfd, EPOLL_CTL_ADD, fd, &ev);   // one-time registration

// Wait: returns ONLY the fds that are actually ready
struct epoll_event events[MAX_EVENTS];
int n = epoll_wait(epfd, events, MAX_EVENTS, -1);
for (int i = 0; i < n; i++) {
    handle(events[i].data.fd);   // only ready fds, no scanning
}
select/poll (O(n) scan):
  10,000 connections, 1 fd becomes ready
  โ†’ scan 10,000 fds, find 1
  โ†’ work scales linearly with connection count

epoll (O(1) notification):
  Kernel tracks fds in a red-black tree
  When an fd becomes ready, kernel adds it to a ready list
  epoll_wait returns only the ready ones
  โ†’ constant work regardless of total connection count

Blocking vs. Non-Blocking vs. Asynchronous I/O

These three terms are frequently conflated. Here's the precise distinction:

                      Returns immediately?   Who notifies when ready?
Sync blocking I/O          No (blocks)        Kernel blocks until done
Non-blocking I/O           Yes                App polls via epoll
Async I/O (AIO)            Yes                Kernel calls back when done

Non-blocking + epoll is the model used by Nginx, Redis, Node.js, and most high-performance servers: Set sockets to non-blocking with O_NONBLOCK, register them with epoll, and when epoll signals readiness, read() or write() returns immediately without blocking. Process what's available, then go back to waiting.

Linux AIO (aio_read / aio_write) submits a request and returns immediately; the kernel completes the I/O in the background and notifies via signal or callback. The catch: traditional Linux AIO only works reliably with O_DIRECT file I/O, not sockets, limiting its practical use.

io_uring: The Modern Zero-Copy I/O Interface

Linux 5.1 (2019) introduced io_uring, a complete redesign:

User space                    Kernel space
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”             โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Submit Queue  โ”‚ โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–บ โ”‚  Process I/O     โ”‚
โ”‚  (SQ Ring)   โ”‚  mmap sharedโ”‚                  โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”คโ—„โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ Completion Q  โ”‚  mmap sharedโ”‚  Completion eventsโ”‚
โ”‚  (CQ Ring)   โ”‚             โ”‚                  โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜             โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Key improvements:

  1. Shared-memory ring buffers: submission and completion happen via mmap'd memory, eliminating per-operation syscall overhead (one syscall costs ~100โ€“400 ns)
  2. Batch submission: a single io_uring_submit() can dispatch dozens of I/O operations
  3. True async file I/O: works on regular files, not just O_DIRECT
  4. Zero-copy send: IORING_OP_SEND_ZC sends network data without copying from user space to kernel socket buffers

How Nginx Handles Tens of Thousands of Connections in One Thread

Nginx master process
    โ””โ”€ fork โ†’ worker processes (usually one per CPU core)
                    โ”‚
                    โ””โ”€โ”€ Single-threaded event loop:
                         epoll_wait() โ€” block until events arrive
                         โ”œโ”€โ”€ New connection  โ†’ accept โ†’ register with epoll
                         โ”œโ”€โ”€ Data readable   โ†’ read โ†’ parse โ†’ route
                         โ”œโ”€โ”€ Upstream ready  โ†’ forward response
                         โ””โ”€โ”€ Buffer writable โ†’ write response
                         Loop forever, never block

Every operation is non-blocking. Even static file serving uses sendfile(), which transfers file data from the kernel page cache directly to the socket buffer inside the kernel, never touching user space โ€” zero-copy across the board.

Hands-On Verification

Use strace to observe every I/O syscall in a single HTTP request:

strace -e trace=network,read,write curl -s http://example.com/ 2>&1 | head -40
# You will see: socket โ†’ connect โ†’ sendto (send request) โ†’ recvfrom (receive response)

Python demo comparing select vs. epoll logic (epoll is Linux-only):

import socket, select, time

def echo_server_select(port, duration=3):
    srv = socket.socket(); srv.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
    srv.bind(('127.0.0.1', port)); srv.listen(100); srv.setblocking(False)
    reads = [srv]; requests = 0; start = time.time()
    while time.time() - start < duration:
        r, _, _ = select.select(reads, [], [], 0.05)  # O(n) scan every time
        for s in r:
            if s is srv:
                conn, _ = srv.accept(); reads.append(conn)
            else:
                data = s.recv(1024)
                if data and b'\r\n\r\n' in data:
                    s.send(b'HTTP/1.1 200 OK\r\n\r\nOK'); requests += 1
                    reads.remove(s); s.close()
    srv.close(); return requests

def echo_server_epoll(port, duration=3):
    srv = socket.socket(); srv.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
    srv.bind(('127.0.0.1', port)); srv.listen(100); srv.setblocking(False)
    ep = select.epoll(); ep.register(srv.fileno(), select.EPOLLIN)
    fd_map = {srv.fileno(): srv}; buf = {}; requests = 0; start = time.time()
    while time.time() - start < duration:
        for fd, ev in ep.poll(0.05):           # returns only ready fds
            if fd == srv.fileno():
                conn, _ = srv.accept(); conn.setblocking(False)
                ep.register(conn.fileno(), select.EPOLLIN)
                fd_map[conn.fileno()] = conn; buf[conn.fileno()] = b''
            elif ev & select.EPOLLIN:
                data = fd_map[fd].recv(1024)
                if data:
                    buf[fd] += data
                    if b'\r\n\r\n' in buf[fd]:
                        fd_map[fd].send(b'HTTP/1.1 200 OK\r\n\r\nOK')
                        requests += 1; ep.unregister(fd)
                        fd_map[fd].close()
    ep.close(); srv.close(); return requests

# Benchmark with: ab -n 5000 -c 50 http://127.0.0.1:8080/

A simple C benchmark using io_uring for batch file reads:

// Requires liburing: apt install liburing-dev
#include <liburing.h>
#include <fcntl.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#define BATCH 64
#define BUFSZ 4096

int main() {
    struct io_uring ring;
    io_uring_queue_init(BATCH, &ring, 0);

    // Submit BATCH read requests simultaneously
    for (int i = 0; i < BATCH; i++) {
        struct io_uring_sqe *sqe = io_uring_get_sqe(&ring);
        char *buf = malloc(BUFSZ);
        int fd = open("/dev/zero", O_RDONLY);
        io_uring_prep_read(sqe, fd, buf, BUFSZ, 0);
        io_uring_sqe_set_data(sqe, buf);
    }
    io_uring_submit(&ring);  // one syscall submits all 64

    // Harvest completions
    struct io_uring_cqe *cqe;
    for (int i = 0; i < BATCH; i++) {
        io_uring_wait_cqe(&ring, &cqe);
        free(io_uring_cqe_get_data(cqe));
        io_uring_cqe_seen(&ring, cqe);
    }
    io_uring_queue_exit(&ring);
    printf("Completed %d reads with 1 syscall for submission\n", BATCH);
    return 0;
}
gcc -O2 -o uring uring.c -luring && ./uring

๐Ÿ”ฌ Going Deeper

io_uring performance numbers in production are compelling. Meta's internal testing showed approximately 30% higher IOPS and 20% lower CPU utilization when migrating storage services from epoll+pread to io_uring โ€” primarily from eliminating per-operation syscall overhead through batching. Cloudflare published results in 2022 showing their DNS resolver's P99 latency drop from 10 ms to 2 ms after an io_uring rewrite, with significantly fewer context switches per request.

The full zero-copy chain is worth understanding precisely. A traditional read() + write() to serve a static file involves four memory copies: disk โ†’ kernel page cache โ†’ user space buffer โ†’ kernel socket buffer โ†’ NIC DMA. sendfile() eliminates the user-space round trip, reducing to two copies: disk โ†’ kernel page cache โ†’ NIC DMA. io_uring with SEND_ZC (zero-copy send) can reduce to one copy when the data is already in the kernel: kernel page cache โ†’ NIC DMA directly. Understanding this chain explains why enabling sendfile on in Nginx's config can improve static file throughput by 40% on a cold cache workload โ€” not because the code changed, but because the data movement path did.

Coroutines and async/await are user-space concurrency models built on top of the same event-driven I/O. Python's asyncio, Rust's Tokio, and Go's goroutines all use a user-space scheduler: when a coroutine blocks on I/O (suspends itself and registers a callback with the event loop), the scheduler switches to another runnable coroutine. The underlying mechanism is still epoll or io_uring. Coroutine context switches cost roughly 100ร— less than OS thread switches โ€” they don't require a kernel transition, just swapping a handful of registers in user space. Go's goroutine scheduler is particularly sophisticated, multiplexing millions of goroutines onto a handful of OS threads using work-stealing and cooperative preemption.

For deep reading, W. Richard Stevens' UNIX Network Programming, Volume 1 (3rd edition) is the definitive reference for select/poll/epoll and the POSIX AIO interfaces โ€” Chapters 6 and 14 in particular. Brendan Gregg's Systems Performance: Enterprise and the Cloud, Chapter 10 ("Network"), covers socket buffer tuning, zero-copy profiling with perf, and TCP stack analysis in production. The io_uring design paper by Jens Axboe, "Efficient IO with io_uring" (2019, available at kernel.dk), is short and essential โ€” it explains the ring buffer design, why it beats epoll for throughput-bound workloads, and what SQPOLL mode (kernel thread continuously polling the submission ring) means for ultra-low-latency applications.

Rate this chapter
4.7  / 5  (5 ratings)

๐Ÿ’ฌ Comments