Chapter 23

Network Model: netpoll and the epoll Wrapper

Network Model: netpoll and the epoll Wrapper

Your first Go TCP server probably looked something like this:

ln, _ := net.Listen("tcp", ":8080")
for {
    conn, _ := ln.Accept()
    go handleConn(conn)
}

Deceptively simple โ€” one goroutine per connection, conn.Read() blocks until data arrives, logic flows linearly top to bottom. It reads like a synchronous script.

But there is a puzzle buried here: if every goroutine blocks on Read(), how does the server handle a hundred thousand concurrent connections?

An OS thread blocked on read() does nothing. If a hundred thousand goroutines were each stuck on their own OS thread, you would need a hundred thousand OS threads โ€” tens of gigabytes of stack memory and catastrophic context-switch overhead.

Go does not do that. It uses a carefully constructed illusion: expose a blocking API to the caller, use non-blocking I/O toward the kernel, and bridge the two with epoll/kqueue/IOCP in the runtime scheduler.

This chapter tears apart every layer of that mechanism.


Level 1: The Essence of Blocking vs Non-Blocking

Two Philosophies of Waiting

When you call conn.Read(buf), a read(fd, buf, len) system call descends into the kernel. The kernel checks whether the socket receive buffer has data.

If data is present, the kernel copies it into user space and the syscall returns immediately.

If there is no data yet, two strategies diverge:

Blocking mode:

thread calls read() โ†’ no data โ†’ kernel suspends thread โ†’ waits for data
                                                          โ†’ kernel wakes thread โ†’ returns data
                   โ†‘
         thread is completely idle during the wait

Non-blocking mode:

thread calls read() โ†’ no data โ†’ kernel returns EAGAIN immediately
         โ†‘
thread is free to do other work, can ask again later

Blocking is simple but unscalable โ€” one thread can only wait for one thing at a time. Ten thousand connections demand ten thousand threads.

Non-blocking avoids the thread-per-connection problem but introduces busy polling โ€” repeatedly asking "any data yet?" burns CPU and gives unpredictable latency.

The right answer is event-driven I/O: hand a set of file descriptors to the kernel and say "notify me when any one of them is ready."

That is the contract offered by epoll on Linux, kqueue on macOS/BSD, and IOCP on Windows.

Why Go Exposes a Blocking API

Given that event-driven is the correct approach, why does Go not expose epoll directly and let users write their own event loops?

Because code readability has enormous engineering value, and event-driven code is structurally hostile to readability. Consider Node.js:

// One logical flow โ€” shredded into nested callbacks
server.on('connection', (socket) => {
    socket.on('data', (chunk) => {
        parseRequest(chunk, (req) => {
            queryDB(req, (result) => {
                socket.write(formatResponse(result));
            });
        });
    });
});

The logic is sequential (accept โ†’ read โ†’ parse โ†’ query โ†’ respond), but the code is an inverted pyramid of callbacks. Every I/O wait slices the control flow, requiring a new callback to resume it.

Go's answer: let goroutines act as lightweight virtual threads. The runtime applies event-driven scheduling invisibly. Programmers write linear code; the runtime parks goroutines on I/O waits and unparks them when data arrives.

// Linear logic, zero callbacks
go func() {
    conn, _ := ln.Accept()
    data, _ := io.ReadAll(conn)
    req := parseRequest(data)
    result := queryDB(req)
    conn.Write(formatResponse(result))
}()

This is the defining design decision of Go's network model: pay for scheduling complexity in the runtime so that application code can stay simple.


Level 2: Go's netpoll Mechanism

Overall Architecture

The Go runtime implements a cross-platform network poller (netpoller) that selects the appropriate OS primitive per platform:

User code
   โ”‚
   โ–ผ
net.Conn  (net package โ€” public API)
   โ”‚
   โ–ผ
netFD     (internal โ€” wraps a raw OS fd)
   โ”‚
   โ–ผ
pollDesc  (runtime โ€” registers with netpoller)
   โ”‚
   โ”œโ”€โ”€ Linux:   epoll
   โ”œโ”€โ”€ macOS:   kqueue
   โ””โ”€โ”€ Windows: IOCP

Two distinct layers collaborate: the net package (visible to users) and the runtime package (hidden internals). Understanding their handshake is the key to understanding everything.

netFD: Wrapping the File Descriptor

When you call net.Dial("tcp", "example.com:80"), Go constructs a netFD internally:

// src/net/fd_unix.go (simplified)
type netFD struct {
    pfd         poll.FD   // underlying poll file descriptor
    family      int
    sotype      int
    isConnected bool
    net         string
    laddr       Addr
    raddr       Addr
}

// src/internal/poll/fd_unix.go (simplified)
type FD struct {
    fdmu   fdMutex    // read/write mutex
    Sysfd  int        // raw OS file descriptor
    pd     pollDesc   // bridge to netpoller
    isFile bool
}

The raw OS file descriptor (Sysfd) is opened in non-blocking mode via syscall.SetNonblock. The pd pollDesc field is the goroutineโ€“netpoller bridge.

pollDesc: The Heart of Goroutine Parking

pollDesc is a runtime-internal struct defined in runtime/netpoll.go:

// src/runtime/netpoll.go (simplified)
type pollDesc struct {
    link *pollDesc         // free list for reuse

    lock mutex
    fd   uintptr

    // read side
    rg  atomic.Uintptr    // waiting goroutine pointer, or pdReady/pdWait
    rt  timer
    rd  int64             // read deadline (absolute nanoseconds)

    // write side
    wg  atomic.Uintptr
    wt  timer
    wd  int64             // write deadline
}

When a goroutine wants to read but no data has arrived, the execution path is:

goroutine calls conn.Read(buf)
   โ”‚
   โ–ผ
poll.FD.Read()
   โ”‚
   โ–ผ
syscall.Read(fd, buf) โ†’ returns EAGAIN (non-blocking, no data)
   โ”‚
   โ–ผ
pollDesc.waitRead()
   โ”‚
   โ–ผ
runtime.gopark(netpollblockcommit, ...)
   โ”‚
   โ–ผ
goroutine is parked (no OS thread consumed)
   goroutine pointer stored in pollDesc.rg

gopark is the runtime primitive that transitions a goroutine from "running" to "waiting." Once parked, the M (OS thread) that was running it is released to execute other goroutines.

This is the core insight: goroutines blocked on I/O do not hold OS threads. A server with fifty thousand idle connections uses a handful of OS threads, not fifty thousand.

The epoll Event Loop

On Linux, Go initializes epoll at program startup:

// src/runtime/netpoll_epoll.go (simplified)
var epfd int32 = -1

func netpollinit() {
    epfd = epollcreate1(_EPOLL_CLOEXEC)
}

func netpollopen(fd uintptr, pd *pollDesc) int32 {
    var ev epollevent
    // Edge-triggered mode: notify only on state changes
    ev.events = _EPOLLIN | _EPOLLOUT | _EPOLLRDHUP | _EPOLLET
    *(**pollDesc)(unsafe.Pointer(&ev.data)) = pd
    return -epollctl(epfd, _EPOLL_CTL_ADD, int32(fd), &ev)
}

Note edge-triggered (ET) mode, not level-triggered (LT). ET fires only when a fd transitions from not-ready to ready, not repeatedly while it remains ready. This means the program must drain the fd completely (reading until EAGAIN) each time it is woken.

The poller is invoked by the Go scheduler at specific points via netpoll(delay):

// src/runtime/netpoll_epoll.go (simplified)
func netpoll(delay int64) gList {
    var events [128]epollevent
    n := epollwait(epfd, &events[0], int32(len(events)), waitms)

    var toRun gList
    for i := int32(0); i < n; i++ {
        ev := events[i]
        pd := *(**pollDesc)(unsafe.Pointer(&ev.data))

        var mode int32
        if ev.events&(_EPOLLIN|_EPOLLRDHUP|_EPOLLHUP|_EPOLLERR) != 0 {
            mode += 'r'
        }
        if ev.events&_EPOLLOUT != 0 {
            mode += 'w'
        }
        netpollready(&toRun, pd, mode)
    }
    return toRun
}

netpoll returns a list of goroutines โ€” those that were parked waiting for I/O and are now runnable. The scheduler injects them back into run queues.

The Complete Round-Trip: Park to Resume

Packet arrives at NIC
   โ”‚
   โ–ผ
Kernel writes data into socket receive buffer
   โ”‚
   โ–ผ
epoll detects EPOLLIN on the fd
   โ”‚
   โ–ผ
Go scheduler calls netpoll() from schedule() or sysmon
   โ”‚
   โ–ผ
netpoll() finds the ready pollDesc
   โ”‚
   โ–ผ
Extracts parked goroutine from pollDesc.rg
   โ”‚
   โ–ผ
Goroutine placed into P's local run queue
   โ”‚
   โ–ผ
Goroutine is scheduled onto an M
   โ”‚
   โ–ผ
syscall.Read() succeeds this time; data copied into buf
   โ”‚
   โ–ผ
conn.Read() returns to user code

The entire sequence is invisible to the caller. You see conn.Read() block and then return. Underneath, a goroutine was parked, I/O became ready, and the goroutine was unparked. The abstraction is airtight.


Level 3: Code Practice

Building a Production-Grade TCP Server

package main

import (
    "bufio"
    "context"
    "fmt"
    "io"
    "log"
    "net"
    "time"
)

type Server struct {
    addr    string
    handler func(conn net.Conn)
}

func NewServer(addr string, handler func(conn net.Conn)) *Server {
    return &Server{addr: addr, handler: handler}
}

func (s *Server) ListenAndServe(ctx context.Context) error {
    lc := net.ListenConfig{
        KeepAlive: 30 * time.Second,
    }
    ln, err := lc.Listen(ctx, "tcp", s.addr)
    if err != nil {
        return fmt.Errorf("listen %s: %w", s.addr, err)
    }

    // Close listener when context is cancelled (graceful shutdown)
    go func() {
        <-ctx.Done()
        ln.Close()
    }()

    log.Printf("server listening on %s", s.addr)
    for {
        conn, err := ln.Accept()
        if err != nil {
            select {
            case <-ctx.Done():
                return nil  // intentional shutdown
            default:
                if ne, ok := err.(net.Error); ok && ne.Timeout() {
                    time.Sleep(5 * time.Millisecond)
                    continue
                }
                return fmt.Errorf("accept: %w", err)
            }
        }
        go s.handler(conn)
    }
}

func echoHandler(conn net.Conn) {
    defer conn.Close()

    reader := bufio.NewReaderSize(conn, 4096)
    buf := make([]byte, 4096)

    for {
        // Reset read deadline before each read โ€” implements idle timeout
        conn.SetReadDeadline(time.Now().Add(30 * time.Second))

        n, err := reader.Read(buf)
        if n > 0 {
            conn.SetWriteDeadline(time.Now().Add(10 * time.Second))
            if _, werr := conn.Write(buf[:n]); werr != nil {
                log.Printf("[%s] write: %v", conn.RemoteAddr(), werr)
                return
            }
        }
        if err != nil {
            if err != io.EOF {
                if ne, ok := err.(net.Error); ok && ne.Timeout() {
                    log.Printf("[%s] idle timeout", conn.RemoteAddr())
                } else {
                    log.Printf("[%s] read: %v", conn.RemoteAddr(), err)
                }
            }
            return
        }
    }
}

SetDeadline vs SetReadDeadline vs SetWriteDeadline

These three methods are frequently confused. Their semantics are precise:

// SetDeadline sets both read and write deadlines simultaneously.
// The argument is an absolute time point, not a duration.
conn.SetDeadline(time.Now().Add(30 * time.Second))

// SetReadDeadline affects only Read calls.
// Refresh it before each read to implement idle timeout.
conn.SetReadDeadline(time.Now().Add(10 * time.Second))

// SetWriteDeadline affects only Write calls.
// Prevents slow clients from tying up goroutines indefinitely.
conn.SetWriteDeadline(time.Now().Add(5 * time.Second))

Common mistake: treating a deadline as a per-operation timeout.

// WRONG: this deadline is absolute, not per-operation
conn.SetReadDeadline(time.Now().Add(10 * time.Second))
// Even if a Read succeeds at second 9, the next Read still expires
// at second 10 โ€” possibly milliseconds later.

// CORRECT: reset before each operation
for {
    conn.SetReadDeadline(time.Now().Add(10 * time.Second))
    n, err := conn.Read(buf)
    // ...
}

When a deadline fires, the returned error satisfies net.Error:

n, err := conn.Read(buf)
if err != nil {
    if ne, ok := err.(net.Error); ok && ne.Timeout() {
        log.Printf("deadline exceeded")
        return
    }
    // connection reset, EOF, or other hard error
    return
}

Throughput Benchmarks

// server_bench_test.go
package main

import (
    "bufio"
    "net"
    "strings"
    "testing"
)

func BenchmarkEchoServer(b *testing.B) {
    ln, err := net.Listen("tcp", "127.0.0.1:0")
    if err != nil {
        b.Fatal(err)
    }
    defer ln.Close()

    go func() {
        for {
            conn, err := ln.Accept()
            if err != nil {
                return
            }
            go echoHandler(conn)
        }
    }()

    addr := ln.Addr().String()
    payload := strings.Repeat("x", 1024)  // 1 KB per round-trip

    b.ResetTimer()
    b.SetBytes(int64(len(payload)))

    b.RunParallel(func(pb *testing.PB) {
        conn, err := net.Dial("tcp", addr)
        if err != nil {
            b.Error(err)
            return
        }
        defer conn.Close()

        r := bufio.NewReader(conn)
        rbuf := make([]byte, len(payload))
        for pb.Next() {
            conn.Write([]byte(payload))
            if _, err := io.ReadFull(r, rbuf); err != nil {
                b.Error(err)
                return
            }
        }
    })
}
go test -bench=BenchmarkEchoServer -benchmem -benchtime=10s -count=3 ./...

Sample output:

BenchmarkEchoServer-8   198765   6012 ns/op   170319872 B/s   0 B/op   0 allocs/op

B/s is your primary throughput signal. 0 allocs/op confirms the hot path has no heap allocations โ€” critical for high-frequency servers.

Use benchstat to compare across commits:

go test -bench=. -count=10 > old.txt
# make changes
go test -bench=. -count=10 > new.txt
benchstat old.txt new.txt

Level 4: Advanced Topics and Edge Cases

Comparison with Nginx's Event Loop

Nginx uses a classic event-loop architecture (single-threaded worker processes):

Nginx Worker Process
   โ”‚
   โ–ผ
epoll_wait()    โ† waits for any event
   โ”‚
   โ”œโ”€โ”€ accept new connection
   โ”œโ”€โ”€ read request bytes
   โ”œโ”€โ”€ write response bytes
   โ””โ”€โ”€ close connection
   โ”‚
   โ–ผ
back to epoll_wait()

Nginx strengths:

Nginx weaknesses:

Go's goroutine model strengths:

Go's goroutine model weaknesses:

Node.js and libuv

Node.js wraps epoll/kqueue/IOCP with libuv, exposing a phase-structured event loop:

libuv event loop phases
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  timers        (setTimeout, setInterval) โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚  pending callbacks                    โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚  idle / prepare                       โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚  poll          (epoll_wait)           โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚  check         (setImmediate)         โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚  close callbacks                      โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Async/await in Node.js eliminates callback hell syntactically, but the runtime remains single-threaded. A CPU-bound computation blocks the entire loop; you need worker_threads to escape. Go sidesteps this at the language level โ€” goroutines run across multiple M threads, so CPU-heavy and I/O-bound goroutines coexist naturally.

fasthttp: Why It Avoids the Standard Library

fasthttp (valyala/fasthttp) benchmarks 5โ€“10x faster than net/http. The speed advantage comes not from a superior network API but from eliminating allocations:

// net/http allocates fresh objects for every request:
// http.Request, Header map, body Reader โ€” GC pressure at high QPS

// fasthttp pools everything
fasthttp.ListenAndServe(":8080", func(ctx *fasthttp.RequestCtx) {
    // ctx comes from a pool; returned after handler returns
    // zero new allocations on the hot path
    ctx.WriteString("Hello, World!")
})

fasthttp also replaces bufio.Reader's byte-by-byte scan with a more aggressive HTTP parser operating directly on raw byte slices.

The trade-off is real: fasthttp's API is incompatible with the standard library. Middleware, tooling, and HTTP/2 support are limited. For most services the standard library is fast enough; the bottleneck is rarely the HTTP parsing layer.

gnet: Zero-Copy Networking

gnet (panjf2000/gnet) is an event-driven network framework that bypasses Go's runtime netpoll entirely:

type echoServer struct {
    gnet.BuiltinEventEngine
}

func (es *echoServer) OnTraffic(c gnet.Conn) gnet.Action {
    buf, _ := c.Next(-1)  // borrow bytes from gnet's ring buffer โ€” zero copy
    c.Write(buf)
    return gnet.None
}

func main() {
    gnet.Run(&echoServer{}, "tcp://:8080",
        gnet.WithMulticore(true),
        gnet.WithReusePort(true),
    )
}

gnet uses one acceptor goroutine plus one event-loop goroutine per CPU core, each managing its own epoll instance. This is architecturally identical to Nginx's multi-worker model.

Benefits: million-QPS throughput in echo benchmarks, predictable sub-millisecond latency, far lower memory per connection.

Costs: event-driven handler style โ€” no blocking calls allowed, difficult to integrate with Go's standard context, database/sql, or any library that expects goroutines.

The sendfile Optimization

For file-serving workloads, net/http automatically uses the sendfile(2) syscall when it detects that the http.ResponseWriter is backed by a TCP socket and the body is an *os.File:

http.HandleFunc("/download", func(w http.ResponseWriter, r *http.Request) {
    f, _ := os.Open("/var/data/large-file.bin")
    defer f.Close()
    http.ServeContent(w, r, "large-file.bin", time.Now(), f)
    // Internally: io.Copy(w, f)
    // net/http detects *os.File + TCP socket โ†’ sendfile syscall
})

The difference in data movement:

Normal read+write:
  file page cache โ†’ kernel read buffer โ†’ user-space buf โ†’ kernel write buffer โ†’ socket buffer
  (two kernelโ†”user copies)

sendfile:
  file page cache โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ†’ socket buffer
  (zero kernelโ†”user copies; one in-kernel DMA transfer)

On file-serving workloads, sendfile can double throughput while halving CPU utilization. The optimization is entirely automatic โ€” no code change required.

Performance Landscape Summary

Scenario: echo server, 1 KB messages, 8-core machine

Framework       QPS       Memory/10K conns   P99 latency
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
net/http        ~80K      ~1.5 GB            ~5 ms
fasthttp        ~400K     ~400 MB            ~1 ms
gnet            >1M       ~100 MB            <1 ms
Nginx (proxy)   ~600K     ~50 MB             <1 ms

Numbers are illustrative; real-world results depend heavily on handler complexity.

The practical guidance: net/http is the right default for almost every service. The bottleneck is almost always the database, an external API, or business logic โ€” not the HTTP layer. Reach for fasthttp or gnet only when building infrastructure-level components (proxies, message brokers, game servers) where you can rigorously control what happens inside every handler.


Summary

Go's network model is a carefully layered abstraction:

  1. User layer: linear, blocking-style code; goroutine-per-connection idiom
  2. net package: netFD wraps the raw fd and forces non-blocking mode
  3. runtime layer: pollDesc manages goroutine parking and unparking; netpoll() drives the epoll loop
  4. OS layer: epoll/kqueue/IOCP delivers readiness events

This stack achieves two seemingly contradictory goals simultaneously: a programmer-friendly synchronous API and event-loop-grade runtime efficiency.

Internalizing this model pays dividends beyond correct deadline handling. When you hit a performance wall, you can reason clearly about whether the bottleneck is goroutine count, heap allocation rate, GC pressure, or genuine I/O saturation โ€” and respond with evidence rather than guesswork.

Rate this chapter
4.8  / 5  (7 ratings)

๐Ÿ’ฌ Comments