Network Model: netpoll and the epoll Wrapper
Network Model: netpoll and the epoll Wrapper
Your first Go TCP server probably looked something like this:
ln, _ := net.Listen("tcp", ":8080")
for {
conn, _ := ln.Accept()
go handleConn(conn)
}
Deceptively simple — one goroutine per connection, conn.Read() blocks until data arrives, logic flows linearly top to bottom. It reads like a synchronous script.
But there is a puzzle buried here: if every goroutine blocks on Read(), how does the server handle a hundred thousand concurrent connections?
An OS thread blocked on read() does nothing. If a hundred thousand goroutines were each stuck on their own OS thread, you would need a hundred thousand OS threads — tens of gigabytes of stack memory and catastrophic context-switch overhead.
Go does not do that. It uses a carefully constructed illusion: expose a blocking API to the caller, use non-blocking I/O toward the kernel, and bridge the two with epoll/kqueue/IOCP in the runtime scheduler.
This chapter tears apart every layer of that mechanism.
Level 1: The Essence of Blocking vs Non-Blocking
Two Philosophies of Waiting
When you call conn.Read(buf), a read(fd, buf, len) system call descends into the kernel. The kernel checks whether the socket receive buffer has data.
If data is present, the kernel copies it into user space and the syscall returns immediately.
If there is no data yet, two strategies diverge:
Blocking mode:
thread calls read() → no data → kernel suspends thread → waits for data
→ kernel wakes thread → returns data
↑
thread is completely idle during the wait
Non-blocking mode:
thread calls read() → no data → kernel returns EAGAIN immediately
↑
thread is free to do other work, can ask again later
Blocking is simple but unscalable — one thread can only wait for one thing at a time. Ten thousand connections demand ten thousand threads.
Non-blocking avoids the thread-per-connection problem but introduces busy polling — repeatedly asking "any data yet?" burns CPU and gives unpredictable latency.
The right answer is event-driven I/O: hand a set of file descriptors to the kernel and say "notify me when any one of them is ready."
That is the contract offered by epoll on Linux, kqueue on macOS/BSD, and IOCP on Windows.
Why Go Exposes a Blocking API
Given that event-driven is the correct approach, why does Go not expose epoll directly and let users write their own event loops?
Because code readability has enormous engineering value, and event-driven code is structurally hostile to readability. Consider Node.js:
// One logical flow — shredded into nested callbacks
server.on('connection', (socket) => {
socket.on('data', (chunk) => {
parseRequest(chunk, (req) => {
queryDB(req, (result) => {
socket.write(formatResponse(result));
});
});
});
});
The logic is sequential (accept → read → parse → query → respond), but the code is an inverted pyramid of callbacks. Every I/O wait slices the control flow, requiring a new callback to resume it.
Go's answer: let goroutines act as lightweight virtual threads. The runtime applies event-driven scheduling invisibly. Programmers write linear code; the runtime parks goroutines on I/O waits and unparks them when data arrives.
// Linear logic, zero callbacks
go func() {
conn, _ := ln.Accept()
data, _ := io.ReadAll(conn)
req := parseRequest(data)
result := queryDB(req)
conn.Write(formatResponse(result))
}()
This is the defining design decision of Go's network model: pay for scheduling complexity in the runtime so that application code can stay simple.
Level 2: Go's netpoll Mechanism
Overall Architecture
The Go runtime implements a cross-platform network poller (netpoller) that selects the appropriate OS primitive per platform:
User code
│
▼
net.Conn (net package — public API)
│
▼
netFD (internal — wraps a raw OS fd)
│
▼
pollDesc (runtime — registers with netpoller)
│
├── Linux: epoll
├── macOS: kqueue
└── Windows: IOCP
Two distinct layers collaborate: the net package (visible to users) and the runtime package (hidden internals). Understanding their handshake is the key to understanding everything.
netFD: Wrapping the File Descriptor
When you call net.Dial("tcp", "example.com:80"), Go constructs a netFD internally:
// src/net/fd_unix.go (simplified)
type netFD struct {
pfd poll.FD // underlying poll file descriptor
family int
sotype int
isConnected bool
net string
laddr Addr
raddr Addr
}
// src/internal/poll/fd_unix.go (simplified)
type FD struct {
fdmu fdMutex // read/write mutex
Sysfd int // raw OS file descriptor
pd pollDesc // bridge to netpoller
isFile bool
}
The raw OS file descriptor (Sysfd) is opened in non-blocking mode via syscall.SetNonblock. The pd pollDesc field is the goroutine–netpoller bridge.
pollDesc: The Heart of Goroutine Parking
pollDesc is a runtime-internal struct defined in runtime/netpoll.go:
// src/runtime/netpoll.go (simplified)
type pollDesc struct {
link *pollDesc // free list for reuse
lock mutex
fd uintptr
// read side
rg atomic.Uintptr // waiting goroutine pointer, or pdReady/pdWait
rt timer
rd int64 // read deadline (absolute nanoseconds)
// write side
wg atomic.Uintptr
wt timer
wd int64 // write deadline
}
When a goroutine wants to read but no data has arrived, the execution path is:
goroutine calls conn.Read(buf)
│
▼
poll.FD.Read()
│
▼
syscall.Read(fd, buf) → returns EAGAIN (non-blocking, no data)
│
▼
pollDesc.waitRead()
│
▼
runtime.gopark(netpollblockcommit, ...)
│
▼
goroutine is parked (no OS thread consumed)
goroutine pointer stored in pollDesc.rg
gopark is the runtime primitive that transitions a goroutine from "running" to "waiting." Once parked, the M (OS thread) that was running it is released to execute other goroutines.
This is the core insight: goroutines blocked on I/O do not hold OS threads. A server with fifty thousand idle connections uses a handful of OS threads, not fifty thousand.
The epoll Event Loop
On Linux, Go initializes epoll at program startup:
// src/runtime/netpoll_epoll.go (simplified)
var epfd int32 = -1
func netpollinit() {
epfd = epollcreate1(_EPOLL_CLOEXEC)
}
func netpollopen(fd uintptr, pd *pollDesc) int32 {
var ev epollevent
// Edge-triggered mode: notify only on state changes
ev.events = _EPOLLIN | _EPOLLOUT | _EPOLLRDHUP | _EPOLLET
*(**pollDesc)(unsafe.Pointer(&ev.data)) = pd
return -epollctl(epfd, _EPOLL_CTL_ADD, int32(fd), &ev)
}
Note edge-triggered (ET) mode, not level-triggered (LT). ET fires only when a fd transitions from not-ready to ready, not repeatedly while it remains ready. This means the program must drain the fd completely (reading until EAGAIN) each time it is woken.
The poller is invoked by the Go scheduler at specific points via netpoll(delay):
// src/runtime/netpoll_epoll.go (simplified)
func netpoll(delay int64) gList {
var events [128]epollevent
n := epollwait(epfd, &events[0], int32(len(events)), waitms)
var toRun gList
for i := int32(0); i < n; i++ {
ev := events[i]
pd := *(**pollDesc)(unsafe.Pointer(&ev.data))
var mode int32
if ev.events&(_EPOLLIN|_EPOLLRDHUP|_EPOLLHUP|_EPOLLERR) != 0 {
mode += 'r'
}
if ev.events&_EPOLLOUT != 0 {
mode += 'w'
}
netpollready(&toRun, pd, mode)
}
return toRun
}
netpoll returns a list of goroutines — those that were parked waiting for I/O and are now runnable. The scheduler injects them back into run queues.
The Complete Round-Trip: Park to Resume
Packet arrives at NIC
│
▼
Kernel writes data into socket receive buffer
│
▼
epoll detects EPOLLIN on the fd
│
▼
Go scheduler calls netpoll() from schedule() or sysmon
│
▼
netpoll() finds the ready pollDesc
│
▼
Extracts parked goroutine from pollDesc.rg
│
▼
Goroutine placed into P's local run queue
│
▼
Goroutine is scheduled onto an M
│
▼
syscall.Read() succeeds this time; data copied into buf
│
▼
conn.Read() returns to user code
The entire sequence is invisible to the caller. You see conn.Read() block and then return. Underneath, a goroutine was parked, I/O became ready, and the goroutine was unparked. The abstraction is airtight.
Level 3: Code Practice
Building a Production-Grade TCP Server
package main
import (
"bufio"
"context"
"fmt"
"io"
"log"
"net"
"time"
)
type Server struct {
addr string
handler func(conn net.Conn)
}
func NewServer(addr string, handler func(conn net.Conn)) *Server {
return &Server{addr: addr, handler: handler}
}
func (s *Server) ListenAndServe(ctx context.Context) error {
lc := net.ListenConfig{
KeepAlive: 30 * time.Second,
}
ln, err := lc.Listen(ctx, "tcp", s.addr)
if err != nil {
return fmt.Errorf("listen %s: %w", s.addr, err)
}
// Close listener when context is cancelled (graceful shutdown)
go func() {
<-ctx.Done()
ln.Close()
}()
log.Printf("server listening on %s", s.addr)
for {
conn, err := ln.Accept()
if err != nil {
select {
case <-ctx.Done():
return nil // intentional shutdown
default:
if ne, ok := err.(net.Error); ok && ne.Timeout() {
time.Sleep(5 * time.Millisecond)
continue
}
return fmt.Errorf("accept: %w", err)
}
}
go s.handler(conn)
}
}
func echoHandler(conn net.Conn) {
defer conn.Close()
reader := bufio.NewReaderSize(conn, 4096)
buf := make([]byte, 4096)
for {
// Reset read deadline before each read — implements idle timeout
conn.SetReadDeadline(time.Now().Add(30 * time.Second))
n, err := reader.Read(buf)
if n > 0 {
conn.SetWriteDeadline(time.Now().Add(10 * time.Second))
if _, werr := conn.Write(buf[:n]); werr != nil {
log.Printf("[%s] write: %v", conn.RemoteAddr(), werr)
return
}
}
if err != nil {
if err != io.EOF {
if ne, ok := err.(net.Error); ok && ne.Timeout() {
log.Printf("[%s] idle timeout", conn.RemoteAddr())
} else {
log.Printf("[%s] read: %v", conn.RemoteAddr(), err)
}
}
return
}
}
}
SetDeadline vs SetReadDeadline vs SetWriteDeadline
These three methods are frequently confused. Their semantics are precise:
// SetDeadline sets both read and write deadlines simultaneously.
// The argument is an absolute time point, not a duration.
conn.SetDeadline(time.Now().Add(30 * time.Second))
// SetReadDeadline affects only Read calls.
// Refresh it before each read to implement idle timeout.
conn.SetReadDeadline(time.Now().Add(10 * time.Second))
// SetWriteDeadline affects only Write calls.
// Prevents slow clients from tying up goroutines indefinitely.
conn.SetWriteDeadline(time.Now().Add(5 * time.Second))
Common mistake: treating a deadline as a per-operation timeout.
// WRONG: this deadline is absolute, not per-operation
conn.SetReadDeadline(time.Now().Add(10 * time.Second))
// Even if a Read succeeds at second 9, the next Read still expires
// at second 10 — possibly milliseconds later.
// CORRECT: reset before each operation
for {
conn.SetReadDeadline(time.Now().Add(10 * time.Second))
n, err := conn.Read(buf)
// ...
}
When a deadline fires, the returned error satisfies net.Error:
n, err := conn.Read(buf)
if err != nil {
if ne, ok := err.(net.Error); ok && ne.Timeout() {
log.Printf("deadline exceeded")
return
}
// connection reset, EOF, or other hard error
return
}
Throughput Benchmarks
// server_bench_test.go
package main
import (
"bufio"
"net"
"strings"
"testing"
)
func BenchmarkEchoServer(b *testing.B) {
ln, err := net.Listen("tcp", "127.0.0.1:0")
if err != nil {
b.Fatal(err)
}
defer ln.Close()
go func() {
for {
conn, err := ln.Accept()
if err != nil {
return
}
go echoHandler(conn)
}
}()
addr := ln.Addr().String()
payload := strings.Repeat("x", 1024) // 1 KB per round-trip
b.ResetTimer()
b.SetBytes(int64(len(payload)))
b.RunParallel(func(pb *testing.PB) {
conn, err := net.Dial("tcp", addr)
if err != nil {
b.Error(err)
return
}
defer conn.Close()
r := bufio.NewReader(conn)
rbuf := make([]byte, len(payload))
for pb.Next() {
conn.Write([]byte(payload))
if _, err := io.ReadFull(r, rbuf); err != nil {
b.Error(err)
return
}
}
})
}
go test -bench=BenchmarkEchoServer -benchmem -benchtime=10s -count=3 ./...
Sample output:
BenchmarkEchoServer-8 198765 6012 ns/op 170319872 B/s 0 B/op 0 allocs/op
B/s is your primary throughput signal. 0 allocs/op confirms the hot path has no heap allocations — critical for high-frequency servers.
Use benchstat to compare across commits:
go test -bench=. -count=10 > old.txt
# make changes
go test -bench=. -count=10 > new.txt
benchstat old.txt new.txt
Level 4: Advanced Topics and Edge Cases
Comparison with Nginx's Event Loop
Nginx uses a classic event-loop architecture (single-threaded worker processes):
Nginx Worker Process
│
▼
epoll_wait() ← waits for any event
│
├── accept new connection
├── read request bytes
├── write response bytes
└── close connection
│
▼
back to epoll_wait()
Nginx strengths:
- Tiny memory footprint (~3-4 MB per worker)
- Zero context-switch overhead
- Excellent CPU cache locality
Nginx weaknesses:
- Any blocking operation in a handler stalls the entire worker
- Business logic must be written as callbacks or state machines
- CPU-bound tasks require out-of-process offloading
Go's goroutine model strengths:
- Linear code; any Go call is safe inside a handler
- Goroutine count scales elastically with connection count
- CPU-intensive and I/O-bound goroutines run in parallel across multiple OS threads
Go's goroutine model weaknesses:
- Each goroutine carries an initial 2KB-8KB stack (auto-growing — memory grows under load)
- Goroutine scheduling overhead is non-zero
- GC stop-the-world pauses matter in sub-millisecond latency applications
Node.js and libuv
Node.js wraps epoll/kqueue/IOCP with libuv, exposing a phase-structured event loop:
libuv event loop phases
┌──────────────────────────────────────┐
│ timers (setTimeout, setInterval) │
├──────────────────────────────────────┤
│ pending callbacks │
├──────────────────────────────────────┤
│ idle / prepare │
├──────────────────────────────────────┤
│ poll (epoll_wait) │
├──────────────────────────────────────┤
│ check (setImmediate) │
├──────────────────────────────────────┤
│ close callbacks │
└──────────────────────────────────────┘
Async/await in Node.js eliminates callback hell syntactically, but the runtime remains single-threaded. A CPU-bound computation blocks the entire loop; you need worker_threads to escape. Go sidesteps this at the language level — goroutines run across multiple M threads, so CPU-heavy and I/O-bound goroutines coexist naturally.
fasthttp: Why It Avoids the Standard Library
fasthttp (valyala/fasthttp) benchmarks 5–10x faster than net/http. The speed advantage comes not from a superior network API but from eliminating allocations:
// net/http allocates fresh objects for every request:
// http.Request, Header map, body Reader — GC pressure at high QPS
// fasthttp pools everything
fasthttp.ListenAndServe(":8080", func(ctx *fasthttp.RequestCtx) {
// ctx comes from a pool; returned after handler returns
// zero new allocations on the hot path
ctx.WriteString("Hello, World!")
})
fasthttp also replaces bufio.Reader's byte-by-byte scan with a more aggressive HTTP parser operating directly on raw byte slices.
The trade-off is real: fasthttp's API is incompatible with the standard library. Middleware, tooling, and HTTP/2 support are limited. For most services the standard library is fast enough; the bottleneck is rarely the HTTP parsing layer.
gnet: Zero-Copy Networking
gnet (panjf2000/gnet) is an event-driven network framework that bypasses Go's runtime netpoll entirely:
type echoServer struct {
gnet.BuiltinEventEngine
}
func (es *echoServer) OnTraffic(c gnet.Conn) gnet.Action {
buf, _ := c.Next(-1) // borrow bytes from gnet's ring buffer — zero copy
c.Write(buf)
return gnet.None
}
func main() {
gnet.Run(&echoServer{}, "tcp://:8080",
gnet.WithMulticore(true),
gnet.WithReusePort(true),
)
}
gnet uses one acceptor goroutine plus one event-loop goroutine per CPU core, each managing its own epoll instance. This is architecturally identical to Nginx's multi-worker model.
Benefits: million-QPS throughput in echo benchmarks, predictable sub-millisecond latency, far lower memory per connection.
Costs: event-driven handler style — no blocking calls allowed, difficult to integrate with Go's standard context, database/sql, or any library that expects goroutines.
The sendfile Optimization
For file-serving workloads, net/http automatically uses the sendfile(2) syscall when it detects that the http.ResponseWriter is backed by a TCP socket and the body is an *os.File:
http.HandleFunc("/download", func(w http.ResponseWriter, r *http.Request) {
f, _ := os.Open("/var/data/large-file.bin")
defer f.Close()
http.ServeContent(w, r, "large-file.bin", time.Now(), f)
// Internally: io.Copy(w, f)
// net/http detects *os.File + TCP socket → sendfile syscall
})
The difference in data movement:
Normal read+write:
file page cache → kernel read buffer → user-space buf → kernel write buffer → socket buffer
(two kernel↔user copies)
sendfile:
file page cache ────────────────────────────────────→ socket buffer
(zero kernel↔user copies; one in-kernel DMA transfer)
On file-serving workloads, sendfile can double throughput while halving CPU utilization. The optimization is entirely automatic — no code change required.
Performance Landscape Summary
Scenario: echo server, 1 KB messages, 8-core machine
Framework QPS Memory/10K conns P99 latency
────────────────────────────────────────────────────────
net/http ~80K ~1.5 GB ~5 ms
fasthttp ~400K ~400 MB ~1 ms
gnet >1M ~100 MB <1 ms
Nginx (proxy) ~600K ~50 MB <1 ms
Numbers are illustrative; real-world results depend heavily on handler complexity.
The practical guidance: net/http is the right default for almost every service. The bottleneck is almost always the database, an external API, or business logic — not the HTTP layer. Reach for fasthttp or gnet only when building infrastructure-level components (proxies, message brokers, game servers) where you can rigorously control what happens inside every handler.
Summary
Go's network model is a carefully layered abstraction:
- User layer: linear, blocking-style code; goroutine-per-connection idiom
netpackage:netFDwraps the raw fd and forces non-blocking moderuntimelayer:pollDescmanages goroutine parking and unparking;netpoll()drives the epoll loop- OS layer: epoll/kqueue/IOCP delivers readiness events
This stack achieves two seemingly contradictory goals simultaneously: a programmer-friendly synchronous API and event-loop-grade runtime efficiency.
Internalizing this model pays dividends beyond correct deadline handling. When you hit a performance wall, you can reason clearly about whether the bottleneck is goroutine count, heap allocation rate, GC pressure, or genuine I/O saturation — and respond with evidence rather than guesswork.