Chapter 41

Build a Simple Database

Build a Simple Database

There is a saying that circulates widely among systems programmers: the only way to truly understand a system is to build it yourself. Databases are the most compelling example of this principle. You can read every page of "Database System Concepts," memorize every insertion rule of the B+tree, but until you write an engine that can execute SELECT * FROM users WHERE age > 18, you won't truly understand why databases are designed the way they are.

In this chapter we build a complete, small database engine in Go. Not a toy in-memory hash table, but a real system with disk storage, crash recovery, and transaction isolation. We'll progress through: key-value store → B+tree index → WAL log → MVCC transactions → SQL parser. Each step reveals a core mechanism of modern database engines.

Level 1 · What a Database Engine Actually Is

Storage Engine vs. Query Engine

When you run SELECT name FROM employees WHERE salary > 10000 ORDER BY name, two fundamentally different things happen inside the database.

The query engine understands your intent: it parses the SQL text, generates an execution plan, decides whether to filter or sort first, decides which index to use. The query engine is a translator — it translates human language into a sequence of machine operations.

The storage engine handles the physical reading and writing of data: how data is organized on disk, how to read it out, how to write it back, how to recover after a crash. The storage engine is a physical system — it deals directly with hardware and the operating system.

MySQL's architecture clearly embodies this separation: the optimizer and executor (query engine) sit above InnoDB or MyISAM (storage engine). You can even swap the storage engine for a table without touching the layer above. PostgreSQL couples storage and query engines more tightly, but the logical layers remain clear.

This separation is not an accidental engineering decision — it reflects the essential nature of the database problem: how to store data and how to query data are two fundamentally different problems, with different sources of complexity, different optimization directions, and different rates of change.

Why This Is the Best Systems Programming Exercise

Building a database engine forces you to confront the most central tensions in computer systems:

Speed vs. Durability. Memory is fast, disk is slow. But memory is volatile — power off and it's gone. All crash-recovery mechanisms are fundamentally solving this tension.

Concurrency vs. Consistency. Multiple users read and write the database simultaneously. Making them not interfere with each other while keeping data consistent is the core problem of concurrency control. MVCC is one answer modern databases have arrived at.

Space vs. Time. How large should each B+tree node be? Too small means too many nodes, the tree is too deep, queries are slow. Too large means each I/O transfers too much data, wasteful. How large should the buffer pool be? These are all space-time tradeoffs.

Flexibility vs. Performance. Supporting SQL means you need to parse and optimize queries, which has CPU overhead. A direct key-value API is faster but less expressive.

Each of these tensions has corresponding engineering solutions in database systems, and those solutions appear again and again in other systems (operating systems, distributed systems, file systems). Learning databases is really learning a framework for thinking about systems problems.

Why Go Is a Good Fit

Go's advantages for this project are multifold:

Level 2 · Core Mechanism Principles

B+Tree: The Foundation of Database Indexes

The B+tree is the default index structure for the vast majority of relational databases. Understanding why B+tree was chosen over other structures requires first understanding the I/O reality databases face.

A single disk random I/O takes 5–10ms, while a memory access takes 100ns — a difference of 50,000x. For a table with 1 million records, a binary search tree (BST) has a height of about 20, meaning each lookup requires 20 I/Os, or 100–200ms. This is completely unacceptable.

The B+tree solution is: make each node as large as possible to minimize tree height. A typical B+tree node size equals the minimum disk I/O unit — a page — usually 4KB or 16KB (InnoDB defaults to 16KB). A 16KB node can hold roughly 1,000 64-bit keys, so a 3-level B+tree can index 1 billion records with only 3 I/Os.

B+Tree Node Structure

B+trees have internal nodes and leaf nodes:

Internal node:
+--------+----+--------+----+--------+
| ptr[0] | k1 | ptr[1] | k2 | ptr[2] |
+--------+----+--------+----+--------+

Leaf node:
+-----+-----+-----+------+
| k1  | k2  | k3  | next | → pointer to next leaf (enables range scans)
+-----+-----+-----+------+
| v1  | v2  | v3  |

Internal nodes store only keys and child pointers, not values. All actual data (or primary keys, in InnoDB's secondary indexes) is in leaf nodes. Leaf nodes are linked in a doubly-linked list, making range scans extremely efficient — find the start, then walk the linked list without tree backtracking.

Insert and Split

When a leaf node is full, B+tree splits it into two nodes and promotes the middle key to the parent. If the parent is also full, continue splitting upward; worst case splits all the way to the root, and the tree grows one level taller.

Before split (node full, capacity=3):
[10, 20, 30]

After inserting 25, split:
Parent gets key 25
[10, 20]   [25, 30]

Delete and Merge

After deletion, if a node's key count falls below the minimum threshold (usually capacity/2), two options exist:

  1. Borrow: Take a key from an adjacent sibling (routed through the parent).
  2. Merge: Merge with a sibling, the separator key in the parent descends.

Merging triggers a key decrease in the parent, which may cause recursive merging; worst case the tree shrinks one level.

WAL: Sequential Write for Crash Recovery

One of the database's most basic promises is durability: once a transaction commits, the data won't be lost due to a crash.

The most direct implementation: on commit, flush all dirty pages (modified pages not yet written to disk) to disk. But this has two problems: random I/O is extremely slow (disk random write performance is about 1/100 of sequential write), and if a modification spans multiple pages, a crash in the middle leaves data inconsistent.

WAL (Write-Ahead Log)'s solution is elegant:

Core rule: Before modifying any data page on disk, the corresponding log record must be written to the WAL file first.

WAL writes are pure sequential append — sequential write performance is very high. Data pages can be written lazily; as long as WAL is on disk, crash recovery can replay all operations from WAL to restore consistency.

WAL log record format:
+----------+----------+---------+---------+----------+
| LSN (8B) | TxID (8B)| PageID  | Offset  | Data     |
+----------+----------+---------+---------+----------+

LSN (Log Sequence Number) is a monotonically increasing sequence number that determines the order of log records and marks how "current" a data page is. Each data page's header stores the LSN of the last WAL record that modified it (called pageLSN).

Recovery Process (simplified ARIES algorithm):

  1. Analysis phase: Scan WAL from the most recent checkpoint to determine which transactions committed and which did not.
  2. Redo phase: Replay all WAL records to restore the database to the pre-crash state (including modifications from uncommitted transactions).
  3. Undo phase: Roll back all uncommitted transactions (via compensating log records in WAL).

MVCC: Implementing Snapshot Isolation

Reads and writes don't block each other — this is MVCC (Multi-Version Concurrency Control)'s most important property. It lets read operations see a consistent historical snapshot without blocking write operations.

Version Chains

Each row of data has a version chain. Each version in the chain records:

Version chain for row id=1:
v3: name="Charlie", begin_ts=200, end_ts=NULL    ← newest version
v2: name="Bob",     begin_ts=100, end_ts=200
v1: name="Alice",   begin_ts=1,   end_ts=100

Snapshot Read

When a transaction begins, it records the current transaction ID as the snapshot timestamp (snap_ts). Read operations can only see versions satisfying:

So transaction A sees a consistent snapshot throughout its lifetime — even if other transactions modify data while A is running, A can't see those modifications.

Write Operations

Writes don't modify existing versions; they create new versions (INSERT) or mark old versions expired (DELETE/UPDATE = DELETE + INSERT). This is what "multi-version" means.

Buffer Pool: Page Cache Management

Databases don't read and write disk directly; they go through the buffer pool. The buffer pool is a region of memory that caches recently accessed disk pages.

Buffer pool structure:
+---------+---------+---------+---------+
| Frame 0 | Frame 1 | Frame 2 | Frame 3 |  ← fixed-size memory blocks
+---------+---------+---------+---------+
     ↓
  Page Table: PageID → Frame number
  Replacement Policy (LRU/Clock): decides which frame to evict

When a page is needed:

  1. Check the page table. On hit (page is in buffer pool), return directly.
  2. On miss, select a victim frame. If the victim frame is dirty, flush it to disk first.
  3. Read the required page from disk into the victim frame, update the page table.

The replacement policy determines which frame is evicted. The simplest is LRU (Least Recently Used). Real databases (like InnoDB) use a modified LRU — split into new/old sublists to prevent full table scans from evicting hot data.

Level 3 · Building It Step by Step

Step 1: Page Storage and Buffer Pool

Our database manages disk in 4KB pages. Each page has a unique PageID.

package storage

import (
    "encoding/binary"
    "fmt"
    "os"
    "sync"
)

const PageSize = 4096

type PageID uint32
type Page [PageSize]byte

// DiskManager manages the disk file, reads and writes by page
type DiskManager struct {
    file       *os.File
    numPages   PageID
    mu         sync.Mutex
}

func NewDiskManager(path string) (*DiskManager, error) {
    f, err := os.OpenFile(path, os.O_RDWR|os.O_CREATE, 0644)
    if err != nil {
        return nil, fmt.Errorf("open file: %w", err)
    }
    info, _ := f.Stat()
    numPages := PageID(info.Size() / PageSize)
    return &DiskManager{file: f, numPages: numPages}, nil
}

func (dm *DiskManager) ReadPage(id PageID, p *Page) error {
    _, err := dm.file.ReadAt(p[:], int64(id)*PageSize)
    return err
}

func (dm *DiskManager) WritePage(id PageID, p *Page) error {
    _, err := dm.file.WriteAt(p[:], int64(id)*PageSize)
    return err
}

func (dm *DiskManager) AllocatePage() PageID {
    dm.mu.Lock()
    defer dm.mu.Unlock()
    id := dm.numPages
    dm.numPages++
    var empty Page
    dm.file.WriteAt(empty[:], int64(id)*PageSize)
    return id
}

// BufferPool: caches disk pages, supports LRU eviction
type frame struct {
    page     Page
    pageID   PageID
    dirty    bool
    pinCount int
}

type BufferPool struct {
    dm        *DiskManager
    frames    []frame
    pageTable map[PageID]int
    freeList  []int
    lruList   []int
    mu        sync.Mutex
}

func NewBufferPool(dm *DiskManager, poolSize int) *BufferPool {
    bp := &BufferPool{
        dm:        dm,
        frames:    make([]frame, poolSize),
        pageTable: make(map[PageID]int),
        freeList:  make([]int, poolSize),
    }
    for i := range bp.freeList {
        bp.freeList[i] = i
    }
    return bp
}

// FetchPage gets a page (pins it to prevent eviction)
func (bp *BufferPool) FetchPage(id PageID) (*Page, error) {
    bp.mu.Lock()
    defer bp.mu.Unlock()

    if fi, ok := bp.pageTable[id]; ok {
        bp.frames[fi].pinCount++
        bp.moveFront(&bp.lruList, fi)
        return &bp.frames[fi].page, nil
    }

    fi, err := bp.evict()
    if err != nil {
        return nil, err
    }

    f := &bp.frames[fi]
    if err := bp.dm.ReadPage(id, &f.page); err != nil {
        return nil, fmt.Errorf("read page %d: %w", id, err)
    }
    f.pageID = id
    f.dirty = false
    f.pinCount = 1
    bp.pageTable[id] = fi
    bp.lruList = append([]int{fi}, bp.lruList...)
    return &f.page, nil
}

func (bp *BufferPool) UnpinPage(id PageID, dirty bool) {
    bp.mu.Lock()
    defer bp.mu.Unlock()
    if fi, ok := bp.pageTable[id]; ok {
        bp.frames[fi].pinCount--
        if dirty {
            bp.frames[fi].dirty = true
        }
    }
}

func (bp *BufferPool) evict() (int, error) {
    if len(bp.freeList) > 0 {
        fi := bp.freeList[0]
        bp.freeList = bp.freeList[1:]
        return fi, nil
    }
    for i := len(bp.lruList) - 1; i >= 0; i-- {
        fi := bp.lruList[i]
        f := &bp.frames[fi]
        if f.pinCount == 0 {
            if f.dirty {
                if err := bp.dm.WritePage(f.pageID, &f.page); err != nil {
                    return 0, err
                }
            }
            delete(bp.pageTable, f.pageID)
            bp.lruList = append(bp.lruList[:i], bp.lruList[i+1:]...)
            return fi, nil
        }
    }
    return 0, fmt.Errorf("buffer pool full: all pages are pinned")
}

func (bp *BufferPool) moveFront(list *[]int, val int) {
    for i, v := range *list {
        if v == val {
            *list = append([]int{val}, append((*list)[:i], (*list)[i+1:]...)...)
            return
        }
    }
}

func PutUint64(p *Page, offset int, v uint64) {
    binary.LittleEndian.PutUint64(p[offset:], v)
}

func GetUint64(p *Page, offset int) uint64 {
    return binary.LittleEndian.Uint64(p[offset:])
}

Step 2: B+Tree Key-Value Store

package btree

import (
    "bytes"
    "encoding/binary"
    "fmt"
    "github.com/yourname/minidb/storage"
)

// Leaf node format:
// [isLeaf:1][numKeys:2][key0:8][val0:8]...[keyN:8][valN:8][nextLeaf:4]
// Internal node format:
// [isLeaf:1][numKeys:2][child0:4][key0:8][child1:4][key1:8]...[childN:4]

const (
    PageSize = storage.PageSize
    MaxKeys  = 200
    LeafFlag = 0x01
)

type Key uint64
type Value uint64
type PageID = storage.PageID

type BTree struct {
    bp     *storage.BufferPool
    dm     *storage.DiskManager
    rootID PageID
}

func NewBTree(bp *storage.BufferPool, dm *storage.DiskManager) *BTree {
    rootID := dm.AllocatePage()
    bt := &BTree{bp: bp, dm: dm, rootID: rootID}
    bt.initLeafPage(rootID)
    return bt
}

func (bt *BTree) initLeafPage(id PageID) {
    p, _ := bt.bp.FetchPage(id)
    p[0] = LeafFlag
    binary.LittleEndian.PutUint16(p[1:], 0)
    bt.bp.UnpinPage(id, true)
}

func (bt *BTree) Get(key Key) (Value, bool) {
    leafID := bt.findLeaf(bt.rootID, key)
    p, err := bt.bp.FetchPage(leafID)
    if err != nil {
        return 0, false
    }
    defer bt.bp.UnpinPage(leafID, false)

    numKeys := int(binary.LittleEndian.Uint16(p[1:]))
    for i := 0; i < numKeys; i++ {
        offset := 3 + i*16
        k := Key(binary.LittleEndian.Uint64(p[offset:]))
        if k == key {
            v := Value(binary.LittleEndian.Uint64(p[offset+8:]))
            return v, true
        }
    }
    return 0, false
}

func (bt *BTree) findLeaf(nodeID PageID, key Key) PageID {
    p, _ := bt.bp.FetchPage(nodeID)
    if p[0] == LeafFlag {
        bt.bp.UnpinPage(nodeID, false)
        return nodeID
    }
    numKeys := int(binary.LittleEndian.Uint16(p[1:]))
    child := PageID(binary.LittleEndian.Uint32(p[3:]))
    for i := 0; i < numKeys; i++ {
        k := Key(binary.LittleEndian.Uint64(p[3+4+i*12:]))
        if key >= k {
            child = PageID(binary.LittleEndian.Uint32(p[3+4+i*12+8:]))
        }
    }
    bt.bp.UnpinPage(nodeID, false)
    return bt.findLeaf(child, key)
}

// RangeScan iterates all key-value pairs in [start, end]
func (bt *BTree) RangeScan(start, end Key, fn func(Key, Value)) {
    leafID := bt.findLeaf(bt.rootID, start)
    for leafID != 0 {
        p, _ := bt.bp.FetchPage(leafID)
        numKeys := int(binary.LittleEndian.Uint16(p[1:]))
        finished := false
        for i := 0; i < numKeys; i++ {
            k := Key(binary.LittleEndian.Uint64(p[3+i*16:]))
            v := Value(binary.LittleEndian.Uint64(p[3+i*16+8:]))
            if k > end {
                finished = true
                break
            }
            if k >= start {
                fn(k, v)
            }
        }
        next := PageID(binary.LittleEndian.Uint32(p[PageSize-4:]))
        bt.bp.UnpinPage(leafID, false)
        if finished || next == 0 {
            break
        }
        leafID = next
    }
}

// Put inserts or updates a key-value pair
func (bt *BTree) Put(key Key, val Value) error {
    leafID := bt.findLeaf(bt.rootID, key)
    newKey, newPageID, split := bt.insertLeaf(leafID, key, val)
    if split {
        bt.insertIntoParent(bt.rootID, leafID, newKey, newPageID)
    }
    return nil
}

func (bt *BTree) insertLeaf(id PageID, key Key, val Value) (Key, PageID, bool) {
    p, _ := bt.bp.FetchPage(id)
    numKeys := int(binary.LittleEndian.Uint16(p[1:]))

    if numKeys < MaxKeys {
        pos := 0
        for pos < numKeys {
            k := Key(binary.LittleEndian.Uint64(p[3+pos*16:]))
            if k >= key {
                break
            }
            pos++
        }
        copy(p[3+(pos+1)*16:], p[3+pos*16:3+numKeys*16])
        binary.LittleEndian.PutUint64(p[3+pos*16:], uint64(key))
        binary.LittleEndian.PutUint64(p[3+pos*16+8:], uint64(val))
        binary.LittleEndian.PutUint16(p[1:], uint16(numKeys+1))
        bt.bp.UnpinPage(id, true)
        return 0, 0, false
    }

    // Leaf full: split
    mid := MaxKeys / 2
    newID := bt.dm.AllocatePage()
    newPage, _ := bt.bp.FetchPage(newID)
    newPage[0] = LeafFlag

    rightCount := numKeys - mid
    copy(newPage[3:], p[3+mid*16:3+numKeys*16])
    binary.LittleEndian.PutUint16(newPage[1:], uint16(rightCount))
    binary.LittleEndian.PutUint16(p[1:], uint16(mid))

    origNext := PageID(binary.LittleEndian.Uint32(p[PageSize-4:]))
    binary.LittleEndian.PutUint32(newPage[PageSize-4:], uint32(origNext))
    binary.LittleEndian.PutUint32(p[PageSize-4:], uint32(newID))

    promoteKey := Key(binary.LittleEndian.Uint64(newPage[3:]))
    bt.bp.UnpinPage(id, true)
    bt.bp.UnpinPage(newID, true)

    if key < promoteKey {
        bt.insertLeaf(id, key, val)
    } else {
        bt.insertLeaf(newID, key, val)
    }
    return promoteKey, newID, true
}

func (bt *BTree) insertIntoParent(rootID, leftID PageID, key Key, rightID PageID) {
    if leftID == bt.rootID {
        newRootID := bt.dm.AllocatePage()
        p, _ := bt.bp.FetchPage(newRootID)
        p[0] = 0
        binary.LittleEndian.PutUint16(p[1:], 1)
        binary.LittleEndian.PutUint32(p[3:], uint32(leftID))
        binary.LittleEndian.PutUint64(p[7:], uint64(key))
        binary.LittleEndian.PutUint32(p[15:], uint32(rightID))
        bt.bp.UnpinPage(newRootID, true)
        bt.rootID = newRootID
    }
}

// Prevent unused import error
var _ = bytes.Compare
var _ = fmt.Sprintf

Step 3: WAL for Crash Recovery

package wal

import (
    "encoding/binary"
    "fmt"
    "io"
    "os"
    "sync"
    "sync/atomic"
)

type LogRecord struct {
    LSN    uint64
    TxID   uint64
    Type   uint8
    PageID uint32
    Offset uint32
    Data   []byte
}

const (
    RecordUpdate = 0
    RecordCommit = 1
    RecordAbort  = 2
    RecordBegin  = 3
)

type WAL struct {
    file    *os.File
    nextLSN uint64
    mu      sync.Mutex
}

func NewWAL(path string) (*WAL, error) {
    f, err := os.OpenFile(path, os.O_RDWR|os.O_CREATE|os.O_APPEND, 0644)
    if err != nil {
        return nil, err
    }
    return &WAL{file: f, nextLSN: 1}, nil
}

// Append writes a WAL record and returns its LSN
func (w *WAL) Append(rec *LogRecord) (uint64, error) {
    w.mu.Lock()
    defer w.mu.Unlock()

    lsn := atomic.AddUint64(&w.nextLSN, 1) - 1
    rec.LSN = lsn

    dataLen := len(rec.Data)
    // Format: [total_len:4][LSN:8][TxID:8][Type:1][PageID:4][Offset:4][DataLen:4][Data...]
    totalLen := 8 + 8 + 1 + 4 + 4 + 4 + dataLen

    buf := make([]byte, totalLen+4)
    binary.LittleEndian.PutUint32(buf[0:], uint32(totalLen))
    binary.LittleEndian.PutUint64(buf[4:], lsn)
    binary.LittleEndian.PutUint64(buf[12:], rec.TxID)
    buf[20] = rec.Type
    binary.LittleEndian.PutUint32(buf[21:], rec.PageID)
    binary.LittleEndian.PutUint32(buf[25:], rec.Offset)
    binary.LittleEndian.PutUint32(buf[29:], uint32(dataLen))
    copy(buf[33:], rec.Data)

    if _, err := w.file.Write(buf); err != nil {
        return 0, fmt.Errorf("wal append: %w", err)
    }
    if err := w.file.Sync(); err != nil {
        return 0, fmt.Errorf("wal sync: %w", err)
    }
    return lsn, nil
}

// Recover replays WAL and returns operations from committed transactions
func (w *WAL) Recover() ([]LogRecord, error) {
    w.file.Seek(0, io.SeekStart)
    var allRecords []LogRecord
    committedTx := make(map[uint64]bool)

    for {
        var totalLen uint32
        if err := binary.Read(w.file, binary.LittleEndian, &totalLen); err != nil {
            if err == io.EOF {
                break
            }
            return nil, err
        }
        buf := make([]byte, totalLen)
        if _, err := io.ReadFull(w.file, buf); err != nil {
            break // incomplete record: crash point
        }
        rec := LogRecord{
            LSN:    binary.LittleEndian.Uint64(buf[0:]),
            TxID:   binary.LittleEndian.Uint64(buf[8:]),
            Type:   buf[16],
            PageID: binary.LittleEndian.Uint32(buf[17:]),
            Offset: binary.LittleEndian.Uint32(buf[21:]),
        }
        dataLen := binary.LittleEndian.Uint32(buf[25:])
        rec.Data = make([]byte, dataLen)
        copy(rec.Data, buf[29:])
        allRecords = append(allRecords, rec)
        if rec.Type == RecordCommit {
            committedTx[rec.TxID] = true
        }
    }

    var committed []LogRecord
    for _, rec := range allRecords {
        if rec.Type == RecordUpdate && committedTx[rec.TxID] {
            committed = append(committed, rec)
        }
    }
    return committed, nil
}

Step 4: MVCC Transactions

package mvcc

import (
    "fmt"
    "sync"
    "sync/atomic"
)

type Version struct {
    BeginTS uint64
    EndTS   uint64  // 0 means "currently valid"
    Value   []byte
    Next    *Version
}

type Row struct {
    Key  uint64
    Head *Version
    mu   sync.RWMutex
}

type MVCC struct {
    nextTS uint64
    rows   sync.Map // key(uint64) → *Row
}

func NewMVCC() *MVCC {
    return &MVCC{}
}

func (m *MVCC) Begin() uint64 {
    return atomic.AddUint64(&m.nextTS, 1)
}

// Read performs a snapshot read: returns the latest version visible at snapTS
func (m *MVCC) Read(key, snapTS uint64) ([]byte, bool) {
    val, ok := m.rows.Load(key)
    if !ok {
        return nil, false
    }
    row := val.(*Row)
    row.mu.RLock()
    defer row.mu.RUnlock()

    for v := row.Head; v != nil; v = v.Next {
        if v.BeginTS <= snapTS && (v.EndTS == 0 || v.EndTS > snapTS) {
            result := make([]byte, len(v.Value))
            copy(result, v.Value)
            return result, true
        }
    }
    return nil, false
}

// Write creates a new version at commitTS
func (m *MVCC) Write(key, commitTS uint64, value []byte) error {
    actual, _ := m.rows.LoadOrStore(key, &Row{Key: key})
    row := actual.(*Row)
    row.mu.Lock()
    defer row.mu.Unlock()

    newVer := &Version{
        BeginTS: commitTS,
        EndTS:   0,
        Value:   append([]byte{}, value...),
        Next:    row.Head,
    }
    if row.Head != nil && row.Head.EndTS == 0 {
        row.Head.EndTS = commitTS
    }
    row.Head = newVer
    return nil
}

// Delete marks the current version as ended
func (m *MVCC) Delete(key, commitTS uint64) error {
    val, ok := m.rows.Load(key)
    if !ok {
        return fmt.Errorf("key %d not found", key)
    }
    row := val.(*Row)
    row.mu.Lock()
    defer row.mu.Unlock()
    if row.Head != nil && row.Head.EndTS == 0 {
        row.Head.EndTS = commitTS
    }
    return nil
}

// Tx represents a complete MVCC transaction
type Tx struct {
    mvcc      *MVCC
    snapTS    uint64
    writes    []writeIntent
    committed bool
}

type writeIntent struct {
    key   uint64
    value []byte // nil means delete
}

func (m *MVCC) BeginTx() *Tx {
    return &Tx{mvcc: m, snapTS: m.Begin()}
}

func (tx *Tx) Get(key uint64) ([]byte, bool) {
    return tx.mvcc.Read(key, tx.snapTS)
}

func (tx *Tx) Set(key uint64, value []byte) {
    tx.writes = append(tx.writes, writeIntent{key: key, value: value})
}

func (tx *Tx) Del(key uint64) {
    tx.writes = append(tx.writes, writeIntent{key: key, value: nil})
}

func (tx *Tx) Commit() error {
    commitTS := atomic.AddUint64(&tx.mvcc.nextTS, 1)
    for _, w := range tx.writes {
        if w.value == nil {
            tx.mvcc.Delete(w.key, commitTS)
        } else {
            tx.mvcc.Write(w.key, commitTS, w.value)
        }
    }
    tx.committed = true
    return nil
}

func (tx *Tx) Rollback() {
    tx.writes = nil
}

Step 5: Simple SQL Parser

package sql

import (
    "fmt"
    "strings"
)

type StatementType int

const (
    StmtSelect StatementType = iota
    StmtInsert
    StmtUpdate
    StmtDelete
)

type Statement struct {
    Type       StatementType
    Table      string
    Columns    []string
    Values     []string
    Where      *WhereClause
    SetClauses []SetClause
}

type WhereClause struct {
    Column string
    Op     string
    Value  string
}

type SetClause struct {
    Column string
    Value  string
}

func Parse(query string) (*Statement, error) {
    tokens := strings.Fields(strings.ReplaceAll(query, ",", " , "))
    if len(tokens) == 0 {
        return nil, fmt.Errorf("empty query")
    }
    switch strings.ToUpper(tokens[0]) {
    case "SELECT":
        return parseSelect(tokens)
    case "INSERT":
        return parseInsert(tokens)
    case "UPDATE":
        return parseUpdate(tokens)
    case "DELETE":
        return parseDelete(tokens)
    default:
        return nil, fmt.Errorf("unknown statement type: %s", tokens[0])
    }
}

func parseSelect(tokens []string) (*Statement, error) {
    stmt := &Statement{Type: StmtSelect}
    i := 1
    for i < len(tokens) && strings.ToUpper(tokens[i]) != "FROM" {
        col := strings.Trim(tokens[i], ", ")
        if col != "" && col != "," {
            stmt.Columns = append(stmt.Columns, col)
        }
        i++
    }
    if i >= len(tokens) {
        return nil, fmt.Errorf("missing FROM")
    }
    i++
    if i >= len(tokens) {
        return nil, fmt.Errorf("missing table name")
    }
    stmt.Table = tokens[i]
    i++
    if i < len(tokens) && strings.ToUpper(tokens[i]) == "WHERE" {
        w, err := parseWhere(tokens[i+1:])
        if err != nil {
            return nil, err
        }
        stmt.Where = w
    }
    return stmt, nil
}

func parseInsert(tokens []string) (*Statement, error) {
    if len(tokens) < 5 || strings.ToUpper(tokens[1]) != "INTO" {
        return nil, fmt.Errorf("invalid INSERT syntax")
    }
    stmt := &Statement{Type: StmtInsert, Table: tokens[2]}
    valIdx := -1
    for i, t := range tokens {
        if strings.ToUpper(t) == "VALUES" {
            valIdx = i
            break
        }
    }
    if valIdx == -1 {
        return nil, fmt.Errorf("missing VALUES")
    }
    for i := valIdx + 1; i < len(tokens); i++ {
        v := strings.Trim(tokens[i], "(),\"' ")
        if v != "" {
            stmt.Values = append(stmt.Values, v)
        }
    }
    return stmt, nil
}

func parseUpdate(tokens []string) (*Statement, error) {
    if len(tokens) < 4 {
        return nil, fmt.Errorf("invalid UPDATE syntax")
    }
    stmt := &Statement{Type: StmtUpdate, Table: tokens[1]}
    i := 3
    for i < len(tokens) && strings.ToUpper(tokens[i]) != "WHERE" {
        part := strings.Trim(tokens[i], ", ")
        if eq := strings.Index(part, "="); eq != -1 {
            stmt.SetClauses = append(stmt.SetClauses, SetClause{
                Column: strings.TrimSpace(part[:eq]),
                Value:  strings.Trim(part[eq+1:], "'\""),
            })
        }
        i++
    }
    if i < len(tokens) && strings.ToUpper(tokens[i]) == "WHERE" {
        w, err := parseWhere(tokens[i+1:])
        if err != nil {
            return nil, err
        }
        stmt.Where = w
    }
    return stmt, nil
}

func parseDelete(tokens []string) (*Statement, error) {
    if len(tokens) < 3 {
        return nil, fmt.Errorf("invalid DELETE syntax")
    }
    stmt := &Statement{Type: StmtDelete, Table: tokens[2]}
    for i, t := range tokens {
        if strings.ToUpper(t) == "WHERE" && i+1 < len(tokens) {
            w, err := parseWhere(tokens[i+1:])
            if err != nil {
                return nil, err
            }
            stmt.Where = w
            break
        }
    }
    return stmt, nil
}

func parseWhere(tokens []string) (*WhereClause, error) {
    ops := []string{">=", "<=", "=", ">", "<"}
    combined := strings.Join(tokens, " ")
    for _, op := range ops {
        if idx := strings.Index(combined, op); idx != -1 {
            col := strings.TrimSpace(combined[:idx])
            val := strings.TrimSpace(combined[idx+len(op):])
            return &WhereClause{
                Column: col,
                Op:     op,
                Value:  strings.Trim(val, "'\""),
            }, nil
        }
    }
    return nil, fmt.Errorf("no operator in WHERE clause")
}

// Integration test
func RunExample() {
    queries := []string{
        "SELECT name , age FROM users WHERE age > 18",
        "INSERT INTO users VALUES ( 1 , Alice , 30 )",
        "UPDATE users SET name=Bob WHERE id=1",
        "DELETE FROM users WHERE id=1",
    }
    for _, q := range queries {
        stmt, err := Parse(q)
        if err != nil {
            fmt.Printf("Error parsing %q: %v\n", q, err)
            continue
        }
        fmt.Printf("OK: table=%s type=%d where=%v\n", stmt.Table, stmt.Type, stmt.Where)
    }
}

Level 4 · Advanced Topics and Edge Cases

Lock-Free Reads and the True Cost of MVCC

MVCC's lock-free reads come with a hidden cost. The longer the version chain, the more versions a read must traverse. Under heavy write workloads, version chains can grow very long, causing read performance to degrade.

Garbage Collection (Vacuum) periodically cleans expired versions. PostgreSQL's VACUUM is a famously tricky design challenge — it needs to find the "oldest active transaction," and all non-current versions older than that timestamp can be safely deleted.

The key insight: MVCC shifts conflict from reads blocking writes to read performance degrading under write load. It's a tradeoff, not a free lunch.

// Compaction: delete historical versions older than olderThan
func (m *MVCC) Compaction(olderThan uint64) {
    m.rows.Range(func(key, val interface{}) bool {
        row := val.(*Row)
        row.mu.Lock()
        defer row.mu.Unlock()
        // Find first version with BeginTS <= olderThan, truncate chain after it
        cur := row.Head
        for cur != nil && cur.BeginTS > olderThan {
            cur = cur.Next
        }
        if cur != nil {
            cur.Next = nil
        }
        return true
    })
}

LSM Tree: The Write-Optimized Alternative to B+Tree

B+trees are unfriendly to write operations: each write requires random I/O to update one or more pages. LSM trees (Log-Structured Merge Trees) convert all writes to sequential writes, at the cost of requiring merging multiple files during reads.

LSM tree structure:
MemTable (memory, red-black tree) --flush--> L0 SSTable (immutable, sorted)
L0 SSTables --compaction--> L1 SSTables
L1 SSTables --compaction--> L2 SSTables
...

RocksDB, BadgerDB, and LevelDB all use LSM trees. Write-intensive scenarios (logs, time-series data) favor LSM; read-heavy scenarios (OLTP) favor B+trees. This is why InnoDB and MyISAM chose B+tree while Cassandra and HBase chose LSM.

Comparing BoltDB and BadgerDB Internals

BoltDB is a pure B+tree implementation that uses Copy-on-Write (COW) for write transactions instead of WAL — when modifying, it copies all nodes on the path from root to the modified leaf, keeping original pages intact. On commit, it atomically updates the root page pointer. This design gives completely lock-free reads but limits write throughput to B+tree's random I/O ceiling.

BadgerDB is an LSM tree implementation that separates Key and Value storage (Key in LSM tree, Value in Value Log). This design (called WiscKey) significantly reduces LSM tree write amplification, with substantial performance gains for large-value scenarios.

Rough performance comparison (highly workload-dependent):

Scenario B+Tree (BoltDB) LSM (BadgerDB)
Random read Fast Slower (merging needed)
Sequential read Fast Fast
Random write Slower (random I/O) Fast (sequential append)
Disk space Low fragmentation Space amplification

Embedded Library vs. Standalone Server

Our implementation is suitable as an embedded library (like SQLite or BoltDB), linked directly into the application process. This eliminates network I/O latency and is appropriate for single-machine scenarios.

To become a standalone server, you need:

  1. Network layer: Listen on TCP connections, implement a protocol (MySQL wire protocol or custom binary protocol).
  2. Connection management: One goroutine per connection, connection pool management.
  3. Transaction coordination: Cross-connection transactions must have serialized commits (global transaction ID generator).
  4. Authentication and permissions: User management, table-level/row-level permissions.

SQLite chose embedded, MySQL/PostgreSQL chose server mode — each with its own design philosophy. Understanding this architectural choice, and the tradeoffs that drove it, is itself a lesson in systems design.

Building this engine end to end — even in this simplified form — gives you an intuition for why every major database design decision is the way it is. When you read about InnoDB's buffer pool or PostgreSQL's MVCC implementation, you'll no longer be reading documentation: you'll be recognizing decisions you already made yourself.

Rate this chapter
4.9  / 5  (3 ratings)

💬 Comments