Build a Simple Database
Build a Simple Database
There is a saying that circulates widely among systems programmers: the only way to truly understand a system is to build it yourself. Databases are the most compelling example of this principle. You can read every page of "Database System Concepts," memorize every insertion rule of the B+tree, but until you write an engine that can execute SELECT * FROM users WHERE age > 18, you won't truly understand why databases are designed the way they are.
In this chapter we build a complete, small database engine in Go. Not a toy in-memory hash table, but a real system with disk storage, crash recovery, and transaction isolation. We'll progress through: key-value store → B+tree index → WAL log → MVCC transactions → SQL parser. Each step reveals a core mechanism of modern database engines.
Level 1 · What a Database Engine Actually Is
Storage Engine vs. Query Engine
When you run SELECT name FROM employees WHERE salary > 10000 ORDER BY name, two fundamentally different things happen inside the database.
The query engine understands your intent: it parses the SQL text, generates an execution plan, decides whether to filter or sort first, decides which index to use. The query engine is a translator — it translates human language into a sequence of machine operations.
The storage engine handles the physical reading and writing of data: how data is organized on disk, how to read it out, how to write it back, how to recover after a crash. The storage engine is a physical system — it deals directly with hardware and the operating system.
MySQL's architecture clearly embodies this separation: the optimizer and executor (query engine) sit above InnoDB or MyISAM (storage engine). You can even swap the storage engine for a table without touching the layer above. PostgreSQL couples storage and query engines more tightly, but the logical layers remain clear.
This separation is not an accidental engineering decision — it reflects the essential nature of the database problem: how to store data and how to query data are two fundamentally different problems, with different sources of complexity, different optimization directions, and different rates of change.
Why This Is the Best Systems Programming Exercise
Building a database engine forces you to confront the most central tensions in computer systems:
Speed vs. Durability. Memory is fast, disk is slow. But memory is volatile — power off and it's gone. All crash-recovery mechanisms are fundamentally solving this tension.
Concurrency vs. Consistency. Multiple users read and write the database simultaneously. Making them not interfere with each other while keeping data consistent is the core problem of concurrency control. MVCC is one answer modern databases have arrived at.
Space vs. Time. How large should each B+tree node be? Too small means too many nodes, the tree is too deep, queries are slow. Too large means each I/O transfers too much data, wasteful. How large should the buffer pool be? These are all space-time tradeoffs.
Flexibility vs. Performance. Supporting SQL means you need to parse and optimize queries, which has CPU overhead. A direct key-value API is faster but less expressive.
Each of these tensions has corresponding engineering solutions in database systems, and those solutions appear again and again in other systems (operating systems, distributed systems, file systems). Learning databases is really learning a framework for thinking about systems problems.
Why Go Is a Good Fit
Go's advantages for this project are multifold:
- Direct byte layout control:
encoding/binarylets you precisely control on-disk byte formats, as clear as C struct memory layout but without C's memory safety hazards. - Natural interface abstraction: storage engine interface, codec interface, transaction interface — all expressed cleanly with Go interfaces.
- Complete concurrency primitives: MVCC needs concurrent reads and writes on version chains;
sync.RWMutex,sync/atomic, and channels all come into play. - Complete file operation library:
os.File'sReadAt/WriteAtsupport random I/O, the foundation of a buffer pool.
Level 2 · Core Mechanism Principles
B+Tree: The Foundation of Database Indexes
The B+tree is the default index structure for the vast majority of relational databases. Understanding why B+tree was chosen over other structures requires first understanding the I/O reality databases face.
A single disk random I/O takes 5–10ms, while a memory access takes 100ns — a difference of 50,000x. For a table with 1 million records, a binary search tree (BST) has a height of about 20, meaning each lookup requires 20 I/Os, or 100–200ms. This is completely unacceptable.
The B+tree solution is: make each node as large as possible to minimize tree height. A typical B+tree node size equals the minimum disk I/O unit — a page — usually 4KB or 16KB (InnoDB defaults to 16KB). A 16KB node can hold roughly 1,000 64-bit keys, so a 3-level B+tree can index 1 billion records with only 3 I/Os.
B+Tree Node Structure
B+trees have internal nodes and leaf nodes:
Internal node:
+--------+----+--------+----+--------+
| ptr[0] | k1 | ptr[1] | k2 | ptr[2] |
+--------+----+--------+----+--------+
Leaf node:
+-----+-----+-----+------+
| k1 | k2 | k3 | next | → pointer to next leaf (enables range scans)
+-----+-----+-----+------+
| v1 | v2 | v3 |
Internal nodes store only keys and child pointers, not values. All actual data (or primary keys, in InnoDB's secondary indexes) is in leaf nodes. Leaf nodes are linked in a doubly-linked list, making range scans extremely efficient — find the start, then walk the linked list without tree backtracking.
Insert and Split
When a leaf node is full, B+tree splits it into two nodes and promotes the middle key to the parent. If the parent is also full, continue splitting upward; worst case splits all the way to the root, and the tree grows one level taller.
Before split (node full, capacity=3):
[10, 20, 30]
After inserting 25, split:
Parent gets key 25
[10, 20] [25, 30]
Delete and Merge
After deletion, if a node's key count falls below the minimum threshold (usually capacity/2), two options exist:
- Borrow: Take a key from an adjacent sibling (routed through the parent).
- Merge: Merge with a sibling, the separator key in the parent descends.
Merging triggers a key decrease in the parent, which may cause recursive merging; worst case the tree shrinks one level.
WAL: Sequential Write for Crash Recovery
One of the database's most basic promises is durability: once a transaction commits, the data won't be lost due to a crash.
The most direct implementation: on commit, flush all dirty pages (modified pages not yet written to disk) to disk. But this has two problems: random I/O is extremely slow (disk random write performance is about 1/100 of sequential write), and if a modification spans multiple pages, a crash in the middle leaves data inconsistent.
WAL (Write-Ahead Log)'s solution is elegant:
Core rule: Before modifying any data page on disk, the corresponding log record must be written to the WAL file first.
WAL writes are pure sequential append — sequential write performance is very high. Data pages can be written lazily; as long as WAL is on disk, crash recovery can replay all operations from WAL to restore consistency.
WAL log record format:
+----------+----------+---------+---------+----------+
| LSN (8B) | TxID (8B)| PageID | Offset | Data |
+----------+----------+---------+---------+----------+
LSN (Log Sequence Number) is a monotonically increasing sequence number that determines the order of log records and marks how "current" a data page is. Each data page's header stores the LSN of the last WAL record that modified it (called pageLSN).
Recovery Process (simplified ARIES algorithm):
- Analysis phase: Scan WAL from the most recent checkpoint to determine which transactions committed and which did not.
- Redo phase: Replay all WAL records to restore the database to the pre-crash state (including modifications from uncommitted transactions).
- Undo phase: Roll back all uncommitted transactions (via compensating log records in WAL).
MVCC: Implementing Snapshot Isolation
Reads and writes don't block each other — this is MVCC (Multi-Version Concurrency Control)'s most important property. It lets read operations see a consistent historical snapshot without blocking write operations.
Version Chains
Each row of data has a version chain. Each version in the chain records:
- The data content
- The transaction ID that created this version (
begin_ts) - The transaction ID that deleted this version (
end_ts, NULL means current version)
Version chain for row id=1:
v3: name="Charlie", begin_ts=200, end_ts=NULL ← newest version
v2: name="Bob", begin_ts=100, end_ts=200
v1: name="Alice", begin_ts=1, end_ts=100
Snapshot Read
When a transaction begins, it records the current transaction ID as the snapshot timestamp (snap_ts). Read operations can only see versions satisfying:
begin_ts <= snap_ts(version existed before the transaction started)end_ts > snap_tsorend_ts == NULL(version not deleted before the transaction started)
So transaction A sees a consistent snapshot throughout its lifetime — even if other transactions modify data while A is running, A can't see those modifications.
Write Operations
Writes don't modify existing versions; they create new versions (INSERT) or mark old versions expired (DELETE/UPDATE = DELETE + INSERT). This is what "multi-version" means.
Buffer Pool: Page Cache Management
Databases don't read and write disk directly; they go through the buffer pool. The buffer pool is a region of memory that caches recently accessed disk pages.
Buffer pool structure:
+---------+---------+---------+---------+
| Frame 0 | Frame 1 | Frame 2 | Frame 3 | ← fixed-size memory blocks
+---------+---------+---------+---------+
↓
Page Table: PageID → Frame number
Replacement Policy (LRU/Clock): decides which frame to evict
When a page is needed:
- Check the page table. On hit (page is in buffer pool), return directly.
- On miss, select a victim frame. If the victim frame is dirty, flush it to disk first.
- Read the required page from disk into the victim frame, update the page table.
The replacement policy determines which frame is evicted. The simplest is LRU (Least Recently Used). Real databases (like InnoDB) use a modified LRU — split into new/old sublists to prevent full table scans from evicting hot data.
Level 3 · Building It Step by Step
Step 1: Page Storage and Buffer Pool
Our database manages disk in 4KB pages. Each page has a unique PageID.
package storage
import (
"encoding/binary"
"fmt"
"os"
"sync"
)
const PageSize = 4096
type PageID uint32
type Page [PageSize]byte
// DiskManager manages the disk file, reads and writes by page
type DiskManager struct {
file *os.File
numPages PageID
mu sync.Mutex
}
func NewDiskManager(path string) (*DiskManager, error) {
f, err := os.OpenFile(path, os.O_RDWR|os.O_CREATE, 0644)
if err != nil {
return nil, fmt.Errorf("open file: %w", err)
}
info, _ := f.Stat()
numPages := PageID(info.Size() / PageSize)
return &DiskManager{file: f, numPages: numPages}, nil
}
func (dm *DiskManager) ReadPage(id PageID, p *Page) error {
_, err := dm.file.ReadAt(p[:], int64(id)*PageSize)
return err
}
func (dm *DiskManager) WritePage(id PageID, p *Page) error {
_, err := dm.file.WriteAt(p[:], int64(id)*PageSize)
return err
}
func (dm *DiskManager) AllocatePage() PageID {
dm.mu.Lock()
defer dm.mu.Unlock()
id := dm.numPages
dm.numPages++
var empty Page
dm.file.WriteAt(empty[:], int64(id)*PageSize)
return id
}
// BufferPool: caches disk pages, supports LRU eviction
type frame struct {
page Page
pageID PageID
dirty bool
pinCount int
}
type BufferPool struct {
dm *DiskManager
frames []frame
pageTable map[PageID]int
freeList []int
lruList []int
mu sync.Mutex
}
func NewBufferPool(dm *DiskManager, poolSize int) *BufferPool {
bp := &BufferPool{
dm: dm,
frames: make([]frame, poolSize),
pageTable: make(map[PageID]int),
freeList: make([]int, poolSize),
}
for i := range bp.freeList {
bp.freeList[i] = i
}
return bp
}
// FetchPage gets a page (pins it to prevent eviction)
func (bp *BufferPool) FetchPage(id PageID) (*Page, error) {
bp.mu.Lock()
defer bp.mu.Unlock()
if fi, ok := bp.pageTable[id]; ok {
bp.frames[fi].pinCount++
bp.moveFront(&bp.lruList, fi)
return &bp.frames[fi].page, nil
}
fi, err := bp.evict()
if err != nil {
return nil, err
}
f := &bp.frames[fi]
if err := bp.dm.ReadPage(id, &f.page); err != nil {
return nil, fmt.Errorf("read page %d: %w", id, err)
}
f.pageID = id
f.dirty = false
f.pinCount = 1
bp.pageTable[id] = fi
bp.lruList = append([]int{fi}, bp.lruList...)
return &f.page, nil
}
func (bp *BufferPool) UnpinPage(id PageID, dirty bool) {
bp.mu.Lock()
defer bp.mu.Unlock()
if fi, ok := bp.pageTable[id]; ok {
bp.frames[fi].pinCount--
if dirty {
bp.frames[fi].dirty = true
}
}
}
func (bp *BufferPool) evict() (int, error) {
if len(bp.freeList) > 0 {
fi := bp.freeList[0]
bp.freeList = bp.freeList[1:]
return fi, nil
}
for i := len(bp.lruList) - 1; i >= 0; i-- {
fi := bp.lruList[i]
f := &bp.frames[fi]
if f.pinCount == 0 {
if f.dirty {
if err := bp.dm.WritePage(f.pageID, &f.page); err != nil {
return 0, err
}
}
delete(bp.pageTable, f.pageID)
bp.lruList = append(bp.lruList[:i], bp.lruList[i+1:]...)
return fi, nil
}
}
return 0, fmt.Errorf("buffer pool full: all pages are pinned")
}
func (bp *BufferPool) moveFront(list *[]int, val int) {
for i, v := range *list {
if v == val {
*list = append([]int{val}, append((*list)[:i], (*list)[i+1:]...)...)
return
}
}
}
func PutUint64(p *Page, offset int, v uint64) {
binary.LittleEndian.PutUint64(p[offset:], v)
}
func GetUint64(p *Page, offset int) uint64 {
return binary.LittleEndian.Uint64(p[offset:])
}
Step 2: B+Tree Key-Value Store
package btree
import (
"bytes"
"encoding/binary"
"fmt"
"github.com/yourname/minidb/storage"
)
// Leaf node format:
// [isLeaf:1][numKeys:2][key0:8][val0:8]...[keyN:8][valN:8][nextLeaf:4]
// Internal node format:
// [isLeaf:1][numKeys:2][child0:4][key0:8][child1:4][key1:8]...[childN:4]
const (
PageSize = storage.PageSize
MaxKeys = 200
LeafFlag = 0x01
)
type Key uint64
type Value uint64
type PageID = storage.PageID
type BTree struct {
bp *storage.BufferPool
dm *storage.DiskManager
rootID PageID
}
func NewBTree(bp *storage.BufferPool, dm *storage.DiskManager) *BTree {
rootID := dm.AllocatePage()
bt := &BTree{bp: bp, dm: dm, rootID: rootID}
bt.initLeafPage(rootID)
return bt
}
func (bt *BTree) initLeafPage(id PageID) {
p, _ := bt.bp.FetchPage(id)
p[0] = LeafFlag
binary.LittleEndian.PutUint16(p[1:], 0)
bt.bp.UnpinPage(id, true)
}
func (bt *BTree) Get(key Key) (Value, bool) {
leafID := bt.findLeaf(bt.rootID, key)
p, err := bt.bp.FetchPage(leafID)
if err != nil {
return 0, false
}
defer bt.bp.UnpinPage(leafID, false)
numKeys := int(binary.LittleEndian.Uint16(p[1:]))
for i := 0; i < numKeys; i++ {
offset := 3 + i*16
k := Key(binary.LittleEndian.Uint64(p[offset:]))
if k == key {
v := Value(binary.LittleEndian.Uint64(p[offset+8:]))
return v, true
}
}
return 0, false
}
func (bt *BTree) findLeaf(nodeID PageID, key Key) PageID {
p, _ := bt.bp.FetchPage(nodeID)
if p[0] == LeafFlag {
bt.bp.UnpinPage(nodeID, false)
return nodeID
}
numKeys := int(binary.LittleEndian.Uint16(p[1:]))
child := PageID(binary.LittleEndian.Uint32(p[3:]))
for i := 0; i < numKeys; i++ {
k := Key(binary.LittleEndian.Uint64(p[3+4+i*12:]))
if key >= k {
child = PageID(binary.LittleEndian.Uint32(p[3+4+i*12+8:]))
}
}
bt.bp.UnpinPage(nodeID, false)
return bt.findLeaf(child, key)
}
// RangeScan iterates all key-value pairs in [start, end]
func (bt *BTree) RangeScan(start, end Key, fn func(Key, Value)) {
leafID := bt.findLeaf(bt.rootID, start)
for leafID != 0 {
p, _ := bt.bp.FetchPage(leafID)
numKeys := int(binary.LittleEndian.Uint16(p[1:]))
finished := false
for i := 0; i < numKeys; i++ {
k := Key(binary.LittleEndian.Uint64(p[3+i*16:]))
v := Value(binary.LittleEndian.Uint64(p[3+i*16+8:]))
if k > end {
finished = true
break
}
if k >= start {
fn(k, v)
}
}
next := PageID(binary.LittleEndian.Uint32(p[PageSize-4:]))
bt.bp.UnpinPage(leafID, false)
if finished || next == 0 {
break
}
leafID = next
}
}
// Put inserts or updates a key-value pair
func (bt *BTree) Put(key Key, val Value) error {
leafID := bt.findLeaf(bt.rootID, key)
newKey, newPageID, split := bt.insertLeaf(leafID, key, val)
if split {
bt.insertIntoParent(bt.rootID, leafID, newKey, newPageID)
}
return nil
}
func (bt *BTree) insertLeaf(id PageID, key Key, val Value) (Key, PageID, bool) {
p, _ := bt.bp.FetchPage(id)
numKeys := int(binary.LittleEndian.Uint16(p[1:]))
if numKeys < MaxKeys {
pos := 0
for pos < numKeys {
k := Key(binary.LittleEndian.Uint64(p[3+pos*16:]))
if k >= key {
break
}
pos++
}
copy(p[3+(pos+1)*16:], p[3+pos*16:3+numKeys*16])
binary.LittleEndian.PutUint64(p[3+pos*16:], uint64(key))
binary.LittleEndian.PutUint64(p[3+pos*16+8:], uint64(val))
binary.LittleEndian.PutUint16(p[1:], uint16(numKeys+1))
bt.bp.UnpinPage(id, true)
return 0, 0, false
}
// Leaf full: split
mid := MaxKeys / 2
newID := bt.dm.AllocatePage()
newPage, _ := bt.bp.FetchPage(newID)
newPage[0] = LeafFlag
rightCount := numKeys - mid
copy(newPage[3:], p[3+mid*16:3+numKeys*16])
binary.LittleEndian.PutUint16(newPage[1:], uint16(rightCount))
binary.LittleEndian.PutUint16(p[1:], uint16(mid))
origNext := PageID(binary.LittleEndian.Uint32(p[PageSize-4:]))
binary.LittleEndian.PutUint32(newPage[PageSize-4:], uint32(origNext))
binary.LittleEndian.PutUint32(p[PageSize-4:], uint32(newID))
promoteKey := Key(binary.LittleEndian.Uint64(newPage[3:]))
bt.bp.UnpinPage(id, true)
bt.bp.UnpinPage(newID, true)
if key < promoteKey {
bt.insertLeaf(id, key, val)
} else {
bt.insertLeaf(newID, key, val)
}
return promoteKey, newID, true
}
func (bt *BTree) insertIntoParent(rootID, leftID PageID, key Key, rightID PageID) {
if leftID == bt.rootID {
newRootID := bt.dm.AllocatePage()
p, _ := bt.bp.FetchPage(newRootID)
p[0] = 0
binary.LittleEndian.PutUint16(p[1:], 1)
binary.LittleEndian.PutUint32(p[3:], uint32(leftID))
binary.LittleEndian.PutUint64(p[7:], uint64(key))
binary.LittleEndian.PutUint32(p[15:], uint32(rightID))
bt.bp.UnpinPage(newRootID, true)
bt.rootID = newRootID
}
}
// Prevent unused import error
var _ = bytes.Compare
var _ = fmt.Sprintf
Step 3: WAL for Crash Recovery
package wal
import (
"encoding/binary"
"fmt"
"io"
"os"
"sync"
"sync/atomic"
)
type LogRecord struct {
LSN uint64
TxID uint64
Type uint8
PageID uint32
Offset uint32
Data []byte
}
const (
RecordUpdate = 0
RecordCommit = 1
RecordAbort = 2
RecordBegin = 3
)
type WAL struct {
file *os.File
nextLSN uint64
mu sync.Mutex
}
func NewWAL(path string) (*WAL, error) {
f, err := os.OpenFile(path, os.O_RDWR|os.O_CREATE|os.O_APPEND, 0644)
if err != nil {
return nil, err
}
return &WAL{file: f, nextLSN: 1}, nil
}
// Append writes a WAL record and returns its LSN
func (w *WAL) Append(rec *LogRecord) (uint64, error) {
w.mu.Lock()
defer w.mu.Unlock()
lsn := atomic.AddUint64(&w.nextLSN, 1) - 1
rec.LSN = lsn
dataLen := len(rec.Data)
// Format: [total_len:4][LSN:8][TxID:8][Type:1][PageID:4][Offset:4][DataLen:4][Data...]
totalLen := 8 + 8 + 1 + 4 + 4 + 4 + dataLen
buf := make([]byte, totalLen+4)
binary.LittleEndian.PutUint32(buf[0:], uint32(totalLen))
binary.LittleEndian.PutUint64(buf[4:], lsn)
binary.LittleEndian.PutUint64(buf[12:], rec.TxID)
buf[20] = rec.Type
binary.LittleEndian.PutUint32(buf[21:], rec.PageID)
binary.LittleEndian.PutUint32(buf[25:], rec.Offset)
binary.LittleEndian.PutUint32(buf[29:], uint32(dataLen))
copy(buf[33:], rec.Data)
if _, err := w.file.Write(buf); err != nil {
return 0, fmt.Errorf("wal append: %w", err)
}
if err := w.file.Sync(); err != nil {
return 0, fmt.Errorf("wal sync: %w", err)
}
return lsn, nil
}
// Recover replays WAL and returns operations from committed transactions
func (w *WAL) Recover() ([]LogRecord, error) {
w.file.Seek(0, io.SeekStart)
var allRecords []LogRecord
committedTx := make(map[uint64]bool)
for {
var totalLen uint32
if err := binary.Read(w.file, binary.LittleEndian, &totalLen); err != nil {
if err == io.EOF {
break
}
return nil, err
}
buf := make([]byte, totalLen)
if _, err := io.ReadFull(w.file, buf); err != nil {
break // incomplete record: crash point
}
rec := LogRecord{
LSN: binary.LittleEndian.Uint64(buf[0:]),
TxID: binary.LittleEndian.Uint64(buf[8:]),
Type: buf[16],
PageID: binary.LittleEndian.Uint32(buf[17:]),
Offset: binary.LittleEndian.Uint32(buf[21:]),
}
dataLen := binary.LittleEndian.Uint32(buf[25:])
rec.Data = make([]byte, dataLen)
copy(rec.Data, buf[29:])
allRecords = append(allRecords, rec)
if rec.Type == RecordCommit {
committedTx[rec.TxID] = true
}
}
var committed []LogRecord
for _, rec := range allRecords {
if rec.Type == RecordUpdate && committedTx[rec.TxID] {
committed = append(committed, rec)
}
}
return committed, nil
}
Step 4: MVCC Transactions
package mvcc
import (
"fmt"
"sync"
"sync/atomic"
)
type Version struct {
BeginTS uint64
EndTS uint64 // 0 means "currently valid"
Value []byte
Next *Version
}
type Row struct {
Key uint64
Head *Version
mu sync.RWMutex
}
type MVCC struct {
nextTS uint64
rows sync.Map // key(uint64) → *Row
}
func NewMVCC() *MVCC {
return &MVCC{}
}
func (m *MVCC) Begin() uint64 {
return atomic.AddUint64(&m.nextTS, 1)
}
// Read performs a snapshot read: returns the latest version visible at snapTS
func (m *MVCC) Read(key, snapTS uint64) ([]byte, bool) {
val, ok := m.rows.Load(key)
if !ok {
return nil, false
}
row := val.(*Row)
row.mu.RLock()
defer row.mu.RUnlock()
for v := row.Head; v != nil; v = v.Next {
if v.BeginTS <= snapTS && (v.EndTS == 0 || v.EndTS > snapTS) {
result := make([]byte, len(v.Value))
copy(result, v.Value)
return result, true
}
}
return nil, false
}
// Write creates a new version at commitTS
func (m *MVCC) Write(key, commitTS uint64, value []byte) error {
actual, _ := m.rows.LoadOrStore(key, &Row{Key: key})
row := actual.(*Row)
row.mu.Lock()
defer row.mu.Unlock()
newVer := &Version{
BeginTS: commitTS,
EndTS: 0,
Value: append([]byte{}, value...),
Next: row.Head,
}
if row.Head != nil && row.Head.EndTS == 0 {
row.Head.EndTS = commitTS
}
row.Head = newVer
return nil
}
// Delete marks the current version as ended
func (m *MVCC) Delete(key, commitTS uint64) error {
val, ok := m.rows.Load(key)
if !ok {
return fmt.Errorf("key %d not found", key)
}
row := val.(*Row)
row.mu.Lock()
defer row.mu.Unlock()
if row.Head != nil && row.Head.EndTS == 0 {
row.Head.EndTS = commitTS
}
return nil
}
// Tx represents a complete MVCC transaction
type Tx struct {
mvcc *MVCC
snapTS uint64
writes []writeIntent
committed bool
}
type writeIntent struct {
key uint64
value []byte // nil means delete
}
func (m *MVCC) BeginTx() *Tx {
return &Tx{mvcc: m, snapTS: m.Begin()}
}
func (tx *Tx) Get(key uint64) ([]byte, bool) {
return tx.mvcc.Read(key, tx.snapTS)
}
func (tx *Tx) Set(key uint64, value []byte) {
tx.writes = append(tx.writes, writeIntent{key: key, value: value})
}
func (tx *Tx) Del(key uint64) {
tx.writes = append(tx.writes, writeIntent{key: key, value: nil})
}
func (tx *Tx) Commit() error {
commitTS := atomic.AddUint64(&tx.mvcc.nextTS, 1)
for _, w := range tx.writes {
if w.value == nil {
tx.mvcc.Delete(w.key, commitTS)
} else {
tx.mvcc.Write(w.key, commitTS, w.value)
}
}
tx.committed = true
return nil
}
func (tx *Tx) Rollback() {
tx.writes = nil
}
Step 5: Simple SQL Parser
package sql
import (
"fmt"
"strings"
)
type StatementType int
const (
StmtSelect StatementType = iota
StmtInsert
StmtUpdate
StmtDelete
)
type Statement struct {
Type StatementType
Table string
Columns []string
Values []string
Where *WhereClause
SetClauses []SetClause
}
type WhereClause struct {
Column string
Op string
Value string
}
type SetClause struct {
Column string
Value string
}
func Parse(query string) (*Statement, error) {
tokens := strings.Fields(strings.ReplaceAll(query, ",", " , "))
if len(tokens) == 0 {
return nil, fmt.Errorf("empty query")
}
switch strings.ToUpper(tokens[0]) {
case "SELECT":
return parseSelect(tokens)
case "INSERT":
return parseInsert(tokens)
case "UPDATE":
return parseUpdate(tokens)
case "DELETE":
return parseDelete(tokens)
default:
return nil, fmt.Errorf("unknown statement type: %s", tokens[0])
}
}
func parseSelect(tokens []string) (*Statement, error) {
stmt := &Statement{Type: StmtSelect}
i := 1
for i < len(tokens) && strings.ToUpper(tokens[i]) != "FROM" {
col := strings.Trim(tokens[i], ", ")
if col != "" && col != "," {
stmt.Columns = append(stmt.Columns, col)
}
i++
}
if i >= len(tokens) {
return nil, fmt.Errorf("missing FROM")
}
i++
if i >= len(tokens) {
return nil, fmt.Errorf("missing table name")
}
stmt.Table = tokens[i]
i++
if i < len(tokens) && strings.ToUpper(tokens[i]) == "WHERE" {
w, err := parseWhere(tokens[i+1:])
if err != nil {
return nil, err
}
stmt.Where = w
}
return stmt, nil
}
func parseInsert(tokens []string) (*Statement, error) {
if len(tokens) < 5 || strings.ToUpper(tokens[1]) != "INTO" {
return nil, fmt.Errorf("invalid INSERT syntax")
}
stmt := &Statement{Type: StmtInsert, Table: tokens[2]}
valIdx := -1
for i, t := range tokens {
if strings.ToUpper(t) == "VALUES" {
valIdx = i
break
}
}
if valIdx == -1 {
return nil, fmt.Errorf("missing VALUES")
}
for i := valIdx + 1; i < len(tokens); i++ {
v := strings.Trim(tokens[i], "(),\"' ")
if v != "" {
stmt.Values = append(stmt.Values, v)
}
}
return stmt, nil
}
func parseUpdate(tokens []string) (*Statement, error) {
if len(tokens) < 4 {
return nil, fmt.Errorf("invalid UPDATE syntax")
}
stmt := &Statement{Type: StmtUpdate, Table: tokens[1]}
i := 3
for i < len(tokens) && strings.ToUpper(tokens[i]) != "WHERE" {
part := strings.Trim(tokens[i], ", ")
if eq := strings.Index(part, "="); eq != -1 {
stmt.SetClauses = append(stmt.SetClauses, SetClause{
Column: strings.TrimSpace(part[:eq]),
Value: strings.Trim(part[eq+1:], "'\""),
})
}
i++
}
if i < len(tokens) && strings.ToUpper(tokens[i]) == "WHERE" {
w, err := parseWhere(tokens[i+1:])
if err != nil {
return nil, err
}
stmt.Where = w
}
return stmt, nil
}
func parseDelete(tokens []string) (*Statement, error) {
if len(tokens) < 3 {
return nil, fmt.Errorf("invalid DELETE syntax")
}
stmt := &Statement{Type: StmtDelete, Table: tokens[2]}
for i, t := range tokens {
if strings.ToUpper(t) == "WHERE" && i+1 < len(tokens) {
w, err := parseWhere(tokens[i+1:])
if err != nil {
return nil, err
}
stmt.Where = w
break
}
}
return stmt, nil
}
func parseWhere(tokens []string) (*WhereClause, error) {
ops := []string{">=", "<=", "=", ">", "<"}
combined := strings.Join(tokens, " ")
for _, op := range ops {
if idx := strings.Index(combined, op); idx != -1 {
col := strings.TrimSpace(combined[:idx])
val := strings.TrimSpace(combined[idx+len(op):])
return &WhereClause{
Column: col,
Op: op,
Value: strings.Trim(val, "'\""),
}, nil
}
}
return nil, fmt.Errorf("no operator in WHERE clause")
}
// Integration test
func RunExample() {
queries := []string{
"SELECT name , age FROM users WHERE age > 18",
"INSERT INTO users VALUES ( 1 , Alice , 30 )",
"UPDATE users SET name=Bob WHERE id=1",
"DELETE FROM users WHERE id=1",
}
for _, q := range queries {
stmt, err := Parse(q)
if err != nil {
fmt.Printf("Error parsing %q: %v\n", q, err)
continue
}
fmt.Printf("OK: table=%s type=%d where=%v\n", stmt.Table, stmt.Type, stmt.Where)
}
}
Level 4 · Advanced Topics and Edge Cases
Lock-Free Reads and the True Cost of MVCC
MVCC's lock-free reads come with a hidden cost. The longer the version chain, the more versions a read must traverse. Under heavy write workloads, version chains can grow very long, causing read performance to degrade.
Garbage Collection (Vacuum) periodically cleans expired versions. PostgreSQL's VACUUM is a famously tricky design challenge — it needs to find the "oldest active transaction," and all non-current versions older than that timestamp can be safely deleted.
The key insight: MVCC shifts conflict from reads blocking writes to read performance degrading under write load. It's a tradeoff, not a free lunch.
// Compaction: delete historical versions older than olderThan
func (m *MVCC) Compaction(olderThan uint64) {
m.rows.Range(func(key, val interface{}) bool {
row := val.(*Row)
row.mu.Lock()
defer row.mu.Unlock()
// Find first version with BeginTS <= olderThan, truncate chain after it
cur := row.Head
for cur != nil && cur.BeginTS > olderThan {
cur = cur.Next
}
if cur != nil {
cur.Next = nil
}
return true
})
}
LSM Tree: The Write-Optimized Alternative to B+Tree
B+trees are unfriendly to write operations: each write requires random I/O to update one or more pages. LSM trees (Log-Structured Merge Trees) convert all writes to sequential writes, at the cost of requiring merging multiple files during reads.
LSM tree structure:
MemTable (memory, red-black tree) --flush--> L0 SSTable (immutable, sorted)
L0 SSTables --compaction--> L1 SSTables
L1 SSTables --compaction--> L2 SSTables
...
RocksDB, BadgerDB, and LevelDB all use LSM trees. Write-intensive scenarios (logs, time-series data) favor LSM; read-heavy scenarios (OLTP) favor B+trees. This is why InnoDB and MyISAM chose B+tree while Cassandra and HBase chose LSM.
Comparing BoltDB and BadgerDB Internals
BoltDB is a pure B+tree implementation that uses Copy-on-Write (COW) for write transactions instead of WAL — when modifying, it copies all nodes on the path from root to the modified leaf, keeping original pages intact. On commit, it atomically updates the root page pointer. This design gives completely lock-free reads but limits write throughput to B+tree's random I/O ceiling.
BadgerDB is an LSM tree implementation that separates Key and Value storage (Key in LSM tree, Value in Value Log). This design (called WiscKey) significantly reduces LSM tree write amplification, with substantial performance gains for large-value scenarios.
Rough performance comparison (highly workload-dependent):
| Scenario | B+Tree (BoltDB) | LSM (BadgerDB) |
|---|---|---|
| Random read | Fast | Slower (merging needed) |
| Sequential read | Fast | Fast |
| Random write | Slower (random I/O) | Fast (sequential append) |
| Disk space | Low fragmentation | Space amplification |
Embedded Library vs. Standalone Server
Our implementation is suitable as an embedded library (like SQLite or BoltDB), linked directly into the application process. This eliminates network I/O latency and is appropriate for single-machine scenarios.
To become a standalone server, you need:
- Network layer: Listen on TCP connections, implement a protocol (MySQL wire protocol or custom binary protocol).
- Connection management: One goroutine per connection, connection pool management.
- Transaction coordination: Cross-connection transactions must have serialized commits (global transaction ID generator).
- Authentication and permissions: User management, table-level/row-level permissions.
SQLite chose embedded, MySQL/PostgreSQL chose server mode — each with its own design philosophy. Understanding this architectural choice, and the tradeoffs that drove it, is itself a lesson in systems design.
Building this engine end to end — even in this simplified form — gives you an intuition for why every major database design decision is the way it is. When you read about InnoDB's buffer pool or PostgreSQL's MVCC implementation, you'll no longer be reading documentation: you'll be recognizing decisions you already made yourself.