Chapter 29

Web Scraping: Concurrency and Rate Limiting

The internet holds vast stores of publicly accessible data, yet that data rarely arrives pre-packaged in a structured form. A web scraper is the engineering tool that converts unstructured HTML pages into structured records. Price monitoring, academic research, competitive analysis, search-engine indexing — all of these domains depend on scrapers as foundational infrastructure.

But scraping is not merely "send an HTTP request, parse HTML." Real, industrial-grade scrapers face three core challenges: throughput (fetching large numbers of URLs in reasonable time), politeness (not overwhelming the target server), and robustness (surviving network failures, anti-bot countermeasures, and JavaScript-rendered pages). Go exhibits distinctive advantages when confronting all three challenges.

This chapter begins with Go's concurrency model, dives deep into the Worker Pool pattern, token-bucket rate limiting, and exponential backoff with jitter, and then assembles a complete, production-ready scraper by combining these components.

Level 1 · Why Go Excels at Scraping

Where the Bottleneck Lives

Before discussing Go, we need to understand where a scraper's performance actually bottlenecks.

A typical single-page fetch looks like: establish TCP connection → TLS handshake → send HTTP request → wait for server response → receive response body → parse HTML. In this pipeline, the CPU is actually busy only during "establish connection" and "parse HTML." For the rest of the time, the program is waiting on network I/O.

For a page with 100 ms network latency and 5 ms HTML parse time, CPU utilization stays below 5 %. If you serialize fetches on a single thread, enormous CPU time is wasted waiting. This is why web scrapers are inherently I/O-bound applications, and concurrency is the primary lever for improving throughput.

Threads vs. Goroutines: The Cost of Concurrency

The classical solution is OS threads. But each OS thread carries roughly 1–8 MB of overhead (mostly a fixed-size stack), is scheduled by the kernel, and incurs context-switch latency of 1–10 µs. For a scraper that needs to maintain thousands of concurrent connections, thread count itself becomes the bottleneck.

Go's goroutines fundamentally solve this problem:

Initial stack of 2 KB, growing on demand; hundreds of thousands of goroutines can coexist
M:N scheduling: the Go runtime multiplexes M goroutines onto N OS threads, where N typically equals the number of CPU cores
Non-blocking I/O: when a goroutine waits on network I/O, the runtime suspends it and schedules another goroutine; when I/O completes, the goroutine is resumed — all transparent to the programmer

This lets you write concurrent logic in a sequential style:

// Looks like synchronous code; actually cooperative concurrency
resp, err := http.Get(url)  // goroutine suspends here, but the OS thread does not block
body, _ := io.ReadAll(resp.Body)

Engineering Quality of Go's HTTP Client

Go's standard net/http package is not a thin wrapper. It ships production-grade features out of the box:

Connection pooling (Keep-Alive): enabled by default, avoiding a fresh TCP connection per request
TLS session resumption: reduces TLS handshake cost
Automatic redirect following: up to 10 redirects by default
Fine-grained timeout control: connection timeout, read timeout, overall timeout

client := &http.Client{
    Timeout: 30 * time.Second,
    Transport: &http.Transport{
        MaxIdleConns:        100,
        MaxIdleConnsPerHost: 10,
        IdleConnTimeout:     90 * time.Second,
        DisableCompression:  false, // handles gzip automatically
    },
}

Understanding these defaults and their tuning knobs is the first step toward building an efficient scraper.

Level 2 · Core Patterns and Algorithms

The Worker Pool Pattern

A Worker Pool is the canonical pattern for "processing an unlimited stream of tasks with a fixed number of goroutines." The core problem it solves is: preventing uncontrolled goroutine creation from exhausting memory.

Channel-Based Worker Pool

func crawlWithWorkerPool(urls []string, workerCount int) []Result {
    jobs := make(chan string, len(urls))
    results := make(chan Result, len(urls))

    // Launch a fixed number of workers
    var wg sync.WaitGroup
    for i := 0; i < workerCount; i++ {
        wg.Add(1)
        go func() {
            defer wg.Done()
            for url := range jobs {
                result := fetch(url)
                results <- result
            }
        }()
    }

    // Enqueue all tasks
    for _, url := range urls {
        jobs <- url
    }
    close(jobs)

    // After all workers finish, close the results channel
    go func() {
        wg.Wait()
        close(results)
    }()

    // Drain results
    var all []Result
    for r := range results {
        all = append(all, r)
    }
    return all
}

A few key design decisions deserve explanation:

The jobs channel is buffered to len(urls) so the main goroutine can enqueue everything immediately without blocking.
After close(jobs), for url := range jobs exits automatically when the channel is drained — the idiomatic Go way to signal "no more work."
A dedicated goroutine waits on the WaitGroup before closing results, preventing a deadlock.

Semaphore-Based Concurrency Control

An equivalent approach uses a buffered channel to simulate a semaphore:

type Semaphore chan struct{}

func NewSemaphore(n int) Semaphore {
    return make(chan struct{}, n)
}

func (s Semaphore) Acquire() { s <- struct{}{} }
func (s Semaphore) Release() { <-s }

func crawlWithSemaphore(urls []string, maxConcurrent int) []Result {
    sem := NewSemaphore(maxConcurrent)
    var mu sync.Mutex
    var results []Result
    var wg sync.WaitGroup

    for _, url := range urls {
        wg.Add(1)
        go func(u string) {
            defer wg.Done()
            sem.Acquire()
            defer sem.Release()

            result := fetch(u)
            mu.Lock()
            results = append(results, result)
            mu.Unlock()
        }(url)
    }

    wg.Wait()
    return results
}

The semaphore approach is more flexible: each URL gets its own goroutine, but concurrency is throttled. The Worker Pool reuses a fixed set of goroutines, reducing goroutine creation/teardown overhead. For large, fixed-length work lists, the Worker Pool is generally preferable.

Token Bucket Rate Limiting: golang.org/x/time/rate

Politeness is central to responsible scraping. Rate limiting ensures that, regardless of how full the task queue is, the request rate to any single target server never exceeds a configured threshold.

The token bucket algorithm is the most widely used rate-limiting approach:

The bucket holds at most burst tokens, representing permitted instantaneous bursts
The system refills the bucket at rate r (r tokens per second)
Each request consumes one token
New tokens are discarded when the bucket is full; requests wait when it is empty

golang.org/x/time/rate implements token bucket:

import "golang.org/x/time/rate"

// 2 requests per second; burst of up to 5
limiter := rate.NewLimiter(rate.Limit(2), 5)

func fetchWithRateLimit(ctx context.Context, limiter *rate.Limiter, url string) (*http.Response, error) {
    // Wait blocks until a token is available or the context is canceled
    if err := limiter.Wait(ctx); err != nil {
        return nil, err
    }
    return http.Get(url)
}

In a scraper, you typically need per-host limiters, applying different rates to different domains:

type HostLimiter struct {
    mu       sync.Mutex
    limiters map[string]*rate.Limiter
    rps      float64
    burst    int
}

func NewHostLimiter(rps float64, burst int) *HostLimiter {
    return &HostLimiter{
        limiters: make(map[string]*rate.Limiter),
        rps:      rps,
        burst:    burst,
    }
}

func (hl *HostLimiter) Get(host string) *rate.Limiter {
    hl.mu.Lock()
    defer hl.mu.Unlock()
    if l, ok := hl.limiters[host]; ok {
        return l
    }
    l := rate.NewLimiter(rate.Limit(hl.rps), hl.burst)
    hl.limiters[host] = l
    return l
}

Exponential Backoff with Jitter

Network requests inevitably fail: server timeouts, 429 Too Many Requests, transient network errors. The correct retry strategy is exponential backoff: each failure doubles the wait time, preventing continuous hammering of an already-overloaded server.

Pure exponential backoff has one flaw: if many clients fail simultaneously and all retry on the same schedule, they produce a Thundering Herd at retry time. Adding random jitter scatters the retry moments.

import (
    "math"
    "math/rand"
    "time"
)

type BackoffConfig struct {
    InitialDelay time.Duration
    MaxDelay     time.Duration
    Multiplier   float64
    MaxRetries   int
}

var DefaultBackoff = BackoffConfig{
    InitialDelay: 1 * time.Second,
    MaxDelay:     60 * time.Second,
    Multiplier:   2.0,
    MaxRetries:   5,
}

// Wait returns the duration to sleep before attempt number `attempt` (0-indexed).
// Returns -1 when retries are exhausted.
func (b BackoffConfig) Wait(attempt int) time.Duration {
    if attempt >= b.MaxRetries {
        return -1
    }
    // Compute the exponential cap
    cap := float64(b.InitialDelay) * math.Pow(b.Multiplier, float64(attempt))
    if cap > float64(b.MaxDelay) {
        cap = float64(b.MaxDelay)
    }
    // Full Jitter: uniform random in [0, cap]
    return time.Duration(rand.Float64() * cap)
}

func fetchWithRetry(ctx context.Context, client *http.Client, url string) (*http.Response, error) {
    var lastErr error
    for attempt := 0; attempt <= DefaultBackoff.MaxRetries; attempt++ {
        if attempt > 0 {
            wait := DefaultBackoff.Wait(attempt - 1)
            if wait < 0 {
                break
            }
            select {
            case <-time.After(wait):
            case <-ctx.Done():
                return nil, ctx.Err()
            }
        }

        resp, err := client.Get(url)
        if err == nil && resp.StatusCode < 500 {
            return resp, nil
        }
        if err != nil {
            lastErr = err
        } else {
            resp.Body.Close()
            lastErr = fmt.Errorf("server error: %d", resp.StatusCode)
        }
    }
    return nil, fmt.Errorf("failed after %d attempts: %w", DefaultBackoff.MaxRetries, lastErr)
}

Robots.txt and Crawl Politeness

robots.txt is the standard protocol (RFC 9309) by which websites declare which paths are off-limits to automated agents. A polite crawler must respect it.

import "github.com/temoto/robotstxt"

type RobotsCache struct {
    mu    sync.RWMutex
    cache map[string]*robotstxt.RobotsData
}

func (rc *RobotsCache) IsAllowed(userAgent, rawURL string) bool {
    u, err := url.Parse(rawURL)
    if err != nil {
        return false
    }
    host := u.Scheme + "://" + u.Host

    rc.mu.RLock()
    data, ok := rc.cache[host]
    rc.mu.RUnlock()

    if !ok {
        data = rc.fetchRobots(host)
        rc.mu.Lock()
        rc.cache[host] = data
        rc.mu.Unlock()
    }

    if data == nil {
        return true // could not fetch robots.txt; allow by default
    }
    return data.TestAgent(u.Path, userAgent)
}

func (rc *RobotsCache) fetchRobots(host string) *robotstxt.RobotsData {
    resp, err := http.Get(host + "/robots.txt")
    if err != nil || resp.StatusCode != 200 {
        return nil
    }
    defer resp.Body.Close()
    data, _ := robotstxt.FromResponse(resp)
    return data
}

Level 3 · Building a Complete Scraper

Colly vs. Raw net/http

colly is the most popular scraping framework in the Go ecosystem. It handles much of the boilerplate:

import "github.com/gocolly/colly/v2"

func scrapeWithColly(startURL string) []Article {
    var articles []Article
    var mu sync.Mutex

    c := colly.NewCollector(
        colly.AllowedDomains("example.com"),
        colly.MaxDepth(3),
        colly.Async(true),
    )

    c.Limit(&colly.LimitRule{
        DomainGlob:  "*",
        Parallelism: 5,
        Delay:       500 * time.Millisecond,
        RandomDelay: 200 * time.Millisecond,
    })

    c.OnHTML("article.post", func(e *colly.HTMLElement) {
        article := Article{
            Title: e.ChildText("h2.title"),
            URL:   e.ChildAttr("a.read-more", "href"),
            Date:  e.ChildText("time"),
        }
        mu.Lock()
        articles = append(articles, article)
        mu.Unlock()

        c.Visit(e.Request.AbsoluteURL(article.URL))
    })

    c.OnError(func(r *colly.Response, err error) {
        log.Printf("Error scraping %s: %v", r.Request.URL, err)
    })

    c.Visit(startURL)
    c.Wait()
    return articles
}

Colly has limitations: fine-grained retry logic and custom rate-limit strategies are awkward to implement within its callback architecture. For complex pipelines, raw net/http + goquery gives more control.

HTML Parsing: goquery and golang.org/x/net/html

goquery provides a jQuery-style CSS selector API backed by the golang.org/x/net/html parse tree:

import (
    "github.com/PuerkitoBio/goquery"
    "golang.org/x/net/html"
    "strings"
)

type Article struct {
    Title   string
    URL     string
    Content string
    Date    string
    Tags    []string
}

func parseArticlePage(resp *http.Response) (*Article, error) {
    doc, err := goquery.NewDocumentFromReader(resp.Body)
    if err != nil {
        return nil, fmt.Errorf("parse HTML: %w", err)
    }

    article := &Article{}
    article.Title = strings.TrimSpace(doc.Find("h1.article-title").Text())
    article.Date = doc.Find("time[datetime]").AttrOr("datetime", "")

    doc.Find("a.tag").Each(func(i int, s *goquery.Selection) {
        article.Tags = append(article.Tags, strings.TrimSpace(s.Text()))
    })

    contentHTML, _ := doc.Find("div.article-content").Html()
    article.Content = htmlToText(contentHTML)

    return article, nil
}

// htmlToText converts HTML to plain text using the standard tokenizer
func htmlToText(htmlStr string) string {
    tokenizer := html.NewTokenizer(strings.NewReader(htmlStr))
    var sb strings.Builder
    for {
        tt := tokenizer.Next()
        switch tt {
        case html.ErrorToken:
            return sb.String()
        case html.TextToken:
            sb.Write(tokenizer.Text())
        case html.StartTagToken:
            tag, _ := tokenizer.TagName()
            if string(tag) == "br" || string(tag) == "p" {
                sb.WriteRune('\n')
            }
        }
    }
}

Link Extraction and Deduplication: Bloom Filter

A scraper's most basic requirement is never re-fetching a URL it has already visited. The naive approach is map[string]bool, but at millions of URLs the memory footprint becomes unacceptable.

A Bloom Filter is the classical data structure for this problem: it uses a tiny amount of memory to answer the question "have I seen this before?" with zero false negatives and a configurable false-positive rate.

import "github.com/bits-and-blooms/bloom/v3"

type Deduplicator struct {
    filter *bloom.BloomFilter
    mu     sync.Mutex
}

// NewDeduplicator creates a Bloom Filter sized for n expected items at false-positive rate fp
func NewDeduplicator(n uint, fp float64) *Deduplicator {
    return &Deduplicator{
        filter: bloom.NewWithEstimates(n, fp),
    }
}

// SeenOrAdd returns true if the URL was already seen (skip it).
// Returns false on first encounter and adds the URL to the filter.
func (d *Deduplicator) SeenOrAdd(url string) bool {
    d.mu.Lock()
    defer d.mu.Unlock()
    if d.filter.TestString(url) {
        return true
    }
    d.filter.AddString(url)
    return false
}

func extractLinks(doc *goquery.Document, baseURL *url.URL) []string {
    var links []string
    doc.Find("a[href]").Each(func(i int, s *goquery.Selection) {
        href, exists := s.Attr("href")
        if !exists {
            return
        }
        u, err := baseURL.Parse(href)
        if err != nil {
            return
        }
        if (u.Scheme == "http" || u.Scheme == "https") && u.Host == baseURL.Host {
            u.Fragment = "" // strip #section
            links = append(links, u.String())
        }
    })
    return links
}

Putting It All Together

package scraper

import (
    "context"
    "encoding/csv"
    "encoding/json"
    "fmt"
    "log"
    "net/http"
    "net/url"
    "os"
    "strings"
    "sync"
    "time"

    "github.com/PuerkitoBio/goquery"
    "golang.org/x/time/rate"
)

type Scraper struct {
    client    *http.Client
    limiter   *HostLimiter
    dedup     *Deduplicator
    robots    *RobotsCache
    workers   int
    userAgent string
}

func NewScraper(rps float64, workers int) *Scraper {
    return &Scraper{
        client: &http.Client{
            Timeout: 30 * time.Second,
            Transport: &http.Transport{
                MaxIdleConnsPerHost: 10,
                IdleConnTimeout:     90 * time.Second,
            },
        },
        limiter:   NewHostLimiter(rps, int(rps*2)),
        dedup:     NewDeduplicator(1_000_000, 0.01), // 1M URLs, 1% false-positive rate
        robots:    &RobotsCache{cache: make(map[string]*robotstxt.RobotsData)},
        workers:   workers,
        userAgent: "MyBot/1.0 (+https://example.com/bot)",
    }
}

type Job struct {
    URL   string
    Depth int
}

type Result struct {
    URL     string
    Article *Article
    Err     error
}

func (s *Scraper) Run(ctx context.Context, seedURLs []string, maxDepth int) []Article {
    jobs := make(chan Job, 10000)
    results := make(chan Result, 10000)

    go func() {
        for _, u := range seedURLs {
            if !s.dedup.SeenOrAdd(u) {
                jobs <- Job{URL: u, Depth: 0}
            }
        }
    }()

    var wg sync.WaitGroup
    for i := 0; i < s.workers; i++ {
        wg.Add(1)
        go func() {
            defer wg.Done()
            for job := range jobs {
                s.processJob(ctx, job, jobs, results, maxDepth)
            }
        }()
    }

    var articles []Article
    var collectWg sync.WaitGroup
    collectWg.Add(1)
    go func() {
        defer collectWg.Done()
        for r := range results {
            if r.Err != nil {
                log.Printf("Error: %s: %v", r.URL, r.Err)
                continue
            }
            if r.Article != nil {
                articles = append(articles, *r.Article)
            }
        }
    }()

    wg.Wait()
    close(results)
    collectWg.Wait()

    return articles
}

func (s *Scraper) processJob(ctx context.Context, job Job, jobs chan<- Job, results chan<- Result, maxDepth int) {
    if !s.robots.IsAllowed(s.userAgent, job.URL) {
        return
    }

    u, _ := url.Parse(job.URL)
    if err := s.limiter.Get(u.Host).Wait(ctx); err != nil {
        return
    }

    resp, err := fetchWithRetry(ctx, s.client, job.URL)
    if err != nil {
        results <- Result{URL: job.URL, Err: err}
        return
    }
    defer resp.Body.Close()

    doc, err := goquery.NewDocumentFromReader(resp.Body)
    if err != nil {
        results <- Result{URL: job.URL, Err: err}
        return
    }

    article, _ := parseArticlePage(resp)
    results <- Result{URL: job.URL, Article: article}

    if job.Depth < maxDepth {
        for _, link := range extractLinks(doc, u) {
            if !s.dedup.SeenOrAdd(link) {
                select {
                case jobs <- Job{URL: link, Depth: job.Depth + 1}:
                default:
                    // queue full; skip rather than block
                }
            }
        }
    }
}

func SaveJSON(articles []Article, path string) error {
    f, err := os.Create(path)
    if err != nil {
        return err
    }
    defer f.Close()
    enc := json.NewEncoder(f)
    enc.SetIndent("", "  ")
    return enc.Encode(articles)
}

func SaveCSV(articles []Article, path string) error {
    f, err := os.Create(path)
    if err != nil {
        return err
    }
    defer f.Close()
    w := csv.NewWriter(f)
    defer w.Flush()
    w.Write([]string{"URL", "Title", "Date", "Tags"})
    for _, a := range articles {
        w.Write([]string{a.URL, a.Title, a.Date, strings.Join(a.Tags, "|")})
    }
    return w.Error()
}

Level 4 · Advanced Topics and Edge Cases

Distributed Scraping with a Redis Work Queue

When scraper scale exceeds single-machine capacity, the task queue must move to Redis, enabling multi-process cooperation:

import "github.com/redis/go-redis/v9"

type RedisQueue struct {
    rdb   *redis.Client
    key   string
    dedup string // Redis Set used for deduplication
}

func NewRedisQueue(addr, queueKey, dedupKey string) *RedisQueue {
    return &RedisQueue{
        rdb:   redis.NewClient(&redis.Options{Addr: addr}),
        key:   queueKey,
        dedup: dedupKey,
    }
}

func (q *RedisQueue) Push(ctx context.Context, urls ...string) error {
    pipe := q.rdb.Pipeline()
    for _, u := range urls {
        // Lua-script atomicity would be ideal; simplified here
        pipe.SAdd(ctx, q.dedup, u)
        pipe.RPush(ctx, q.key, u)
    }
    _, err := pipe.Exec(ctx)
    return err
}

func (q *RedisQueue) Pop(ctx context.Context, timeout time.Duration) (string, error) {
    result, err := q.rdb.BLPop(ctx, timeout, q.key).Result()
    if err != nil {
        return "", err
    }
    if len(result) < 2 {
        return "", fmt.Errorf("unexpected BLPop result")
    }
    return result[1], nil
}

The critical concern in distributed scraping is deduplication consistency. Using a Redis Set for global dedup ensures no two worker processes re-fetch the same URL. When URL counts reach hundreds of millions, a Redis Set consumes too much memory; production systems switch to the RedisBloom module (a server-side Bloom Filter).

Handling JavaScript-Heavy Sites: chromedp

Many modern sites render content entirely in JavaScript via React or Vue. Static HTTP requests cannot capture that content — you need a headless browser:

import (
    "context"
    "github.com/chromedp/chromedp"
    "time"
)

func scrapeJSPage(targetURL string) (string, error) {
    ctx, cancel := chromedp.NewContext(context.Background())
    defer cancel()
    ctx, cancel = context.WithTimeout(ctx, 30*time.Second)
    defer cancel()

    var htmlContent string
    err := chromedp.Run(ctx,
        chromedp.Navigate(targetURL),
        // Wait until a specific element exists (signals JS has rendered)
        chromedp.WaitVisible("#main-content", chromedp.ByID),
        chromedp.OuterHTML("html", &htmlContent),
    )
    if err != nil {
        return "", fmt.Errorf("chromedp: %w", err)
    }
    return htmlContent, nil
}

// Simulate scrolling to trigger infinite-scroll lazy loading
func scrapeInfiniteScroll(targetURL string) ([]string, error) {
    ctx, cancel := chromedp.NewContext(context.Background())
    defer cancel()

    var items []string
    err := chromedp.Run(ctx,
        chromedp.Navigate(targetURL),
        chromedp.ActionFunc(func(ctx context.Context) error {
            for i := 0; i < 5; i++ {
                chromedp.Evaluate(`window.scrollTo(0, document.body.scrollHeight)`, nil).Do(ctx)
                time.Sleep(2 * time.Second)
                var newItems []string
                chromedp.Evaluate(
                    `Array.from(document.querySelectorAll('.item-title')).map(e => e.textContent)`,
                    &newItems,
                ).Do(ctx)
                items = newItems
            }
            return nil
        }),
    )
    return items, err
}

Each chromedp browser instance consumes roughly 100–300 MB of memory, far more than a plain HTTP request. A practical strategy: attempt a static fetch first; if expected content is missing, degrade to chromedp for that URL only.

Proxy Rotation and TLS Fingerprint Management

Sophisticated anti-bot systems analyze not just request frequency but also TLS handshake characteristics (the JA3 fingerprint) and HTTP/2 frame ordering. The standard Go HTTP client has a fixed TLS fingerprint that is trivially identifiable.

type ProxyPool struct {
    proxies []string
    mu      sync.Mutex
    index   int
}

func (p *ProxyPool) Next() *url.URL {
    p.mu.Lock()
    defer p.mu.Unlock()
    proxy := p.proxies[p.index%len(p.proxies)]
    p.index++
    u, _ := url.Parse(proxy)
    return u
}

func newClientWithProxy(proxyURL *url.URL) *http.Client {
    return &http.Client{
        Transport: &http.Transport{
            Proxy: http.ProxyURL(proxyURL),
        },
        Timeout: 30 * time.Second,
    }
}

For TLS fingerprint spoofing, github.com/refraction-networking/utls can mimic Chrome or Firefox TLS handshakes. Apply this technique only where legally and ethically permissible — always verify the target site's terms of service.

Structured Data Extraction: JSON-LD and Microdata

Many sites embed structured data (Schema.org) for SEO purposes. This is often far easier to parse than the surrounding HTML:

func extractJSONLD(doc *goquery.Document) map[string]interface{} {
    var result map[string]interface{}
    doc.Find(`script[type="application/ld+json"]`).Each(func(i int, s *goquery.Selection) {
        if result != nil {
            return
        }
        var data map[string]interface{}
        if err := json.Unmarshal([]byte(s.Text()), &data); err == nil {
            result = data
        }
    })
    return result
}

func extractArticleMetadata(doc *goquery.Document) *Article {
    ld := extractJSONLD(doc)
    if ld == nil {
        return nil
    }
    article := &Article{}
    if name, ok := ld["name"].(string); ok {
        article.Title = name
    }
    if date, ok := ld["datePublished"].(string); ok {
        article.Date = date
    }
    return article
}

JSON-LD parsing is more stable than CSS selectors because structured-data schemas evolve far more slowly than page layouts. In scraper design, attempt JSON-LD extraction first, then fall back to HTML parsing — this layered strategy maximizes resilience.

Engineering Principles Summary

Building a production-grade Go scraper requires internalizing these core principles:

Concurrency control — use a Worker Pool or Semaphore to cap goroutine count and prevent memory exhaustion
Rate limiting — apply per-host token buckets (x/time/rate) to respect server capacity
Error handling — retry with exponential backoff plus jitter; avoid the Thundering Herd
Deduplication — in-memory Bloom Filter for small scale; Redis + RedisBloom for large scale
Politeness — honor robots.txt; declare an honest User-Agent
Observability — log the disposition of every URL (success / error / skipped) for post-analysis
Progressive complexity — start with static HTTP, add chromedp only where JS rendering is confirmed necessary

Go's goroutine model makes high-concurrency scraping feel natural. A well-designed Go scraper reaches the concurrency performance and engineering quality of a large Python scraper in a fraction of the code.

Rate this chapter

4.9 / 5 (3 ratings)

Web Scraping: Concurrency and Rate Limiting

Web Scraping: Concurrency and Rate Limiting

Level 1 · Why Go Excels at Scraping

Where the Bottleneck Lives

Threads vs. Goroutines: The Cost of Concurrency

Engineering Quality of Go's HTTP Client

Level 2 · Core Patterns and Algorithms

The Worker Pool Pattern

Channel-Based Worker Pool

Semaphore-Based Concurrency Control

Token Bucket Rate Limiting: golang.org/x/time/rate

Exponential Backoff with Jitter

Robots.txt and Crawl Politeness

Level 3 · Building a Complete Scraper

Colly vs. Raw net/http

HTML Parsing: goquery and golang.org/x/net/html

Link Extraction and Deduplication: Bloom Filter

Putting It All Together

Level 4 · Advanced Topics and Edge Cases

Distributed Scraping with a Redis Work Queue

Handling JavaScript-Heavy Sites: chromedp

Proxy Rotation and TLS Fingerprint Management

Structured Data Extraction: JSON-LD and Microdata

Engineering Principles Summary

💬 Comments