Chapter 29

Web Scraping: Concurrency and Rate Limiting

Web Scraping: Concurrency and Rate Limiting

The internet holds vast stores of publicly accessible data, yet that data rarely arrives pre-packaged in a structured form. A web scraper is the engineering tool that converts unstructured HTML pages into structured records. Price monitoring, academic research, competitive analysis, search-engine indexing โ€” all of these domains depend on scrapers as foundational infrastructure.

But scraping is not merely "send an HTTP request, parse HTML." Real, industrial-grade scrapers face three core challenges: throughput (fetching large numbers of URLs in reasonable time), politeness (not overwhelming the target server), and robustness (surviving network failures, anti-bot countermeasures, and JavaScript-rendered pages). Go exhibits distinctive advantages when confronting all three challenges.

This chapter begins with Go's concurrency model, dives deep into the Worker Pool pattern, token-bucket rate limiting, and exponential backoff with jitter, and then assembles a complete, production-ready scraper by combining these components.


Level 1 ยท Why Go Excels at Scraping

Where the Bottleneck Lives

Before discussing Go, we need to understand where a scraper's performance actually bottlenecks.

A typical single-page fetch looks like: establish TCP connection โ†’ TLS handshake โ†’ send HTTP request โ†’ wait for server response โ†’ receive response body โ†’ parse HTML. In this pipeline, the CPU is actually busy only during "establish connection" and "parse HTML." For the rest of the time, the program is waiting on network I/O.

For a page with 100 ms network latency and 5 ms HTML parse time, CPU utilization stays below 5 %. If you serialize fetches on a single thread, enormous CPU time is wasted waiting. This is why web scrapers are inherently I/O-bound applications, and concurrency is the primary lever for improving throughput.

Threads vs. Goroutines: The Cost of Concurrency

The classical solution is OS threads. But each OS thread carries roughly 1โ€“8 MB of overhead (mostly a fixed-size stack), is scheduled by the kernel, and incurs context-switch latency of 1โ€“10 ยตs. For a scraper that needs to maintain thousands of concurrent connections, thread count itself becomes the bottleneck.

Go's goroutines fundamentally solve this problem:

This lets you write concurrent logic in a sequential style:

// Looks like synchronous code; actually cooperative concurrency
resp, err := http.Get(url)  // goroutine suspends here, but the OS thread does not block
body, _ := io.ReadAll(resp.Body)

Engineering Quality of Go's HTTP Client

Go's standard net/http package is not a thin wrapper. It ships production-grade features out of the box:

client := &http.Client{
    Timeout: 30 * time.Second,
    Transport: &http.Transport{
        MaxIdleConns:        100,
        MaxIdleConnsPerHost: 10,
        IdleConnTimeout:     90 * time.Second,
        DisableCompression:  false, // handles gzip automatically
    },
}

Understanding these defaults and their tuning knobs is the first step toward building an efficient scraper.


Level 2 ยท Core Patterns and Algorithms

The Worker Pool Pattern

A Worker Pool is the canonical pattern for "processing an unlimited stream of tasks with a fixed number of goroutines." The core problem it solves is: preventing uncontrolled goroutine creation from exhausting memory.

Channel-Based Worker Pool

func crawlWithWorkerPool(urls []string, workerCount int) []Result {
    jobs := make(chan string, len(urls))
    results := make(chan Result, len(urls))

    // Launch a fixed number of workers
    var wg sync.WaitGroup
    for i := 0; i < workerCount; i++ {
        wg.Add(1)
        go func() {
            defer wg.Done()
            for url := range jobs {
                result := fetch(url)
                results <- result
            }
        }()
    }

    // Enqueue all tasks
    for _, url := range urls {
        jobs <- url
    }
    close(jobs)

    // After all workers finish, close the results channel
    go func() {
        wg.Wait()
        close(results)
    }()

    // Drain results
    var all []Result
    for r := range results {
        all = append(all, r)
    }
    return all
}

A few key design decisions deserve explanation:

  1. The jobs channel is buffered to len(urls) so the main goroutine can enqueue everything immediately without blocking.
  2. After close(jobs), for url := range jobs exits automatically when the channel is drained โ€” the idiomatic Go way to signal "no more work."
  3. A dedicated goroutine waits on the WaitGroup before closing results, preventing a deadlock.

Semaphore-Based Concurrency Control

An equivalent approach uses a buffered channel to simulate a semaphore:

type Semaphore chan struct{}

func NewSemaphore(n int) Semaphore {
    return make(chan struct{}, n)
}

func (s Semaphore) Acquire() { s <- struct{}{} }
func (s Semaphore) Release() { <-s }

func crawlWithSemaphore(urls []string, maxConcurrent int) []Result {
    sem := NewSemaphore(maxConcurrent)
    var mu sync.Mutex
    var results []Result
    var wg sync.WaitGroup

    for _, url := range urls {
        wg.Add(1)
        go func(u string) {
            defer wg.Done()
            sem.Acquire()
            defer sem.Release()

            result := fetch(u)
            mu.Lock()
            results = append(results, result)
            mu.Unlock()
        }(url)
    }

    wg.Wait()
    return results
}

The semaphore approach is more flexible: each URL gets its own goroutine, but concurrency is throttled. The Worker Pool reuses a fixed set of goroutines, reducing goroutine creation/teardown overhead. For large, fixed-length work lists, the Worker Pool is generally preferable.

Token Bucket Rate Limiting: golang.org/x/time/rate

Politeness is central to responsible scraping. Rate limiting ensures that, regardless of how full the task queue is, the request rate to any single target server never exceeds a configured threshold.

The token bucket algorithm is the most widely used rate-limiting approach:

golang.org/x/time/rate implements token bucket:

import "golang.org/x/time/rate"

// 2 requests per second; burst of up to 5
limiter := rate.NewLimiter(rate.Limit(2), 5)

func fetchWithRateLimit(ctx context.Context, limiter *rate.Limiter, url string) (*http.Response, error) {
    // Wait blocks until a token is available or the context is canceled
    if err := limiter.Wait(ctx); err != nil {
        return nil, err
    }
    return http.Get(url)
}

In a scraper, you typically need per-host limiters, applying different rates to different domains:

type HostLimiter struct {
    mu       sync.Mutex
    limiters map[string]*rate.Limiter
    rps      float64
    burst    int
}

func NewHostLimiter(rps float64, burst int) *HostLimiter {
    return &HostLimiter{
        limiters: make(map[string]*rate.Limiter),
        rps:      rps,
        burst:    burst,
    }
}

func (hl *HostLimiter) Get(host string) *rate.Limiter {
    hl.mu.Lock()
    defer hl.mu.Unlock()
    if l, ok := hl.limiters[host]; ok {
        return l
    }
    l := rate.NewLimiter(rate.Limit(hl.rps), hl.burst)
    hl.limiters[host] = l
    return l
}

Exponential Backoff with Jitter

Network requests inevitably fail: server timeouts, 429 Too Many Requests, transient network errors. The correct retry strategy is exponential backoff: each failure doubles the wait time, preventing continuous hammering of an already-overloaded server.

Pure exponential backoff has one flaw: if many clients fail simultaneously and all retry on the same schedule, they produce a Thundering Herd at retry time. Adding random jitter scatters the retry moments.

import (
    "math"
    "math/rand"
    "time"
)

type BackoffConfig struct {
    InitialDelay time.Duration
    MaxDelay     time.Duration
    Multiplier   float64
    MaxRetries   int
}

var DefaultBackoff = BackoffConfig{
    InitialDelay: 1 * time.Second,
    MaxDelay:     60 * time.Second,
    Multiplier:   2.0,
    MaxRetries:   5,
}

// Wait returns the duration to sleep before attempt number `attempt` (0-indexed).
// Returns -1 when retries are exhausted.
func (b BackoffConfig) Wait(attempt int) time.Duration {
    if attempt >= b.MaxRetries {
        return -1
    }
    // Compute the exponential cap
    cap := float64(b.InitialDelay) * math.Pow(b.Multiplier, float64(attempt))
    if cap > float64(b.MaxDelay) {
        cap = float64(b.MaxDelay)
    }
    // Full Jitter: uniform random in [0, cap]
    return time.Duration(rand.Float64() * cap)
}

func fetchWithRetry(ctx context.Context, client *http.Client, url string) (*http.Response, error) {
    var lastErr error
    for attempt := 0; attempt <= DefaultBackoff.MaxRetries; attempt++ {
        if attempt > 0 {
            wait := DefaultBackoff.Wait(attempt - 1)
            if wait < 0 {
                break
            }
            select {
            case <-time.After(wait):
            case <-ctx.Done():
                return nil, ctx.Err()
            }
        }

        resp, err := client.Get(url)
        if err == nil && resp.StatusCode < 500 {
            return resp, nil
        }
        if err != nil {
            lastErr = err
        } else {
            resp.Body.Close()
            lastErr = fmt.Errorf("server error: %d", resp.StatusCode)
        }
    }
    return nil, fmt.Errorf("failed after %d attempts: %w", DefaultBackoff.MaxRetries, lastErr)
}

Robots.txt and Crawl Politeness

robots.txt is the standard protocol (RFC 9309) by which websites declare which paths are off-limits to automated agents. A polite crawler must respect it.

import "github.com/temoto/robotstxt"

type RobotsCache struct {
    mu    sync.RWMutex
    cache map[string]*robotstxt.RobotsData
}

func (rc *RobotsCache) IsAllowed(userAgent, rawURL string) bool {
    u, err := url.Parse(rawURL)
    if err != nil {
        return false
    }
    host := u.Scheme + "://" + u.Host

    rc.mu.RLock()
    data, ok := rc.cache[host]
    rc.mu.RUnlock()

    if !ok {
        data = rc.fetchRobots(host)
        rc.mu.Lock()
        rc.cache[host] = data
        rc.mu.Unlock()
    }

    if data == nil {
        return true // could not fetch robots.txt; allow by default
    }
    return data.TestAgent(u.Path, userAgent)
}

func (rc *RobotsCache) fetchRobots(host string) *robotstxt.RobotsData {
    resp, err := http.Get(host + "/robots.txt")
    if err != nil || resp.StatusCode != 200 {
        return nil
    }
    defer resp.Body.Close()
    data, _ := robotstxt.FromResponse(resp)
    return data
}

Level 3 ยท Building a Complete Scraper

Colly vs. Raw net/http

colly is the most popular scraping framework in the Go ecosystem. It handles much of the boilerplate:

import "github.com/gocolly/colly/v2"

func scrapeWithColly(startURL string) []Article {
    var articles []Article
    var mu sync.Mutex

    c := colly.NewCollector(
        colly.AllowedDomains("example.com"),
        colly.MaxDepth(3),
        colly.Async(true),
    )

    c.Limit(&colly.LimitRule{
        DomainGlob:  "*",
        Parallelism: 5,
        Delay:       500 * time.Millisecond,
        RandomDelay: 200 * time.Millisecond,
    })

    c.OnHTML("article.post", func(e *colly.HTMLElement) {
        article := Article{
            Title: e.ChildText("h2.title"),
            URL:   e.ChildAttr("a.read-more", "href"),
            Date:  e.ChildText("time"),
        }
        mu.Lock()
        articles = append(articles, article)
        mu.Unlock()

        c.Visit(e.Request.AbsoluteURL(article.URL))
    })

    c.OnError(func(r *colly.Response, err error) {
        log.Printf("Error scraping %s: %v", r.Request.URL, err)
    })

    c.Visit(startURL)
    c.Wait()
    return articles
}

Colly has limitations: fine-grained retry logic and custom rate-limit strategies are awkward to implement within its callback architecture. For complex pipelines, raw net/http + goquery gives more control.

HTML Parsing: goquery and golang.org/x/net/html

goquery provides a jQuery-style CSS selector API backed by the golang.org/x/net/html parse tree:

import (
    "github.com/PuerkitoBio/goquery"
    "golang.org/x/net/html"
    "strings"
)

type Article struct {
    Title   string
    URL     string
    Content string
    Date    string
    Tags    []string
}

func parseArticlePage(resp *http.Response) (*Article, error) {
    doc, err := goquery.NewDocumentFromReader(resp.Body)
    if err != nil {
        return nil, fmt.Errorf("parse HTML: %w", err)
    }

    article := &Article{}
    article.Title = strings.TrimSpace(doc.Find("h1.article-title").Text())
    article.Date = doc.Find("time[datetime]").AttrOr("datetime", "")

    doc.Find("a.tag").Each(func(i int, s *goquery.Selection) {
        article.Tags = append(article.Tags, strings.TrimSpace(s.Text()))
    })

    contentHTML, _ := doc.Find("div.article-content").Html()
    article.Content = htmlToText(contentHTML)

    return article, nil
}

// htmlToText converts HTML to plain text using the standard tokenizer
func htmlToText(htmlStr string) string {
    tokenizer := html.NewTokenizer(strings.NewReader(htmlStr))
    var sb strings.Builder
    for {
        tt := tokenizer.Next()
        switch tt {
        case html.ErrorToken:
            return sb.String()
        case html.TextToken:
            sb.Write(tokenizer.Text())
        case html.StartTagToken:
            tag, _ := tokenizer.TagName()
            if string(tag) == "br" || string(tag) == "p" {
                sb.WriteRune('\n')
            }
        }
    }
}

A scraper's most basic requirement is never re-fetching a URL it has already visited. The naive approach is map[string]bool, but at millions of URLs the memory footprint becomes unacceptable.

A Bloom Filter is the classical data structure for this problem: it uses a tiny amount of memory to answer the question "have I seen this before?" with zero false negatives and a configurable false-positive rate.

import "github.com/bits-and-blooms/bloom/v3"

type Deduplicator struct {
    filter *bloom.BloomFilter
    mu     sync.Mutex
}

// NewDeduplicator creates a Bloom Filter sized for n expected items at false-positive rate fp
func NewDeduplicator(n uint, fp float64) *Deduplicator {
    return &Deduplicator{
        filter: bloom.NewWithEstimates(n, fp),
    }
}

// SeenOrAdd returns true if the URL was already seen (skip it).
// Returns false on first encounter and adds the URL to the filter.
func (d *Deduplicator) SeenOrAdd(url string) bool {
    d.mu.Lock()
    defer d.mu.Unlock()
    if d.filter.TestString(url) {
        return true
    }
    d.filter.AddString(url)
    return false
}

func extractLinks(doc *goquery.Document, baseURL *url.URL) []string {
    var links []string
    doc.Find("a[href]").Each(func(i int, s *goquery.Selection) {
        href, exists := s.Attr("href")
        if !exists {
            return
        }
        u, err := baseURL.Parse(href)
        if err != nil {
            return
        }
        if (u.Scheme == "http" || u.Scheme == "https") && u.Host == baseURL.Host {
            u.Fragment = "" // strip #section
            links = append(links, u.String())
        }
    })
    return links
}

Putting It All Together

package scraper

import (
    "context"
    "encoding/csv"
    "encoding/json"
    "fmt"
    "log"
    "net/http"
    "net/url"
    "os"
    "strings"
    "sync"
    "time"

    "github.com/PuerkitoBio/goquery"
    "golang.org/x/time/rate"
)

type Scraper struct {
    client    *http.Client
    limiter   *HostLimiter
    dedup     *Deduplicator
    robots    *RobotsCache
    workers   int
    userAgent string
}

func NewScraper(rps float64, workers int) *Scraper {
    return &Scraper{
        client: &http.Client{
            Timeout: 30 * time.Second,
            Transport: &http.Transport{
                MaxIdleConnsPerHost: 10,
                IdleConnTimeout:     90 * time.Second,
            },
        },
        limiter:   NewHostLimiter(rps, int(rps*2)),
        dedup:     NewDeduplicator(1_000_000, 0.01), // 1M URLs, 1% false-positive rate
        robots:    &RobotsCache{cache: make(map[string]*robotstxt.RobotsData)},
        workers:   workers,
        userAgent: "MyBot/1.0 (+https://example.com/bot)",
    }
}

type Job struct {
    URL   string
    Depth int
}

type Result struct {
    URL     string
    Article *Article
    Err     error
}

func (s *Scraper) Run(ctx context.Context, seedURLs []string, maxDepth int) []Article {
    jobs := make(chan Job, 10000)
    results := make(chan Result, 10000)

    go func() {
        for _, u := range seedURLs {
            if !s.dedup.SeenOrAdd(u) {
                jobs <- Job{URL: u, Depth: 0}
            }
        }
    }()

    var wg sync.WaitGroup
    for i := 0; i < s.workers; i++ {
        wg.Add(1)
        go func() {
            defer wg.Done()
            for job := range jobs {
                s.processJob(ctx, job, jobs, results, maxDepth)
            }
        }()
    }

    var articles []Article
    var collectWg sync.WaitGroup
    collectWg.Add(1)
    go func() {
        defer collectWg.Done()
        for r := range results {
            if r.Err != nil {
                log.Printf("Error: %s: %v", r.URL, r.Err)
                continue
            }
            if r.Article != nil {
                articles = append(articles, *r.Article)
            }
        }
    }()

    wg.Wait()
    close(results)
    collectWg.Wait()

    return articles
}

func (s *Scraper) processJob(ctx context.Context, job Job, jobs chan<- Job, results chan<- Result, maxDepth int) {
    if !s.robots.IsAllowed(s.userAgent, job.URL) {
        return
    }

    u, _ := url.Parse(job.URL)
    if err := s.limiter.Get(u.Host).Wait(ctx); err != nil {
        return
    }

    resp, err := fetchWithRetry(ctx, s.client, job.URL)
    if err != nil {
        results <- Result{URL: job.URL, Err: err}
        return
    }
    defer resp.Body.Close()

    doc, err := goquery.NewDocumentFromReader(resp.Body)
    if err != nil {
        results <- Result{URL: job.URL, Err: err}
        return
    }

    article, _ := parseArticlePage(resp)
    results <- Result{URL: job.URL, Article: article}

    if job.Depth < maxDepth {
        for _, link := range extractLinks(doc, u) {
            if !s.dedup.SeenOrAdd(link) {
                select {
                case jobs <- Job{URL: link, Depth: job.Depth + 1}:
                default:
                    // queue full; skip rather than block
                }
            }
        }
    }
}

func SaveJSON(articles []Article, path string) error {
    f, err := os.Create(path)
    if err != nil {
        return err
    }
    defer f.Close()
    enc := json.NewEncoder(f)
    enc.SetIndent("", "  ")
    return enc.Encode(articles)
}

func SaveCSV(articles []Article, path string) error {
    f, err := os.Create(path)
    if err != nil {
        return err
    }
    defer f.Close()
    w := csv.NewWriter(f)
    defer w.Flush()
    w.Write([]string{"URL", "Title", "Date", "Tags"})
    for _, a := range articles {
        w.Write([]string{a.URL, a.Title, a.Date, strings.Join(a.Tags, "|")})
    }
    return w.Error()
}

Level 4 ยท Advanced Topics and Edge Cases

Distributed Scraping with a Redis Work Queue

When scraper scale exceeds single-machine capacity, the task queue must move to Redis, enabling multi-process cooperation:

import "github.com/redis/go-redis/v9"

type RedisQueue struct {
    rdb   *redis.Client
    key   string
    dedup string // Redis Set used for deduplication
}

func NewRedisQueue(addr, queueKey, dedupKey string) *RedisQueue {
    return &RedisQueue{
        rdb:   redis.NewClient(&redis.Options{Addr: addr}),
        key:   queueKey,
        dedup: dedupKey,
    }
}

func (q *RedisQueue) Push(ctx context.Context, urls ...string) error {
    pipe := q.rdb.Pipeline()
    for _, u := range urls {
        // Lua-script atomicity would be ideal; simplified here
        pipe.SAdd(ctx, q.dedup, u)
        pipe.RPush(ctx, q.key, u)
    }
    _, err := pipe.Exec(ctx)
    return err
}

func (q *RedisQueue) Pop(ctx context.Context, timeout time.Duration) (string, error) {
    result, err := q.rdb.BLPop(ctx, timeout, q.key).Result()
    if err != nil {
        return "", err
    }
    if len(result) < 2 {
        return "", fmt.Errorf("unexpected BLPop result")
    }
    return result[1], nil
}

The critical concern in distributed scraping is deduplication consistency. Using a Redis Set for global dedup ensures no two worker processes re-fetch the same URL. When URL counts reach hundreds of millions, a Redis Set consumes too much memory; production systems switch to the RedisBloom module (a server-side Bloom Filter).

Handling JavaScript-Heavy Sites: chromedp

Many modern sites render content entirely in JavaScript via React or Vue. Static HTTP requests cannot capture that content โ€” you need a headless browser:

import (
    "context"
    "github.com/chromedp/chromedp"
    "time"
)

func scrapeJSPage(targetURL string) (string, error) {
    ctx, cancel := chromedp.NewContext(context.Background())
    defer cancel()
    ctx, cancel = context.WithTimeout(ctx, 30*time.Second)
    defer cancel()

    var htmlContent string
    err := chromedp.Run(ctx,
        chromedp.Navigate(targetURL),
        // Wait until a specific element exists (signals JS has rendered)
        chromedp.WaitVisible("#main-content", chromedp.ByID),
        chromedp.OuterHTML("html", &htmlContent),
    )
    if err != nil {
        return "", fmt.Errorf("chromedp: %w", err)
    }
    return htmlContent, nil
}

// Simulate scrolling to trigger infinite-scroll lazy loading
func scrapeInfiniteScroll(targetURL string) ([]string, error) {
    ctx, cancel := chromedp.NewContext(context.Background())
    defer cancel()

    var items []string
    err := chromedp.Run(ctx,
        chromedp.Navigate(targetURL),
        chromedp.ActionFunc(func(ctx context.Context) error {
            for i := 0; i < 5; i++ {
                chromedp.Evaluate(`window.scrollTo(0, document.body.scrollHeight)`, nil).Do(ctx)
                time.Sleep(2 * time.Second)
                var newItems []string
                chromedp.Evaluate(
                    `Array.from(document.querySelectorAll('.item-title')).map(e => e.textContent)`,
                    &newItems,
                ).Do(ctx)
                items = newItems
            }
            return nil
        }),
    )
    return items, err
}

Each chromedp browser instance consumes roughly 100โ€“300 MB of memory, far more than a plain HTTP request. A practical strategy: attempt a static fetch first; if expected content is missing, degrade to chromedp for that URL only.

Proxy Rotation and TLS Fingerprint Management

Sophisticated anti-bot systems analyze not just request frequency but also TLS handshake characteristics (the JA3 fingerprint) and HTTP/2 frame ordering. The standard Go HTTP client has a fixed TLS fingerprint that is trivially identifiable.

type ProxyPool struct {
    proxies []string
    mu      sync.Mutex
    index   int
}

func (p *ProxyPool) Next() *url.URL {
    p.mu.Lock()
    defer p.mu.Unlock()
    proxy := p.proxies[p.index%len(p.proxies)]
    p.index++
    u, _ := url.Parse(proxy)
    return u
}

func newClientWithProxy(proxyURL *url.URL) *http.Client {
    return &http.Client{
        Transport: &http.Transport{
            Proxy: http.ProxyURL(proxyURL),
        },
        Timeout: 30 * time.Second,
    }
}

For TLS fingerprint spoofing, github.com/refraction-networking/utls can mimic Chrome or Firefox TLS handshakes. Apply this technique only where legally and ethically permissible โ€” always verify the target site's terms of service.

Structured Data Extraction: JSON-LD and Microdata

Many sites embed structured data (Schema.org) for SEO purposes. This is often far easier to parse than the surrounding HTML:

func extractJSONLD(doc *goquery.Document) map[string]interface{} {
    var result map[string]interface{}
    doc.Find(`script[type="application/ld+json"]`).Each(func(i int, s *goquery.Selection) {
        if result != nil {
            return
        }
        var data map[string]interface{}
        if err := json.Unmarshal([]byte(s.Text()), &data); err == nil {
            result = data
        }
    })
    return result
}

func extractArticleMetadata(doc *goquery.Document) *Article {
    ld := extractJSONLD(doc)
    if ld == nil {
        return nil
    }
    article := &Article{}
    if name, ok := ld["name"].(string); ok {
        article.Title = name
    }
    if date, ok := ld["datePublished"].(string); ok {
        article.Date = date
    }
    return article
}

JSON-LD parsing is more stable than CSS selectors because structured-data schemas evolve far more slowly than page layouts. In scraper design, attempt JSON-LD extraction first, then fall back to HTML parsing โ€” this layered strategy maximizes resilience.

Engineering Principles Summary

Building a production-grade Go scraper requires internalizing these core principles:

  1. Concurrency control โ€” use a Worker Pool or Semaphore to cap goroutine count and prevent memory exhaustion
  2. Rate limiting โ€” apply per-host token buckets (x/time/rate) to respect server capacity
  3. Error handling โ€” retry with exponential backoff plus jitter; avoid the Thundering Herd
  4. Deduplication โ€” in-memory Bloom Filter for small scale; Redis + RedisBloom for large scale
  5. Politeness โ€” honor robots.txt; declare an honest User-Agent
  6. Observability โ€” log the disposition of every URL (success / error / skipped) for post-analysis
  7. Progressive complexity โ€” start with static HTTP, add chromedp only where JS rendering is confirmed necessary

Go's goroutine model makes high-concurrency scraping feel natural. A well-designed Go scraper reaches the concurrency performance and engineering quality of a large Python scraper in a fraction of the code.

Rate this chapter
4.9  / 5  (3 ratings)

๐Ÿ’ฌ Comments