Web Scraping: Concurrency and Rate Limiting
Web Scraping: Concurrency and Rate Limiting
The internet holds vast stores of publicly accessible data, yet that data rarely arrives pre-packaged in a structured form. A web scraper is the engineering tool that converts unstructured HTML pages into structured records. Price monitoring, academic research, competitive analysis, search-engine indexing โ all of these domains depend on scrapers as foundational infrastructure.
But scraping is not merely "send an HTTP request, parse HTML." Real, industrial-grade scrapers face three core challenges: throughput (fetching large numbers of URLs in reasonable time), politeness (not overwhelming the target server), and robustness (surviving network failures, anti-bot countermeasures, and JavaScript-rendered pages). Go exhibits distinctive advantages when confronting all three challenges.
This chapter begins with Go's concurrency model, dives deep into the Worker Pool pattern, token-bucket rate limiting, and exponential backoff with jitter, and then assembles a complete, production-ready scraper by combining these components.
Level 1 ยท Why Go Excels at Scraping
Where the Bottleneck Lives
Before discussing Go, we need to understand where a scraper's performance actually bottlenecks.
A typical single-page fetch looks like: establish TCP connection โ TLS handshake โ send HTTP request โ wait for server response โ receive response body โ parse HTML. In this pipeline, the CPU is actually busy only during "establish connection" and "parse HTML." For the rest of the time, the program is waiting on network I/O.
For a page with 100 ms network latency and 5 ms HTML parse time, CPU utilization stays below 5 %. If you serialize fetches on a single thread, enormous CPU time is wasted waiting. This is why web scrapers are inherently I/O-bound applications, and concurrency is the primary lever for improving throughput.
Threads vs. Goroutines: The Cost of Concurrency
The classical solution is OS threads. But each OS thread carries roughly 1โ8 MB of overhead (mostly a fixed-size stack), is scheduled by the kernel, and incurs context-switch latency of 1โ10 ยตs. For a scraper that needs to maintain thousands of concurrent connections, thread count itself becomes the bottleneck.
Go's goroutines fundamentally solve this problem:
- Initial stack of 2 KB, growing on demand; hundreds of thousands of goroutines can coexist
- M:N scheduling: the Go runtime multiplexes M goroutines onto N OS threads, where N typically equals the number of CPU cores
- Non-blocking I/O: when a goroutine waits on network I/O, the runtime suspends it and schedules another goroutine; when I/O completes, the goroutine is resumed โ all transparent to the programmer
This lets you write concurrent logic in a sequential style:
// Looks like synchronous code; actually cooperative concurrency
resp, err := http.Get(url) // goroutine suspends here, but the OS thread does not block
body, _ := io.ReadAll(resp.Body)
Engineering Quality of Go's HTTP Client
Go's standard net/http package is not a thin wrapper. It ships production-grade features out of the box:
- Connection pooling (Keep-Alive): enabled by default, avoiding a fresh TCP connection per request
- TLS session resumption: reduces TLS handshake cost
- Automatic redirect following: up to 10 redirects by default
- Fine-grained timeout control: connection timeout, read timeout, overall timeout
client := &http.Client{
Timeout: 30 * time.Second,
Transport: &http.Transport{
MaxIdleConns: 100,
MaxIdleConnsPerHost: 10,
IdleConnTimeout: 90 * time.Second,
DisableCompression: false, // handles gzip automatically
},
}
Understanding these defaults and their tuning knobs is the first step toward building an efficient scraper.
Level 2 ยท Core Patterns and Algorithms
The Worker Pool Pattern
A Worker Pool is the canonical pattern for "processing an unlimited stream of tasks with a fixed number of goroutines." The core problem it solves is: preventing uncontrolled goroutine creation from exhausting memory.
Channel-Based Worker Pool
func crawlWithWorkerPool(urls []string, workerCount int) []Result {
jobs := make(chan string, len(urls))
results := make(chan Result, len(urls))
// Launch a fixed number of workers
var wg sync.WaitGroup
for i := 0; i < workerCount; i++ {
wg.Add(1)
go func() {
defer wg.Done()
for url := range jobs {
result := fetch(url)
results <- result
}
}()
}
// Enqueue all tasks
for _, url := range urls {
jobs <- url
}
close(jobs)
// After all workers finish, close the results channel
go func() {
wg.Wait()
close(results)
}()
// Drain results
var all []Result
for r := range results {
all = append(all, r)
}
return all
}
A few key design decisions deserve explanation:
- The
jobschannel is buffered tolen(urls)so the main goroutine can enqueue everything immediately without blocking. - After
close(jobs),for url := range jobsexits automatically when the channel is drained โ the idiomatic Go way to signal "no more work." - A dedicated goroutine waits on the WaitGroup before closing
results, preventing a deadlock.
Semaphore-Based Concurrency Control
An equivalent approach uses a buffered channel to simulate a semaphore:
type Semaphore chan struct{}
func NewSemaphore(n int) Semaphore {
return make(chan struct{}, n)
}
func (s Semaphore) Acquire() { s <- struct{}{} }
func (s Semaphore) Release() { <-s }
func crawlWithSemaphore(urls []string, maxConcurrent int) []Result {
sem := NewSemaphore(maxConcurrent)
var mu sync.Mutex
var results []Result
var wg sync.WaitGroup
for _, url := range urls {
wg.Add(1)
go func(u string) {
defer wg.Done()
sem.Acquire()
defer sem.Release()
result := fetch(u)
mu.Lock()
results = append(results, result)
mu.Unlock()
}(url)
}
wg.Wait()
return results
}
The semaphore approach is more flexible: each URL gets its own goroutine, but concurrency is throttled. The Worker Pool reuses a fixed set of goroutines, reducing goroutine creation/teardown overhead. For large, fixed-length work lists, the Worker Pool is generally preferable.
Token Bucket Rate Limiting: golang.org/x/time/rate
Politeness is central to responsible scraping. Rate limiting ensures that, regardless of how full the task queue is, the request rate to any single target server never exceeds a configured threshold.
The token bucket algorithm is the most widely used rate-limiting approach:
- The bucket holds at most
bursttokens, representing permitted instantaneous bursts - The system refills the bucket at rate
r(r tokens per second) - Each request consumes one token
- New tokens are discarded when the bucket is full; requests wait when it is empty
golang.org/x/time/rate implements token bucket:
import "golang.org/x/time/rate"
// 2 requests per second; burst of up to 5
limiter := rate.NewLimiter(rate.Limit(2), 5)
func fetchWithRateLimit(ctx context.Context, limiter *rate.Limiter, url string) (*http.Response, error) {
// Wait blocks until a token is available or the context is canceled
if err := limiter.Wait(ctx); err != nil {
return nil, err
}
return http.Get(url)
}
In a scraper, you typically need per-host limiters, applying different rates to different domains:
type HostLimiter struct {
mu sync.Mutex
limiters map[string]*rate.Limiter
rps float64
burst int
}
func NewHostLimiter(rps float64, burst int) *HostLimiter {
return &HostLimiter{
limiters: make(map[string]*rate.Limiter),
rps: rps,
burst: burst,
}
}
func (hl *HostLimiter) Get(host string) *rate.Limiter {
hl.mu.Lock()
defer hl.mu.Unlock()
if l, ok := hl.limiters[host]; ok {
return l
}
l := rate.NewLimiter(rate.Limit(hl.rps), hl.burst)
hl.limiters[host] = l
return l
}
Exponential Backoff with Jitter
Network requests inevitably fail: server timeouts, 429 Too Many Requests, transient network errors. The correct retry strategy is exponential backoff: each failure doubles the wait time, preventing continuous hammering of an already-overloaded server.
Pure exponential backoff has one flaw: if many clients fail simultaneously and all retry on the same schedule, they produce a Thundering Herd at retry time. Adding random jitter scatters the retry moments.
import (
"math"
"math/rand"
"time"
)
type BackoffConfig struct {
InitialDelay time.Duration
MaxDelay time.Duration
Multiplier float64
MaxRetries int
}
var DefaultBackoff = BackoffConfig{
InitialDelay: 1 * time.Second,
MaxDelay: 60 * time.Second,
Multiplier: 2.0,
MaxRetries: 5,
}
// Wait returns the duration to sleep before attempt number `attempt` (0-indexed).
// Returns -1 when retries are exhausted.
func (b BackoffConfig) Wait(attempt int) time.Duration {
if attempt >= b.MaxRetries {
return -1
}
// Compute the exponential cap
cap := float64(b.InitialDelay) * math.Pow(b.Multiplier, float64(attempt))
if cap > float64(b.MaxDelay) {
cap = float64(b.MaxDelay)
}
// Full Jitter: uniform random in [0, cap]
return time.Duration(rand.Float64() * cap)
}
func fetchWithRetry(ctx context.Context, client *http.Client, url string) (*http.Response, error) {
var lastErr error
for attempt := 0; attempt <= DefaultBackoff.MaxRetries; attempt++ {
if attempt > 0 {
wait := DefaultBackoff.Wait(attempt - 1)
if wait < 0 {
break
}
select {
case <-time.After(wait):
case <-ctx.Done():
return nil, ctx.Err()
}
}
resp, err := client.Get(url)
if err == nil && resp.StatusCode < 500 {
return resp, nil
}
if err != nil {
lastErr = err
} else {
resp.Body.Close()
lastErr = fmt.Errorf("server error: %d", resp.StatusCode)
}
}
return nil, fmt.Errorf("failed after %d attempts: %w", DefaultBackoff.MaxRetries, lastErr)
}
Robots.txt and Crawl Politeness
robots.txt is the standard protocol (RFC 9309) by which websites declare which paths are off-limits to automated agents. A polite crawler must respect it.
import "github.com/temoto/robotstxt"
type RobotsCache struct {
mu sync.RWMutex
cache map[string]*robotstxt.RobotsData
}
func (rc *RobotsCache) IsAllowed(userAgent, rawURL string) bool {
u, err := url.Parse(rawURL)
if err != nil {
return false
}
host := u.Scheme + "://" + u.Host
rc.mu.RLock()
data, ok := rc.cache[host]
rc.mu.RUnlock()
if !ok {
data = rc.fetchRobots(host)
rc.mu.Lock()
rc.cache[host] = data
rc.mu.Unlock()
}
if data == nil {
return true // could not fetch robots.txt; allow by default
}
return data.TestAgent(u.Path, userAgent)
}
func (rc *RobotsCache) fetchRobots(host string) *robotstxt.RobotsData {
resp, err := http.Get(host + "/robots.txt")
if err != nil || resp.StatusCode != 200 {
return nil
}
defer resp.Body.Close()
data, _ := robotstxt.FromResponse(resp)
return data
}
Level 3 ยท Building a Complete Scraper
Colly vs. Raw net/http
colly is the most popular scraping framework in the Go ecosystem. It handles much of the boilerplate:
import "github.com/gocolly/colly/v2"
func scrapeWithColly(startURL string) []Article {
var articles []Article
var mu sync.Mutex
c := colly.NewCollector(
colly.AllowedDomains("example.com"),
colly.MaxDepth(3),
colly.Async(true),
)
c.Limit(&colly.LimitRule{
DomainGlob: "*",
Parallelism: 5,
Delay: 500 * time.Millisecond,
RandomDelay: 200 * time.Millisecond,
})
c.OnHTML("article.post", func(e *colly.HTMLElement) {
article := Article{
Title: e.ChildText("h2.title"),
URL: e.ChildAttr("a.read-more", "href"),
Date: e.ChildText("time"),
}
mu.Lock()
articles = append(articles, article)
mu.Unlock()
c.Visit(e.Request.AbsoluteURL(article.URL))
})
c.OnError(func(r *colly.Response, err error) {
log.Printf("Error scraping %s: %v", r.Request.URL, err)
})
c.Visit(startURL)
c.Wait()
return articles
}
Colly has limitations: fine-grained retry logic and custom rate-limit strategies are awkward to implement within its callback architecture. For complex pipelines, raw net/http + goquery gives more control.
HTML Parsing: goquery and golang.org/x/net/html
goquery provides a jQuery-style CSS selector API backed by the golang.org/x/net/html parse tree:
import (
"github.com/PuerkitoBio/goquery"
"golang.org/x/net/html"
"strings"
)
type Article struct {
Title string
URL string
Content string
Date string
Tags []string
}
func parseArticlePage(resp *http.Response) (*Article, error) {
doc, err := goquery.NewDocumentFromReader(resp.Body)
if err != nil {
return nil, fmt.Errorf("parse HTML: %w", err)
}
article := &Article{}
article.Title = strings.TrimSpace(doc.Find("h1.article-title").Text())
article.Date = doc.Find("time[datetime]").AttrOr("datetime", "")
doc.Find("a.tag").Each(func(i int, s *goquery.Selection) {
article.Tags = append(article.Tags, strings.TrimSpace(s.Text()))
})
contentHTML, _ := doc.Find("div.article-content").Html()
article.Content = htmlToText(contentHTML)
return article, nil
}
// htmlToText converts HTML to plain text using the standard tokenizer
func htmlToText(htmlStr string) string {
tokenizer := html.NewTokenizer(strings.NewReader(htmlStr))
var sb strings.Builder
for {
tt := tokenizer.Next()
switch tt {
case html.ErrorToken:
return sb.String()
case html.TextToken:
sb.Write(tokenizer.Text())
case html.StartTagToken:
tag, _ := tokenizer.TagName()
if string(tag) == "br" || string(tag) == "p" {
sb.WriteRune('\n')
}
}
}
}
Link Extraction and Deduplication: Bloom Filter
A scraper's most basic requirement is never re-fetching a URL it has already visited. The naive approach is map[string]bool, but at millions of URLs the memory footprint becomes unacceptable.
A Bloom Filter is the classical data structure for this problem: it uses a tiny amount of memory to answer the question "have I seen this before?" with zero false negatives and a configurable false-positive rate.
import "github.com/bits-and-blooms/bloom/v3"
type Deduplicator struct {
filter *bloom.BloomFilter
mu sync.Mutex
}
// NewDeduplicator creates a Bloom Filter sized for n expected items at false-positive rate fp
func NewDeduplicator(n uint, fp float64) *Deduplicator {
return &Deduplicator{
filter: bloom.NewWithEstimates(n, fp),
}
}
// SeenOrAdd returns true if the URL was already seen (skip it).
// Returns false on first encounter and adds the URL to the filter.
func (d *Deduplicator) SeenOrAdd(url string) bool {
d.mu.Lock()
defer d.mu.Unlock()
if d.filter.TestString(url) {
return true
}
d.filter.AddString(url)
return false
}
func extractLinks(doc *goquery.Document, baseURL *url.URL) []string {
var links []string
doc.Find("a[href]").Each(func(i int, s *goquery.Selection) {
href, exists := s.Attr("href")
if !exists {
return
}
u, err := baseURL.Parse(href)
if err != nil {
return
}
if (u.Scheme == "http" || u.Scheme == "https") && u.Host == baseURL.Host {
u.Fragment = "" // strip #section
links = append(links, u.String())
}
})
return links
}
Putting It All Together
package scraper
import (
"context"
"encoding/csv"
"encoding/json"
"fmt"
"log"
"net/http"
"net/url"
"os"
"strings"
"sync"
"time"
"github.com/PuerkitoBio/goquery"
"golang.org/x/time/rate"
)
type Scraper struct {
client *http.Client
limiter *HostLimiter
dedup *Deduplicator
robots *RobotsCache
workers int
userAgent string
}
func NewScraper(rps float64, workers int) *Scraper {
return &Scraper{
client: &http.Client{
Timeout: 30 * time.Second,
Transport: &http.Transport{
MaxIdleConnsPerHost: 10,
IdleConnTimeout: 90 * time.Second,
},
},
limiter: NewHostLimiter(rps, int(rps*2)),
dedup: NewDeduplicator(1_000_000, 0.01), // 1M URLs, 1% false-positive rate
robots: &RobotsCache{cache: make(map[string]*robotstxt.RobotsData)},
workers: workers,
userAgent: "MyBot/1.0 (+https://example.com/bot)",
}
}
type Job struct {
URL string
Depth int
}
type Result struct {
URL string
Article *Article
Err error
}
func (s *Scraper) Run(ctx context.Context, seedURLs []string, maxDepth int) []Article {
jobs := make(chan Job, 10000)
results := make(chan Result, 10000)
go func() {
for _, u := range seedURLs {
if !s.dedup.SeenOrAdd(u) {
jobs <- Job{URL: u, Depth: 0}
}
}
}()
var wg sync.WaitGroup
for i := 0; i < s.workers; i++ {
wg.Add(1)
go func() {
defer wg.Done()
for job := range jobs {
s.processJob(ctx, job, jobs, results, maxDepth)
}
}()
}
var articles []Article
var collectWg sync.WaitGroup
collectWg.Add(1)
go func() {
defer collectWg.Done()
for r := range results {
if r.Err != nil {
log.Printf("Error: %s: %v", r.URL, r.Err)
continue
}
if r.Article != nil {
articles = append(articles, *r.Article)
}
}
}()
wg.Wait()
close(results)
collectWg.Wait()
return articles
}
func (s *Scraper) processJob(ctx context.Context, job Job, jobs chan<- Job, results chan<- Result, maxDepth int) {
if !s.robots.IsAllowed(s.userAgent, job.URL) {
return
}
u, _ := url.Parse(job.URL)
if err := s.limiter.Get(u.Host).Wait(ctx); err != nil {
return
}
resp, err := fetchWithRetry(ctx, s.client, job.URL)
if err != nil {
results <- Result{URL: job.URL, Err: err}
return
}
defer resp.Body.Close()
doc, err := goquery.NewDocumentFromReader(resp.Body)
if err != nil {
results <- Result{URL: job.URL, Err: err}
return
}
article, _ := parseArticlePage(resp)
results <- Result{URL: job.URL, Article: article}
if job.Depth < maxDepth {
for _, link := range extractLinks(doc, u) {
if !s.dedup.SeenOrAdd(link) {
select {
case jobs <- Job{URL: link, Depth: job.Depth + 1}:
default:
// queue full; skip rather than block
}
}
}
}
}
func SaveJSON(articles []Article, path string) error {
f, err := os.Create(path)
if err != nil {
return err
}
defer f.Close()
enc := json.NewEncoder(f)
enc.SetIndent("", " ")
return enc.Encode(articles)
}
func SaveCSV(articles []Article, path string) error {
f, err := os.Create(path)
if err != nil {
return err
}
defer f.Close()
w := csv.NewWriter(f)
defer w.Flush()
w.Write([]string{"URL", "Title", "Date", "Tags"})
for _, a := range articles {
w.Write([]string{a.URL, a.Title, a.Date, strings.Join(a.Tags, "|")})
}
return w.Error()
}
Level 4 ยท Advanced Topics and Edge Cases
Distributed Scraping with a Redis Work Queue
When scraper scale exceeds single-machine capacity, the task queue must move to Redis, enabling multi-process cooperation:
import "github.com/redis/go-redis/v9"
type RedisQueue struct {
rdb *redis.Client
key string
dedup string // Redis Set used for deduplication
}
func NewRedisQueue(addr, queueKey, dedupKey string) *RedisQueue {
return &RedisQueue{
rdb: redis.NewClient(&redis.Options{Addr: addr}),
key: queueKey,
dedup: dedupKey,
}
}
func (q *RedisQueue) Push(ctx context.Context, urls ...string) error {
pipe := q.rdb.Pipeline()
for _, u := range urls {
// Lua-script atomicity would be ideal; simplified here
pipe.SAdd(ctx, q.dedup, u)
pipe.RPush(ctx, q.key, u)
}
_, err := pipe.Exec(ctx)
return err
}
func (q *RedisQueue) Pop(ctx context.Context, timeout time.Duration) (string, error) {
result, err := q.rdb.BLPop(ctx, timeout, q.key).Result()
if err != nil {
return "", err
}
if len(result) < 2 {
return "", fmt.Errorf("unexpected BLPop result")
}
return result[1], nil
}
The critical concern in distributed scraping is deduplication consistency. Using a Redis Set for global dedup ensures no two worker processes re-fetch the same URL. When URL counts reach hundreds of millions, a Redis Set consumes too much memory; production systems switch to the RedisBloom module (a server-side Bloom Filter).
Handling JavaScript-Heavy Sites: chromedp
Many modern sites render content entirely in JavaScript via React or Vue. Static HTTP requests cannot capture that content โ you need a headless browser:
import (
"context"
"github.com/chromedp/chromedp"
"time"
)
func scrapeJSPage(targetURL string) (string, error) {
ctx, cancel := chromedp.NewContext(context.Background())
defer cancel()
ctx, cancel = context.WithTimeout(ctx, 30*time.Second)
defer cancel()
var htmlContent string
err := chromedp.Run(ctx,
chromedp.Navigate(targetURL),
// Wait until a specific element exists (signals JS has rendered)
chromedp.WaitVisible("#main-content", chromedp.ByID),
chromedp.OuterHTML("html", &htmlContent),
)
if err != nil {
return "", fmt.Errorf("chromedp: %w", err)
}
return htmlContent, nil
}
// Simulate scrolling to trigger infinite-scroll lazy loading
func scrapeInfiniteScroll(targetURL string) ([]string, error) {
ctx, cancel := chromedp.NewContext(context.Background())
defer cancel()
var items []string
err := chromedp.Run(ctx,
chromedp.Navigate(targetURL),
chromedp.ActionFunc(func(ctx context.Context) error {
for i := 0; i < 5; i++ {
chromedp.Evaluate(`window.scrollTo(0, document.body.scrollHeight)`, nil).Do(ctx)
time.Sleep(2 * time.Second)
var newItems []string
chromedp.Evaluate(
`Array.from(document.querySelectorAll('.item-title')).map(e => e.textContent)`,
&newItems,
).Do(ctx)
items = newItems
}
return nil
}),
)
return items, err
}
Each chromedp browser instance consumes roughly 100โ300 MB of memory, far more than a plain HTTP request. A practical strategy: attempt a static fetch first; if expected content is missing, degrade to chromedp for that URL only.
Proxy Rotation and TLS Fingerprint Management
Sophisticated anti-bot systems analyze not just request frequency but also TLS handshake characteristics (the JA3 fingerprint) and HTTP/2 frame ordering. The standard Go HTTP client has a fixed TLS fingerprint that is trivially identifiable.
type ProxyPool struct {
proxies []string
mu sync.Mutex
index int
}
func (p *ProxyPool) Next() *url.URL {
p.mu.Lock()
defer p.mu.Unlock()
proxy := p.proxies[p.index%len(p.proxies)]
p.index++
u, _ := url.Parse(proxy)
return u
}
func newClientWithProxy(proxyURL *url.URL) *http.Client {
return &http.Client{
Transport: &http.Transport{
Proxy: http.ProxyURL(proxyURL),
},
Timeout: 30 * time.Second,
}
}
For TLS fingerprint spoofing, github.com/refraction-networking/utls can mimic Chrome or Firefox TLS handshakes. Apply this technique only where legally and ethically permissible โ always verify the target site's terms of service.
Structured Data Extraction: JSON-LD and Microdata
Many sites embed structured data (Schema.org) for SEO purposes. This is often far easier to parse than the surrounding HTML:
func extractJSONLD(doc *goquery.Document) map[string]interface{} {
var result map[string]interface{}
doc.Find(`script[type="application/ld+json"]`).Each(func(i int, s *goquery.Selection) {
if result != nil {
return
}
var data map[string]interface{}
if err := json.Unmarshal([]byte(s.Text()), &data); err == nil {
result = data
}
})
return result
}
func extractArticleMetadata(doc *goquery.Document) *Article {
ld := extractJSONLD(doc)
if ld == nil {
return nil
}
article := &Article{}
if name, ok := ld["name"].(string); ok {
article.Title = name
}
if date, ok := ld["datePublished"].(string); ok {
article.Date = date
}
return article
}
JSON-LD parsing is more stable than CSS selectors because structured-data schemas evolve far more slowly than page layouts. In scraper design, attempt JSON-LD extraction first, then fall back to HTML parsing โ this layered strategy maximizes resilience.
Engineering Principles Summary
Building a production-grade Go scraper requires internalizing these core principles:
- Concurrency control โ use a Worker Pool or Semaphore to cap goroutine count and prevent memory exhaustion
- Rate limiting โ apply per-host token buckets (
x/time/rate) to respect server capacity - Error handling โ retry with exponential backoff plus jitter; avoid the Thundering Herd
- Deduplication โ in-memory Bloom Filter for small scale; Redis + RedisBloom for large scale
- Politeness โ honor
robots.txt; declare an honestUser-Agent - Observability โ log the disposition of every URL (success / error / skipped) for post-analysis
- Progressive complexity โ start with static HTTP, add
chromedponly where JS rendering is confirmed necessary
Go's goroutine model makes high-concurrency scraping feel natural. A well-designed Go scraper reaches the concurrency performance and engineering quality of a large Python scraper in a fraction of the code.