Chapter 35

Background Jobs and Graceful Shutdown

In 2018, a major e-commerce platform suffered a serious incident during a peak sales event. The ops team performed a rolling deployment at the height of traffic. When application processes received SIGTERM, they exited immediately—cutting off over 2,000 in-flight order transactions mid-process. Because payment states were never written to the database, these orders entered a limbo state: "payment deducted, order not confirmed." The customer support team spent three full days manually investigating and issuing refunds.

This is a story about "exit"—and the cost was far heavier than it sounds.

In the Go world, background jobs and graceful shutdown are two problems that production services must take seriously. The former determines what work your system can do autonomously; the latter determines whether that work can end safely when the process restarts. Together they define a service's lifecycle management capability.

This chapter starts from first principles: the scheduling mechanics behind background tasks, goroutine lifecycle management, and the complete implementation path for signal handling and graceful shutdown.

Level 1 · What You Need to Know

The Nature of the Background Job Problem

Behind every web service, there is work that cannot be done on the HTTP request-handling path:

Time-sensitive async tasks: Sending a welcome email after user registration. If this runs synchronously inside the HTTP handler, the API response time is held hostage by the mail service's latency. The user is waiting for you to send an email. That's unreasonable.

Periodic maintenance tasks: Expiring session cleanup, log archival, cache warm-up, statistics aggregation—these tasks need to run on a schedule, independent of any user request.

Compute-intensive tasks: Report generation, video processing, bulk data imports—these may take minutes or hours. HTTP requests cannot wait that long; they must run asynchronously in the background.

These requirements gave rise to the Background Job System design pattern. Different languages and frameworks have different names for it: Cron Jobs, Workers, Daemon Tasks, Background Services. But the core questions are always the same: Where does a task get triggered? Where does it run? What happens on failure? How is it handled when the system shuts down?

Why Graceful Shutdown Matters

A process can terminate in two ways:

Abrupt exit (Crash / kill -9): The process is forcibly killed. All in-memory state is lost. Any code currently executing stops instantly. This is the accident scenario—uncontrollable.

Signal-based exit (SIGTERM / SIGINT): The OS or a user sends a termination signal. The process has the opportunity to catch the signal and execute cleanup logic before exiting. This is the normal process lifecycle management approach.

Kubernetes Pod rolling updates, Docker container restarts, systemctl stop, Ctrl+C—all of these notify the process via SIGTERM by default. If the process ignores the signal (or has no handler registered), Go's runtime default behavior is to exit immediately.

Graceful shutdown means: upon receiving an exit signal, let in-flight work finish safely, reject new work, orderly release all resources, then exit.

This process breaks down into four phases:

Stop accepting new requests: HTTP server stops accepting new connections; the job scheduler stops triggering new tasks.
Wait for in-progress work to complete: HTTP requests finish processing; running background tasks reach a safe checkpoint.
Close resource connections: Database connection pools, Redis clients, message queue consumers.
Exit the process: All goroutines have exited; the main function returns.

Go Signal Handling Basics

Go programs receive Unix signals via the os/signal package:

sigCh := make(chan os.Signal, 1)
signal.Notify(sigCh, syscall.SIGTERM, syscall.SIGINT)

sig := <-sigCh
fmt.Printf("received signal: %v\n", sig)
// Begin graceful shutdown

signal.Notify routes the specified signals to the provided channel. A buffer size of at least 1 is recommended to avoid signal loss when no goroutine is currently reading—Go's signal mechanism is best-effort, not guaranteed-delivery.

SIGTERM (signal 15) is the standard process-termination signal; the kill command sends it by default. SIGINT (signal 2) is sent when the user presses Ctrl+C in the terminal. Production services need to handle both.

Level 2 · Principles and Mechanics

`time.Ticker` vs `time.AfterFunc`: Two Periodic Trigger Models

The Go standard library provides two primary timer mechanisms, which differ fundamentally in semantics:

time.Ticker: Fixed-interval triggering

ticker := time.NewTicker(1 * time.Minute)
defer ticker.Stop()

for {
    select {
    case <-ticker.C:
        // Triggered once per minute
        doWork()
    case <-ctx.Done():
        return
    }
}

A Ticker is channel-based. ticker.C is a read-only channel to which the runtime sends timestamps at the specified interval. A critical detail: if doWork() takes longer than the Ticker interval, the Ticker doesn't wait—it keeps trying to send to the channel. If the channel buffer (default: 1) is full, the tick is dropped. This means a Ticker does not guarantee strict interval spacing between ticks.

time.AfterFunc: Delayed one-shot triggering

time.AfterFunc(5*time.Second, func() {
    doWork()
})

AfterFunc executes the callback in a new goroutine without blocking the caller. It fires once and does not repeat. To achieve a loop, you call AfterFunc again inside the callback. This pattern implements "wait N seconds after the task completes before running again"—semantically different from Ticker's "trigger every N seconds."

The core distinction: Ticker uses wall-clock interval (fixed frequency), while recursive AfterFunc uses execution interval (time between completions). For long-running tasks, the former can cause concurrent executions; the latter cannot.

`robfig/cron`: Cron-Expression Scheduling

The standard library's Ticker only supports fixed time intervals. Production environments often need more complex scheduling rules—"every day at 3 AM" or "every first Monday of the month." That's where cron-expression schedulers come in.

github.com/robfig/cron/v3 is the most commonly used implementation in the Go ecosystem:

c := cron.New(cron.WithSeconds()) // Enable second-level precision
c.AddFunc("0 0 3 * * *", func() { // Daily at 3:00 AM
    cleanupExpiredSessions()
})
c.AddFunc("0 0 * * 1", func() { // Every Monday at midnight
    generateWeeklyReport()
})
c.Start()
defer c.Stop()

cron.Stop() stops the scheduler from triggering new tasks, but does not wait for currently running tasks to complete. If you need to wait, you must combine it with a WaitGroup yourself. This is a detail that demands explicit attention when designing a background job system.

Internally, robfig/cron maintains a list of jobs sorted by next-trigger time (a min-heap). A single goroutine loops continuously, computing the time until the next wakeup and sleeping. When awakened, it triggers all jobs whose time has come (each in its own goroutine), then recalculates the next wakeup.

Goroutine Lifecycle Management

Go goroutines have no parent-child relationship and no implicit cancellation propagation. This differs from OS threads—OS process trees have explicit hierarchy, and a parent process can broadcast signals to children. In Go, goroutine exit must be coordinated explicitly by the developer via channels or contexts.

The context.WithCancel propagation model:

ctx, cancel := context.WithCancel(context.Background())

// Launch multiple background goroutines, all sharing the same ctx
go workerA(ctx)
go workerB(ctx)
go workerC(ctx)

// Canceling notifies all goroutines simultaneously
cancel()

The cancel function returned by context.WithCancel closes the ctx.Done() channel (technically, it signals all listeners to unblock). Every goroutine waiting on ctx.Done() wakes up at once.

This is the classic fan-out notification pattern. It works because closing a channel notifies all readers simultaneously—unlike sending a value, which delivers to exactly one reader. All goroutines reading from a closed channel receive the zero value immediately.

Context tree propagation:

rootCtx, rootCancel := context.WithCancel(context.Background())

// HTTP server uses a child context with timeout
httpCtx, httpCancel := context.WithTimeout(rootCtx, 30*time.Second)
defer httpCancel()

// Background jobs use another child context
jobCtx, jobCancel := context.WithCancel(rootCtx)
defer jobCancel()

// When rootCancel() is called, both httpCtx and jobCtx are canceled

This tree structure makes graceful shutdown elegantly simple: root context cancellation cascades downward like water flowing downhill. Each component only needs to watch its own ctx.Done() and clean up accordingly.

`os/signal`: The Complete Signal Handling Mechanism

signal.Notify internally maintains a registry mapping signals to channels. When Go's runtime captures a signal, it iterates the registry and delivers the signal to every channel registered for it.

Important edge cases:

Signal loss: If the channel is full (buffer exhausted), the signal is dropped rather than blocking the sender—this prevents deadlock. This is why a buffer size of at least 1 is recommended.

signal.Reset: Restores a signal's default handling behavior (for SIGTERM, the default is immediate exit). Useful in certain test scenarios.

signal.Stop: Stops delivering signals to the channel; used for cleanup.

Goroutine safety: signal.Notify is safe to call from any goroutine.

Level 3 · Code Practice

Building a Complete Background Job System

Let's build a production-grade background job system step by step—with job registration, scheduled execution, panic recovery, and metrics.

1. Job Interface and JobRunner

package jobrunner

import (
    "context"
    "log/slog"
    "sync"
    "time"

    "github.com/robfig/cron/v3"
)

// Job defines the interface for a background task
type Job interface {
    Name() string
    Run(ctx context.Context) error
}

// JobRunner manages scheduling and execution of all background jobs
type JobRunner struct {
    cron    *cron.Cron
    mu      sync.RWMutex
    jobs    map[string]*jobEntry
    wg      sync.WaitGroup // tracks all running jobs
    logger  *slog.Logger
}

type jobEntry struct {
    job      Job
    cronExpr string
    lastRun  time.Time
    lastErr  error
    runCount int64
    errCount int64
}

func NewJobRunner(logger *slog.Logger) *JobRunner {
    return &JobRunner{
        cron: cron.New(cron.WithSeconds(), cron.WithChain(
            cron.Recover(cron.DefaultLogger), // built-in panic recovery
        )),
        jobs:   make(map[string]*jobEntry),
        logger: logger,
    }
}

// Register adds a scheduled job
func (r *JobRunner) Register(cronExpr string, job Job) error {
    r.mu.Lock()
    defer r.mu.Unlock()

    entry := &jobEntry{
        job:      job,
        cronExpr: cronExpr,
    }

    _, err := r.cron.AddFunc(cronExpr, func() {
        r.executeJob(context.Background(), entry)
    })
    if err != nil {
        return fmt.Errorf("invalid cron expression %q: %w", cronExpr, err)
    }

    r.jobs[job.Name()] = entry
    r.logger.Info("job registered", "name", job.Name(), "schedule", cronExpr)
    return nil
}

// executeJob runs a single job with panic recovery, timing, and error recording
func (r *JobRunner) executeJob(ctx context.Context, entry *jobEntry) {
    r.wg.Add(1)
    defer r.wg.Done()

    name := entry.job.Name()
    start := time.Now()

    r.logger.Info("job starting", "job", name)

    // Panic recovery: prevent a single job panic from crashing the process
    defer func() {
        if rec := recover(); rec != nil {
            buf := make([]byte, 64<<10)
            n := runtime.Stack(buf, false)

            r.mu.Lock()
            entry.errCount++
            entry.lastErr = fmt.Errorf("panic: %v", rec)
            r.mu.Unlock()

            r.logger.Error("job panicked",
                "job", name,
                "panic", rec,
                "stack", string(buf[:n]),
                "duration", time.Since(start),
            )
        }
    }()

    err := entry.job.Run(ctx)
    duration := time.Since(start)

    r.mu.Lock()
    entry.lastRun = start
    entry.runCount++
    if err != nil {
        entry.errCount++
        entry.lastErr = err
    } else {
        entry.lastErr = nil
    }
    r.mu.Unlock()

    if err != nil {
        r.logger.Error("job failed", "job", name, "error", err, "duration", duration)
    } else {
        r.logger.Info("job completed", "job", name, "duration", duration)
    }
}

// Start begins the scheduler
func (r *JobRunner) Start() {
    r.cron.Start()
    r.logger.Info("job runner started")
}

// Shutdown gracefully stops the scheduler and waits for running jobs to finish
func (r *JobRunner) Shutdown(timeout time.Duration) error {
    // Step 1: Stop the cron scheduler—no new jobs will be triggered
    cronCtx := r.cron.Stop()

    // Step 2: Wait for cron's internally running jobs (robfig/cron's Stop returns a ctx)
    select {
    case <-cronCtx.Done():
        r.logger.Info("cron scheduler stopped")
    case <-time.After(timeout):
        r.logger.Warn("cron scheduler stop timeout")
    }

    // Step 3: Wait for our own WaitGroup-tracked jobs
    done := make(chan struct{})
    go func() {
        r.wg.Wait()
        close(done)
    }()

    select {
    case <-done:
        r.logger.Info("all background jobs completed")
        return nil
    case <-time.After(timeout):
        return fmt.Errorf("background jobs did not finish within %v", timeout)
    }
}

2. A Concrete Job: Expired Session Cleanup

// CleanupSessionJob removes expired sessions from the database
type CleanupSessionJob struct {
    db     *sql.DB
    logger *slog.Logger
}

func (j *CleanupSessionJob) Name() string { return "cleanup-sessions" }

func (j *CleanupSessionJob) Run(ctx context.Context) error {
    result, err := j.db.ExecContext(ctx,
        "DELETE FROM sessions WHERE expires_at < NOW() LIMIT 10000",
    )
    if err != nil {
        return fmt.Errorf("delete sessions: %w", err)
    }

    affected, _ := result.RowsAffected()
    j.logger.Info("sessions cleaned up", "deleted", affected)
    return nil
}

3. The Complete Graceful Shutdown Orchestration

This is the most critical part—composing the HTTP server, background jobs, and signal handling into a unified shutdown sequence:

func main() {
    logger := slog.New(slog.NewJSONHandler(os.Stdout, nil))

    // Initialize dependencies
    db := mustConnectDB()
    defer db.Close()

    // Create HTTP server
    mux := http.NewServeMux()
    mux.HandleFunc("/api/users", handleUsers)
    server := &http.Server{
        Addr:         ":8080",
        Handler:      mux,
        ReadTimeout:  15 * time.Second,
        WriteTimeout: 30 * time.Second,
        IdleTimeout:  60 * time.Second,
    }

    // Create and start background job scheduler
    runner := jobrunner.NewJobRunner(logger)
    runner.Register("0 */10 * * * *", &CleanupSessionJob{db: db, logger: logger})
    runner.Register("0 0 4 * * *", &DailyReportJob{db: db})
    runner.Start()

    // Signal handling
    sigCh := make(chan os.Signal, 1)
    signal.Notify(sigCh, syscall.SIGTERM, syscall.SIGINT)

    // Start HTTP server (non-blocking)
    go func() {
        logger.Info("HTTP server starting", "addr", server.Addr)
        if err := server.ListenAndServe(); err != nil && !errors.Is(err, http.ErrServerClosed) {
            logger.Error("HTTP server error", "err", err)
            os.Exit(1)
        }
    }()

    // Block until a signal is received
    sig := <-sigCh
    logger.Info("shutdown signal received", "signal", sig)

    // === Graceful Shutdown Sequence ===

    shutdownCtx, shutdownCancel := context.WithTimeout(context.Background(), 60*time.Second)
    defer shutdownCancel()

    // Phase 1: Stop accepting new HTTP requests; wait for in-flight requests to complete
    logger.Info("shutting down HTTP server...")
    if err := server.Shutdown(shutdownCtx); err != nil {
        logger.Error("HTTP server shutdown error", "err", err)
    } else {
        logger.Info("HTTP server shutdown complete")
    }

    // Phase 2: Stop job scheduler; wait for running jobs to complete
    logger.Info("shutting down background jobs...")
    if err := runner.Shutdown(30 * time.Second); err != nil {
        logger.Error("job runner shutdown error", "err", err)
    } else {
        logger.Info("background jobs shutdown complete")
    }

    // Phase 3: Database connection closes via defer db.Close()

    logger.Info("graceful shutdown complete")
}

Edge Case: Double Ctrl+C

In development, if the first Ctrl+C doesn't produce a timely exit (e.g., a job is stuck), the user presses it again. Good practice: first signal starts graceful shutdown; second signal forces an immediate exit:

sigCh := make(chan os.Signal, 2)
signal.Notify(sigCh, syscall.SIGTERM, syscall.SIGINT)

// First signal: begin graceful shutdown
sig := <-sigCh
logger.Info("initiating graceful shutdown", "signal", sig)

go func() {
    doGracefulShutdown()
    os.Exit(0)
}()

// Second signal: force exit
sig2 := <-sigCh
logger.Warn("forcing immediate exit", "signal", sig2)
os.Exit(1)

Panic Isolation and Monitoring

In production, a task panic must not crash the entire service. But it also cannot be silently swallowed—it must be logged and surfaced to monitoring:

func safeRun(ctx context.Context, job Job, onPanic func(name string, err error)) (err error) {
    defer func() {
        if r := recover(); r != nil {
            buf := make([]byte, 64<<10)
            n := runtime.Stack(buf, false)
            stackTrace := string(buf[:n])

            panicErr := fmt.Errorf("panic in job %s: %v\nstack:\n%s",
                job.Name(), r, stackTrace)
            err = panicErr

            if onPanic != nil {
                onPanic(job.Name(), panicErr)
            }
        }
    }()

    return job.Run(ctx)
}

Level 4 · Advanced Topics and Edge Cases

Redis-Backed Persistent Job Queue

In-memory scheduling has a fundamental flaw: tasks queued up in memory are lost when the process restarts. For business-critical tasks (sending notifications, processing payment callbacks), this is unacceptable.

The solution is using Redis as a persistent job queue, typically via asynq (a Redis-backed task queue library for Go):

import "github.com/hibiken/asynq"

const TypeEmailWelcome = "email:welcome"

type EmailWelcomePayload struct {
    UserID int64
    Email  string
}

// Producer: enqueue from an HTTP handler
func EnqueueWelcomeEmail(client *asynq.Client, userID int64, email string) error {
    payload, _ := json.Marshal(EmailWelcomePayload{UserID: userID, Email: email})
    task := asynq.NewTask(TypeEmailWelcome, payload,
        asynq.MaxRetry(3),
        asynq.Timeout(30*time.Second),
        asynq.Retention(24*time.Hour),
    )
    _, err := client.Enqueue(task)
    return err
}

// Consumer: process the task in a worker
func handleEmailWelcome(ctx context.Context, t *asynq.Task) error {
    var payload EmailWelcomePayload
    if err := json.Unmarshal(t.Payload(), &payload); err != nil {
        return fmt.Errorf("unmarshal payload: %w", err)
    }
    return sendWelcomeEmail(ctx, payload.UserID, payload.Email)
}

// Start the worker server
func startWorker(redisAddr string) {
    srv := asynq.NewServer(
        asynq.RedisClientOpt{Addr: redisAddr},
        asynq.Config{
            Concurrency: 10,
            Queues: map[string]int{
                "critical": 6,
                "default":  3,
                "low":      1,
            },
        },
    )

    mux := asynq.NewServeMux()
    mux.HandleFunc(TypeEmailWelcome, handleEmailWelcome)

    if err := srv.Run(mux); err != nil {
        log.Fatal(err)
    }
}

asynq's graceful shutdown is straightforward: srv.Stop() waits for currently processing tasks to complete (up to ShutdownTimeout) before stopping the worker.

Distributed Cron: Leader Election

When a service scales horizontally to multiple instances, a problem emerges: all instances have the same cron jobs registered; at trigger time, every instance executes the job—causing duplicate execution.

The solution is Leader Election: only the instance currently elected as leader executes cron jobs.

Using etcd for leader election:

import (
    clientv3 "go.etcd.io/etcd/client/v3"
    "go.etcd.io/etcd/client/v3/concurrency"
)

func runWithLeaderElection(ctx context.Context, etcdClient *clientv3.Client, campaignKey string, fn func(ctx context.Context)) error {
    session, err := concurrency.NewSession(etcdClient, concurrency.WithTTL(10))
    if err != nil {
        return err
    }
    defer session.Close()

    election := concurrency.NewElection(session, campaignKey)

    for {
        // Campaign blocks until this node becomes leader
        if err := election.Campaign(ctx, "leader"); err != nil {
            if ctx.Err() != nil {
                return nil // Normal exit via ctx cancellation
            }
            return err
        }

        log.Println("became leader, starting cron jobs")

        leaderCtx, leaderCancel := context.WithCancel(ctx)

        // Watch for leadership loss in the background
        go func() {
            defer leaderCancel()
            ch := election.Observe(leaderCtx)
            for resp := range ch {
                if string(resp.Kvs[0].Value) != "leader" {
                    return
                }
            }
        }()

        // Run cron jobs under leaderCtx
        fn(leaderCtx)

        if ctx.Err() != nil {
            return nil
        }
        log.Println("lost leadership, re-campaigning")
    }
}

Job Deduplication and Idempotency

Even with a single leader, duplicate execution can still occur: the leader crashes mid-task; the new leader re-executes the same task.

The correct design ensures idempotency: executing a task once and executing it multiple times produce the same result.

// Use Redis SETNX as a distributed lock for deduplication
func (r *JobRunner) executeWithDedup(ctx context.Context, job Job, ttl time.Duration) error {
    lockKey := fmt.Sprintf("job:lock:%s:%d",
        job.Name(),
        time.Now().Truncate(ttl).Unix(), // Deduplicate within a time window
    )

    ok, err := r.redis.SetNX(ctx, lockKey, "1", ttl).Result()
    if err != nil {
        return fmt.Errorf("acquire lock: %w", err)
    }
    if !ok {
        // Another instance has already executed or is executing this time window's job
        log.Printf("job %s skipped (already executed in this window)", job.Name())
        return nil
    }

    return job.Run(ctx)
}

Dead Letter Queues and Job Heartbeat Monitoring

For long-running tasks, a heartbeat mechanism detects whether a task is stuck:

// Heartbeat: periodically update a timestamp in Redis
func runWithHeartbeat(ctx context.Context, rdb *redis.Client, taskID string, interval time.Duration, fn func(ctx context.Context) error) error {
    heartbeatKey := fmt.Sprintf("job:heartbeat:%s", taskID)

    go func() {
        ticker := time.NewTicker(interval)
        defer ticker.Stop()
        for {
            select {
            case <-ticker.C:
                rdb.Set(ctx, heartbeatKey, time.Now().Unix(), interval*3)
            case <-ctx.Done():
                return
            }
        }
    }()

    return fn(ctx)
}

// Monitor goroutine: periodically check all job heartbeats
func monitorHeartbeats(ctx context.Context, rdb *redis.Client, alertFn func(taskID string)) {
    ticker := time.NewTicker(30 * time.Second)
    defer ticker.Stop()
    for {
        select {
        case <-ticker.C:
            keys, _ := rdb.Keys(ctx, "job:heartbeat:*").Result()
            for _, key := range keys {
                exists, _ := rdb.Exists(ctx, key).Result()
                if exists == 0 {
                    // Heartbeat key expired—job may be stuck or crashed
                    taskID := strings.TrimPrefix(key, "job:heartbeat:")
                    alertFn(taskID)
                }
            }
        case <-ctx.Done():
            return
        }
    }
}

A dead letter queue collects tasks that have failed beyond the maximum retry count, for manual or automated review and reprocessing. asynq has built-in dead letter queue support—tasks exceeding max retries are automatically moved to the asynq:dead queue, viewable via its Web UI, and can be manually re-enqueued.

Design Summary: Seven Principles of Production-Grade Background Job Systems

Explicit lifecycle: Every goroutine must have a clear start and exit path, controlled via context or channel.
Panic isolation: Task panics must not crash the entire process; capture with recover, log, and alert.
Graceful shutdown is non-negotiable: On SIGTERM, give running tasks sufficient time to complete (30 seconds is typically a reasonable upper bound).
Persist critical tasks: For business-critical tasks, use a persistent queue (Redis or database), not an in-memory queue.
Idempotent design: Tasks should be safe to execute multiple times, or use locking to prevent duplicate execution.
Observability: Task execution count, failure count, last execution time, and duration should all be exposed as metrics.
Distributed deduplication: In multi-instance deployments, use leader election or distributed locks to ensure cron tasks are not triggered redundantly.

These seven principles are not "best practices"—they are hard lessons from production incidents. Every one of them is written in outages.

Rate this chapter

4.8 / 5 (3 ratings)

Background Jobs and Graceful Shutdown

Background Jobs and Graceful Shutdown

Level 1 · What You Need to Know

The Nature of the Background Job Problem

Why Graceful Shutdown Matters

Go Signal Handling Basics

Level 2 · Principles and Mechanics

time.Ticker vs time.AfterFunc: Two Periodic Trigger Models

robfig/cron: Cron-Expression Scheduling

Goroutine Lifecycle Management

os/signal: The Complete Signal Handling Mechanism

Level 3 · Code Practice

Building a Complete Background Job System

Edge Case: Double Ctrl+C

Panic Isolation and Monitoring

Level 4 · Advanced Topics and Edge Cases

Redis-Backed Persistent Job Queue

Distributed Cron: Leader Election

Job Deduplication and Idempotency

Dead Letter Queues and Job Heartbeat Monitoring

Design Summary: Seven Principles of Production-Grade Background Job Systems

💬 Comments

`time.Ticker` vs `time.AfterFunc`: Two Periodic Trigger Models

`robfig/cron`: Cron-Expression Scheduling

`os/signal`: The Complete Signal Handling Mechanism