Background Jobs and Graceful Shutdown
Background Jobs and Graceful Shutdown
In 2018, a major e-commerce platform suffered a serious incident during a peak sales event. The ops team performed a rolling deployment at the height of traffic. When application processes received SIGTERM, they exited immediatelyโcutting off over 2,000 in-flight order transactions mid-process. Because payment states were never written to the database, these orders entered a limbo state: "payment deducted, order not confirmed." The customer support team spent three full days manually investigating and issuing refunds.
This is a story about "exit"โand the cost was far heavier than it sounds.
In the Go world, background jobs and graceful shutdown are two problems that production services must take seriously. The former determines what work your system can do autonomously; the latter determines whether that work can end safely when the process restarts. Together they define a service's lifecycle management capability.
This chapter starts from first principles: the scheduling mechanics behind background tasks, goroutine lifecycle management, and the complete implementation path for signal handling and graceful shutdown.
Level 1 ยท What You Need to Know
The Nature of the Background Job Problem
Behind every web service, there is work that cannot be done on the HTTP request-handling path:
Time-sensitive async tasks: Sending a welcome email after user registration. If this runs synchronously inside the HTTP handler, the API response time is held hostage by the mail service's latency. The user is waiting for you to send an email. That's unreasonable.
Periodic maintenance tasks: Expiring session cleanup, log archival, cache warm-up, statistics aggregationโthese tasks need to run on a schedule, independent of any user request.
Compute-intensive tasks: Report generation, video processing, bulk data importsโthese may take minutes or hours. HTTP requests cannot wait that long; they must run asynchronously in the background.
These requirements gave rise to the Background Job System design pattern. Different languages and frameworks have different names for it: Cron Jobs, Workers, Daemon Tasks, Background Services. But the core questions are always the same: Where does a task get triggered? Where does it run? What happens on failure? How is it handled when the system shuts down?
Why Graceful Shutdown Matters
A process can terminate in two ways:
Abrupt exit (Crash / kill -9): The process is forcibly killed. All in-memory state is lost. Any code currently executing stops instantly. This is the accident scenarioโuncontrollable.
Signal-based exit (SIGTERM / SIGINT): The OS or a user sends a termination signal. The process has the opportunity to catch the signal and execute cleanup logic before exiting. This is the normal process lifecycle management approach.
Kubernetes Pod rolling updates, Docker container restarts, systemctl stop, Ctrl+Cโall of these notify the process via SIGTERM by default. If the process ignores the signal (or has no handler registered), Go's runtime default behavior is to exit immediately.
Graceful shutdown means: upon receiving an exit signal, let in-flight work finish safely, reject new work, orderly release all resources, then exit.
This process breaks down into four phases:
- Stop accepting new requests: HTTP server stops accepting new connections; the job scheduler stops triggering new tasks.
- Wait for in-progress work to complete: HTTP requests finish processing; running background tasks reach a safe checkpoint.
- Close resource connections: Database connection pools, Redis clients, message queue consumers.
- Exit the process: All goroutines have exited; the
mainfunction returns.
Go Signal Handling Basics
Go programs receive Unix signals via the os/signal package:
sigCh := make(chan os.Signal, 1)
signal.Notify(sigCh, syscall.SIGTERM, syscall.SIGINT)
sig := <-sigCh
fmt.Printf("received signal: %v\n", sig)
// Begin graceful shutdown
signal.Notify routes the specified signals to the provided channel. A buffer size of at least 1 is recommended to avoid signal loss when no goroutine is currently readingโGo's signal mechanism is best-effort, not guaranteed-delivery.
SIGTERM (signal 15) is the standard process-termination signal; the kill command sends it by default. SIGINT (signal 2) is sent when the user presses Ctrl+C in the terminal. Production services need to handle both.
Level 2 ยท Principles and Mechanics
time.Ticker vs time.AfterFunc: Two Periodic Trigger Models
The Go standard library provides two primary timer mechanisms, which differ fundamentally in semantics:
time.Ticker: Fixed-interval triggering
ticker := time.NewTicker(1 * time.Minute)
defer ticker.Stop()
for {
select {
case <-ticker.C:
// Triggered once per minute
doWork()
case <-ctx.Done():
return
}
}
A Ticker is channel-based. ticker.C is a read-only channel to which the runtime sends timestamps at the specified interval. A critical detail: if doWork() takes longer than the Ticker interval, the Ticker doesn't waitโit keeps trying to send to the channel. If the channel buffer (default: 1) is full, the tick is dropped. This means a Ticker does not guarantee strict interval spacing between ticks.
time.AfterFunc: Delayed one-shot triggering
time.AfterFunc(5*time.Second, func() {
doWork()
})
AfterFunc executes the callback in a new goroutine without blocking the caller. It fires once and does not repeat. To achieve a loop, you call AfterFunc again inside the callback. This pattern implements "wait N seconds after the task completes before running again"โsemantically different from Ticker's "trigger every N seconds."
The core distinction: Ticker uses wall-clock interval (fixed frequency), while recursive AfterFunc uses execution interval (time between completions). For long-running tasks, the former can cause concurrent executions; the latter cannot.
robfig/cron: Cron-Expression Scheduling
The standard library's Ticker only supports fixed time intervals. Production environments often need more complex scheduling rulesโ"every day at 3 AM" or "every first Monday of the month." That's where cron-expression schedulers come in.
github.com/robfig/cron/v3 is the most commonly used implementation in the Go ecosystem:
c := cron.New(cron.WithSeconds()) // Enable second-level precision
c.AddFunc("0 0 3 * * *", func() { // Daily at 3:00 AM
cleanupExpiredSessions()
})
c.AddFunc("0 0 * * 1", func() { // Every Monday at midnight
generateWeeklyReport()
})
c.Start()
defer c.Stop()
cron.Stop() stops the scheduler from triggering new tasks, but does not wait for currently running tasks to complete. If you need to wait, you must combine it with a WaitGroup yourself. This is a detail that demands explicit attention when designing a background job system.
Internally, robfig/cron maintains a list of jobs sorted by next-trigger time (a min-heap). A single goroutine loops continuously, computing the time until the next wakeup and sleeping. When awakened, it triggers all jobs whose time has come (each in its own goroutine), then recalculates the next wakeup.
Goroutine Lifecycle Management
Go goroutines have no parent-child relationship and no implicit cancellation propagation. This differs from OS threadsโOS process trees have explicit hierarchy, and a parent process can broadcast signals to children. In Go, goroutine exit must be coordinated explicitly by the developer via channels or contexts.
The context.WithCancel propagation model:
ctx, cancel := context.WithCancel(context.Background())
// Launch multiple background goroutines, all sharing the same ctx
go workerA(ctx)
go workerB(ctx)
go workerC(ctx)
// Canceling notifies all goroutines simultaneously
cancel()
The cancel function returned by context.WithCancel closes the ctx.Done() channel (technically, it signals all listeners to unblock). Every goroutine waiting on ctx.Done() wakes up at once.
This is the classic fan-out notification pattern. It works because closing a channel notifies all readers simultaneouslyโunlike sending a value, which delivers to exactly one reader. All goroutines reading from a closed channel receive the zero value immediately.
Context tree propagation:
rootCtx, rootCancel := context.WithCancel(context.Background())
// HTTP server uses a child context with timeout
httpCtx, httpCancel := context.WithTimeout(rootCtx, 30*time.Second)
defer httpCancel()
// Background jobs use another child context
jobCtx, jobCancel := context.WithCancel(rootCtx)
defer jobCancel()
// When rootCancel() is called, both httpCtx and jobCtx are canceled
This tree structure makes graceful shutdown elegantly simple: root context cancellation cascades downward like water flowing downhill. Each component only needs to watch its own ctx.Done() and clean up accordingly.
os/signal: The Complete Signal Handling Mechanism
signal.Notify internally maintains a registry mapping signals to channels. When Go's runtime captures a signal, it iterates the registry and delivers the signal to every channel registered for it.
Important edge cases:
Signal loss: If the channel is full (buffer exhausted), the signal is dropped rather than blocking the senderโthis prevents deadlock. This is why a buffer size of at least 1 is recommended.
signal.Reset: Restores a signal's default handling behavior (for SIGTERM, the default is immediate exit). Useful in certain test scenarios.
signal.Stop: Stops delivering signals to the channel; used for cleanup.
Goroutine safety: signal.Notify is safe to call from any goroutine.
Level 3 ยท Code Practice
Building a Complete Background Job System
Let's build a production-grade background job system step by stepโwith job registration, scheduled execution, panic recovery, and metrics.
1. Job Interface and JobRunner
package jobrunner
import (
"context"
"log/slog"
"sync"
"time"
"github.com/robfig/cron/v3"
)
// Job defines the interface for a background task
type Job interface {
Name() string
Run(ctx context.Context) error
}
// JobRunner manages scheduling and execution of all background jobs
type JobRunner struct {
cron *cron.Cron
mu sync.RWMutex
jobs map[string]*jobEntry
wg sync.WaitGroup // tracks all running jobs
logger *slog.Logger
}
type jobEntry struct {
job Job
cronExpr string
lastRun time.Time
lastErr error
runCount int64
errCount int64
}
func NewJobRunner(logger *slog.Logger) *JobRunner {
return &JobRunner{
cron: cron.New(cron.WithSeconds(), cron.WithChain(
cron.Recover(cron.DefaultLogger), // built-in panic recovery
)),
jobs: make(map[string]*jobEntry),
logger: logger,
}
}
// Register adds a scheduled job
func (r *JobRunner) Register(cronExpr string, job Job) error {
r.mu.Lock()
defer r.mu.Unlock()
entry := &jobEntry{
job: job,
cronExpr: cronExpr,
}
_, err := r.cron.AddFunc(cronExpr, func() {
r.executeJob(context.Background(), entry)
})
if err != nil {
return fmt.Errorf("invalid cron expression %q: %w", cronExpr, err)
}
r.jobs[job.Name()] = entry
r.logger.Info("job registered", "name", job.Name(), "schedule", cronExpr)
return nil
}
// executeJob runs a single job with panic recovery, timing, and error recording
func (r *JobRunner) executeJob(ctx context.Context, entry *jobEntry) {
r.wg.Add(1)
defer r.wg.Done()
name := entry.job.Name()
start := time.Now()
r.logger.Info("job starting", "job", name)
// Panic recovery: prevent a single job panic from crashing the process
defer func() {
if rec := recover(); rec != nil {
buf := make([]byte, 64<<10)
n := runtime.Stack(buf, false)
r.mu.Lock()
entry.errCount++
entry.lastErr = fmt.Errorf("panic: %v", rec)
r.mu.Unlock()
r.logger.Error("job panicked",
"job", name,
"panic", rec,
"stack", string(buf[:n]),
"duration", time.Since(start),
)
}
}()
err := entry.job.Run(ctx)
duration := time.Since(start)
r.mu.Lock()
entry.lastRun = start
entry.runCount++
if err != nil {
entry.errCount++
entry.lastErr = err
} else {
entry.lastErr = nil
}
r.mu.Unlock()
if err != nil {
r.logger.Error("job failed", "job", name, "error", err, "duration", duration)
} else {
r.logger.Info("job completed", "job", name, "duration", duration)
}
}
// Start begins the scheduler
func (r *JobRunner) Start() {
r.cron.Start()
r.logger.Info("job runner started")
}
// Shutdown gracefully stops the scheduler and waits for running jobs to finish
func (r *JobRunner) Shutdown(timeout time.Duration) error {
// Step 1: Stop the cron schedulerโno new jobs will be triggered
cronCtx := r.cron.Stop()
// Step 2: Wait for cron's internally running jobs (robfig/cron's Stop returns a ctx)
select {
case <-cronCtx.Done():
r.logger.Info("cron scheduler stopped")
case <-time.After(timeout):
r.logger.Warn("cron scheduler stop timeout")
}
// Step 3: Wait for our own WaitGroup-tracked jobs
done := make(chan struct{})
go func() {
r.wg.Wait()
close(done)
}()
select {
case <-done:
r.logger.Info("all background jobs completed")
return nil
case <-time.After(timeout):
return fmt.Errorf("background jobs did not finish within %v", timeout)
}
}
2. A Concrete Job: Expired Session Cleanup
// CleanupSessionJob removes expired sessions from the database
type CleanupSessionJob struct {
db *sql.DB
logger *slog.Logger
}
func (j *CleanupSessionJob) Name() string { return "cleanup-sessions" }
func (j *CleanupSessionJob) Run(ctx context.Context) error {
result, err := j.db.ExecContext(ctx,
"DELETE FROM sessions WHERE expires_at < NOW() LIMIT 10000",
)
if err != nil {
return fmt.Errorf("delete sessions: %w", err)
}
affected, _ := result.RowsAffected()
j.logger.Info("sessions cleaned up", "deleted", affected)
return nil
}
3. The Complete Graceful Shutdown Orchestration
This is the most critical partโcomposing the HTTP server, background jobs, and signal handling into a unified shutdown sequence:
func main() {
logger := slog.New(slog.NewJSONHandler(os.Stdout, nil))
// Initialize dependencies
db := mustConnectDB()
defer db.Close()
// Create HTTP server
mux := http.NewServeMux()
mux.HandleFunc("/api/users", handleUsers)
server := &http.Server{
Addr: ":8080",
Handler: mux,
ReadTimeout: 15 * time.Second,
WriteTimeout: 30 * time.Second,
IdleTimeout: 60 * time.Second,
}
// Create and start background job scheduler
runner := jobrunner.NewJobRunner(logger)
runner.Register("0 */10 * * * *", &CleanupSessionJob{db: db, logger: logger})
runner.Register("0 0 4 * * *", &DailyReportJob{db: db})
runner.Start()
// Signal handling
sigCh := make(chan os.Signal, 1)
signal.Notify(sigCh, syscall.SIGTERM, syscall.SIGINT)
// Start HTTP server (non-blocking)
go func() {
logger.Info("HTTP server starting", "addr", server.Addr)
if err := server.ListenAndServe(); err != nil && !errors.Is(err, http.ErrServerClosed) {
logger.Error("HTTP server error", "err", err)
os.Exit(1)
}
}()
// Block until a signal is received
sig := <-sigCh
logger.Info("shutdown signal received", "signal", sig)
// === Graceful Shutdown Sequence ===
shutdownCtx, shutdownCancel := context.WithTimeout(context.Background(), 60*time.Second)
defer shutdownCancel()
// Phase 1: Stop accepting new HTTP requests; wait for in-flight requests to complete
logger.Info("shutting down HTTP server...")
if err := server.Shutdown(shutdownCtx); err != nil {
logger.Error("HTTP server shutdown error", "err", err)
} else {
logger.Info("HTTP server shutdown complete")
}
// Phase 2: Stop job scheduler; wait for running jobs to complete
logger.Info("shutting down background jobs...")
if err := runner.Shutdown(30 * time.Second); err != nil {
logger.Error("job runner shutdown error", "err", err)
} else {
logger.Info("background jobs shutdown complete")
}
// Phase 3: Database connection closes via defer db.Close()
logger.Info("graceful shutdown complete")
}
Edge Case: Double Ctrl+C
In development, if the first Ctrl+C doesn't produce a timely exit (e.g., a job is stuck), the user presses it again. Good practice: first signal starts graceful shutdown; second signal forces an immediate exit:
sigCh := make(chan os.Signal, 2)
signal.Notify(sigCh, syscall.SIGTERM, syscall.SIGINT)
// First signal: begin graceful shutdown
sig := <-sigCh
logger.Info("initiating graceful shutdown", "signal", sig)
go func() {
doGracefulShutdown()
os.Exit(0)
}()
// Second signal: force exit
sig2 := <-sigCh
logger.Warn("forcing immediate exit", "signal", sig2)
os.Exit(1)
Panic Isolation and Monitoring
In production, a task panic must not crash the entire service. But it also cannot be silently swallowedโit must be logged and surfaced to monitoring:
func safeRun(ctx context.Context, job Job, onPanic func(name string, err error)) (err error) {
defer func() {
if r := recover(); r != nil {
buf := make([]byte, 64<<10)
n := runtime.Stack(buf, false)
stackTrace := string(buf[:n])
panicErr := fmt.Errorf("panic in job %s: %v\nstack:\n%s",
job.Name(), r, stackTrace)
err = panicErr
if onPanic != nil {
onPanic(job.Name(), panicErr)
}
}
}()
return job.Run(ctx)
}
Level 4 ยท Advanced Topics and Edge Cases
Redis-Backed Persistent Job Queue
In-memory scheduling has a fundamental flaw: tasks queued up in memory are lost when the process restarts. For business-critical tasks (sending notifications, processing payment callbacks), this is unacceptable.
The solution is using Redis as a persistent job queue, typically via asynq (a Redis-backed task queue library for Go):
import "github.com/hibiken/asynq"
const TypeEmailWelcome = "email:welcome"
type EmailWelcomePayload struct {
UserID int64
Email string
}
// Producer: enqueue from an HTTP handler
func EnqueueWelcomeEmail(client *asynq.Client, userID int64, email string) error {
payload, _ := json.Marshal(EmailWelcomePayload{UserID: userID, Email: email})
task := asynq.NewTask(TypeEmailWelcome, payload,
asynq.MaxRetry(3),
asynq.Timeout(30*time.Second),
asynq.Retention(24*time.Hour),
)
_, err := client.Enqueue(task)
return err
}
// Consumer: process the task in a worker
func handleEmailWelcome(ctx context.Context, t *asynq.Task) error {
var payload EmailWelcomePayload
if err := json.Unmarshal(t.Payload(), &payload); err != nil {
return fmt.Errorf("unmarshal payload: %w", err)
}
return sendWelcomeEmail(ctx, payload.UserID, payload.Email)
}
// Start the worker server
func startWorker(redisAddr string) {
srv := asynq.NewServer(
asynq.RedisClientOpt{Addr: redisAddr},
asynq.Config{
Concurrency: 10,
Queues: map[string]int{
"critical": 6,
"default": 3,
"low": 1,
},
},
)
mux := asynq.NewServeMux()
mux.HandleFunc(TypeEmailWelcome, handleEmailWelcome)
if err := srv.Run(mux); err != nil {
log.Fatal(err)
}
}
asynq's graceful shutdown is straightforward: srv.Stop() waits for currently processing tasks to complete (up to ShutdownTimeout) before stopping the worker.
Distributed Cron: Leader Election
When a service scales horizontally to multiple instances, a problem emerges: all instances have the same cron jobs registered; at trigger time, every instance executes the jobโcausing duplicate execution.
The solution is Leader Election: only the instance currently elected as leader executes cron jobs.
Using etcd for leader election:
import (
clientv3 "go.etcd.io/etcd/client/v3"
"go.etcd.io/etcd/client/v3/concurrency"
)
func runWithLeaderElection(ctx context.Context, etcdClient *clientv3.Client, campaignKey string, fn func(ctx context.Context)) error {
session, err := concurrency.NewSession(etcdClient, concurrency.WithTTL(10))
if err != nil {
return err
}
defer session.Close()
election := concurrency.NewElection(session, campaignKey)
for {
// Campaign blocks until this node becomes leader
if err := election.Campaign(ctx, "leader"); err != nil {
if ctx.Err() != nil {
return nil // Normal exit via ctx cancellation
}
return err
}
log.Println("became leader, starting cron jobs")
leaderCtx, leaderCancel := context.WithCancel(ctx)
// Watch for leadership loss in the background
go func() {
defer leaderCancel()
ch := election.Observe(leaderCtx)
for resp := range ch {
if string(resp.Kvs[0].Value) != "leader" {
return
}
}
}()
// Run cron jobs under leaderCtx
fn(leaderCtx)
if ctx.Err() != nil {
return nil
}
log.Println("lost leadership, re-campaigning")
}
}
Job Deduplication and Idempotency
Even with a single leader, duplicate execution can still occur: the leader crashes mid-task; the new leader re-executes the same task.
The correct design ensures idempotency: executing a task once and executing it multiple times produce the same result.
// Use Redis SETNX as a distributed lock for deduplication
func (r *JobRunner) executeWithDedup(ctx context.Context, job Job, ttl time.Duration) error {
lockKey := fmt.Sprintf("job:lock:%s:%d",
job.Name(),
time.Now().Truncate(ttl).Unix(), // Deduplicate within a time window
)
ok, err := r.redis.SetNX(ctx, lockKey, "1", ttl).Result()
if err != nil {
return fmt.Errorf("acquire lock: %w", err)
}
if !ok {
// Another instance has already executed or is executing this time window's job
log.Printf("job %s skipped (already executed in this window)", job.Name())
return nil
}
return job.Run(ctx)
}
Dead Letter Queues and Job Heartbeat Monitoring
For long-running tasks, a heartbeat mechanism detects whether a task is stuck:
// Heartbeat: periodically update a timestamp in Redis
func runWithHeartbeat(ctx context.Context, rdb *redis.Client, taskID string, interval time.Duration, fn func(ctx context.Context) error) error {
heartbeatKey := fmt.Sprintf("job:heartbeat:%s", taskID)
go func() {
ticker := time.NewTicker(interval)
defer ticker.Stop()
for {
select {
case <-ticker.C:
rdb.Set(ctx, heartbeatKey, time.Now().Unix(), interval*3)
case <-ctx.Done():
return
}
}
}()
return fn(ctx)
}
// Monitor goroutine: periodically check all job heartbeats
func monitorHeartbeats(ctx context.Context, rdb *redis.Client, alertFn func(taskID string)) {
ticker := time.NewTicker(30 * time.Second)
defer ticker.Stop()
for {
select {
case <-ticker.C:
keys, _ := rdb.Keys(ctx, "job:heartbeat:*").Result()
for _, key := range keys {
exists, _ := rdb.Exists(ctx, key).Result()
if exists == 0 {
// Heartbeat key expiredโjob may be stuck or crashed
taskID := strings.TrimPrefix(key, "job:heartbeat:")
alertFn(taskID)
}
}
case <-ctx.Done():
return
}
}
}
A dead letter queue collects tasks that have failed beyond the maximum retry count, for manual or automated review and reprocessing. asynq has built-in dead letter queue supportโtasks exceeding max retries are automatically moved to the asynq:dead queue, viewable via its Web UI, and can be manually re-enqueued.
Design Summary: Seven Principles of Production-Grade Background Job Systems
- Explicit lifecycle: Every goroutine must have a clear start and exit path, controlled via context or channel.
- Panic isolation: Task panics must not crash the entire process; capture with
recover, log, and alert. - Graceful shutdown is non-negotiable: On SIGTERM, give running tasks sufficient time to complete (30 seconds is typically a reasonable upper bound).
- Persist critical tasks: For business-critical tasks, use a persistent queue (Redis or database), not an in-memory queue.
- Idempotent design: Tasks should be safe to execute multiple times, or use locking to prevent duplicate execution.
- Observability: Task execution count, failure count, last execution time, and duration should all be exposed as metrics.
- Distributed deduplication: In multi-instance deployments, use leader election or distributed locks to ensure cron tasks are not triggered redundantly.
These seven principles are not "best practices"โthey are hard lessons from production incidents. Every one of them is written in outages.