Chapter 18

Scheduling: Who Runs First

Imagine you're a nurse at a clinic with twenty patients in the waiting room. How do you decide who sees the doctor next? First in line? Whoever has the quickest appointment? By urgency? That decision-making process is scheduling — and the operating system uses the exact same logic to decide which process or thread gets the CPU next.

Schedulers exist because the number of CPU cores (a handful to a few dozen) is always far smaller than the number of processes wanting to run (often in the hundreds). The scheduler's job is to distribute limited CPU time fairly and efficiently while serving very different needs: some tasks demand fast response (mouse clicks), some want high throughput (video encoding), some must be exactly on time (audio playback).

Core Concepts

Three Classic Scheduling Algorithms

First Come First Served (FCFS)

Processes run in arrival order. One finishes before the next starts.

Processes: P1 (10ms)  P2 (2ms)  P3 (3ms)  arrive in order

Timeline:
0────────────10──12───15
      P1         P2    P3

Average wait time = (0 + 10 + 12) / 3 = 7.3ms
Problem: P2 and P3 only need a few milliseconds but wait for P1.
This is the "Convoy Effect" — a slow truck blocking a highway.

Shortest Job First (SJF)

Always run whichever process has the shortest remaining burst time.

Processes: P1 (10ms)  P2 (2ms)  P3 (3ms)  all arrive at once

Timeline:
0──2───5──────────15
   P2    P3    P1

Average wait time = (5 + 0 + 2) / 3 = 2.3ms
Problem: if short jobs keep arriving, P1 waits forever — "Starvation."

Round Robin (RR)

Each process gets a fixed time slice (quantum, typically 1–100ms), then the next process runs.

Quantum = 4ms, processes P1(10ms) P2(5ms) P3(3ms)

0──4──8──11──15──17──21──23
  P1  P2  P3   P1  P2  P1  P1

Good average response time, but total completion time is longer than SJF.
Quantum too short → context switch overhead dominates
Quantum too long  → degenerates into FCFS

Linux CFS: The Completely Fair Scheduler

Since Linux 2.6.23 (2007), Linux has used CFS (Completely Fair Scheduler). Its central idea: every process should receive a fair share of CPU time.

CFS does not use fixed time slices. Instead it tracks virtual runtime (vruntime):

vruntime = actual_runtime × (NICE_0_LOAD / process_weight)

A process's weight is determined by its nice value:
  nice = -20 (highest priority) → heavy weight → vruntime grows slowly
  nice =  0  (default)          → normal weight → vruntime grows normally
  nice = +19 (lowest priority)  → light weight  → vruntime grows quickly

CFS always picks the process with the smallest vruntime to run next, ensuring that processes which have run less are always given preference — until they catch up with the others.

The Red-Black Tree: CFS's Data Structure

        vruntime=100
        /            \
  vruntime=80     vruntime=150
  /      \
vr=60  vr=90

CFS inserts all runnable processes into a red-black tree keyed by vruntime.
The leftmost node (smallest vruntime) = next process to run.
Insert / delete / find: O(log n)

The red-black tree keeps scheduling decisions at O(log n) — efficient even with hundreds of runnable processes.

nice Values and Priority

# Run a CPU-heavy task at low priority (nice=10), yielding to others
nice -n 10 ./cpu_heavy_task

# Change the priority of already-running process PID=1234 (requires root for negative values)
renice -n -5 -p 1234

The nice range is -20 (highest priority) to +19 (lowest). Regular users can only raise the nice value (lower priority). Root can set negative values.

Real-Time Scheduling vs. Normal Scheduling

Linux has multiple scheduling classes, from highest to lowest priority:

Class             Policy              Use case
──────────────────────────────────────────────────────
SCHED_DEADLINE    Earliest deadline   Hard real-time (industrial control)
SCHED_FIFO        FIFO, no preempt    Audio/video, real-time tasks
SCHED_RR          Real-time RR        Real-time tasks
SCHED_OTHER       CFS (default)       Ordinary processes
SCHED_IDLE        Lowest priority     Background idle work

A real-time process (SCHED_FIFO or SCHED_RR) can preempt any normal process regardless of nice values. An infinite loop under SCHED_FIFO can starve the entire system — which is why real-time scheduling requires root privilege.

Hands-On Verification

# Show scheduling class and nice value for all processes
ps -eo pid,ni,cls,cmd | head -20
# NI = nice value; CLS = scheduling class (TS=normal, FF=FIFO, RR=round-robin)

# Watch CPU usage in real time, sorted by CPU (press P)
top

# htop shows per-core load bars — very intuitive
htop

# Show scheduling statistics for a specific process
cat /proc/$(pgrep -n python3)/sched
# Fields include nr_voluntary_switches (process yielded CPU willingly)
# and nr_involuntary_switches (OS forced the switch)

# Demonstrate nice values competing for CPU
# Start a normal-priority CPU hog in the background
python3 -c "while True: pass" &
NORMAL_PID=$!

# Start a low-priority CPU hog (nice=19)
nice -n 19 python3 -c "while True: pass" &
LOW_PID=$!

# Watch their CPU usage — the normal process should dominate
top -p $NORMAL_PID,$LOW_PID

# Clean up
kill $NORMAL_PID $LOW_PID

# Monitor overall scheduling pressure
vmstat 1
# procs r column = number of processes in the run queue
# If r consistently exceeds your core count, CPU is the bottleneck

🔬 Going Deeper

CFS's Successor: EEVDF in Linux 6.6

Linux 6.6 (late 2023) introduced EEVDF (Earliest Eligible Virtual Deadline First), which partially replaces CFS. EEVDF adds deadline awareness on top of CFS's fairness: it tracks each task's "virtual deadline" (when it should have finished running) and always picks the eligible task whose deadline is soonest. For latency-sensitive workloads — gaming, audio, real-time media — the improvement is measurable because CFS could sometimes keep a high-priority task waiting longer than necessary.

The Fundamental Scheduling Trade-Off

Every scheduling algorithm faces an irresolvable tension:

Minimize response time: give users instant feedback (short quanta, frequent switches)
Maximize throughput: complete as many jobs as possible per second (long quanta, fewer switches)

Modern OSes handle this by treating different tasks differently: interactive processes (shells, GUI apps) that wake up in response to user input are given higher priority and shorter quanta; batch workloads (compilation, number crunching) get longer quanta to amortize context-switch overhead. CFS's vruntime mechanism achieves a reasonable balance automatically, without requiring the OS to explicitly classify each process.

Recommended Reading:

Operating Systems: Three Easy Pieces (OSTEP) — Chapters 7–10 cover scheduling algorithms from FCFS through MLFQ with careful analysis and thought-provoking exercises.
Linux kernel source kernel/sched/fair.c — The complete CFS implementation. Reading it alongside OSTEP's scheduling chapters is one of the most educational things you can do as a systems programmer.
Ingo Molnár's original CFS announcement email (linux-kernel mailing list archive, 2007) — The designer explaining his own reasoning in plain language. Highly recommended primary source.

Rate this chapter

4.7 / 5 (12 ratings)

Scheduling: Who Runs First

Scheduling: Who Runs First

Core Concepts

Three Classic Scheduling Algorithms

Linux CFS: The Completely Fair Scheduler

nice Values and Priority

Real-Time Scheduling vs. Normal Scheduling

Hands-On Verification

🔬 Going Deeper

💬 Comments