Scheduling: Who Runs First
Scheduling: Who Runs First
Imagine you're a nurse at a clinic with twenty patients in the waiting room. How do you decide who sees the doctor next? First in line? Whoever has the quickest appointment? By urgency? That decision-making process is scheduling — and the operating system uses the exact same logic to decide which process or thread gets the CPU next.
Schedulers exist because the number of CPU cores (a handful to a few dozen) is always far smaller than the number of processes wanting to run (often in the hundreds). The scheduler's job is to distribute limited CPU time fairly and efficiently while serving very different needs: some tasks demand fast response (mouse clicks), some want high throughput (video encoding), some must be exactly on time (audio playback).
Core Concepts
Three Classic Scheduling Algorithms
First Come First Served (FCFS)
Processes run in arrival order. One finishes before the next starts.
Processes: P1 (10ms) P2 (2ms) P3 (3ms) arrive in order
Timeline:
0────────────10──12───15
P1 P2 P3
Average wait time = (0 + 10 + 12) / 3 = 7.3ms
Problem: P2 and P3 only need a few milliseconds but wait for P1.
This is the "Convoy Effect" — a slow truck blocking a highway.
Shortest Job First (SJF)
Always run whichever process has the shortest remaining burst time.
Processes: P1 (10ms) P2 (2ms) P3 (3ms) all arrive at once
Timeline:
0──2───5──────────15
P2 P3 P1
Average wait time = (5 + 0 + 2) / 3 = 2.3ms
Problem: if short jobs keep arriving, P1 waits forever — "Starvation."
Round Robin (RR)
Each process gets a fixed time slice (quantum, typically 1–100ms), then the next process runs.
Quantum = 4ms, processes P1(10ms) P2(5ms) P3(3ms)
0──4──8──11──15──17──21──23
P1 P2 P3 P1 P2 P1 P1
Good average response time, but total completion time is longer than SJF.
Quantum too short → context switch overhead dominates
Quantum too long → degenerates into FCFS
Linux CFS: The Completely Fair Scheduler
Since Linux 2.6.23 (2007), Linux has used CFS (Completely Fair Scheduler). Its central idea: every process should receive a fair share of CPU time.
CFS does not use fixed time slices. Instead it tracks virtual runtime (vruntime):
vruntime = actual_runtime × (NICE_0_LOAD / process_weight)
A process's weight is determined by its nice value:
nice = -20 (highest priority) → heavy weight → vruntime grows slowly
nice = 0 (default) → normal weight → vruntime grows normally
nice = +19 (lowest priority) → light weight → vruntime grows quickly
CFS always picks the process with the smallest vruntime to run next, ensuring that processes which have run less are always given preference — until they catch up with the others.
The Red-Black Tree: CFS's Data Structure
vruntime=100
/ \
vruntime=80 vruntime=150
/ \
vr=60 vr=90
CFS inserts all runnable processes into a red-black tree keyed by vruntime.
The leftmost node (smallest vruntime) = next process to run.
Insert / delete / find: O(log n)
The red-black tree keeps scheduling decisions at O(log n) — efficient even with hundreds of runnable processes.
nice Values and Priority
# Run a CPU-heavy task at low priority (nice=10), yielding to others
nice -n 10 ./cpu_heavy_task
# Change the priority of already-running process PID=1234 (requires root for negative values)
renice -n -5 -p 1234
The nice range is -20 (highest priority) to +19 (lowest). Regular users can only raise the nice value (lower priority). Root can set negative values.
Real-Time Scheduling vs. Normal Scheduling
Linux has multiple scheduling classes, from highest to lowest priority:
Class Policy Use case
──────────────────────────────────────────────────────
SCHED_DEADLINE Earliest deadline Hard real-time (industrial control)
SCHED_FIFO FIFO, no preempt Audio/video, real-time tasks
SCHED_RR Real-time RR Real-time tasks
SCHED_OTHER CFS (default) Ordinary processes
SCHED_IDLE Lowest priority Background idle work
A real-time process (SCHED_FIFO or SCHED_RR) can preempt any normal process regardless of nice values. An infinite loop under SCHED_FIFO can starve the entire system — which is why real-time scheduling requires root privilege.
Hands-On Verification
# Show scheduling class and nice value for all processes
ps -eo pid,ni,cls,cmd | head -20
# NI = nice value; CLS = scheduling class (TS=normal, FF=FIFO, RR=round-robin)
# Watch CPU usage in real time, sorted by CPU (press P)
top
# htop shows per-core load bars — very intuitive
htop
# Show scheduling statistics for a specific process
cat /proc/$(pgrep -n python3)/sched
# Fields include nr_voluntary_switches (process yielded CPU willingly)
# and nr_involuntary_switches (OS forced the switch)
# Demonstrate nice values competing for CPU
# Start a normal-priority CPU hog in the background
python3 -c "while True: pass" &
NORMAL_PID=$!
# Start a low-priority CPU hog (nice=19)
nice -n 19 python3 -c "while True: pass" &
LOW_PID=$!
# Watch their CPU usage — the normal process should dominate
top -p $NORMAL_PID,$LOW_PID
# Clean up
kill $NORMAL_PID $LOW_PID
# Monitor overall scheduling pressure
vmstat 1
# procs r column = number of processes in the run queue
# If r consistently exceeds your core count, CPU is the bottleneck
🔬 Going Deeper
CFS's Successor: EEVDF in Linux 6.6
Linux 6.6 (late 2023) introduced EEVDF (Earliest Eligible Virtual Deadline First), which partially replaces CFS. EEVDF adds deadline awareness on top of CFS's fairness: it tracks each task's "virtual deadline" (when it should have finished running) and always picks the eligible task whose deadline is soonest. For latency-sensitive workloads — gaming, audio, real-time media — the improvement is measurable because CFS could sometimes keep a high-priority task waiting longer than necessary.
The Fundamental Scheduling Trade-Off
Every scheduling algorithm faces an irresolvable tension:
- Minimize response time: give users instant feedback (short quanta, frequent switches)
- Maximize throughput: complete as many jobs as possible per second (long quanta, fewer switches)
Modern OSes handle this by treating different tasks differently: interactive processes (shells, GUI apps) that wake up in response to user input are given higher priority and shorter quanta; batch workloads (compilation, number crunching) get longer quanta to amortize context-switch overhead. CFS's vruntime mechanism achieves a reasonable balance automatically, without requiring the OS to explicitly classify each process.
Recommended Reading:
- Operating Systems: Three Easy Pieces (OSTEP) — Chapters 7–10 cover scheduling algorithms from FCFS through MLFQ with careful analysis and thought-provoking exercises.
- Linux kernel source
kernel/sched/fair.c— The complete CFS implementation. Reading it alongside OSTEP's scheduling chapters is one of the most educational things you can do as a systems programmer. - Ingo Molnár's original CFS announcement email (linux-kernel mailing list archive, 2007) — The designer explaining his own reasoning in plain language. Highly recommended primary source.