Chapter 20

File Systems: City Planning for Data

File Systems: City Planning for Data

A city planner's job is to distribute an enormous number of buildings across a landscape in a way that lets people find them, manage them, and build or demolish them systematically. A file system solves exactly the same problem for bytes on a disk. Without one, a hard drive is just an unstructured ocean of bits โ€” a city with no street names and no building numbers. The file system is the city map: it divides the bit-ocean into blocks (city blocks), assigns each file an address (inode), lays out streets (directory trees), and maintains a land registry (the superblock).

Core Concepts

What a File Really Is: inode + Data Blocks

On any Unix/Linux file system, a file consists of two separate things:

inode (index node): the file's identity card, storing all metadata โ€” with one notable omission: the filename is not stored here.

inode structure
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ inode number      : 1234567     โ”‚
โ”‚ File type         : regular     โ”‚
โ”‚ Permissions       : rw-r--r--   โ”‚
โ”‚ Hard link count   : 2           โ”‚
โ”‚ Owner UID         : 1000        โ”‚
โ”‚ Group GID         : 1000        โ”‚
โ”‚ File size         : 4096 bytes  โ”‚
โ”‚ Created           : 2024-01-15  โ”‚
โ”‚ Modified          : 2024-03-20  โ”‚
โ”‚ Data block ptr[0] : โ†’ block 8192โ”‚
โ”‚ Data block ptr[1] : โ†’ block 8193โ”‚
โ”‚ ...                             โ”‚
โ”‚ Indirect pointer  : โ†’ ptr block โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Data blocks: the disk regions that actually hold the file's content โ€” typically 4 KB each.

Filenames are not in the inode. They live in directories, which are nothing more than a table mapping name โ†’ inode number.

Directories Are Just Special Files

A directory is a file whose content is a lookup table:

Contents of /home/alice/ (conceptual)

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚   Filename        โ”‚  inode #   โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚   .               โ”‚  1000001   โ”‚  โ† this directory itself
โ”‚   ..              โ”‚  1000000   โ”‚  โ† parent directory
โ”‚   documents       โ”‚  1000050   โ”‚
โ”‚   photo.jpg       โ”‚  1234567   โ”‚
โ”‚   notes.txt       โ”‚  1234568   โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

When you access /home/alice/photo.jpg, the OS does this:

  1. Find inode for / โ†’ read root directory โ†’ find inode number for home
  2. Read home directory โ†’ find inode number for alice
  3. Read alice directory โ†’ find photo.jpg โ†’ inode number 1234567
  4. Use inode 1234567 to locate data blocks โ†’ read file content

This chain-following process is called path resolution.

Hard link: creates a new directory entry pointing to the same inode.

Filename A โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
                   โ”œโ”€โ”€โ–บ inode 1234567 โ”€โ”€โ–บ data blocks
Filename B โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Both filenames have the same inode number (visible with ls -i).
Delete A โ†’ link count drops from 2 to 1 โ†’ data still intact
Delete B โ†’ link count drops from 1 to 0 โ†’ data blocks marked free (actually deleted)

Symbolic link (symlink): a new file whose content is a path string.

Symlink โ”€โ”€โ–บ inode X โ”€โ”€โ–บ content: "/home/alice/photo.jpg" (a string)
                                          โ”‚
                              OS re-resolves this path
                                          โ”‚
                                          โ–ผ
                              inode 1234567 โ”€โ”€โ–บ data blocks

Symlinks can cross file systems, can point to directories, and can point to non-existent targets (dangling links). Hard links cannot cross file systems and cannot point to directories (which would allow cycles).

Comparing Major File Systems

Feature ext4 (Linux default) APFS (macOS) NTFS (Windows)
Max file size 16 TB 8 EB 16 EB
Journaling Yes Yes (CoW-based) Yes
Copy-on-Write No Yes No
Snapshots Via LVM Native VSS
Encryption Extra layer Native full-disk BitLocker
Character Stable, general SSD-optimized Windows ecosystem

File Deletion: Why Recovery Is Possible

When you press Delete, does the file actually disappear? Not immediately. Deletion does only two things:

  1. Remove the filename entry from the directory table
  2. Decrement the inode's link count. If it reaches zero, mark the data blocks as "free"

The content of those data blocks is left completely intact โ€” they are merely flagged as available for future writes. File recovery tools like Photorec and TestDisk work by scanning blocks that are marked free but still contain recognizable data patterns. This is why "secure deletion" requires a dedicated tool: the shred command overwrites the file multiple times with random data before deleting it.

Before delete:  directory โ”€โ”€โ–บ inode โ”€โ”€โ–บ data blocks ["secret content..."]
After delete:   directory     (gone)    inode โ”€โ”€โ–บ data blocks ["content still here!"]
                                        link count = 0, blocks = reclaimable

Journaling: Surviving Power Failures

What happens if a write is interrupted halfway by a power failure? The file system could be left in an inconsistent state โ€” inode updated but data blocks not written, or directory updated but inode not yet modified.

Journaling is the solution. Before making any change, write a description of what you intend to do into a dedicated journal area on disk. Then carry out the operation. Then clear the journal entry.

Normal flow:
  1. Write to journal: "going to update inode and block X"
  2. Write inode
  3. Write data block X
  4. Clear journal entry

Power-failure recovery:
  Boot โ†’ check journal โ†’ find incomplete entry โ†’ redo steps 2 and 3 โ†’ clear journal

ext4's default "ordered" journaling mode records only metadata changes in the journal. Data blocks are written directly, with data written before the metadata journal entry is committed. This balances safety with write performance.

Hands-On Verification

# Show a file's inode number
ls -i /etc/hosts
# e.g.: 1234567 /etc/hosts

# Show full inode details
stat /etc/hosts
# Output includes: Inode:  Links:  Blocks:  ...

# Create a hard link and confirm shared inode
echo "hello" > /tmp/original.txt
ln /tmp/original.txt /tmp/hardlink.txt
ls -li /tmp/original.txt /tmp/hardlink.txt
# Both show the same inode number; Links: 2

# Delete the original โ€” hard link still works
rm /tmp/original.txt
cat /tmp/hardlink.txt   # prints: hello

# Create a symbolic link
ln -s /tmp/hardlink.txt /tmp/softlink.txt
ls -la /tmp/softlink.txt
# Output: lrwxrwxrwx ... /tmp/softlink.txt -> /tmp/hardlink.txt
# Show file system types and usage
df -Th
# TYPE column: ext4 / apfs / ntfs ...

# Show inode usage (you can run out of inodes even with free disk space!)
df -i

# Check block size on an ext4 volume
tune2fs -l /dev/sda1 | grep "Block size"
# Typically: Block size: 4096

# Find all hard links sharing an inode number
find / -inum 1234567 2>/dev/null
# Check journaling mode of an ext4 volume
tune2fs -l /dev/sda1 | grep "Default mount options"
# "journal_data_ordered" means ordered journaling (the default)

๐Ÿ”ฌ Going Deeper

Copy-on-Write File Systems: The Smart Design of APFS and Btrfs

Traditional file systems (ext4, NTFS) write data in-place, overwriting the original blocks. Backup requires external snapshot tools. Copy-on-Write (CoW) file systems โ€” macOS APFS, Linux Btrfs, OpenBSD FFS2 โ€” never overwrite. Every modification writes to a new location, then atomically updates the pointer.

Benefits:

Costs: write amplification (a small change can trigger a cascade of block copies), fragmentation over time (data scatters across the disk), and the need for periodic defragmentation or compaction.

VFS: How Linux Supports Dozens of File Systems at Once

The Linux kernel includes a layer called the VFS (Virtual File System) that defines a standard interface: open, read, write, close, and a handful of others. Any file system that implements this interface โ€” ext4, NTFS, FAT32, NFS, FUSE-based systems โ€” can be mounted and used simultaneously. From the perspective of application code, reading a file off an NTFS USB stick looks identical to reading one off an ext4 SSD. VFS is one of Linux's most important abstraction layers, enabling the ecosystem diversity that makes Linux useful on everything from smartphones to supercomputers.

Recommended Reading:

Rate this chapter
4.8  / 5  (9 ratings)

๐Ÿ’ฌ Comments