Chapter 2

Arrays: The Power and Price of Contiguous Memory

Chapter 2: Arrays — The Power and Price of Contiguous Memory

Every programmer's first data structure is the array. It seems trivially simple: a block of contiguous memory cells, each holding one element, accessible by index. Yet this simplicity hides a profound trade-off that determines the performance of virtually every program you will ever write.

The array's power comes from one physical fact: modern CPUs are built to access contiguous memory blazingly fast. Its price comes from one logical consequence: if elements must stay contiguous, inserting or removing an element forces everything else to move. Understanding this trade-off — and the engineering decisions built on top of it — separates beginners from systems programmers.

Level 1 · What You Need to Know

2.1 The Fundamental Contract: Contiguous Memory

An array stores elements in contiguous memory locations. Each element occupies the same number of bytes, so the address of element i is:

address(arr[i]) = base_address + i × element_size

This formula is the source of the array's superpower: O(1) random access. No matter whether you ask for arr[0] or arr[999999], the CPU performs one multiplication and one addition — constant time.

Why this matters (Knuth, The Art of Computer Programming, Vol. 1, 1968): Knuth identified sequential allocation as the most natural mapping from a mathematical sequence to machine memory. The single-arithmetic-operation access pattern maps directly to how hardware addressing works: the CPU's memory controller takes a numeric address and fetches the word at that location in one memory cycle.

import array

# A true array of integers (4 bytes each)
arr = array.array('i', [10, 20, 30, 40, 50])

# O(1) access — same speed regardless of index
first = arr[0]      # base + 0 * 4 = base
last = arr[4]       # base + 4 * 4 = base + 16
middle = arr[2]     # base + 2 * 4 = base + 8

2.2 The Cost Model: Operations and Their Complexities

Operation	Time Complexity	Why
Access by index `arr[i]`	O(1)	Arithmetic on address
Search (unsorted)	O(n)	Must check each element
Search (sorted)	O(log n)	Binary search
Insert at end	O(1) amortized	May need resize
Insert at position i	O(n)	Must shift n-i elements right
Delete at end	O(1)	Just decrement length
Delete at position i	O(n)	Must shift n-i elements left
Append	O(1) amortized	Dynamic array growth

The critical insight: insertion and deletion are expensive in the middle because maintaining contiguity requires shifting elements. If you insert at index 0 of an n-element array, all n elements must move one position to the right.

# Demonstrating the cost of insert at front vs append
import timeit

n = 100000
# Insert at front: O(n) per operation → O(n²) total
def insert_front():
    lst = []
    for i in range(n):
        lst.insert(0, i)  # Shift all existing elements

# Append at back: O(1) amortized per operation → O(n) total
def append_back():
    lst = []
    for i in range(n):
        lst.append(i)  # No shifting needed

t_front = timeit.timeit(insert_front, number=1)
t_back = timeit.timeit(append_back, number=1)
print(f"Insert at front: {t_front:.3f}s")   # ~2.5s
print(f"Append at back:  {t_back:.3f}s")    # ~0.008s
# Ratio: ~300x slower for insert(0)

2.3 Python `list` Is Not a Traditional Array

A traditional array (C's int arr[100]) stores values directly in contiguous memory. Python's list is different — it is a dynamic array of pointers (references to objects).

Traditional array (C):
┌────┬────┬────┬────┬────┐
│ 10 │ 20 │ 30 │ 40 │ 50 │  ← Values stored directly
└────┴────┴────┴────┴────┘

Python list:
┌────┬────┬────┬────┬────┐
│ *  │ *  │ *  │ *  │ *  │  ← Pointers (8 bytes each on 64-bit)
└─┼──┴─┼──┴─┼──┴─┼──┴─┼──┘
  ↓    ↓    ↓    ↓    ↓
 int  int  str  list float  ← Objects scattered in heap

Key implications:

Heterogeneous: A Python list can hold mixed types because it stores pointers, not values.
Memory overhead: Each element requires a pointer (8 bytes) plus the object header (28 bytes for a Python int).
Cache inefficiency: The actual values are scattered across heap memory — bad for CPU cache.

import sys

# Memory of a Python list vs what you might expect
lst = [1, 2, 3, 4, 5]
print(sys.getsizeof(lst))  # 104 bytes (list object itself)
# But each int object is 28 bytes, so total = 104 + 5*28 = 244 bytes
# A C array of 5 ints would be: 5 * 4 = 20 bytes

# For pure numeric work, use array module or numpy
import array
arr = array.array('i', [1, 2, 3, 4, 5])
print(sys.getsizeof(arr))  # 84 bytes (stores values directly)

2.4 Common Python List Operations — True Costs

Python's list is a dynamic array. Here are the costs that surprise people:

lst = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

# O(1) — constant time
lst.append(11)         # Add to end (amortized O(1))
lst.pop()              # Remove from end
lst[5]                 # Index access
len(lst)               # Length (stored as attribute)

# O(n) — linear time (THESE ARE THE TRAPS)
lst.insert(0, 99)     # Insert at front → shift all elements
lst.pop(0)            # Remove from front → shift all elements
lst.remove(5)         # Find and remove → search + shift
99 in lst             # Membership test → linear scan
lst.index(5)          # Find index → linear scan

# O(k) where k is slice length
lst[2:5]              # Slice → copy k elements
del lst[2:5]          # Delete slice → copy + shift

# O(n) — less obvious
lst2 = lst + [1,2,3]  # Concatenation → copy entire list
lst3 = lst * 3        # Repetition → copy entire list 3 times

Critical mistake #1: Using list.insert(0, x) in a loop

If you need to prepend elements frequently, use collections.deque:

from collections import deque

# BAD: O(n²) total
def build_reversed_bad(items):
    result = []
    for item in items:
        result.insert(0, item)  # O(n) each time!
    return result

# GOOD: O(n) total
def build_reversed_good(items):
    result = deque()
    for item in items:
        result.appendleft(item)  # O(1) each time
    return list(result)

2.5 Two-Dimensional Arrays: Row-Major Storage

When you create a 2D array (a matrix), the hardware still stores it in a 1D strip of memory. The question is: do you lay it out row by row (row-major) or column by column (column-major)?

C, Python (NumPy default), Java, and most languages use row-major order:

Matrix:     Memory layout (row-major):
┌─────────┐
│ 1  2  3 │    [1, 2, 3, 4, 5, 6, 7, 8, 9]
│ 4  5  6 │     ───row0── ───row1── ───row2──
│ 7  8  9 │
└─────────┘
Address of M[i][j] = base + (i * num_cols + j) * element_size

Why this matters for performance: Accessing elements in row order follows the memory layout, which means the CPU cache prefetcher can predict and preload upcoming data. Accessing in column order jumps across memory, defeating the cache.

import numpy as np
import timeit

n = 5000
matrix = np.random.rand(n, n)

# Row-major traversal (cache-friendly)
def row_sum():
    total = 0.0
    for i in range(n):
        for j in range(n):
            total += matrix[i, j]  # Sequential in memory
    return total

# Column-major traversal (cache-hostile)
def col_sum():
    total = 0.0
    for j in range(n):
        for i in range(n):
            total += matrix[i, j]  # Jumping across memory
    return total

# row_sum is typically 3-5x faster due to cache effects
# (In practice, use matrix.sum() which is ~1000x faster than either)

2.6 The Shallow Copy Trap

One of the most common bugs for Python beginners:

# WRONG: All rows share the same list object
grid = [[0] * 5] * 3
grid[0][0] = 1
print(grid)
# [[1, 0, 0, 0, 0], [1, 0, 0, 0, 0], [1, 0, 0, 0, 0]]  ← ALL rows changed!

# WHY: [expr] * 3 creates 3 references to the SAME object
# Internally: grid = [ptr, ptr, ptr] where all three point to same [0,0,0,0,0]

# CORRECT: Each row is an independent list object
grid = [[0] * 5 for _ in range(3)]
grid[0][0] = 1
print(grid)
# [[1, 0, 0, 0, 0], [0, 0, 0, 0, 0], [0, 0, 0, 0, 0]]  ← Only row 0 changed

# WHY: The list comprehension evaluates [0]*5 three separate times,
# creating three distinct list objects in memory.

The rule: [mutable_object] * n creates n references to the same object. Use a list comprehension when you need independent copies.

This is not a bug — it is consistent with Python's object model where * on a list performs shallow copy of references (documented in Python Language Reference §6.7, "Binary arithmetic operations"). The inner [0] * 5 is safe because integers are immutable — sharing references to immutable objects has no observable effect.

Level 2 · How It Works

2.7 Dynamic Array Growth Strategy

A static array has fixed size. But Python's list grows automatically when you append beyond its current capacity. How?

The strategy is called geometric growth (or table doubling): when the array is full, allocate a new, larger array and copy all elements over. The critical design question is: how much larger?

CPython's actual growth formula (from Objects/listobject.c, function list_resize):

// CPython 3.12 source: Objects/listobject.c
// new_allocated = (size_t)newsize + (newsize >> 3) + (newsize < 9 ? 3 : 6);
// This gives growth of approximately: newsize + newsize/8 + small_constant
// Growth factor ≈ 1.125 (i.e., grow by 12.5%)

The actual allocated sizes for a list growing from 0:

0, 4, 8, 16, 24, 32, 40, 52, 64, 76, ...

import sys

# Watch Python list grow
lst = []
prev_size = sys.getsizeof(lst)
print(f"len=0, size={prev_size} bytes, capacity≈0")

for i in range(100):
    lst.append(i)
    new_size = sys.getsizeof(lst)
    if new_size != prev_size:
        # Size of list object = 56 (overhead) + 8 * capacity (pointers)
        capacity = (new_size - 56) // 8
        print(f"len={i+1}, size={new_size} bytes, capacity≈{capacity}")
        prev_size = new_size

# Output (CPython 3.12, 64-bit):
# len=1,  size=88 bytes,  capacity≈4
# len=5,  size=120 bytes, capacity≈8
# len=9,  size=184 bytes, capacity≈16
# len=17, size=248 bytes, capacity≈24
# len=25, size=312 bytes, capacity≈32
# ...

2.8 Why Growth Factor ~1.125, Not 2?

Most textbooks teach "double the array" (growth factor = 2). Java's ArrayList uses 1.5. CPython uses approximately 1.125. Why?

The trade-off (Andrew Koenig, "Patterns and Antipatterns," JOOP 1998; Sedgewick, Algorithms, 4th ed., 2011):

Growth Factor	Wasted Space (worst case)	Amortized Copy Cost	Memory Reuse?
2.0	Up to 50% of allocated memory	O(1) per append	No — new allocation always > sum of all previous
1.5	Up to 33%	O(1) per append	Yes — after several resizes, freed blocks can be reused
1.125	Up to 11%	O(1) per append	Yes — almost always reusable

The key insight (attributed to Andrew Appel, Princeton; discussed in Slator's Dynamic Arrays notes, 2003): With growth factor 2, the new size always exceeds the total of all previously freed blocks combined (1 + 2 + 4 + ... + n/2 = n - 1 < n). This means the allocator can never reuse the freed memory in the same contiguous region. With factor < φ (the golden ratio ≈ 1.618), eventually the sum of freed blocks exceeds the new allocation size, enabling reuse.

CPython chose ~1.125 because:

Python lists are extremely common — millions exist in a typical program
Most lists are small (< 20 elements) and never grow much
The slight increase in copy operations is dwarfed by the memory savings across millions of lists
The +3 or +6 additive term ensures small lists don't resize too frequently

# Empirical verification: measuring amortized cost
import time

def measure_append(n):
    """Measure total time for n appends — should be O(n) if amortized O(1)."""
    lst = []
    start = time.perf_counter()
    for i in range(n):
        lst.append(i)
    elapsed = time.perf_counter() - start
    return elapsed

# If amortized O(1), doubling n should double the time
for n in [100_000, 200_000, 400_000, 800_000, 1_600_000]:
    t = measure_append(n)
    print(f"n={n:>10,}: {t:.4f}s  ({t/n*1e6:.2f} µs/append)")

# Output shows roughly constant time per append:
# n=   100,000: 0.0067s  (0.07 µs/append)
# n=   200,000: 0.0133s  (0.07 µs/append)
# n=   400,000: 0.0267s  (0.07 µs/append)
# n=   800,000: 0.0536s  (0.07 µs/append)
# n= 1,600,000: 0.1079s  (0.07 µs/append)

2.9 Cache Locality: Why Array Traversal Crushes Linked Lists

Modern CPUs do not fetch individual bytes from main memory. They fetch entire cache lines — typically 64 bytes at a time (Intel, AMD, ARM since ~2005). When you access arr[0], the hardware loads bytes at addresses 0-63 into L1 cache. If arr[1] through arr[7] (for 8-byte elements) live in that same cache line, accessing them is essentially free — no memory fetch needed.

The memory hierarchy (Hennessy & Patterson, Computer Architecture: A Quantitative Approach, 6th ed., 2017):

Level	Size	Latency	Bandwidth
L1 Cache	32-64 KB	~1 ns (4 cycles)	~1 TB/s
L2 Cache	256 KB-1 MB	~4 ns (12 cycles)	~500 GB/s
L3 Cache	4-64 MB	~12 ns (36 cycles)	~200 GB/s
Main Memory (DRAM)	8-256 GB	~100 ns (300 cycles)	~50 GB/s

Consequence: An L1 cache hit is 100x faster than a main memory access. Arrays exploit this because sequential elements are physically adjacent, causing almost every access to be a cache hit after the first fetch in each cache line.

Linked lists scatter nodes across the heap. Each node.next pointer dereference is likely a cache miss, requiring a full memory round-trip.

import numpy as np
import time

# Array traversal (cache-friendly)
arr = np.arange(10_000_000, dtype=np.int64)  # 80 MB, contiguous
start = time.perf_counter()
total = arr.sum()
t_array = time.perf_counter() - start

# Simulated linked-list-like access (cache-hostile)
# Shuffle indices to simulate pointer-chasing
indices = np.arange(10_000_000)
np.random.shuffle(indices)
start = time.perf_counter()
total = 0
for idx in indices:
    total += arr[idx]  # Random access pattern
t_random = time.perf_counter() - start

print(f"Sequential: {t_array:.4f}s")  # ~0.005s
print(f"Random:     {t_random:.4f}s")  # ~3.0s
print(f"Ratio:      {t_random/t_array:.0f}x slower")  # ~600x
# (The extreme ratio includes Python loop overhead, but even in C
#  the difference is 5-10x due purely to cache effects)

Hardware prefetching (Intel Optimization Manual, §2.3.5.4): Modern CPUs have prefetch units that detect sequential access patterns and speculatively load upcoming cache lines before your code even requests them. This works perfectly for array traversal but is useless for linked list traversal (the next address is unknowable until the current node is fetched).

2.10 `array` Module vs. NumPy Array vs. `list`

Python offers three "array-like" containers with very different performance characteristics:

Feature	`list`	`array.array`	`numpy.ndarray`
Stores	References (pointers)	Raw values	Raw values
Type	Heterogeneous	Homogeneous	Homogeneous
Element size (int)	8 (ptr) + 28 (obj) = 36 bytes	4 bytes (int32)	4 or 8 bytes
Cache-friendly	No (values scattered)	Yes (contiguous values)	Yes (contiguous values)
Vectorized ops	No	No	Yes (SIMD)
Use case	General purpose	I/O buffers, protocol data	Numerical computing

import sys
import array
import numpy as np

n = 1_000_000

# Memory comparison for storing 1 million integers
lst = list(range(n))
arr = array.array('i', range(n))  # 4-byte signed int
npa = np.arange(n, dtype=np.int32)

print(f"list:        {sys.getsizeof(lst) + n * 28:>12,} bytes")  # ~36 MB
print(f"array.array: {sys.getsizeof(arr):>12,} bytes")           # ~4 MB
print(f"numpy:       {npa.nbytes:>12,} bytes")                    # ~4 MB

When to use which:

list: Default choice. Flexibility over performance. Mixed types, small collections.
array.array: When you need compact storage of a single primitive type and don't need math operations. Useful for binary I/O.
numpy.ndarray: Any numerical computation. Vectorized operations, linear algebra, scientific computing.

2.11 Memory Alignment and the `struct` Module

When data crosses hardware boundaries, performance suffers. CPUs access memory most efficiently when data is aligned — meaning an n-byte value starts at an address divisible by n.

import struct

# Pack values with alignment (default behavior)
# char (1) + pad (3) + int (4) + char (1) + pad (7) + double (8) = 24 bytes
s1 = struct.Struct('c i c d')  # Note: default alignment depends on platform

# Pack values without alignment padding
s2 = struct.Struct('=c i c d')  # '=' means native byte order, no padding
print(f"With alignment: {struct.calcsize('c i c d')} bytes")    # Platform-dependent
print(f"Without:        {struct.calcsize('=cicid')} bytes")     # Tightly packed

# Why alignment matters:
# An unaligned 4-byte int at address 0x03 spans two memory words (0x00-0x03 and 0x04-0x07).
# The CPU must perform TWO memory reads and mask/shift the bytes together.
# An aligned int at address 0x04 requires ONE memory read.

Real-world impact (Drepper, "What Every Programmer Should Know About Memory," 2007): Unaligned accesses on x86 are handled transparently by hardware but with a performance penalty of ~2x for L1 cache hits. On ARM and older architectures, unaligned access causes a hardware fault (bus error), terminating the program.

Level 3 · What the Spec Says

2.12 Amortized Analysis of Dynamic Arrays

In Chapter 1, we introduced Tarjan's potential method for amortized analysis. Let us apply it rigorously to prove that dynamic array append is O(1) amortized.

Theorem: For a dynamic array with growth factor α > 1, the amortized cost of append is O(1).

Proof using the accounting method (Cormen, Leiserson, Rivest, Stein, Introduction to Algorithms, 4th ed., 2022, Chapter 16):

Assign each append operation a charge of 3 units (for growth factor 2):

1 unit pays for the actual insertion
1 unit is saved as "credit" on the newly inserted element
1 unit is saved as "credit" on one element in the first half of the array

When the array is full (n elements, capacity n), we need to copy all n elements to a new array of size 2n. At this moment:

The first half (n/2 elements) each have 1 credit (deposited when they were "passed" by new insertions)
The second half (n/2 elements) each have 1 credit (deposited when they were inserted)
Total credits available: n/2 + n/2 = n — exactly enough to pay for copying n elements

Therefore, no operation ever goes into "debt," and the amortized cost per append is 3 = O(1). ∎

For general growth factor α:

The amortized cost per append is:

c_amortized = 1 + α/(α-1)

Derivation: When resizing from capacity C to αC, we copy C elements. Between two consecutive resizes, we insert C(α-1) new elements. So the copy cost per element is C / (C(α-1)) = 1/(α-1). Each element also pays 1 for its own insertion.

α	Amortized cost	Wasted space
2.0	1 + 2 = 3	≤ 50%
1.5	1 + 3 = 4	≤ 33%
1.25	1 + 5 = 6	≤ 20%
1.125	1 + 9 = 10	≤ 11%

CPython's choice of α ≈ 1.125 means each append "costs" about 10 units of work amortized, but the memory overhead stays below 11%. This is a deliberate engineering trade-off: Python programs create many lists but rarely perform billions of appends to a single list, so minimizing memory waste is more valuable than minimizing copy overhead.

2.13 Array Implementation Across Languages

The array is a universal data structure, but implementations vary significantly:

C (ISO/IEC 9899:2018, §6.5.6):

int arr[100];  // Stack-allocated, fixed size, no bounds checking
arr[i] == *(arr + i)  // Pointer arithmetic — undefined behavior if out of bounds
// sizeof(arr) == 400 (100 * 4 bytes)
// No runtime length information — programmer must track it

The C standard defines array subscript arr[i] as exactly equivalent to *(arr + i). This means:

No bounds checking (segfault or silent corruption on overflow)
arr[i] and i[arr] are both legal and equivalent
Array names decay to pointers in most contexts (§6.3.2.1)

Java (JLS §10, Java Language Specification, Gosling et al.):

int[] arr = new int[100];  // Heap-allocated, fixed size, bounds-checked
// arr.length is a final field — O(1) access to size
// ArrayIndexOutOfBoundsException on invalid index
// JIT may eliminate bounds checks after proving index validity

Java arrays are objects with a length field. The JVM specification (§2.11.5) requires bounds checking on every array access (aaload, iaload instructions), but HotSpot's JIT compiler performs range check elimination when it can prove the index is valid (e.g., in counted loops).

Python (CPython implementation, Include/cpython/listobject.h):

// Simplified from CPython source
typedef struct {
    PyObject_VAR_HEAD       // ob_refcnt, ob_type, ob_size (length)
    PyObject **ob_item;     // Pointer to array of pointers
    Py_ssize_t allocated;   // Total allocated slots
} PyListObject;

Key differences:

ob_size = number of elements currently in the list
allocated = total capacity (≥ ob_size)
ob_item points to a separately allocated array of PyObject* pointers
Each element access requires: bounds check → pointer dereference → object access

Performance consequence: A Python list access does approximately 5x more work than a C array access (bounds check + pointer load + potential cache miss for the object). This is why NumPy exists — it bypasses the Python object layer entirely.

2.14 CPU Cache Architecture: L1/L2/L3 and the 64-Byte Cache Line

(Cross-reference: Computer How It Works, Chapter 11 — for full treatment of cache architecture)

The cache system is the single most important factor in understanding array performance. Here are the specifications that matter:

Cache line size: 64 bytes (Intel SDM §11.6; ARM Architecture Reference Manual, §D7.2)

All modern x86 and ARM processors use a 64-byte cache line as the fundamental unit of transfer between cache levels and main memory. When any byte in a 64-byte aligned block is accessed, the entire block is loaded.

For arrays, this means:

int32[16] elements fit in one cache line (16 × 4 = 64)
int64[8] elements fit in one cache line (8 × 8 = 64)
float64[8] elements fit in one cache line (8 × 8 = 64)

Associativity and eviction (Hennessy & Patterson, §B.3):

L1 caches are typically 8-way set-associative: each memory address maps to one of N/8 sets, and within that set, 8 cache lines can coexist. When a 9th line maps to the same set, one must be evicted (typically LRU — Least Recently Used).

Why arrays benefit:

Spatial locality: Accessing arr[i] loads 64 bytes containing arr[i] through arr[i+7] (for 8-byte elements). Subsequent accesses hit cache.
Temporal locality: Recently accessed elements remain in cache for reuse.
Prefetching: Hardware detects stride patterns (stride = element_size for sequential access) and issues prefetch requests 2-4 cache lines ahead.

# Demonstrating cache line effects
import numpy as np
import time

# Stride-1 access: every element → maximum cache utilization
# Stride-8 access: every 8th element → 1 useful element per cache line
# Stride-64 access: every 64th element → crosses cache line every time

arr = np.zeros(8_000_000, dtype=np.int64)  # 64 MB

def stride_access(arr, stride):
    total = 0
    for i in range(0, len(arr), stride):
        total += arr[i]
    return total

for stride in [1, 2, 4, 8, 16, 32]:
    n_accesses = len(arr) // stride
    start = time.perf_counter()
    stride_access(arr, stride)
    elapsed = time.perf_counter() - start
    ns_per = elapsed / n_accesses * 1e9
    print(f"Stride {stride:>2}: {ns_per:.1f} ns/access ({n_accesses:>10,} accesses)")

# With stride 1-8, the ns/access is nearly constant (cache line amortization)
# Beyond stride 8 (for int64), each access is a new cache line → slower

2.15 Row-Major vs. Column-Major Storage

The choice of memory layout for multi-dimensional arrays has been a source of bugs and performance issues since the earliest days of scientific computing.

Row-major (C, Python/NumPy default, Java, C#, Pascal):

M[i][j] stored at: base + (i * ncols + j) * element_size
Adjacent in memory: M[0][0], M[0][1], M[0][2], ... M[0][n-1], M[1][0], ...

Column-major (Fortran, MATLAB, R, Julia):

M(i,j) stored at: base + (j * nrows + i) * element_size
Adjacent in memory: M(1,1), M(2,1), M(3,1), ... M(m,1), M(1,2), ...

Historical reason (Backus et al., "The FORTRAN Automatic Coding System," 1957): Fortran was designed for scientific computing where operations on column vectors (extracting a variable across all observations) are more common than operations on rows. C was designed for systems programming where structures (rows) are processed sequentially.

Performance impact when layout doesn't match access pattern:

import numpy as np
import time

n = 4000
# Row-major (C order) — default
C = np.ones((n, n), dtype=np.float64, order='C')
# Column-major (Fortran order)
F = np.ones((n, n), dtype=np.float64, order='F')

# Row traversal on both
def row_traverse(M):
    s = 0.0
    for i in range(M.shape[0]):
        for j in range(M.shape[1]):
            s += M[i, j]
    return s

# Column traversal on both
def col_traverse(M):
    s = 0.0
    for j in range(M.shape[1]):
        for i in range(M.shape[0]):
            s += M[i, j]
    return s

# Row traverse is fast on C-order, slow on F-order
# Col traverse is fast on F-order, slow on C-order
# In pure Python loops, difference is 2-4x
# In compiled code (C/Fortran), difference is 5-20x

Practical advice: Always traverse arrays in the order they are stored. In NumPy, check array.flags to see if an array is C-contiguous or F-contiguous, and iterate accordingly.

Level 4 · Edge Cases and Pitfalls

2.16 Python List Hidden Performance Traps

Trap 1: list.insert(0, x) is O(n)

This is the most common performance bug in Python code. Every front-insertion requires copying the entire array:

# Building a list of n elements by prepending: O(n²) total
# n = 100,000 takes ~2.5 seconds
# n = 1,000,000 takes ~250 seconds (4+ minutes!)
# Solution: use collections.deque or build reversed, then reverse once

Trap 2: list + list creates a new list (O(n+m))

# This innocent-looking loop is O(n²):
result = []
for chunk in chunks:
    result = result + chunk  # Creates new list each iteration!

# Fix: use extend() which is O(m) per call
result = []
for chunk in chunks:
    result.extend(chunk)  # Modifies in-place

# Or even better:
import itertools
result = list(itertools.chain.from_iterable(chunks))  # Single allocation

Trap 3: [x] * n shares references

# For immutable x (int, str, tuple): safe
zeros = [0] * 1000  # Fine — integers are immutable

# For mutable x (list, dict, set): DANGEROUS
rows = [[]] * 5
rows[0].append(1)
print(rows)  # [[1], [1], [1], [1], [1]] — all five are the same list!

# Fix:
rows = [[] for _ in range(5)]  # Five independent lists

Trap 4: Membership testing is O(n)

# BAD: O(n) per lookup → O(n²) for loop
large_list = list(range(1_000_000))
for item in query_items:
    if item in large_list:  # Linear scan!
        process(item)

# GOOD: O(1) per lookup → O(n) for loop
large_set = set(large_list)  # One-time O(n) conversion
for item in query_items:
    if item in large_set:  # Hash lookup!
        process(item)

2.17 Large Array Memory Estimation

sys.getsizeof() only reports the shallow size of a Python object — the size of the container, not its contents.

import sys

lst = [1, 2, 3, 4, 5]

# Shallow size (just the list object + pointer array)
print(sys.getsizeof(lst))  # 104 bytes

# But each int object also occupies memory!
print(sys.getsizeof(1))    # 28 bytes

# True total memory:
shallow = sys.getsizeof(lst)             # 104
deep = shallow + sum(sys.getsizeof(x) for x in lst)  # 104 + 5*28 = 244
print(f"Shallow: {shallow}, Deep: {deep}")

# For large arrays of integers:
n = 1_000_000
lst = list(range(n))
# Shallow: ~8 MB (list overhead + 1M pointers × 8 bytes)
# Deep: ~8 MB + 1M × 28 bytes = ~36 MB total
# Compare: numpy array = 8 MB (int64) or 4 MB (int32)

Accurate memory measurement with tracemalloc:

import tracemalloc

tracemalloc.start()

# Measure actual memory impact
lst = list(range(1_000_000))
snapshot = tracemalloc.take_snapshot()
stats = snapshot.statistics('lineno')
for stat in stats[:5]:
    print(stat)
# Shows ~36 MB total (pointer array + integer objects)

tracemalloc.stop()

Memory estimation formulas (CPython 3.12, 64-bit):

Container	Formula	1M int elements
`list` of Python ints	56 + 8n + 28n = 56 + 36n bytes	~36 MB
`list` of Python floats	56 + 8n + 24n = 56 + 32n bytes	~32 MB
`array.array('i')`	64 + 4n bytes	~4 MB
`array.array('d')`	64 + 8n bytes	~8 MB
`numpy.ndarray` (int32)	112 + 4n bytes	~4 MB
`numpy.ndarray` (int64)	112 + 8n bytes	~8 MB

2.18 Interview Problems: Arrays

The following LeetCode problems test array mastery. We provide optimal solutions with detailed analysis.

Problem 1: Rotate Array (LeetCode #189)

Problem: Rotate an array of n elements to the right by k steps.

Example: [1,2,3,4,5,6,7], k=3 → [5,6,7,1,2,3,4]

def rotate(nums: list[int], k: int) -> None:
    """
    Three-reversal algorithm.
    Time: O(n), Space: O(1)
    
    Insight (Bentley, Programming Pearls, 1986):
    Rotating by k is equivalent to:
    1. Reverse entire array:      [7,6,5,4,3,2,1]
    2. Reverse first k elements:  [5,6,7,4,3,2,1]  
    3. Reverse remaining n-k:     [5,6,7,1,2,3,4]
    
    Why this works: Let A = first n-k elements, B = last k elements.
    We want BA. Starting from AB:
      reverse(AB) = (AB)^R = B^R A^R
      reverse first k of B^R A^R = B A^R
      reverse last n-k of B A^R = B A = BA ✓
    """
    n = len(nums)
    k = k % n  # Handle k > n
    if k == 0:
        return
    
    def reverse(arr, start, end):
        while start < end:
            arr[start], arr[end] = arr[end], arr[start]
            start += 1
            end -= 1
    
    reverse(nums, 0, n - 1)      # Reverse all
    reverse(nums, 0, k - 1)      # Reverse first k
    reverse(nums, k, n - 1)      # Reverse rest

# Alternative: cyclic replacement (O(n) time, O(1) space)
def rotate_cyclic(nums: list[int], k: int) -> None:
    """
    Move each element directly to its final position.
    Use GCD(n,k) to determine number of cycles needed.
    """
    n = len(nums)
    k = k % n
    if k == 0:
        return
    
    from math import gcd
    num_cycles = gcd(n, k)
    
    for start in range(num_cycles):
        current = start
        prev = nums[start]
        while True:
            next_idx = (current + k) % n
            nums[next_idx], prev = prev, nums[next_idx]
            current = next_idx
            if current == start:
                break

Problem 2: Merge Sorted Array (LeetCode #88)

Problem: Merge nums2 into nums1 (which has extra space at end), maintaining sorted order.

def merge(nums1: list[int], m: int, nums2: list[int], n: int) -> None:
    """
    Merge from the back to avoid overwriting unprocessed elements.
    Time: O(m+n), Space: O(1)
    
    Key insight: nums1 has space at the end (indices m to m+n-1).
    By filling from the back (largest first), we never overwrite
    elements in nums1 that haven't been processed yet.
    
    Why it works: At position p = m+n-1, we place the largest of
    nums1[i] and nums2[j]. Since we're placing the largest remaining
    element at the highest unfilled position, and nums1's values at
    indices > i are already placed (or empty), we never lose data.
    """
    p = m + n - 1  # Pointer to current fill position
    i = m - 1       # Pointer to end of nums1's values
    j = n - 1       # Pointer to end of nums2's values
    
    while j >= 0:
        if i >= 0 and nums1[i] > nums2[j]:
            nums1[p] = nums1[i]
            i -= 1
        else:
            nums1[p] = nums2[j]
            j -= 1
        p -= 1

# Example trace:
# nums1 = [1,2,3,0,0,0], m=3, nums2 = [2,5,6], n=3
# Step 1: compare 3 vs 6 → place 6 at index 5: [1,2,3,0,0,6]
# Step 2: compare 3 vs 5 → place 5 at index 4: [1,2,3,0,5,6]
# Step 3: compare 3 vs 2 → place 3 at index 3: [1,2,3,3,5,6]  (i now at 1)
# Step 4: compare 2 vs 2 → place 2 at index 2: [1,2,2,3,5,6]  (j now at -1)
# Done! j < 0, remaining nums1 elements already in place.

Problem 3: Remove Element (LeetCode #27)

Problem: Remove all instances of val from nums in-place. Return new length.

def removeElement(nums: list[int], val: int) -> int:
    """
    Two-pointer technique: slow pointer marks the "write" position.
    Time: O(n), Space: O(1)
    
    Invariant: all elements before index `slow` are != val.
    """
    slow = 0
    for fast in range(len(nums)):
        if nums[fast] != val:
            nums[slow] = nums[fast]
            slow += 1
    return slow

# Variant: when elements to remove are rare, swap with end to minimize writes
def removeElement_swap(nums: list[int], val: int) -> int:
    """
    Swap unwanted elements to end. Fewer writes when val is rare.
    Time: O(n), Space: O(1)
    """
    n = len(nums)
    i = 0
    while i < n:
        if nums[i] == val:
            nums[i] = nums[n - 1]
            n -= 1
            # Don't increment i — need to check swapped element
        else:
            i += 1
    return n

Problem 4: Spiral Matrix (LeetCode #54)

Problem: Return all elements of an m×n matrix in spiral order.

def spiralOrder(matrix: list[list[int]]) -> list[int]:
    """
    Layer-by-layer peeling approach.
    Time: O(m×n), Space: O(1) extra (excluding output)
    
    Think of the matrix as concentric rectangular rings.
    Process each ring: right → down → left → up.
    After processing the outer ring, shrink boundaries inward.
    """
    if not matrix or not matrix[0]:
        return []
    
    result = []
    top, bottom = 0, len(matrix) - 1
    left, right = 0, len(matrix[0]) - 1
    
    while top <= bottom and left <= right:
        # Traverse right along top row
        for col in range(left, right + 1):
            result.append(matrix[top][col])
        top += 1
        
        # Traverse down along right column
        for row in range(top, bottom + 1):
            result.append(matrix[row][right])
        right -= 1
        
        # Traverse left along bottom row (if still valid)
        if top <= bottom:
            for col in range(right, left - 1, -1):
                result.append(matrix[bottom][col])
            bottom -= 1
        
        # Traverse up along left column (if still valid)
        if left <= right:
            for row in range(bottom, top - 1, -1):
                result.append(matrix[row][left])
            left += 1
    
    return result

# Example:
# [[1, 2, 3],
#  [4, 5, 6],     →  [1, 2, 3, 6, 9, 8, 7, 4, 5]
#  [7, 8, 9]]

2.19 When NOT to Use Arrays

Arrays are the default choice, but they are wrong in specific scenarios:

Scenario	Problem with Array	Better Choice	Why
Frequent insert/delete at front	O(n) per operation	`collections.deque`	O(1) deque uses doubly-linked block list
Frequent insert/delete in middle	O(n) per operation	Linked list or balanced BST	O(1) or O(log n) respectively
Need O(1) membership test	O(n) linear scan	`set` or `dict`	Hash table gives O(1) average
Sparse data (mostly zeros)	Wastes memory on zeros	`dict` or scipy.sparse	Store only non-zero entries
Unknown final size + millions of prepends	O(n²) total	`deque` + convert	Amortized O(1) prepend
Need sorted + frequent inserts	O(n) per insert to maintain order	`sortedcontainers.SortedList`	B-tree gives O(log n)

from collections import deque
import timeit

n = 100_000

# Scenario: Build sequence by prepending
def with_list():
    lst = []
    for i in range(n):
        lst.insert(0, i)

def with_deque():
    d = deque()
    for i in range(n):
        d.appendleft(i)

t_list = timeit.timeit(with_list, number=1)
t_deque = timeit.timeit(with_deque, number=1)
print(f"list.insert(0): {t_list:.3f}s")   # ~2.5s
print(f"deque.appendleft: {t_deque:.3f}s") # ~0.007s
print(f"Ratio: {t_list/t_deque:.0f}x")     # ~350x faster

The decision framework:

What is your dominant operation? If it's random access by index → array. If it's insertion/deletion at ends → deque. If it's search → hash table or tree.
How large is your data? For n < 100, the constant factors dominate and any structure works. For n > 10⁶, wrong choice means minutes vs. milliseconds.
Do you need cache performance? For tight numerical loops, only contiguous memory (array/numpy) gives you hardware prefetching.

Summary

The array is deceptively simple. Behind arr[i] lies a deep connection between mathematics (address arithmetic), hardware (cache lines, prefetching, alignment), and systems design (growth factors, memory allocation).

Key takeaways:

O(1) random access comes from contiguous memory + fixed element size
The price is O(n) insertion/deletion (maintaining contiguity requires shifting)
Python list stores pointers, not values — 5-9x memory overhead vs. NumPy
Dynamic array growth factor is a space-time trade-off; CPython chose ~1.125 for memory efficiency
Cache locality makes sequential array traversal 5-10x faster than random access patterns
Always match your traversal order to the storage order (row-major or column-major)
Use the right data structure: deque for ends, set for membership, numpy for numbers

Rate this chapter

4.6 / 5 (108 ratings)