Arrays: The Power and Price of Contiguous Memory
Chapter 2: Arrays โ The Power and Price of Contiguous Memory
Every programmer's first data structure is the array. It seems trivially simple: a block of contiguous memory cells, each holding one element, accessible by index. Yet this simplicity hides a profound trade-off that determines the performance of virtually every program you will ever write.
The array's power comes from one physical fact: modern CPUs are built to access contiguous memory blazingly fast. Its price comes from one logical consequence: if elements must stay contiguous, inserting or removing an element forces everything else to move. Understanding this trade-off โ and the engineering decisions built on top of it โ separates beginners from systems programmers.
Level 1 ยท What You Need to Know
2.1 The Fundamental Contract: Contiguous Memory
An array stores elements in contiguous memory locations. Each element occupies the same number of bytes, so the address of element i is:
address(arr[i]) = base_address + i ร element_size
This formula is the source of the array's superpower: O(1) random access. No matter whether you ask for arr[0] or arr[999999], the CPU performs one multiplication and one addition โ constant time.
Why this matters (Knuth, The Art of Computer Programming, Vol. 1, 1968): Knuth identified sequential allocation as the most natural mapping from a mathematical sequence to machine memory. The single-arithmetic-operation access pattern maps directly to how hardware addressing works: the CPU's memory controller takes a numeric address and fetches the word at that location in one memory cycle.
import array
# A true array of integers (4 bytes each)
arr = array.array('i', [10, 20, 30, 40, 50])
# O(1) access โ same speed regardless of index
first = arr[0] # base + 0 * 4 = base
last = arr[4] # base + 4 * 4 = base + 16
middle = arr[2] # base + 2 * 4 = base + 8
2.2 The Cost Model: Operations and Their Complexities
| Operation | Time Complexity | Why |
|---|---|---|
Access by index arr[i] |
O(1) | Arithmetic on address |
| Search (unsorted) | O(n) | Must check each element |
| Search (sorted) | O(log n) | Binary search |
| Insert at end | O(1) amortized | May need resize |
| Insert at position i | O(n) | Must shift n-i elements right |
| Delete at end | O(1) | Just decrement length |
| Delete at position i | O(n) | Must shift n-i elements left |
| Append | O(1) amortized | Dynamic array growth |
The critical insight: insertion and deletion are expensive in the middle because maintaining contiguity requires shifting elements. If you insert at index 0 of an n-element array, all n elements must move one position to the right.
# Demonstrating the cost of insert at front vs append
import timeit
n = 100000
# Insert at front: O(n) per operation โ O(nยฒ) total
def insert_front():
lst = []
for i in range(n):
lst.insert(0, i) # Shift all existing elements
# Append at back: O(1) amortized per operation โ O(n) total
def append_back():
lst = []
for i in range(n):
lst.append(i) # No shifting needed
t_front = timeit.timeit(insert_front, number=1)
t_back = timeit.timeit(append_back, number=1)
print(f"Insert at front: {t_front:.3f}s") # ~2.5s
print(f"Append at back: {t_back:.3f}s") # ~0.008s
# Ratio: ~300x slower for insert(0)
2.3 Python list Is Not a Traditional Array
A traditional array (C's int arr[100]) stores values directly in contiguous memory. Python's list is different โ it is a dynamic array of pointers (references to objects).
Traditional array (C):
โโโโโโฌโโโโโฌโโโโโฌโโโโโฌโโโโโ
โ 10 โ 20 โ 30 โ 40 โ 50 โ โ Values stored directly
โโโโโโดโโโโโดโโโโโดโโโโโดโโโโโ
Python list:
โโโโโโฌโโโโโฌโโโโโฌโโโโโฌโโโโโ
โ * โ * โ * โ * โ * โ โ Pointers (8 bytes each on 64-bit)
โโโผโโโดโโผโโโดโโผโโโดโโผโโโดโโผโโโ
โ โ โ โ โ
int int str list float โ Objects scattered in heap
Key implications:
- Heterogeneous: A Python list can hold mixed types because it stores pointers, not values.
- Memory overhead: Each element requires a pointer (8 bytes) plus the object header (28 bytes for a Python int).
- Cache inefficiency: The actual values are scattered across heap memory โ bad for CPU cache.
import sys
# Memory of a Python list vs what you might expect
lst = [1, 2, 3, 4, 5]
print(sys.getsizeof(lst)) # 104 bytes (list object itself)
# But each int object is 28 bytes, so total = 104 + 5*28 = 244 bytes
# A C array of 5 ints would be: 5 * 4 = 20 bytes
# For pure numeric work, use array module or numpy
import array
arr = array.array('i', [1, 2, 3, 4, 5])
print(sys.getsizeof(arr)) # 84 bytes (stores values directly)
2.4 Common Python List Operations โ True Costs
Python's list is a dynamic array. Here are the costs that surprise people:
lst = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
# O(1) โ constant time
lst.append(11) # Add to end (amortized O(1))
lst.pop() # Remove from end
lst[5] # Index access
len(lst) # Length (stored as attribute)
# O(n) โ linear time (THESE ARE THE TRAPS)
lst.insert(0, 99) # Insert at front โ shift all elements
lst.pop(0) # Remove from front โ shift all elements
lst.remove(5) # Find and remove โ search + shift
99 in lst # Membership test โ linear scan
lst.index(5) # Find index โ linear scan
# O(k) where k is slice length
lst[2:5] # Slice โ copy k elements
del lst[2:5] # Delete slice โ copy + shift
# O(n) โ less obvious
lst2 = lst + [1,2,3] # Concatenation โ copy entire list
lst3 = lst * 3 # Repetition โ copy entire list 3 times
Critical mistake #1: Using list.insert(0, x) in a loop
If you need to prepend elements frequently, use collections.deque:
from collections import deque
# BAD: O(nยฒ) total
def build_reversed_bad(items):
result = []
for item in items:
result.insert(0, item) # O(n) each time!
return result
# GOOD: O(n) total
def build_reversed_good(items):
result = deque()
for item in items:
result.appendleft(item) # O(1) each time
return list(result)
2.5 Two-Dimensional Arrays: Row-Major Storage
When you create a 2D array (a matrix), the hardware still stores it in a 1D strip of memory. The question is: do you lay it out row by row (row-major) or column by column (column-major)?
C, Python (NumPy default), Java, and most languages use row-major order:
Matrix: Memory layout (row-major):
โโโโโโโโโโโ
โ 1 2 3 โ [1, 2, 3, 4, 5, 6, 7, 8, 9]
โ 4 5 6 โ โโโrow0โโ โโโrow1โโ โโโrow2โโ
โ 7 8 9 โ
โโโโโโโโโโโ
Address of M[i][j] = base + (i * num_cols + j) * element_size
Why this matters for performance: Accessing elements in row order follows the memory layout, which means the CPU cache prefetcher can predict and preload upcoming data. Accessing in column order jumps across memory, defeating the cache.
import numpy as np
import timeit
n = 5000
matrix = np.random.rand(n, n)
# Row-major traversal (cache-friendly)
def row_sum():
total = 0.0
for i in range(n):
for j in range(n):
total += matrix[i, j] # Sequential in memory
return total
# Column-major traversal (cache-hostile)
def col_sum():
total = 0.0
for j in range(n):
for i in range(n):
total += matrix[i, j] # Jumping across memory
return total
# row_sum is typically 3-5x faster due to cache effects
# (In practice, use matrix.sum() which is ~1000x faster than either)
2.6 The Shallow Copy Trap
One of the most common bugs for Python beginners:
# WRONG: All rows share the same list object
grid = [[0] * 5] * 3
grid[0][0] = 1
print(grid)
# [[1, 0, 0, 0, 0], [1, 0, 0, 0, 0], [1, 0, 0, 0, 0]] โ ALL rows changed!
# WHY: [expr] * 3 creates 3 references to the SAME object
# Internally: grid = [ptr, ptr, ptr] where all three point to same [0,0,0,0,0]
# CORRECT: Each row is an independent list object
grid = [[0] * 5 for _ in range(3)]
grid[0][0] = 1
print(grid)
# [[1, 0, 0, 0, 0], [0, 0, 0, 0, 0], [0, 0, 0, 0, 0]] โ Only row 0 changed
# WHY: The list comprehension evaluates [0]*5 three separate times,
# creating three distinct list objects in memory.
The rule: [mutable_object] * n creates n references to the same object. Use a list comprehension when you need independent copies.
This is not a bug โ it is consistent with Python's object model where * on a list performs shallow copy of references (documented in Python Language Reference ยง6.7, "Binary arithmetic operations"). The inner [0] * 5 is safe because integers are immutable โ sharing references to immutable objects has no observable effect.
Level 2 ยท How It Works
2.7 Dynamic Array Growth Strategy
A static array has fixed size. But Python's list grows automatically when you append beyond its current capacity. How?
The strategy is called geometric growth (or table doubling): when the array is full, allocate a new, larger array and copy all elements over. The critical design question is: how much larger?
CPython's actual growth formula (from Objects/listobject.c, function list_resize):
// CPython 3.12 source: Objects/listobject.c
// new_allocated = (size_t)newsize + (newsize >> 3) + (newsize < 9 ? 3 : 6);
// This gives growth of approximately: newsize + newsize/8 + small_constant
// Growth factor โ 1.125 (i.e., grow by 12.5%)
The actual allocated sizes for a list growing from 0:
0, 4, 8, 16, 24, 32, 40, 52, 64, 76, ...
import sys
# Watch Python list grow
lst = []
prev_size = sys.getsizeof(lst)
print(f"len=0, size={prev_size} bytes, capacityโ0")
for i in range(100):
lst.append(i)
new_size = sys.getsizeof(lst)
if new_size != prev_size:
# Size of list object = 56 (overhead) + 8 * capacity (pointers)
capacity = (new_size - 56) // 8
print(f"len={i+1}, size={new_size} bytes, capacityโ{capacity}")
prev_size = new_size
# Output (CPython 3.12, 64-bit):
# len=1, size=88 bytes, capacityโ4
# len=5, size=120 bytes, capacityโ8
# len=9, size=184 bytes, capacityโ16
# len=17, size=248 bytes, capacityโ24
# len=25, size=312 bytes, capacityโ32
# ...
2.8 Why Growth Factor ~1.125, Not 2?
Most textbooks teach "double the array" (growth factor = 2). Java's ArrayList uses 1.5. CPython uses approximately 1.125. Why?
The trade-off (Andrew Koenig, "Patterns and Antipatterns," JOOP 1998; Sedgewick, Algorithms, 4th ed., 2011):
| Growth Factor | Wasted Space (worst case) | Amortized Copy Cost | Memory Reuse? |
|---|---|---|---|
| 2.0 | Up to 50% of allocated memory | O(1) per append | No โ new allocation always > sum of all previous |
| 1.5 | Up to 33% | O(1) per append | Yes โ after several resizes, freed blocks can be reused |
| 1.125 | Up to 11% | O(1) per append | Yes โ almost always reusable |
The key insight (attributed to Andrew Appel, Princeton; discussed in Slator's Dynamic Arrays notes, 2003): With growth factor 2, the new size always exceeds the total of all previously freed blocks combined (1 + 2 + 4 + ... + n/2 = n - 1 < n). This means the allocator can never reuse the freed memory in the same contiguous region. With factor < ฯ (the golden ratio โ 1.618), eventually the sum of freed blocks exceeds the new allocation size, enabling reuse.
CPython chose ~1.125 because:
- Python lists are extremely common โ millions exist in a typical program
- Most lists are small (< 20 elements) and never grow much
- The slight increase in copy operations is dwarfed by the memory savings across millions of lists
- The
+3or+6additive term ensures small lists don't resize too frequently
# Empirical verification: measuring amortized cost
import time
def measure_append(n):
"""Measure total time for n appends โ should be O(n) if amortized O(1)."""
lst = []
start = time.perf_counter()
for i in range(n):
lst.append(i)
elapsed = time.perf_counter() - start
return elapsed
# If amortized O(1), doubling n should double the time
for n in [100_000, 200_000, 400_000, 800_000, 1_600_000]:
t = measure_append(n)
print(f"n={n:>10,}: {t:.4f}s ({t/n*1e6:.2f} ยตs/append)")
# Output shows roughly constant time per append:
# n= 100,000: 0.0067s (0.07 ยตs/append)
# n= 200,000: 0.0133s (0.07 ยตs/append)
# n= 400,000: 0.0267s (0.07 ยตs/append)
# n= 800,000: 0.0536s (0.07 ยตs/append)
# n= 1,600,000: 0.1079s (0.07 ยตs/append)
2.9 Cache Locality: Why Array Traversal Crushes Linked Lists
Modern CPUs do not fetch individual bytes from main memory. They fetch entire cache lines โ typically 64 bytes at a time (Intel, AMD, ARM since ~2005). When you access arr[0], the hardware loads bytes at addresses 0-63 into L1 cache. If arr[1] through arr[7] (for 8-byte elements) live in that same cache line, accessing them is essentially free โ no memory fetch needed.
The memory hierarchy (Hennessy & Patterson, Computer Architecture: A Quantitative Approach, 6th ed., 2017):
| Level | Size | Latency | Bandwidth |
|---|---|---|---|
| L1 Cache | 32-64 KB | ~1 ns (4 cycles) | ~1 TB/s |
| L2 Cache | 256 KB-1 MB | ~4 ns (12 cycles) | ~500 GB/s |
| L3 Cache | 4-64 MB | ~12 ns (36 cycles) | ~200 GB/s |
| Main Memory (DRAM) | 8-256 GB | ~100 ns (300 cycles) | ~50 GB/s |
Consequence: An L1 cache hit is 100x faster than a main memory access. Arrays exploit this because sequential elements are physically adjacent, causing almost every access to be a cache hit after the first fetch in each cache line.
Linked lists scatter nodes across the heap. Each node.next pointer dereference is likely a cache miss, requiring a full memory round-trip.
import numpy as np
import time
# Array traversal (cache-friendly)
arr = np.arange(10_000_000, dtype=np.int64) # 80 MB, contiguous
start = time.perf_counter()
total = arr.sum()
t_array = time.perf_counter() - start
# Simulated linked-list-like access (cache-hostile)
# Shuffle indices to simulate pointer-chasing
indices = np.arange(10_000_000)
np.random.shuffle(indices)
start = time.perf_counter()
total = 0
for idx in indices:
total += arr[idx] # Random access pattern
t_random = time.perf_counter() - start
print(f"Sequential: {t_array:.4f}s") # ~0.005s
print(f"Random: {t_random:.4f}s") # ~3.0s
print(f"Ratio: {t_random/t_array:.0f}x slower") # ~600x
# (The extreme ratio includes Python loop overhead, but even in C
# the difference is 5-10x due purely to cache effects)
Hardware prefetching (Intel Optimization Manual, ยง2.3.5.4): Modern CPUs have prefetch units that detect sequential access patterns and speculatively load upcoming cache lines before your code even requests them. This works perfectly for array traversal but is useless for linked list traversal (the next address is unknowable until the current node is fetched).
2.10 array Module vs. NumPy Array vs. list
Python offers three "array-like" containers with very different performance characteristics:
| Feature | list |
array.array |
numpy.ndarray |
|---|---|---|---|
| Stores | References (pointers) | Raw values | Raw values |
| Type | Heterogeneous | Homogeneous | Homogeneous |
| Element size (int) | 8 (ptr) + 28 (obj) = 36 bytes | 4 bytes (int32) | 4 or 8 bytes |
| Cache-friendly | No (values scattered) | Yes (contiguous values) | Yes (contiguous values) |
| Vectorized ops | No | No | Yes (SIMD) |
| Use case | General purpose | I/O buffers, protocol data | Numerical computing |
import sys
import array
import numpy as np
n = 1_000_000
# Memory comparison for storing 1 million integers
lst = list(range(n))
arr = array.array('i', range(n)) # 4-byte signed int
npa = np.arange(n, dtype=np.int32)
print(f"list: {sys.getsizeof(lst) + n * 28:>12,} bytes") # ~36 MB
print(f"array.array: {sys.getsizeof(arr):>12,} bytes") # ~4 MB
print(f"numpy: {npa.nbytes:>12,} bytes") # ~4 MB
When to use which:
list: Default choice. Flexibility over performance. Mixed types, small collections.array.array: When you need compact storage of a single primitive type and don't need math operations. Useful for binary I/O.numpy.ndarray: Any numerical computation. Vectorized operations, linear algebra, scientific computing.
2.11 Memory Alignment and the struct Module
When data crosses hardware boundaries, performance suffers. CPUs access memory most efficiently when data is aligned โ meaning an n-byte value starts at an address divisible by n.
import struct
# Pack values with alignment (default behavior)
# char (1) + pad (3) + int (4) + char (1) + pad (7) + double (8) = 24 bytes
s1 = struct.Struct('c i c d') # Note: default alignment depends on platform
# Pack values without alignment padding
s2 = struct.Struct('=c i c d') # '=' means native byte order, no padding
print(f"With alignment: {struct.calcsize('c i c d')} bytes") # Platform-dependent
print(f"Without: {struct.calcsize('=cicid')} bytes") # Tightly packed
# Why alignment matters:
# An unaligned 4-byte int at address 0x03 spans two memory words (0x00-0x03 and 0x04-0x07).
# The CPU must perform TWO memory reads and mask/shift the bytes together.
# An aligned int at address 0x04 requires ONE memory read.
Real-world impact (Drepper, "What Every Programmer Should Know About Memory," 2007): Unaligned accesses on x86 are handled transparently by hardware but with a performance penalty of ~2x for L1 cache hits. On ARM and older architectures, unaligned access causes a hardware fault (bus error), terminating the program.
Level 3 ยท What the Spec Says
2.12 Amortized Analysis of Dynamic Arrays
In Chapter 1, we introduced Tarjan's potential method for amortized analysis. Let us apply it rigorously to prove that dynamic array append is O(1) amortized.
Theorem: For a dynamic array with growth factor ฮฑ > 1, the amortized cost of append is O(1).
Proof using the accounting method (Cormen, Leiserson, Rivest, Stein, Introduction to Algorithms, 4th ed., 2022, Chapter 16):
Assign each append operation a charge of 3 units (for growth factor 2):
- 1 unit pays for the actual insertion
- 1 unit is saved as "credit" on the newly inserted element
- 1 unit is saved as "credit" on one element in the first half of the array
When the array is full (n elements, capacity n), we need to copy all n elements to a new array of size 2n. At this moment:
- The first half (n/2 elements) each have 1 credit (deposited when they were "passed" by new insertions)
- The second half (n/2 elements) each have 1 credit (deposited when they were inserted)
- Total credits available: n/2 + n/2 = n โ exactly enough to pay for copying n elements
Therefore, no operation ever goes into "debt," and the amortized cost per append is 3 = O(1). โ
For general growth factor ฮฑ:
The amortized cost per append is:
c_amortized = 1 + ฮฑ/(ฮฑ-1)
Derivation: When resizing from capacity C to ฮฑC, we copy C elements. Between two consecutive resizes, we insert C(ฮฑ-1) new elements. So the copy cost per element is C / (C(ฮฑ-1)) = 1/(ฮฑ-1). Each element also pays 1 for its own insertion.
| ฮฑ | Amortized cost | Wasted space |
|---|---|---|
| 2.0 | 1 + 2 = 3 | โค 50% |
| 1.5 | 1 + 3 = 4 | โค 33% |
| 1.25 | 1 + 5 = 6 | โค 20% |
| 1.125 | 1 + 9 = 10 | โค 11% |
CPython's choice of ฮฑ โ 1.125 means each append "costs" about 10 units of work amortized, but the memory overhead stays below 11%. This is a deliberate engineering trade-off: Python programs create many lists but rarely perform billions of appends to a single list, so minimizing memory waste is more valuable than minimizing copy overhead.
2.13 Array Implementation Across Languages
The array is a universal data structure, but implementations vary significantly:
C (ISO/IEC 9899:2018, ยง6.5.6):
int arr[100]; // Stack-allocated, fixed size, no bounds checking
arr[i] == *(arr + i) // Pointer arithmetic โ undefined behavior if out of bounds
// sizeof(arr) == 400 (100 * 4 bytes)
// No runtime length information โ programmer must track it
The C standard defines array subscript arr[i] as exactly equivalent to *(arr + i). This means:
- No bounds checking (segfault or silent corruption on overflow)
arr[i]andi[arr]are both legal and equivalent- Array names decay to pointers in most contexts (ยง6.3.2.1)
Java (JLS ยง10, Java Language Specification, Gosling et al.):
int[] arr = new int[100]; // Heap-allocated, fixed size, bounds-checked
// arr.length is a final field โ O(1) access to size
// ArrayIndexOutOfBoundsException on invalid index
// JIT may eliminate bounds checks after proving index validity
Java arrays are objects with a length field. The JVM specification (ยง2.11.5) requires bounds checking on every array access (aaload, iaload instructions), but HotSpot's JIT compiler performs range check elimination when it can prove the index is valid (e.g., in counted loops).
Python (CPython implementation, Include/cpython/listobject.h):
// Simplified from CPython source
typedef struct {
PyObject_VAR_HEAD // ob_refcnt, ob_type, ob_size (length)
PyObject **ob_item; // Pointer to array of pointers
Py_ssize_t allocated; // Total allocated slots
} PyListObject;
Key differences:
ob_size= number of elements currently in the listallocated= total capacity (โฅ ob_size)ob_itempoints to a separately allocated array ofPyObject*pointers- Each element access requires: bounds check โ pointer dereference โ object access
Performance consequence: A Python list access does approximately 5x more work than a C array access (bounds check + pointer load + potential cache miss for the object). This is why NumPy exists โ it bypasses the Python object layer entirely.
2.14 CPU Cache Architecture: L1/L2/L3 and the 64-Byte Cache Line
(Cross-reference: Computer How It Works, Chapter 11 โ for full treatment of cache architecture)
The cache system is the single most important factor in understanding array performance. Here are the specifications that matter:
Cache line size: 64 bytes (Intel SDM ยง11.6; ARM Architecture Reference Manual, ยงD7.2)
All modern x86 and ARM processors use a 64-byte cache line as the fundamental unit of transfer between cache levels and main memory. When any byte in a 64-byte aligned block is accessed, the entire block is loaded.
For arrays, this means:
int32[16]elements fit in one cache line (16 ร 4 = 64)int64[8]elements fit in one cache line (8 ร 8 = 64)float64[8]elements fit in one cache line (8 ร 8 = 64)
Associativity and eviction (Hennessy & Patterson, ยงB.3):
L1 caches are typically 8-way set-associative: each memory address maps to one of N/8 sets, and within that set, 8 cache lines can coexist. When a 9th line maps to the same set, one must be evicted (typically LRU โ Least Recently Used).
Why arrays benefit:
- Spatial locality: Accessing
arr[i]loads 64 bytes containingarr[i]througharr[i+7](for 8-byte elements). Subsequent accesses hit cache. - Temporal locality: Recently accessed elements remain in cache for reuse.
- Prefetching: Hardware detects stride patterns (stride = element_size for sequential access) and issues prefetch requests 2-4 cache lines ahead.
# Demonstrating cache line effects
import numpy as np
import time
# Stride-1 access: every element โ maximum cache utilization
# Stride-8 access: every 8th element โ 1 useful element per cache line
# Stride-64 access: every 64th element โ crosses cache line every time
arr = np.zeros(8_000_000, dtype=np.int64) # 64 MB
def stride_access(arr, stride):
total = 0
for i in range(0, len(arr), stride):
total += arr[i]
return total
for stride in [1, 2, 4, 8, 16, 32]:
n_accesses = len(arr) // stride
start = time.perf_counter()
stride_access(arr, stride)
elapsed = time.perf_counter() - start
ns_per = elapsed / n_accesses * 1e9
print(f"Stride {stride:>2}: {ns_per:.1f} ns/access ({n_accesses:>10,} accesses)")
# With stride 1-8, the ns/access is nearly constant (cache line amortization)
# Beyond stride 8 (for int64), each access is a new cache line โ slower
2.15 Row-Major vs. Column-Major Storage
The choice of memory layout for multi-dimensional arrays has been a source of bugs and performance issues since the earliest days of scientific computing.
Row-major (C, Python/NumPy default, Java, C#, Pascal):
M[i][j] stored at: base + (i * ncols + j) * element_size
Adjacent in memory: M[0][0], M[0][1], M[0][2], ... M[0][n-1], M[1][0], ...
Column-major (Fortran, MATLAB, R, Julia):
M(i,j) stored at: base + (j * nrows + i) * element_size
Adjacent in memory: M(1,1), M(2,1), M(3,1), ... M(m,1), M(1,2), ...
Historical reason (Backus et al., "The FORTRAN Automatic Coding System," 1957): Fortran was designed for scientific computing where operations on column vectors (extracting a variable across all observations) are more common than operations on rows. C was designed for systems programming where structures (rows) are processed sequentially.
Performance impact when layout doesn't match access pattern:
import numpy as np
import time
n = 4000
# Row-major (C order) โ default
C = np.ones((n, n), dtype=np.float64, order='C')
# Column-major (Fortran order)
F = np.ones((n, n), dtype=np.float64, order='F')
# Row traversal on both
def row_traverse(M):
s = 0.0
for i in range(M.shape[0]):
for j in range(M.shape[1]):
s += M[i, j]
return s
# Column traversal on both
def col_traverse(M):
s = 0.0
for j in range(M.shape[1]):
for i in range(M.shape[0]):
s += M[i, j]
return s
# Row traverse is fast on C-order, slow on F-order
# Col traverse is fast on F-order, slow on C-order
# In pure Python loops, difference is 2-4x
# In compiled code (C/Fortran), difference is 5-20x
Practical advice: Always traverse arrays in the order they are stored. In NumPy, check array.flags to see if an array is C-contiguous or F-contiguous, and iterate accordingly.
Level 4 ยท Edge Cases and Pitfalls
2.16 Python List Hidden Performance Traps
Trap 1: list.insert(0, x) is O(n)
This is the most common performance bug in Python code. Every front-insertion requires copying the entire array:
# Building a list of n elements by prepending: O(nยฒ) total
# n = 100,000 takes ~2.5 seconds
# n = 1,000,000 takes ~250 seconds (4+ minutes!)
# Solution: use collections.deque or build reversed, then reverse once
Trap 2: list + list creates a new list (O(n+m))
# This innocent-looking loop is O(nยฒ):
result = []
for chunk in chunks:
result = result + chunk # Creates new list each iteration!
# Fix: use extend() which is O(m) per call
result = []
for chunk in chunks:
result.extend(chunk) # Modifies in-place
# Or even better:
import itertools
result = list(itertools.chain.from_iterable(chunks)) # Single allocation
Trap 3: [x] * n shares references
# For immutable x (int, str, tuple): safe
zeros = [0] * 1000 # Fine โ integers are immutable
# For mutable x (list, dict, set): DANGEROUS
rows = [[]] * 5
rows[0].append(1)
print(rows) # [[1], [1], [1], [1], [1]] โ all five are the same list!
# Fix:
rows = [[] for _ in range(5)] # Five independent lists
Trap 4: Membership testing is O(n)
# BAD: O(n) per lookup โ O(nยฒ) for loop
large_list = list(range(1_000_000))
for item in query_items:
if item in large_list: # Linear scan!
process(item)
# GOOD: O(1) per lookup โ O(n) for loop
large_set = set(large_list) # One-time O(n) conversion
for item in query_items:
if item in large_set: # Hash lookup!
process(item)
2.17 Large Array Memory Estimation
sys.getsizeof() only reports the shallow size of a Python object โ the size of the container, not its contents.
import sys
lst = [1, 2, 3, 4, 5]
# Shallow size (just the list object + pointer array)
print(sys.getsizeof(lst)) # 104 bytes
# But each int object also occupies memory!
print(sys.getsizeof(1)) # 28 bytes
# True total memory:
shallow = sys.getsizeof(lst) # 104
deep = shallow + sum(sys.getsizeof(x) for x in lst) # 104 + 5*28 = 244
print(f"Shallow: {shallow}, Deep: {deep}")
# For large arrays of integers:
n = 1_000_000
lst = list(range(n))
# Shallow: ~8 MB (list overhead + 1M pointers ร 8 bytes)
# Deep: ~8 MB + 1M ร 28 bytes = ~36 MB total
# Compare: numpy array = 8 MB (int64) or 4 MB (int32)
Accurate memory measurement with tracemalloc:
import tracemalloc
tracemalloc.start()
# Measure actual memory impact
lst = list(range(1_000_000))
snapshot = tracemalloc.take_snapshot()
stats = snapshot.statistics('lineno')
for stat in stats[:5]:
print(stat)
# Shows ~36 MB total (pointer array + integer objects)
tracemalloc.stop()
Memory estimation formulas (CPython 3.12, 64-bit):
| Container | Formula | 1M int elements |
|---|---|---|
list of Python ints |
56 + 8n + 28n = 56 + 36n bytes | ~36 MB |
list of Python floats |
56 + 8n + 24n = 56 + 32n bytes | ~32 MB |
array.array('i') |
64 + 4n bytes | ~4 MB |
array.array('d') |
64 + 8n bytes | ~8 MB |
numpy.ndarray (int32) |
112 + 4n bytes | ~4 MB |
numpy.ndarray (int64) |
112 + 8n bytes | ~8 MB |
2.18 Interview Problems: Arrays
The following LeetCode problems test array mastery. We provide optimal solutions with detailed analysis.
Problem 1: Rotate Array (LeetCode #189)
Problem: Rotate an array of n elements to the right by k steps.
Example: [1,2,3,4,5,6,7], k=3 โ [5,6,7,1,2,3,4]
def rotate(nums: list[int], k: int) -> None:
"""
Three-reversal algorithm.
Time: O(n), Space: O(1)
Insight (Bentley, Programming Pearls, 1986):
Rotating by k is equivalent to:
1. Reverse entire array: [7,6,5,4,3,2,1]
2. Reverse first k elements: [5,6,7,4,3,2,1]
3. Reverse remaining n-k: [5,6,7,1,2,3,4]
Why this works: Let A = first n-k elements, B = last k elements.
We want BA. Starting from AB:
reverse(AB) = (AB)^R = B^R A^R
reverse first k of B^R A^R = B A^R
reverse last n-k of B A^R = B A = BA โ
"""
n = len(nums)
k = k % n # Handle k > n
if k == 0:
return
def reverse(arr, start, end):
while start < end:
arr[start], arr[end] = arr[end], arr[start]
start += 1
end -= 1
reverse(nums, 0, n - 1) # Reverse all
reverse(nums, 0, k - 1) # Reverse first k
reverse(nums, k, n - 1) # Reverse rest
# Alternative: cyclic replacement (O(n) time, O(1) space)
def rotate_cyclic(nums: list[int], k: int) -> None:
"""
Move each element directly to its final position.
Use GCD(n,k) to determine number of cycles needed.
"""
n = len(nums)
k = k % n
if k == 0:
return
from math import gcd
num_cycles = gcd(n, k)
for start in range(num_cycles):
current = start
prev = nums[start]
while True:
next_idx = (current + k) % n
nums[next_idx], prev = prev, nums[next_idx]
current = next_idx
if current == start:
break
Problem 2: Merge Sorted Array (LeetCode #88)
Problem: Merge nums2 into nums1 (which has extra space at end), maintaining sorted order.
def merge(nums1: list[int], m: int, nums2: list[int], n: int) -> None:
"""
Merge from the back to avoid overwriting unprocessed elements.
Time: O(m+n), Space: O(1)
Key insight: nums1 has space at the end (indices m to m+n-1).
By filling from the back (largest first), we never overwrite
elements in nums1 that haven't been processed yet.
Why it works: At position p = m+n-1, we place the largest of
nums1[i] and nums2[j]. Since we're placing the largest remaining
element at the highest unfilled position, and nums1's values at
indices > i are already placed (or empty), we never lose data.
"""
p = m + n - 1 # Pointer to current fill position
i = m - 1 # Pointer to end of nums1's values
j = n - 1 # Pointer to end of nums2's values
while j >= 0:
if i >= 0 and nums1[i] > nums2[j]:
nums1[p] = nums1[i]
i -= 1
else:
nums1[p] = nums2[j]
j -= 1
p -= 1
# Example trace:
# nums1 = [1,2,3,0,0,0], m=3, nums2 = [2,5,6], n=3
# Step 1: compare 3 vs 6 โ place 6 at index 5: [1,2,3,0,0,6]
# Step 2: compare 3 vs 5 โ place 5 at index 4: [1,2,3,0,5,6]
# Step 3: compare 3 vs 2 โ place 3 at index 3: [1,2,3,3,5,6] (i now at 1)
# Step 4: compare 2 vs 2 โ place 2 at index 2: [1,2,2,3,5,6] (j now at -1)
# Done! j < 0, remaining nums1 elements already in place.
Problem 3: Remove Element (LeetCode #27)
Problem: Remove all instances of val from nums in-place. Return new length.
def removeElement(nums: list[int], val: int) -> int:
"""
Two-pointer technique: slow pointer marks the "write" position.
Time: O(n), Space: O(1)
Invariant: all elements before index `slow` are != val.
"""
slow = 0
for fast in range(len(nums)):
if nums[fast] != val:
nums[slow] = nums[fast]
slow += 1
return slow
# Variant: when elements to remove are rare, swap with end to minimize writes
def removeElement_swap(nums: list[int], val: int) -> int:
"""
Swap unwanted elements to end. Fewer writes when val is rare.
Time: O(n), Space: O(1)
"""
n = len(nums)
i = 0
while i < n:
if nums[i] == val:
nums[i] = nums[n - 1]
n -= 1
# Don't increment i โ need to check swapped element
else:
i += 1
return n
Problem 4: Spiral Matrix (LeetCode #54)
Problem: Return all elements of an mรn matrix in spiral order.
def spiralOrder(matrix: list[list[int]]) -> list[int]:
"""
Layer-by-layer peeling approach.
Time: O(mรn), Space: O(1) extra (excluding output)
Think of the matrix as concentric rectangular rings.
Process each ring: right โ down โ left โ up.
After processing the outer ring, shrink boundaries inward.
"""
if not matrix or not matrix[0]:
return []
result = []
top, bottom = 0, len(matrix) - 1
left, right = 0, len(matrix[0]) - 1
while top <= bottom and left <= right:
# Traverse right along top row
for col in range(left, right + 1):
result.append(matrix[top][col])
top += 1
# Traverse down along right column
for row in range(top, bottom + 1):
result.append(matrix[row][right])
right -= 1
# Traverse left along bottom row (if still valid)
if top <= bottom:
for col in range(right, left - 1, -1):
result.append(matrix[bottom][col])
bottom -= 1
# Traverse up along left column (if still valid)
if left <= right:
for row in range(bottom, top - 1, -1):
result.append(matrix[row][left])
left += 1
return result
# Example:
# [[1, 2, 3],
# [4, 5, 6], โ [1, 2, 3, 6, 9, 8, 7, 4, 5]
# [7, 8, 9]]
2.19 When NOT to Use Arrays
Arrays are the default choice, but they are wrong in specific scenarios:
| Scenario | Problem with Array | Better Choice | Why |
|---|---|---|---|
| Frequent insert/delete at front | O(n) per operation | collections.deque |
O(1) deque uses doubly-linked block list |
| Frequent insert/delete in middle | O(n) per operation | Linked list or balanced BST | O(1) or O(log n) respectively |
| Need O(1) membership test | O(n) linear scan | set or dict |
Hash table gives O(1) average |
| Sparse data (mostly zeros) | Wastes memory on zeros | dict or scipy.sparse |
Store only non-zero entries |
| Unknown final size + millions of prepends | O(nยฒ) total | deque + convert |
Amortized O(1) prepend |
| Need sorted + frequent inserts | O(n) per insert to maintain order | sortedcontainers.SortedList |
B-tree gives O(log n) |
from collections import deque
import timeit
n = 100_000
# Scenario: Build sequence by prepending
def with_list():
lst = []
for i in range(n):
lst.insert(0, i)
def with_deque():
d = deque()
for i in range(n):
d.appendleft(i)
t_list = timeit.timeit(with_list, number=1)
t_deque = timeit.timeit(with_deque, number=1)
print(f"list.insert(0): {t_list:.3f}s") # ~2.5s
print(f"deque.appendleft: {t_deque:.3f}s") # ~0.007s
print(f"Ratio: {t_list/t_deque:.0f}x") # ~350x faster
The decision framework:
- What is your dominant operation? If it's random access by index โ array. If it's insertion/deletion at ends โ deque. If it's search โ hash table or tree.
- How large is your data? For n < 100, the constant factors dominate and any structure works. For n > 10โถ, wrong choice means minutes vs. milliseconds.
- Do you need cache performance? For tight numerical loops, only contiguous memory (array/numpy) gives you hardware prefetching.
Summary
The array is deceptively simple. Behind arr[i] lies a deep connection between mathematics (address arithmetic), hardware (cache lines, prefetching, alignment), and systems design (growth factors, memory allocation).
Key takeaways:
- O(1) random access comes from contiguous memory + fixed element size
- The price is O(n) insertion/deletion (maintaining contiguity requires shifting)
- Python
liststores pointers, not values โ 5-9x memory overhead vs. NumPy - Dynamic array growth factor is a space-time trade-off; CPython chose ~1.125 for memory efficiency
- Cache locality makes sequential array traversal 5-10x faster than random access patterns
- Always match your traversal order to the storage order (row-major or column-major)
- Use the right data structure: deque for ends, set for membership, numpy for numbers