Using MD5 Hash for File Deduplication

2026-04-17 · 5 min read

Why Use Hashes Instead of Byte-by-Byte Comparison

To check if two files have identical content, the most direct method is byte-by-byte comparison, but this is extremely inefficient for large files or many files. Hash comparison's advantage: calculate each file's hash once (O(n), n = file size), then compare hash values (O(1)). Once all files have hashes, files with the same hash have identical content. MD5 is a good choice here: fast, and the 128-bit hash space makes accidental collisions negligible for practical file deduplication.

Basic File Deduplication Script (Python)

import hashlib, os
from collections import defaultdict

def md5_file(filepath, chunk_size=65536):
    """Calculate MD5 of a file efficiently"""
    hasher = hashlib.md5()
    with open(filepath, 'rb') as f:
        while chunk := f.read(chunk_size):
            hasher.update(chunk)
    return hasher.hexdigest()

def find_duplicates(directory):
    """Find all duplicate files in a directory"""
    hash_map = defaultdict(list)

    for root, dirs, files in os.walk(directory):
        for filename in files:
            filepath = os.path.join(root, filename)
            try:
                file_hash = md5_file(filepath)
                hash_map[file_hash].append(filepath)
            except (IOError, PermissionError):
                pass

    # Return only groups with duplicates
    return {h: paths for h, paths in hash_map.items() if len(paths) > 1}

# Usage
duplicates = find_duplicates('/path/to/folder')
for hash_val, paths in duplicates.items():
    print(f"\nDuplicate group ({hash_val[:8]}...):")
    for path in paths:
        print(f"  {path}")

Optimization: Compare File Size First

An important optimization is grouping by file size first — only files with the same size can have identical content. No need to calculate hashes for files with different sizes:

def find_duplicates_optimized(directory):
    # Step 1: Group by file size (fast - no hashing)
    size_map = defaultdict(list)
    for root, _, files in os.walk(directory):
        for filename in files:
            filepath = os.path.join(root, filename)
            try:
                size = os.path.getsize(filepath)
                size_map[size].append(filepath)
            except OSError:
                pass

    # Step 2: Only hash files where size matches
    hash_map = defaultdict(list)
    for size, paths in size_map.items():
        if len(paths)  1}

Storing File Hashes in a Database

For continuously running systems (like file storage services), storing file hashes in a database avoids redundant calculations and enables instant deduplication:

-- Database schema for deduplication
CREATE TABLE files (
    id         INTEGER PRIMARY KEY,
    md5_hash   CHAR(32) NOT NULL,
    sha256_hash CHAR(64), -- Optional extra verification
    file_size  BIGINT NOT NULL,
    original_name TEXT,
    storage_path TEXT NOT NULL,  -- Actual storage location
    created_at TIMESTAMP DEFAULT NOW()
);

CREATE UNIQUE INDEX idx_md5 ON files(md5_hash);
-- When uploading a new file:
-- 1. Calculate MD5
-- 2. Check if MD5 already exists in DB
-- 3. If yes: return existing storage_path (dedup!)
-- 4. If no: store file, insert new row

Content-Addressable Storage (CAS) Principle

Content-Addressable Storage (CAS) is an advanced deduplication architecture: files are stored by their content hash, not by name. The filename is the hash value itself. Git's object storage is a CAS implementation — each Blob object's filename is the SHA1 hash of its content (now migrating to SHA256). In CAS, identical content files are only stored once but can be referenced by multiple different names, achieving deduplication naturally.

# CAS directory structure example
storage/
  ab/
    ab1234567890abcdef1234567890abcdef  # actual content
  5d/
    5d41402abc4b2a76b9719d911017c592    # actual content
  ...

# Lookup: hash → file path
def get_file_path(md5_hash):
    return f"storage/{md5_hash[:2]}/{md5_hash}"

Considerations for Large-Scale Deduplication

MD5 collision risk: For real large-scale file dedup, random MD5 collisions are extremely unlikely, but for user-submitted files there's a risk of deliberately crafted collisions. For critical data, add SHA256 as secondary verification
Large file handling: Read in chunks (e.g., 64KB blocks) to avoid memory issues
Parallel computation: Use multithreading or multiprocessing to hash multiple files in parallel
Incremental deduplication: Record modification times of hashed files to avoid recalculating unchanged files

Try the free tool now

Use Free Tool →