How to Remove Duplicate Lines from Text

2026-04-05 · 5 min read

When You Need to Remove Duplicate Lines

Duplicate line issues are very common in data processing. Common scenarios: duplicate keyword entries when merging lists from multiple sources; duplicate records exported from a database; duplicate entries added by multiple collaborators in a shared document; duplicate URLs or content in web scraping results; duplicate entries when merging multiple configuration files.

Deduplication with Online Tools

The quickest method is using an online text deduplication tool. Paste the text with duplicate lines into the input box, and the tool keeps the first occurrence of each line and removes subsequent duplicates. Options typically include case sensitivity (whether Apple and apple are treated as duplicates) and whether to remove blank lines.

Command-Line Tools (Unix/Linux/macOS)

# sort + uniq 组合：先排序再去重（相邻重复行才能被 uniq 识别）
sort input.txt | uniq > output.txt

# 仅删除连续重复行（不排序）
uniq input.txt > output.txt

# 统计每行出现次数
sort input.txt | uniq -c

# 只保留出现超过一次的行
sort input.txt | uniq -d

# 只保留只出现一次的行（唯一行）
sort input.txt | uniq -u

Python Code Implementation

# 保留顺序的去重（使用 dict 或 set）
def remove_duplicates(text):
    lines = text.splitlines()
    seen = {}
    result = []
    for line in lines:
        key = line.strip().lower()  # 忽略大小写和前后空格
        if key not in seen:
            seen[key] = True
            result.append(line)
    return '\n'.join(result)

# 注意：使用 set() 去重不保证顺序
# unique_lines = list(set(lines))  # 不推荐用于需要保留顺序的情况

Order-Preserving vs. Non-Order-Preserving

Deduplication has two strategies: order-preserving (keep the first occurrence of each line, remove subsequent duplicates) and non-order-preserving (deduplicate with a set and output in random order). For most list organization scenarios, preserving the original line order (removing later duplicates rather than earlier ones) better matches expectations. The sort + uniq combination first sorts then deduplicates, resulting in a sorted unique list different from the original order.

Handling Near-Duplicate Lines

Exact duplicate lines are easy to handle, but "near-duplicates" (differing only in space count, case, or surrounding whitespace) require normalization before comparison. Normalization steps: trim leading and trailing whitespace, unify case (all lowercase or all uppercase), and compress consecutive spaces to a single space. Applying these normalizations before comparison catches more actually-duplicate lines.

Performance Considerations for Large Files

For large files with millions of lines, online tools may hit browser memory limits. In this case, command-line tools are recommended: sort -u input.txt > output.txt efficiently handles very large files because the sort tool supports external sorting (temporary files) without needing to load all data into memory at once.

Deduplication in Excel and Google Sheets

If data is in a spreadsheet, Excel's "Remove Duplicates" feature (Data tab) can deduplicate by column. Google Sheets can use the =UNIQUE(A1:A100) formula to return a deduplicated list. This is more convenient than exporting and processing separately for spreadsheet data.

Try the free tool now

Use Free Tool →