โ† Back to Blog

How to Remove Duplicate Lines from Text

2026-04-05 ยท 5 min read

When You Need to Remove Duplicate Lines

Duplicate line issues are very common in data processing. Common scenarios: duplicate keyword entries when merging lists from multiple sources; duplicate records exported from a database; duplicate entries added by multiple collaborators in a shared document; duplicate URLs or content in web scraping results; duplicate entries when merging multiple configuration files.

Deduplication with Online Tools

The quickest method is using an online text deduplication tool. Paste the text with duplicate lines into the input box, and the tool keeps the first occurrence of each line and removes subsequent duplicates. Options typically include case sensitivity (whether Apple and apple are treated as duplicates) and whether to remove blank lines.

Command-Line Tools (Unix/Linux/macOS)

# sort + uniq ็ป„ๅˆ๏ผšๅ…ˆๆŽ’ๅบๅ†ๅŽป้‡๏ผˆ็›ธ้‚ป้‡ๅค่กŒๆ‰่ƒฝ่ขซ uniq ่ฏ†ๅˆซ๏ผ‰
sort input.txt | uniq > output.txt

# ไป…ๅˆ ้™ค่ฟž็ปญ้‡ๅค่กŒ๏ผˆไธๆŽ’ๅบ๏ผ‰
uniq input.txt > output.txt

# ็ปŸ่ฎกๆฏ่กŒๅ‡บ็Žฐๆฌกๆ•ฐ
sort input.txt | uniq -c

# ๅชไฟ็•™ๅ‡บ็Žฐ่ถ…่ฟ‡ไธ€ๆฌก็š„่กŒ
sort input.txt | uniq -d

# ๅชไฟ็•™ๅชๅ‡บ็Žฐไธ€ๆฌก็š„่กŒ๏ผˆๅ”ฏไธ€่กŒ๏ผ‰
sort input.txt | uniq -u

Python Code Implementation

# ไฟ็•™้กบๅบ็š„ๅŽป้‡๏ผˆไฝฟ็”จ dict ๆˆ– set๏ผ‰
def remove_duplicates(text):
    lines = text.splitlines()
    seen = {}
    result = []
    for line in lines:
        key = line.strip().lower()  # ๅฟฝ็•ฅๅคงๅฐๅ†™ๅ’Œๅ‰ๅŽ็ฉบๆ ผ
        if key not in seen:
            seen[key] = True
            result.append(line)
    return '\n'.join(result)

# ๆณจๆ„๏ผšไฝฟ็”จ set() ๅŽป้‡ไธไฟ่ฏ้กบๅบ
# unique_lines = list(set(lines))  # ไธๆŽจ่็”จไบŽ้œ€่ฆไฟ็•™้กบๅบ็š„ๆƒ…ๅ†ต

Order-Preserving vs. Non-Order-Preserving

Deduplication has two strategies: order-preserving (keep the first occurrence of each line, remove subsequent duplicates) and non-order-preserving (deduplicate with a set and output in random order). For most list organization scenarios, preserving the original line order (removing later duplicates rather than earlier ones) better matches expectations. The sort + uniq combination first sorts then deduplicates, resulting in a sorted unique list different from the original order.

Handling Near-Duplicate Lines

Exact duplicate lines are easy to handle, but "near-duplicates" (differing only in space count, case, or surrounding whitespace) require normalization before comparison. Normalization steps: trim leading and trailing whitespace, unify case (all lowercase or all uppercase), and compress consecutive spaces to a single space. Applying these normalizations before comparison catches more actually-duplicate lines.

Performance Considerations for Large Files

For large files with millions of lines, online tools may hit browser memory limits. In this case, command-line tools are recommended: sort -u input.txt > output.txt efficiently handles very large files because the sort tool supports external sorting (temporary files) without needing to load all data into memory at once.

Deduplication in Excel and Google Sheets

If data is in a spreadsheet, Excel's "Remove Duplicates" feature (Data tab) can deduplicate by column. Google Sheets can use the =UNIQUE(A1:A100) formula to return a deduplicated list. This is more convenient than exporting and processing separately for spreadsheet data.

Try the free tool now

Use Free Tool โ†’