How to Extract Email Addresses from Text

2026-04-09 · 5 min read

Common Scenarios for Email Extraction

Many scenarios require extracting email addresses from text: recovering contact lists from exported CRM data or old documents; extracting contact emails from raw HTML obtained through web scraping; organizing multiple contact emails mentioned in meeting notes; extracting vendor contact information from invoice or contract text; analyzing developer contact information in code files.

Regular Expression for Email Addresses

The standard regular expression for email addresses is a classic computer science problem — a regex fully compliant with RFC 5322 is extremely complex. For practical use, the following simplified version captures 99%+ of common email addresses:

# Python
import re
pattern = r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b'
emails = re.findall(pattern, text)

# JavaScript
const pattern = /\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}\b/g;
const emails = text.match(pattern) || [];

Limitations of Regular Expressions

Even well-designed regular expressions have missed cases and false positives: new long top-level domains (.academy, .photography, etc.) may be missed if the regex only allows 2-4 character TLDs; text in images or PDFs cannot be extracted with regex and requires OCR; deliberately obfuscated email addresses (like "user at example dot com") cannot be recognized by standard regex; HTML-encoded email addresses (@ written as @) need to be decoded before extraction.

Cleanup After Extraction

Extracted email lists usually need further cleanup: remove duplicate addresses; normalize to lowercase (the local part of email addresses is technically case-sensitive, but practically is not); validate format validity (MX record check); delete obvious placeholder addresses (like [email protected], [email protected]); check whether internal or private email addresses are included (should they be excluded?).

Command-Line Extraction

# 使用 grep 提取邮件地址（macOS/Linux）
grep -oE '\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b' input.txt

# 提取并去重排序
grep -oE '\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b' input.txt | sort -u

Legal Compliance Considerations

Extracting email addresses from text and using them for commercial purposes (like sending marketing emails) requires legal compliance awareness. Under GDPR (EU), CAN-SPAM (USA), CASL (Canada), and similar regulations, sending emails to addresses that have not explicitly consented to receive marketing communications is illegal. Unauthorized scraping and using website email addresses for spam can result in heavy fines or even criminal liability. Legal scenarios for email extraction are typically limited to organizing addresses from your own received emails, or processing contact data with explicit consent.

Email Address Validation

Extracted email addresses are not necessarily all valid. Format validation (regular expressions) can only verify whether the format meets specifications, not whether the mailbox actually exists. Deeper validation requires: checking whether the domain's MX record exists (DNS query); sending a verification email to the address (most reliable but requires the recipient's cooperation). Free email validation APIs (like Mailboxlayer, Abstract Email Validation) can automate the first two steps.

Try the free tool now

Use Free Tool →