CSV Text Manipulation Guide
โ Back to Blog
CSV Text Manipulation Guide
ยท 5 min read
CSV Format Basics
CSV (Comma-Separated Values) is the most universal tabular data text format. Each line represents one record; fields are separated by commas; the first line is typically a header row. When a field value itself contains a comma or line break, enclose it in double quotes (commas inside quotes are not treated as separators). The CSV format is simple but has many edge-case pitfalls, and different tools handle them differently.
Common CSV Issues
- Encoding issues: Windows Excel saves CSV in ANSI/GBK encoding by default; opening with UTF-8 on macOS or Linux causes Chinese garbling
- BOM marker: Excel-exported UTF-8 CSV files typically have a BOM (EF BB BF) at the start, causing some programs to incorrectly parse the first line
- Delimiter variants: in German and some other locales, the delimiter may be semicolon (;) rather than comma
- Extra spaces: spaces before or after fields affect exact matching
- Inconsistent field counts: some rows have a different number of fields than the header row
Processing CSV with Command Line
# ๆฅ็ CSV ๆไปถๅ 10 ่ก
head -n 10 data.csv
# ๆ็ฌฌไบๅๆๅบ๏ผๆฐๅญๆๅบ๏ผ้ๅทๅ้๏ผ
sort -t',' -k2 -n data.csv
# ๆๅ็ฌฌไธๅๅ็ฌฌไธๅ
cut -d',' -f1,3 data.csv
# ็ป่ฎก่กๆฐ๏ผๅๅปๆ ้ข่ก๏ผ
wc -l data.csv
# ไฝฟ็จ csvkit๏ผ้ๅฎ่ฃ
๏ผ
# ๆๅๅๆๅบ
csvsort -c "column_name" data.csv
# ๆฅ็็นๅฎๅ
csvcut -c "name,email" data.csv | csvlook
Python pandas for CSV Processing
import pandas as pd
# ่ฏปๅ CSV๏ผๅค็็ผ็ ๅ BOM๏ผ
df = pd.read_csv('data.csv', encoding='utf-8-sig')
# ๅบๆฌๆธ
็
df.columns = df.columns.str.strip() # ๅป้คๅๅ็ฉบ็ฝ
df = df.dropna(how='all') # ๅ ้คๅ
จ็ฉบ่ก
df = df.drop_duplicates() # ๅ ้ค้ๅค่ก
# ๆธ
็ๆๆฌๅ
df['name'] = df['name'].str.strip().str.title()
df['email'] = df['email'].str.strip().str.lower()
# ๆๅๆๅบ
df_sorted = df.sort_values('date', ascending=False)
# ๅฏผๅบ๏ผUTF-8 ๅธฆ BOM ไพ Excel ๆญฃ็กฎๆพ็คบ๏ผ
df.to_csv('output.csv', index=False, encoding='utf-8-sig')
Converting Between CSV and Other Formats
CSV frequently needs to be converted to and from other formats: CSV to JSON (for API data exchange); CSV to Excel (preserving formatting and formulas); CSV to SQL INSERT statements (for database import); CSV to Markdown table (for documentation); and converting between TSV (tab-separated) and CSV. Online format converters or Python pandas can efficiently complete these conversions.
CSV Considerations for Excel Users
Excel users working with CSV should note: double-clicking to open a CSV may cause Excel to interpret purely numeric fields as numbers (postal code "01234" becomes number 1234, losing the leading zero); import through "Data Import" rather than double-clicking to specify data types for each column in the import wizard; when saving, select "CSV (UTF-8)" rather than regular "CSV" to avoid Chinese garbling; Excel's CSV export may handle fields containing line breaks incorrectly.
Strategies for Large CSV Files
For large CSV files exceeding memory capacity (GB scale), direct loading fails. Processing strategies: use pandas chunked reading (chunksize parameter) to process in blocks; use the Polars library (more memory-efficient than pandas, supports lazy loading); use DuckDB to query CSV files as virtual database tables; use command-line tools (awk, sort) for stream processing; consider importing CSV data into a real database for querying.
Try the online tool now โ no installation, completely free.
Open Tool โ
Try the free tool now
Use Free Tool โ