Power Query — AI-Driven Data Cleaning Pipeline
Chapter 10: Power Query Data Cleaning
If you've ever spent hours manually merging monthly Excel files, hand-deleting junk rows, or writing nested SUBSTITUTE formulas to strip unwanted characters — Power Query is about to change your life. It's Excel's built-in data transformation engine: you configure the cleaning steps once, and every future update takes one click to refresh.
What Is Power Query
Power Query (built into Excel 2016+, available as an add-in for 2010/2013) records every transformation step you apply to data. When the source data updates, clicking Refresh All re-runs all those steps automatically.
The core value proposition: convert one-time manual labor into a repeatable automated pipeline.
❌ Traditional Approach
Open each of 12 monthly files individually
Copy-paste into master sheet, fix formatting
Manually delete subtotal rows, merged cells, blank rows
Fix date/number formats with formulas
Repeat every month...
Time cost: 3–5 hours/month
✅ Power Query Approach
First time: set up data source and cleaning steps (~45 min)
Every subsequent month: drop new file in folder, click Refresh All
Wait 10–30 seconds for everything to update
Go get coffee
**After setup:
Access: Data tab → Get Data → choose source type. Or click "From Table/Range" to load current sheet data into the Power Query Editor.
Getting Started: Interface & Loading Data
The Power Query Editor has four key areas:
- Left: Queries pane — all queries in the workbook
- Center: Data preview — live preview of data after each applied step
- Right: Applied Steps — the full transformation history, fully editable
- Top: Ribbon — transformation operations organized by category
✅ Tip
Key insight: Every operation becomes a reversible step. You can delete, reorder, or insert steps at any point — nothing is permanent until you choose to load it to the worksheet.
6 Core Data Cleaning Operations
Case 1: Remove Duplicate Rows
❌ Old way: Sort then manually scan, or use COUNTIF + filter + delete ✅ Power Query: Home → Remove Rows → Remove Duplicates
Select specific columns first to deduplicate by those columns only (other columns may differ).
No column selection = deduplicate on all columns (exact row match).
Time saved: 10,000 rows manually ≈ 30 min. Power Query ≈ 5 seconds. Future refreshes handle new duplicates automatically.
Case 2: Split Column (Name + Phone Combined → Two Columns)
❌ Old way: LEFT/FIND/MID formula combination, tedious to write and test ✅ Power Query: Transform → Split Column → By Delimiter
Enter the delimiter (e.g., "-"), choose "Each occurrence" or "Left/right-most".
Other split options: by number of characters, by lowercase-to-uppercase transition, by digit-to-letter boundary.
Time saved: Writing and debugging FIND+MID formulas ≈ 15–30 min. Power Query split ≈ 30 seconds.
Case 3: Fill Down Empty Cells (Unmerge & Fill)
Scene: ERP export with merged department cells — only the first row per department has a value
Right-click the column → Fill → Down
Each null is replaced with the nearest non-null value above it. Equivalent to "unmerge all and fill." Fill Up also available.
Time saved: Manual fill on 100 merged-cell rows ≈ 10 min. Power Query ≈ 3 seconds, reusable on every refresh.
Case 4: Replace & Clean (Remove Units Like "$", ",", "元")
Scene: Amount column has "$12,345.00" — needs to become the number 12345 1Select column → Transform → Replace Values → find "$", replace with nothing 2Replace Values again → find ",", replace with nothing 3Click type icon on column header → change to Decimal Number
Each replacement is a separate reversible step.
Also useful: Transform → Format → Trim (remove leading/trailing spaces), Clean (remove non-printable characters).
Case 5: Data Type Conversion (Text → Date → Number)
Click the type icon (ABC / 123 / calendar) in any column header to change its type
Power Query enforces strict typing — set types early in your step sequence.
Common problems:
- "20260425" as text → parse with Date.FromText() or convert via integer first
- "1,234.56" with thousand separator → replace comma first, then convert to decimal
- Excel date serial number (e.g., 46000) → select Date type directly
Case 6: Conditional Column (IF Logic Without Writing Formulas)
❌ Old way: =IF(E2>10000,"Large",IF(E2>5000,"Medium","Small")) nested formula ✅ Power Query: Add Column → Conditional Column
Set the new column name, add IF/ELSE IF conditions with dropdowns:
Order Amount > 10000 → "Large Customer"
Order Amount > 5000 → "Medium Customer"
Else → "Small Customer"
Visual configuration — logic is immediately readable, no formula debugging needed.
Time saved: Writing and testing nested IF ≈ 15 min. Power Query conditional column ≈ 2 minutes.
AI Case: Describe Messy Data, Get a Cleaning Plan
I have a CSV exported from our ERP with these data quality issues:
1. Amount column shows "¥12,345.00" — need plain number 12345
2. Supplier column has blanks that should be filled from the row above (merged cell export problem)
3. Date column shows "April 25, 2026" (long English format) — need standard Excel date
4. Notes column has newline characters (\n) from the system — replace with spaces
5. Duplicate order rows exist (same order number) — keep only the row with the highest amount
Please give me the Power Query steps to fix all 5 issues, in order. Specify which menu to use for each, and flag any steps that require M language code.
Combining Tables: Power Query's Killer Feature
Append Queries — Stack Tables Vertically
Scenario: 12 monthly sales files with identical structure → one annual table.
1Load each query (or use Folder method below) 2Home → Append Queries → Three or more tables → select all monthly queries
Result: all rows stacked vertically. Column matching is by name — mismatched column names produce null values, so standardize headers first.
Merge Queries — Join Tables Horizontally (VLOOKUP-Style)
Scenario: Orders table has Product ID; Products table has name, category, cost price. Bring product details into orders.
1In the Orders query: Home → Merge Queries 2Select Products query, match columns (Orders.ProductID = Products.ProductID) 3Join kind: Left Outer (keep all order rows, fill null where no product match) 4Expand the merged column — select which product fields to bring in
Join types: Left Outer (most common, like VLOOKUP) / Inner (only matching rows) / Full Outer (all rows from both).
Advantages over VLOOKUP: handles large data, supports multi-column keys, unaffected by column position changes.
Folder-Based Batch Combine — The Game Changer
1Data → Get Data → From File → From Folder → select the folder containing all monthly files 2Power Query shows all files — click "Combine & Transform Data" 3Choose sample file (Power Query learns the structure), select the worksheet to load 4Power Query applies identical transformations to every file and combines them. A "Source.Name" column tracks which file each row came from. 5Continue cleaning in the editor, then Close & Load to worksheet
✅ Tip
Next month: Drop the new file into that folder. Back in Excel, click Refresh All. Done in under 30 seconds. This is the core value of Power Query.
5 Multi-Table Cases (Summary)
| Case | Approach | Time Savings |
|---|---|---|
| Merge 12 monthly sales files | Folder combine → extract month from filename | 2 hr → 30 sec/month after setup |
| Order table + product info | Merge Query (Left Outer on ProductID) | Replaces VLOOKUP, handles 100K+ rows faster |
| Multi-store daily reports | Folder combine, use Source.Name for store ID | 10 stores × 1 hr/day → 1 min/day |
| Weekly auto-refresh | Set refresh interval in query properties | Eliminates manual re-pull entirely |
| Design complex merge logic | AI consultation (see prompt below) | Saves planning time + avoids common pitfalls |
I need to combine 4 data sources in Excel with Power Query to build an annual sales analysis dataset:
Source 1: Monthly sales detail (12 files, same format) — Order ID, Date, Product ID, Qty, Unit Price, Customer ID, Sales Rep ID
Source 2: Product table (1 file, ~500 rows) — Product ID, Name, Category, Cost, Supplier
Source 3: Customer table (1 file, ~1000 rows) — Customer ID, Name, City, Tier
Source 4: Employee table (1 file, ~50 rows) — Rep ID, Name, Region, Start Date
Goal: one wide table for full-year analysis by product / customer / region / sales rep.
Please design: 1) the recommended merge order and method; 2) key steps; 3) potential problems to watch for and how to prevent them.
M Language Basics
Every GUI operation generates M code behind the scenes. View it: View → Advanced Editor. You don't need to write M manually — but knowing it exists lets you leverage AI to handle cases the UI can't cover.
3 AI-Generated M Language Cases
In Power Query, I combined files from a folder. The Source.Name column has filenames like "2026-03-sales-data.xlsx". I need a new column "YearMonth" containing just "2026-03". Give me the M formula for a custom column.
I want my Power Query filter to automatically keep only the last 30 days of data on every refresh, rather than a fixed date. The UI date filter can't do dynamic dates. Give me the M code for a filter step.
My date column has values like "Apr 25, 2026" (English month abbreviation). Power Query won't auto-convert this to a date type. Give me the M code to convert it properly.
Comprehensive Case: Annual Sales Auto-Aggregation System
Build a fully automated reporting pipeline using everything in this chapter.
| Query | Source | Output |
|---|---|---|
| Sales Detail | Folder (12 monthly files) | Clean combined annual data, load to sheet |
| Products | Products Excel file | Connection only (no sheet output) |
| Customers | Customers Excel file | Connection only |
| Sales Wide | Sales Detail + merge Products + merge Customers | Full wide table for pivot/analysis, load to sheet |
✅ Tip
First-time setup: 2–4 hours (including testing and adjustment)
Monthly maintenance: ~2 minutes (drop file in folder + click refresh + wait)
Annual time savings: conservatively 20–40 hours
Power Query Cheat Sheet
| Task | Location | Action |
|---|---|---|
| Load Excel file | Data → Get Data | From File → From Workbook |
| Load folder (batch) | Data → Get Data | From File → From Folder |
| Remove duplicate rows | Home | Remove Rows → Remove Duplicates |
| Remove blank rows | Home | Remove Rows → Remove Blank Rows |
| Split column | Transform | Split Column → By Delimiter or By Characters |
| Fill down blanks | Right-click column | Fill → Down |
| Replace values | Transform / right-click | Replace Values |
| Trim whitespace | Transform → Format | Trim |
| Change column type | Click type icon in header | Select target type |
| Add conditional column | Add Column | Conditional Column |
| Add custom column (M) | Add Column | Custom Column, enter M formula |
| Stack tables vertically | Home | Append Queries |
| Join tables horizontally | Home | Merge Queries, choose join type |
| Refresh all data | Data tab | Refresh All |
| View M code | View | Advanced Editor |
| Set auto-refresh | Right-click query → Properties | Set refresh interval |
Previous ← Chapter 9: Conditional Formatting Next Chapter 11: Multi-Sheet Operations →