窗口函数实战
MySQL 窗口函数完全指南
窗口函数(Window Functions)是 MySQL 8.0 引入的最重要特性之一。与普通聚合函数不同,窗口函数在计算时不折叠行 -- 每行都保留在结果集中,同时能访问与当前行相关的一组行(即"窗口")进行计算。这使得排名、累计求和、移动平均、同比环比等以往需要复杂子查询才能实现的分析任务变得简洁高效。
**版本要求:**窗口函数需要 MySQL 8.0 或更高版本。MariaDB 10.2+ 也支持窗口函数,但语法和行为细节略有差异。本章所有示例基于 MySQL 8.0/8.4。
本章将系统讲解所有窗口函数的语法与语义,通过 20+ 个真实业务场景演示如何使用窗口函数解决实际问题,并深入分析其执行原理和性能优化策略。
窗口函数语法结构
窗口函数的完整语法由三部分组成:函数调用、OVER 子句、以及可选的命名窗口。
function_name([args]) OVER ( [PARTITION BY partition_expression, ...] [ORDER BY sort_expression [ASC|DESC], ...] [frame_clause] )
-- frame_clause: {ROWS | RANGE | GROUPS} BETWEEN frame_start AND frame_end
-- frame_start / frame_end: UNBOUNDED PRECEDING | N PRECEDING | CURRENT ROW | N FOLLOWING | UNBOUNDED FOLLOWING
核心概念
- PARTITION BY -- 将结果集分成若干分区,窗口函数在每个分区内独立计算,类似 GROUP BY 但不折叠行
- ORDER BY -- 定义分区内行的排序,影响排名函数的结果和帧的范围
- 帧子句(Frame Clause) -- 定义当前行能"看到"的行范围,控制聚合窗口的计算边界
命名窗口(WINDOW 子句)
当多个窗口函数使用相同的窗口定义时,可以用 WINDOW 子句命名并复用:
SELECT
employee_id,
department,
salary,
ROW_NUMBER() OVER w AS rn,
RANK() OVER w AS rnk,
DENSE_RANK() OVER w AS dense_rnk
FROM employees
WINDOW w AS (PARTITION BY department ORDER BY salary DESC);
排名函数
ROW_NUMBER()
为分区内的每一行分配唯一连续编号(1, 2, 3, ...),即使值相同也不会产生并列。
语法
ROW_NUMBER() OVER (
[PARTITION BY col, ...]
ORDER BY col [ASC|DESC], ...
)
示例数据
| id | department | name | salary |
|---|---|---|---|
| id | department | name | salary |
| 1 | Engineering | Alice | 12000 |
| 2 | Engineering | Bob | 11000 |
| 3 | Engineering | Charlie | 12000 |
| 4 | Sales | Diana | 9000 |
| 5 | Sales | Eve | 9500 |
查询与结果
SELECT
name, department, salary,
ROW_NUMBER() OVER (PARTITION BY department ORDER BY salary DESC) AS rn
FROM employees;
| name | department | salary | rn |
|---|---|---|---|
| name | department | salary | rn |
| Alice | Engineering | 12000 | 1 |
| Charlie | Engineering | 12000 | 2 |
| Bob | Engineering | 11000 | 3 |
| Eve | Sales | 9500 | 1 |
| Diana | Sales | 9000 | 2 |
注意:Alice 和 Charlie 的 salary 相同,但 ROW_NUMBER() 仍然分配了不同的编号(1 和 2)。具体顺序在相同值时不确定,取决于存储引擎返回行的顺序。如果需要确定性排序,应在 ORDER BY 中增加额外列(如主键)。
RANK()
与 ROW_NUMBER() 类似,但允许并列。值相同的行获得相同排名,下一个不同值的排名跳过并列数。
SELECT
name, department, salary,
RANK() OVER (PARTITION BY department ORDER BY salary DESC) AS rnk
FROM employees;
| name | department | salary | rnk |
|---|---|---|---|
| name | department | salary | rnk |
| Alice | Engineering | 12000 | 1 |
| Charlie | Engineering | 12000 | 1 |
| Bob | Engineering | 11000 | 3 |
| Eve | Sales | 9500 | 1 |
| Diana | Sales | 9000 | 2 |
注意 Bob 的排名是 3(不是 2),因为有两个人并列第 1 名。
DENSE_RANK()
与 RANK() 相同允许并列,但下一个不同值的排名不跳过,保持连续。
SELECT
name, department, salary,
DENSE_RANK() OVER (PARTITION BY department ORDER BY salary DESC) AS dense_rnk
FROM employees;
| name | department | salary | dense_rnk |
|---|---|---|---|
| name | department | salary | dense_rnk |
| Alice | Engineering | 12000 | 1 |
| Charlie | Engineering | 12000 | 1 |
| Bob | Engineering | 11000 | 2 |
| Eve | Sales | 9500 | 1 |
| Diana | Sales | 9000 | 2 |
Bob 现在是 2(紧跟并列第 1 名之后)。
ROW_NUMBER vs RANK vs DENSE_RANK 对比
| 函数 | 并列 | 跳号 | 适用场景 |
|---|---|---|---|
| 函数 | 并列 | 跳号 | 适用场景 |
| ROW_NUMBER() | 不允许 | 无 | 分页、去重、唯一编号 |
| RANK() | 允许 | 跳过 | 竞赛排名、考试排名 |
| DENSE_RANK() | 允许 | 不跳过 | 工资等级、连续排名 |
NTILE(n)
将分区内的行均匀分成 n 个桶,并返回桶编号(1 到 n)。如果行数不能被 n 整除,前面的桶多分一行。
SELECT
name, salary,
NTILE(4) OVER (ORDER BY salary DESC) AS quartile
FROM employees;
| name | salary | quartile |
|---|---|---|
| name | salary | quartile |
| Alice | 12000 | 1 |
| Charlie | 12000 | 1 |
| Bob | 11000 | 2 |
| Eve | 9500 | 3 |
| Diana | 9000 | 4 |
**典型用途:**计算百分位(NTILE(100))、四分位(NTILE(4))、将用户分成 A/B 测试组。
取值函数
LAG(expr, offset, default)
访问当前行之前第 offset 行的值。offset 默认 1,default 默认 NULL。
语法
LAG(expression [, offset [, default_value]])
OVER (
[PARTITION BY col, ...]
ORDER BY col [ASC|DESC], ...
)
示例:月度销售额环比
SELECT
month,
revenue,
LAG(revenue, 1, 0) OVER (ORDER BY month) AS prev_month,
ROUND((revenue - LAG(revenue, 1) OVER (ORDER BY month))
/ LAG(revenue, 1) OVER (ORDER BY month) * 100, 1) AS mom_pct
FROM monthly_sales;
| month | revenue | prev_month | mom_pct |
|---|---|---|---|
| month | revenue | prev_month | mom_pct |
| 2024-01 | 100000 | 0 | NULL |
| 2024-02 | 120000 | 100000 | 20.0 |
| 2024-03 | 115000 | 120000 | -4.2 |
| 2024-04 | 130000 | 115000 | 13.0 |
LEAD(expr, offset, default)
访问当前行之后第 offset 行的值。与 LAG 方向相反。
SELECT
order_id,
customer_id,
order_date,
LEAD(order_date, 1) OVER (
PARTITION BY customer_id ORDER BY order_date
) AS next_order_date,
DATEDIFF(
LEAD(order_date, 1) OVER (PARTITION BY customer_id ORDER BY order_date),
order_date
) AS days_to_next
FROM orders;
| order_id | customer_id | order_date | next_order_date | days_to_next |
|---|---|---|---|---|
| order_id | customer_id | order_date | next_order_date | days_to_next |
| 101 | 1 | 2024-01-05 | 2024-01-20 | 15 |
| 105 | 1 | 2024-01-20 | 2024-03-10 | 50 |
| 112 | 1 | 2024-03-10 | NULL | NULL |
FIRST_VALUE(expr)
返回窗口帧中第一行的值。
SELECT
name, department, salary,
FIRST_VALUE(name) OVER (
PARTITION BY department ORDER BY salary DESC
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
) AS top_earner
FROM employees;
| name | department | salary | top_earner |
|---|---|---|---|
| name | department | salary | top_earner |
| Alice | Engineering | 12000 | Alice |
| Charlie | Engineering | 12000 | Alice |
| Bob | Engineering | 11000 | Alice |
| Eve | Sales | 9500 | Eve |
| Diana | Sales | 9000 | Eve |
LAST_VALUE(expr)
返回窗口帧中最后一行的值。
**常见陷阱:**默认帧是
RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW,这意味着 LAST_VALUE() 只会返回到当前行为止的最后一个值(即当前行自身)。要获得分区内真正的最后一行,必须显式指定帧为ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING。
-- 错误用法(常见错误):
SELECT name, salary,
LAST_VALUE(name) OVER (
PARTITION BY department ORDER BY salary DESC
) AS lowest_earner -- 永远返回当前行的 name!
FROM employees;
-- 正确用法:
SELECT name, salary,
LAST_VALUE(name) OVER (
PARTITION BY department ORDER BY salary DESC
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
) AS lowest_earner -- 返回分区内 salary 最低的人
FROM employees;
NTH_VALUE(expr, n)
返回窗口帧中第 n 行的值。n 从 1 开始。
SELECT
name, department, salary,
NTH_VALUE(name, 2) OVER (
PARTITION BY department ORDER BY salary DESC
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
) AS second_highest
FROM employees;
| name | department | salary | second_highest |
|---|---|---|---|
| name | department | salary | second_highest |
| Alice | Engineering | 12000 | Charlie |
| Charlie | Engineering | 12000 | Charlie |
| Bob | Engineering | 11000 | Charlie |
| Eve | Sales | 9500 | Diana |
| Diana | Sales | 9000 | Diana |
取值函数对比
| 函数 | 方向 | 用途 | 注意事项 |
|---|---|---|---|
| 函数 | 方向 | 用途 | 注意事项 |
| LAG | 向前看(前一行) | 环比、差值计算 | 分区首行返回 default |
| LEAD | 向后看(后一行) | 预测下一次、间隔计算 | 分区末行返回 default |
| FIRST_VALUE | 帧第一行 | 分组内最大/最小 | 通常不需改帧 |
| LAST_VALUE | 帧最后一行 | 分组内最大/最小 | 必须显式设帧! |
| NTH_VALUE | 帧第 n 行 | 取第 2 名、第 3 名 | 需要显式设帧 |
聚合窗口函数
所有标准聚合函数都可以配合 OVER() 作为窗口函数使用。与普通聚合不同,窗口聚合不折叠行。
SUM() OVER()
计算窗口帧内的累计总和。
SELECT
order_date,
amount,
SUM(amount) OVER (ORDER BY order_date) AS running_total,
SUM(amount) OVER () AS grand_total
FROM orders;
| order_date | amount | running_total | grand_total |
|---|---|---|---|
| order_date | amount | running_total | grand_total |
| 2024-01-01 | 100 | 100 | 850 |
| 2024-01-02 | 200 | 300 | 850 |
| 2024-01-03 | 150 | 450 | 850 |
| 2024-01-04 | 250 | 700 | 850 |
| 2024-01-05 | 150 | 850 | 850 |
AVG() OVER()
计算窗口帧内的平均值。
-- 每个员工的薪资与部门平均薪资对比
SELECT
name, department, salary,
ROUND(AVG(salary) OVER (PARTITION BY department), 0) AS dept_avg,
salary - ROUND(AVG(salary) OVER (PARTITION BY department), 0) AS diff_from_avg
FROM employees;
| name | department | salary | dept_avg | diff_from_avg |
|---|---|---|---|---|
| name | department | salary | dept_avg | diff_from_avg |
| Alice | Engineering | 12000 | 11667 | 333 |
| Charlie | Engineering | 12000 | 11667 | 333 |
| Bob | Engineering | 11000 | 11667 | -667 |
| Eve | Sales | 9500 | 9250 | 250 |
| Diana | Sales | 9000 | 9250 | -250 |
COUNT() OVER()
-- 每个订单占部门订单总数的比例
SELECT
order_id, department,
COUNT(*) OVER (PARTITION BY department) AS dept_orders,
ROUND(1.0 / COUNT(*) OVER (PARTITION BY department) * 100, 1) AS pct
FROM orders;
MIN() / MAX() OVER()
-- 与分区内极值对比
SELECT
name, department, salary,
MIN(salary) OVER (PARTITION BY department) AS dept_min,
MAX(salary) OVER (PARTITION BY department) AS dept_max,
salary - MIN(salary) OVER (PARTITION BY department) AS above_min
FROM employees;
PARTITION BY 详解
PARTITION BY 子句将结果集分成多个分区,窗口函数在每个分区内独立计算。它类似 GROUP BY,但不折叠行。
多列分区
-- 按年份和部门分区,计算排名
SELECT
YEAR(hire_date) AS hire_year,
department,
name,
salary,
RANK() OVER (
PARTITION BY YEAR(hire_date), department
ORDER BY salary DESC
) AS rank_in_year_dept
FROM employees;
PARTITION BY 表达式
-- 按薪资区间分区
SELECT
name, salary,
CASE
WHEN salary >= 10000 THEN 'High'
WHEN salary >= 7000 THEN 'Mid'
ELSE 'Low'
END AS tier,
ROW_NUMBER() OVER (
PARTITION BY CASE
WHEN salary >= 10000 THEN 'High'
WHEN salary >= 7000 THEN 'Mid'
ELSE 'Low'
END
ORDER BY salary DESC
) AS rank_in_tier
FROM employees;
无 PARTITION BY(全局窗口)
省略 PARTITION BY 时,整个结果集被视为一个分区:
-- 全局排名
SELECT name, salary,
ROW_NUMBER() OVER (ORDER BY salary DESC) AS global_rank
FROM employees;
帧子句(Frame Clause)
帧子句定义了窗口函数在分区内能"看到"的行范围。只有聚合窗口函数(SUM, AVG, COUNT, MIN, MAX)和取值函数(FIRST_VALUE, LAST_VALUE, NTH_VALUE)会受帧子句影响。排名函数(ROW_NUMBER, RANK, DENSE_RANK, NTILE)和 LAG/LEAD 忽略帧子句。
ROWS 模式
基于物理行数定义帧范围。
-- 3 日移动平均
SELECT
order_date,
amount,
ROUND(AVG(amount) OVER (
ORDER BY order_date
ROWS BETWEEN 2 PRECEDING AND CURRENT ROW
), 2) AS moving_avg_3
FROM daily_sales;
| order_date | amount | moving_avg_3 |
|---|---|---|
| order_date | amount | moving_avg_3 |
| 2024-01-01 | 100 | 100.00 |
| 2024-01-02 | 200 | 150.00 |
| 2024-01-03 | 150 | 150.00 |
| 2024-01-04 | 300 | 216.67 |
| 2024-01-05 | 250 | 233.33 |
RANGE 模式
基于值范围定义帧。与 ROWS 不同,RANGE 考虑值的大小关系。
-- 累计到当前值的总和(值相同的行全部包含)
SELECT
salary,
name,
SUM(salary) OVER (
ORDER BY salary
RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) AS cumulative_salary
FROM employees;
**ROWS vs RANGE 的区别:**当 ORDER BY 列有重复值时,RANGE 会将值相同的行视为同一组全部包含,而 ROWS 严格按物理行数计算。大多数情况下建议使用 ROWS,因为行为更可预测且性能更好。
GROUPS 模式(MySQL 8.0.2+)
基于**peer group(同值组)**定义帧。每个同值组包含 ORDER BY 值相同的所有行。
-- 当前组 + 前后各 1 组
SELECT
department, salary,
SUM(salary) OVER (
ORDER BY salary
GROUPS BETWEEN 1 PRECEDING AND 1 FOLLOWING
) AS group_sum
FROM employees;
常用帧子句速查
| 帧定义 | 含义 | 典型用途 |
|---|---|---|
| 帧定义 | 含义 | 典型用途 |
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW |
从分区开头到当前行 | 累计求和 |
ROWS BETWEEN N PRECEDING AND CURRENT ROW |
前 N 行到当前行 | 移动平均 |
ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING |
当前行到分区末尾 | 剩余求和 |
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING |
整个分区 | LAST_VALUE、NTH_VALUE |
ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING |
前后各 1 行 + 当前行 | 平滑处理 |
RANGE BETWEEN INTERVAL 7 DAY PRECEDING AND CURRENT ROW |
当前日期前 7 天内 | 7 日滚动聚合 |
默认帧行为
当显式省略帧子句时,MySQL 使用以下默认值:
- 有 ORDER BY:
RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW - 无 ORDER BY:
RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING(即整个分区)
20+ 实战案例
1 连续登录天数
找出每个用户的最大连续登录天数。核心思路:用 ROW_NUMBER() 生成序号,然后用日期减去序号得到"组标识"。
WITH login_groups AS (
SELECT
user_id,
login_date,
login_date - INTERVAL ROW_NUMBER() OVER (
PARTITION BY user_id ORDER BY login_date
) DAY AS grp
FROM (
SELECT DISTINCT user_id, DATE(login_time) AS login_date
FROM user_logins
) t
)
SELECT
user_id,
MIN(login_date) AS streak_start,
MAX(login_date) AS streak_end,
COUNT(*) AS consecutive_days
FROM login_groups
GROUP BY user_id, grp
ORDER BY consecutive_days DESC;
**原理:**对于连续日期序列,date - ROW_NUMBER() 的结果是一个常数。例如 1月1日(rn=1)、1月2日(rn=2)、1月3日(rn=3) 减去后都得到 12月31日。一旦有中断,差值就不同了,形成新的分组。
2 累计求和(Running Total)
SELECT
transaction_date,
amount,
SUM(amount) OVER (
ORDER BY transaction_date
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) AS running_total
FROM transactions
WHERE account_id = 1001
ORDER BY transaction_date;
3 7 日移动平均
SELECT
sale_date,
daily_revenue,
ROUND(AVG(daily_revenue) OVER (
ORDER BY sale_date
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
), 2) AS ma_7
FROM daily_sales;
注意用 ROWS BETWEEN 6 PRECEDING AND CURRENT ROW 表示包含当前行在内的 7 行。
4 分组 Top-N(每部门薪资前 3)
SELECT * FROM (
SELECT
department,
name,
salary,
ROW_NUMBER() OVER (
PARTITION BY department ORDER BY salary DESC
) AS rn
FROM employees
) ranked
WHERE rn <= 3;
如果允许并列,用 DENSE_RANK() 替代 ROW_NUMBER()。
5 同比(Year-over-Year)
SELECT
YEAR(order_date) AS yr,
MONTH(order_date) AS mo,
SUM(amount) AS monthly_revenue,
LAG(SUM(amount), 12) OVER (ORDER BY YEAR(order_date), MONTH(order_date)) AS same_month_last_year,
ROUND(
(SUM(amount) - LAG(SUM(amount), 12) OVER (ORDER BY YEAR(order_date), MONTH(order_date)))
/ LAG(SUM(amount), 12) OVER (ORDER BY YEAR(order_date), MONTH(order_date)) * 100, 1
) AS yoy_pct
FROM orders
GROUP BY YEAR(order_date), MONTH(order_date)
ORDER BY yr, mo;
6 环比(Month-over-Month)
WITH monthly AS (
SELECT
DATE_FORMAT(order_date, '%Y-%m') AS month,
SUM(amount) AS revenue
FROM orders
GROUP BY DATE_FORMAT(order_date, '%Y-%m')
)
SELECT
month,
revenue,
LAG(revenue, 1) OVER (ORDER BY month) AS prev_month,
ROUND((revenue - LAG(revenue, 1) OVER (ORDER BY month))
/ LAG(revenue, 1) OVER (ORDER BY month) * 100, 1) AS mom_pct
FROM monthly;
7 百分位排名(Percentile Rank)
SELECT
name,
salary,
ROUND(PERCENT_RANK() OVER (ORDER BY salary) * 100, 1) AS percentile,
CUME_DIST() OVER (ORDER BY salary) AS cume_dist
FROM employees;
PERCENT_RANK():(rank - 1) / (total_rows - 1),范围 [0, 1]。CUME_DIST():rank / total_rows,范围 (0, 1]。
8 间隙检测(Gap Detection)
查找 ID 序列中的间隙。
SELECT
id AS gap_start,
next_id - 1 AS gap_end,
next_id - id - 1 AS gap_size
FROM (
SELECT
id,
LEAD(id, 1) OVER (ORDER BY id) AS next_id
FROM orders
) t
WHERE next_id - id > 1
ORDER BY gap_start;
9 岛屿检测(Island Detection)
在状态序列中找出连续相同状态的"岛屿"。
WITH numbered AS (
SELECT
server_id,
check_time,
status,
ROW_NUMBER() OVER (PARTITION BY server_id ORDER BY check_time) -
ROW_NUMBER() OVER (PARTITION BY server_id, status ORDER BY check_time) AS island_id
FROM server_status
)
SELECT
server_id,
status,
MIN(check_time) AS island_start,
MAX(check_time) AS island_end,
COUNT(*) AS duration_checks
FROM numbered
GROUP BY server_id, status, island_id
ORDER BY server_id, island_start;
**原理:**对全体行编号与按状态分组编号的差值,在连续相同状态的行中保持恒定,一旦状态切换差值就变化。
10 去重保留最新记录
DELETE t1 FROM user_addresses t1
INNER JOIN (
SELECT id,
ROW_NUMBER() OVER (
PARTITION BY user_id ORDER BY updated_at DESC
) AS rn
FROM user_addresses
) t2 ON t1.id = t2.id
WHERE t2.rn > 1;
每个 user_id 只保留 updated_at 最新的一条地址记录。
11 累计占比(Pareto 分析)
SELECT
product_name,
revenue,
SUM(revenue) OVER (ORDER BY revenue DESC) AS cumulative_revenue,
ROUND(SUM(revenue) OVER (ORDER BY revenue DESC)
/ SUM(revenue) OVER () * 100, 1) AS cumulative_pct
FROM product_sales
ORDER BY revenue DESC;
快速识别贡献了 80% 收入的前 20% 产品。
12 分组内占比
SELECT
department,
name,
salary,
ROUND(salary / SUM(salary) OVER (PARTITION BY department) * 100, 1) AS pct_of_dept
FROM employees;
13 会话化(Sessionization)
将用户行为日志按 30 分钟超时切分为会话。
WITH events_with_gap AS (
SELECT
user_id,
event_time,
TIMESTAMPDIFF(MINUTE,
LAG(event_time) OVER (PARTITION BY user_id ORDER BY event_time),
event_time
) AS gap_minutes
FROM user_events
),
sessions AS (
SELECT
user_id,
event_time,
SUM(CASE WHEN gap_minutes > 30 OR gap_minutes IS NULL THEN 1 ELSE 0 END)
OVER (PARTITION BY user_id ORDER BY event_time) AS session_id
FROM events_with_gap
)
SELECT
user_id,
session_id,
MIN(event_time) AS session_start,
MAX(event_time) AS session_end,
COUNT(*) AS events_count,
TIMESTAMPDIFF(MINUTE, MIN(event_time), MAX(event_time)) AS duration_min
FROM sessions
GROUP BY user_id, session_id;
14 漏斗分析
WITH funnel AS (
SELECT
user_id,
event_name,
event_time,
ROW_NUMBER() OVER (
PARTITION BY user_id, event_name ORDER BY event_time
) AS event_seq
FROM user_events
WHERE event_name IN ('page_view', 'add_to_cart', 'checkout', 'payment')
)
SELECT
event_name,
COUNT(DISTINCT user_id) AS users,
ROUND(COUNT(DISTINCT user_id) /
FIRST_VALUE(COUNT(DISTINCT user_id)) OVER (ORDER BY
FIELD(event_name, 'page_view', 'add_to_cart', 'checkout', 'payment')
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
) * 100, 1) AS conversion_pct
FROM funnel
WHERE event_seq = 1
GROUP BY event_name
ORDER BY FIELD(event_name, 'page_view', 'add_to_cart', 'checkout', 'payment');
15 中位数计算
WITH ordered AS (
SELECT
salary,
ROW_NUMBER() OVER (ORDER BY salary) AS rn,
COUNT(*) OVER () AS total
FROM employees
)
SELECT AVG(salary) AS median_salary
FROM ordered
WHERE rn IN (FLOOR((total + 1) / 2), CEIL((total + 1) / 2));
16 前后行差值(Delta)
SELECT
reading_time,
meter_value,
meter_value - LAG(meter_value, 1, meter_value) OVER (ORDER BY reading_time) AS delta
FROM electricity_readings
WHERE meter_id = 'M001'
ORDER BY reading_time;
17 排行榜变动追踪
WITH weekly_rank AS (
SELECT
week_num,
player_name,
score,
RANK() OVER (PARTITION BY week_num ORDER BY score DESC) AS rnk
FROM weekly_scores
)
SELECT
w.week_num,
w.player_name,
w.score,
w.rnk AS current_rank,
LAG(w.rnk) OVER (PARTITION BY w.player_name ORDER BY w.week_num) AS prev_rank,
LAG(w.rnk) OVER (PARTITION BY w.player_name ORDER BY w.week_num) - w.rnk AS rank_change
FROM weekly_rank w
ORDER BY w.week_num, w.rnk;
18 首次购买 vs 复购标记
SELECT
order_id,
customer_id,
order_date,
amount,
CASE
WHEN ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY order_date) = 1
THEN 'First Purchase'
ELSE 'Repeat Purchase'
END AS purchase_type
FROM orders;
19 库存变动与余额
SELECT
transaction_time,
product_id,
CASE type WHEN 'IN' THEN qty ELSE -qty END AS change,
SUM(CASE type WHEN 'IN' THEN qty ELSE -qty END) OVER (
PARTITION BY product_id
ORDER BY transaction_time
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) AS balance
FROM inventory_transactions
ORDER BY product_id, transaction_time;
20 员工薪资与同级对比
SELECT
name,
department,
job_level,
salary,
AVG(salary) OVER (PARTITION BY job_level) AS level_avg,
MIN(salary) OVER (PARTITION BY job_level) AS level_min,
MAX(salary) OVER (PARTITION BY job_level) AS level_max,
NTILE(4) OVER (PARTITION BY job_level ORDER BY salary) AS salary_quartile,
ROUND(PERCENT_RANK() OVER (PARTITION BY job_level ORDER BY salary) * 100, 0) AS percentile
FROM employees;
21 页面停留时间计算
SELECT
user_id,
page_url,
view_time,
LEAD(view_time) OVER (PARTITION BY user_id ORDER BY view_time) AS next_view_time,
TIMESTAMPDIFF(SECOND, view_time,
LEAD(view_time) OVER (PARTITION BY user_id ORDER BY view_time)
) AS stay_seconds
FROM page_views
WHERE session_id = 'abc123'
ORDER BY view_time;
22 分区内行间移动标准差
SELECT
trade_date,
close_price,
ROUND(AVG(close_price) OVER w, 2) AS ma_20,
ROUND(STDDEV_POP(close_price) OVER w, 2) AS std_20,
ROUND(AVG(close_price) OVER w + 2 * STDDEV_POP(close_price) OVER w, 2) AS upper_band,
ROUND(AVG(close_price) OVER w - 2 * STDDEV_POP(close_price) OVER w, 2) AS lower_band
FROM stock_prices
WHERE ticker = 'AAPL'
WINDOW w AS (ORDER BY trade_date ROWS BETWEEN 19 PRECEDING AND CURRENT ROW)
ORDER BY trade_date;
布林带(Bollinger Bands)计算:20 日均线 +/- 2 倍标准差。
性能分析
窗口函数 vs 子查询 vs 临时表
窗口函数
- 一次扫描完成多个计算
- 优化器可合并多个 OVER 子句
- 代码简洁,可维护性高
- 排序是主要开销
相关子查询
- 每行执行一次子查询(O(n^2))
- 小数据量可接受,大表极慢
- 可利用索引加速
- 兼容 MySQL 5.7
性能对比实测
-- 场景:100 万行订单表,计算每用户累计订单金额
-- 方案 1:窗口函数(~0.8s)
SELECT order_id, user_id, amount,
SUM(amount) OVER (PARTITION BY user_id ORDER BY order_date) AS running_total
FROM orders;
-- 方案 2:相关子查询(~45s)
SELECT o1.order_id, o1.user_id, o1.amount,
(SELECT SUM(o2.amount) FROM orders o2
WHERE o2.user_id = o1.user_id AND o2.order_date <= o1.order_date) AS running_total
FROM orders o1;
-- 方案 3:临时表 + 变量(MySQL 5.7 兼容,~1.2s,但行为不可靠)
SET @running := 0, @prev_user := '';
SELECT order_id, user_id, amount,
@running := IF(@prev_user = user_id, @running + amount, amount) AS running_total,
@prev_user := user_id
FROM orders ORDER BY user_id, order_date;
**MySQL 8.0 中用户变量赋值的行为未定义:**MySQL 优化器可能以任意顺序求值 SELECT 列表中的表达式。方案 3 在 MySQL 8.0 中不可靠,应使用窗口函数替代。
EXPLAIN 分析窗口函数
EXPLAIN FORMAT=TREE
SELECT department, name, salary,
RANK() OVER (PARTITION BY department ORDER BY salary DESC) AS rnk
FROM employees;
/* 输出示例:
-> Window aggregate with buffering: rank() OVER (
PARTITION BY department ORDER BY salary DESC)
-> Sort: employees.department, employees.salary DESC
-> Table scan on employees (cost=10.25 rows=100)
*/
关键观察点:
- **Sort 节点:**窗口函数几乎总是需要排序。如果有合适索引,可能避免排序
- **Window aggregate with buffering:**MySQL 需要缓冲行来计算窗口。对大数据集,这意味着更多内存
- **多个窗口:**如果 PARTITION BY 和 ORDER BY 不同,可能需要多次排序
优化建议
- 为 PARTITION BY + ORDER BY 创建索引 -- 如果窗口函数用于子查询/CTE 且可下推,合适的复合索引可避免排序
- 减少参与窗口计算的行数 -- 先用 WHERE 过滤,再应用窗口函数
- 合并窗口定义 -- 使用 WINDOW 子句复用相同的 PARTITION BY + ORDER BY
- 避免不必要的 ORDER BY -- 如果只需要全局聚合(如 SUM() OVER()),省略 ORDER BY
- 控制帧大小 -- ROWS 帧通常比 RANGE 帧更高效
- 注意内存使用 --
sort_buffer_size和tmp_table_size会影响窗口函数性能
-- 创建索引优化窗口函数
ALTER TABLE orders ADD INDEX idx_user_date (user_id, order_date);
-- 先过滤再排名
SELECT * FROM (
SELECT *, ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY amount DESC) AS rn
FROM orders
WHERE order_date >= '2024-01-01' -- 先过滤减少数据量
) t WHERE rn = 1;
MySQL 8.0 特性与限制
MySQL 8.0 支持的窗口函数一览
| 类别 | 函数 | 最低版本 |
|---|---|---|
| 类别 | 函数 | 最低版本 |
| 排名 | ROW_NUMBER(), RANK(), DENSE_RANK(), NTILE() | 8.0.2 |
| 取值 | LAG(), LEAD(), FIRST_VALUE(), LAST_VALUE(), NTH_VALUE() | 8.0.2 |
| 分布 | PERCENT_RANK(), CUME_DIST() | 8.0.2 |
| 聚合窗口 | SUM(), AVG(), COUNT(), MIN(), MAX(), BIT_AND(), BIT_OR(), BIT_XOR(), STDDEV_POP(), STDDEV_SAMP(), VAR_POP(), VAR_SAMP(), JSON_ARRAYAGG(), JSON_OBJECTAGG() | 8.0.2 |
| 帧模式 | ROWS, RANGE, GROUPS | 8.0.2(GROUPS 8.0.2) |
| 帧排除 | EXCLUDE CURRENT ROW, EXCLUDE GROUP, EXCLUDE TIES, EXCLUDE NO OTHERS | 不支持 |
| 命名窗口 | WINDOW w AS (...) | 8.0.2 |
已知限制
- **不支持 EXCLUDE 子句:**标准 SQL 中的
EXCLUDE CURRENT ROW、EXCLUDE GROUP等帧排除子句 MySQL 未实现 - **不支持 FILTER 子句:**标准 SQL 的
COUNT(*) FILTER (WHERE ...)不可用,需用COUNT(CASE WHEN ... THEN 1 END)替代 - 不支持 DISTINCT 窗口聚合:
COUNT(DISTINCT col) OVER ()会报错,需要子查询变通 - **窗口函数不能嵌套:**不能在窗口函数内调用另一个窗口函数
- **不能在 WHERE/HAVING 中使用:**窗口函数只能出现在 SELECT 和 ORDER BY 子句中。要过滤结果需包装子查询
- **GROUP BY + 窗口函数:**窗口函数在 GROUP BY 之后执行,因此作用于聚合后的结果行
FILTER 子句的替代方案
-- 标准 SQL(PostgreSQL 支持,MySQL 不支持):
SELECT COUNT(*) FILTER (WHERE status = 'active') OVER (PARTITION BY dept)
FROM employees;
-- MySQL 替代方案:
SELECT
SUM(CASE WHEN status = 'active' THEN 1 ELSE 0 END) OVER (PARTITION BY dept) AS active_count
FROM employees;
COUNT(DISTINCT) OVER() 的替代方案
-- 不支持:
-- SELECT COUNT(DISTINCT category) OVER (PARTITION BY dept) FROM products;
-- 替代方案 1:子查询
SELECT p.*,
(SELECT COUNT(DISTINCT p2.category) FROM products p2 WHERE p2.dept = p.dept) AS distinct_cats
FROM products p;
-- 替代方案 2:DENSE_RANK 技巧
SELECT *,
MAX(dr) OVER (PARTITION BY dept) AS distinct_cats
FROM (
SELECT *, DENSE_RANK() OVER (PARTITION BY dept ORDER BY category) AS dr
FROM products
) t;
执行顺序
窗口函数在 SQL 逻辑执行顺序中位于SELECT 阶段:
- FROM / JOIN
- WHERE
- GROUP BY
- HAVING
- 窗口函数(WINDOW)
- SELECT
- DISTINCT
- ORDER BY
- LIMIT
这意味着窗口函数可以引用 SELECT 列表中的别名聚合结果,但不能在 WHERE 或 HAVING 中使用窗口函数的结果。
常见问题(FAQ)
Q1: 窗口函数和 GROUP BY 有什么区别?
GROUP BY 将多行折叠为一行,每组只输出一条结果。窗口函数保留所有原始行,在每行上附加聚合/排名计算。例如,GROUP BY department 只返回各部门一行,而 SUM(salary) OVER (PARTITION BY department) 保留每个员工行,同时每行都附带部门总薪资。
Q2: 为什么 LAST_VALUE() 总是返回当前行的值?
因为默认帧是 RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW。帧的末尾就是当前行,所以 LAST_VALUE 返回当前行的值。要获取分区内真正的最后一个值,必须显式设置帧为 ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING。
Q3: 能否在 WHERE 子句中使用窗口函数?
不能。窗口函数在 WHERE 之后执行。要过滤窗口函数的结果,需要将查询包装在子查询或 CTE 中:SELECT * FROM (SELECT *, ROW_NUMBER() OVER (...) AS rn FROM t) sub WHERE rn = 1;
Q4: 窗口函数会导致排序吗?如何优化?
几乎总是会。PARTITION BY 和 ORDER BY 都需要排序。优化方法:(1) 为 PARTITION BY + ORDER BY 列创建复合索引;(2) 使用 WINDOW 子句复用相同的窗口定义,让优化器只排序一次;(3) 先用 WHERE 过滤减少数据量;(4) 如果只需聚合不需要排序,省略 ORDER BY。
Q5: ROW_NUMBER() 结果值相同时顺序不确定怎么办?
ORDER BY 中增加一个唯一列(通常是主键)作为 tiebreaker:ROW_NUMBER() OVER (PARTITION BY dept ORDER BY salary DESC, id ASC)。这样即使 salary 相同,排序也是确定性的。
Q6: MySQL 5.7 如何实现窗口函数的等效功能?
MySQL 5.7 不支持窗口函数。常见替代方案:(1) 排名用用户变量 @rn := @rn + 1(MySQL 8.0 中行为不可靠);(2) 累计求和用相关子查询;(3) LAG/LEAD 用 self-join。但这些方案性能通常差很多,建议升级到 MySQL 8.0。
Q7: 多个窗口函数是否会多次排序?
取决于 PARTITION BY 和 ORDER BY 是否相同。相同窗口定义的多个函数只排序一次。不同窗口定义则需要分别排序。使用 WINDOW 子句复用窗口定义,让优化器合并排序。通过 EXPLAIN FORMAT=TREE 可以查看实际排序次数。
Q8: ROWS 和 RANGE 帧模式该如何选择?
ROWS:按物理行数,行为更可预测,性能更好。大多数情况优先使用。RANGE:按值范围,当 ORDER BY 列有重复值且需要将相同值视为一组时使用。如果需要基于时间间隔(如 RANGE BETWEEN INTERVAL 7 DAY PRECEDING)则必须用 RANGE。GROUPS:按同值组,介于 ROWS 和 RANGE 之间,适合同值组粒度的计算。
Q9: 窗口函数能用于 UPDATE 语句吗?
不能直接在 UPDATE SET 子句中使用窗口函数。但可以通过 JOIN 子查询实现:UPDATE orders o JOIN (SELECT id, ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY order_date) AS rn FROM orders) t ON o.id = t.id SET o.seq = t.rn;
Q10: 窗口函数和 LIMIT 配合使用有什么注意事项?
LIMIT 在窗口函数之后执行,所以窗口函数会先计算所有行再截断。如果你只需要 Top-N,窗口函数仍然要处理全部数据。对于大表,可以先用 WHERE 缩小范围,或者考虑 ORDER BY + LIMIT 这种不需要窗口函数的写法是否能满足需求。