提示
Spark SQL 教程 正在筹划编写中,使用过程中有任何建议,提供意见、建议、纠错、催更加微信 gr99123。
窗口函数对一组行(称为窗口)进行操作,并基于该组行计算每行的返回值。窗口函数对于处理诸如计算移动平均值、计算累积统计数据或访问给定当前行相对位置的行值等任务非常有用。
Spark SQL 的窗口函数结构为:
window_function OVER
( [ { PARTITION | DISTRIBUTE } BY partition_col_name = partition_col_val ( [ , ... ] ) ]
{ ORDER | SORT } BY expression [ ASC | DESC ] [ NULLS { FIRST | LAST } ] [ , ... ]
[ window_frame ] )
参数:
RANK | DENSE_RANK | PERCENT_RANK | NTILE | ROW_NUMBER
CUME_DIST | LAG | LEAD
MAX | MIN | COUNT | SUM | AVG | ...
有关 Spark 聚合函数的完整列表,请参阅内置聚合函数文档。
{ RANGE | ROWS } { frame_start | BETWEEN frame_start AND frame_end }
UNBOUNDED PRECEDING | offset PRECEDING | CURRENT ROW | offset FOLLOWING | UNBOUNDED FOLLOWING
以下是一些示例讲解:
CREATE TABLE employees (name STRING, dept STRING, salary INT, age INT);
INSERT INTO employees VALUES ("Lisa", "Sales", 10000, 35);
INSERT INTO employees VALUES ("Evan", "Sales", 32000, 38);
INSERT INTO employees VALUES ("Fred", "Engineering", 21000, 28);
INSERT INTO employees VALUES ("Alex", "Sales", 30000, 33);
INSERT INTO employees VALUES ("Tom", "Engineering", 23000, 33);
INSERT INTO employees VALUES ("Jane", "Marketing", 29000, 28);
INSERT INTO employees VALUES ("Jeff", "Marketing", 35000, 38);
INSERT INTO employees VALUES ("Paul", "Engineering", 29000, 23);
INSERT INTO employees VALUES ("Chloe", "Engineering", 23000, 25);
SELECT * FROM employees;
+-----+-----------+------+-----+
| name| dept|salary| age|
+-----+-----------+------+-----+
|Chloe|Engineering| 23000| 25|
| Fred|Engineering| 21000| 28|
| Paul|Engineering| 29000| 23|
|Helen| Marketing| 29000| 40|
| Tom|Engineering| 23000| 33|
| Jane| Marketing| 29000| 28|
| Jeff| Marketing| 35000| 38|
| Evan| Sales| 32000| 38|
| Lisa| Sales| 10000| 35|
| Alex| Sales| 30000| 33|
+-----+-----------+------+-----+
-- 在对应部门的排序号
SELECT name, dept,
RANK() OVER (PARTITION BY dept ORDER BY salary) AS rank
FROM employees;
+-----+-----------+------+----+
| name| dept|salary|rank|
+-----+-----------+------+----+
| Lisa| Sales| 10000| 1|
| Alex| Sales| 30000| 2|
| Evan| Sales| 32000| 3|
| Fred|Engineering| 21000| 1|
| Tom|Engineering| 23000| 2|
|Chloe|Engineering| 23000| 2|
| Paul|Engineering| 29000| 4|
|Helen| Marketing| 29000| 1|
| Jane| Marketing| 29000| 1|
| Jeff| Marketing| 35000| 3|
+-----+-----------+------+----+
SELECT name, dept,
DENSE_RANK() OVER (PARTITION BY dept ORDER BY salary ROWS BETWEEN
UNBOUNDED PRECEDING AND CURRENT ROW) AS dense_rank
FROM employees;
+-----+-----------+------+----------+
| name| dept|salary|dense_rank|
+-----+-----------+------+----------+
| Lisa| Sales| 10000| 1|
| Alex| Sales| 30000| 2|
| Evan| Sales| 32000| 3|
| Fred|Engineering| 21000| 1|
| Tom|Engineering| 23000| 2|
|Chloe|Engineering| 23000| 2|
| Paul|Engineering| 29000| 3|
|Helen| Marketing| 29000| 1|
| Jane| Marketing| 29000| 1|
| Jeff| Marketing| 35000| 2|
+-----+-----------+------+----------+
SELECT name, dept, age,
CUME_DIST() OVER (PARTITION BY dept ORDER BY age
RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS cume_dist FROM employees;
+-----+-----------+------+------------------+
| name| dept|age | cume_dist|
+-----+-----------+------+------------------+
| Alex| Sales| 33|0.3333333333333333|
| Lisa| Sales| 35|0.6666666666666666|
| Evan| Sales| 38| 1.0|
| Paul|Engineering| 23| 0.25|
|Chloe|Engineering| 25| 0.75|
| Fred|Engineering| 28| 0.25|
| Tom|Engineering| 33| 1.0|
| Jane| Marketing| 28|0.3333333333333333|
| Jeff| Marketing| 38|0.6666666666666666|
|Helen| Marketing| 40| 1.0|
+-----+-----------+------+------------------+
SELECT name, dept, salary,
MIN(salary) OVER (PARTITION BY dept ORDER BY salary) AS min
FROM employees;
+-----+-----------+------+-----+
| name| dept|salary| min|
+-----+-----------+------+-----+
| Lisa| Sales| 10000|10000|
| Alex| Sales| 30000|10000|
| Evan| Sales| 32000|10000|
|Helen| Marketing| 29000|29000|
| Jane| Marketing| 29000|29000|
| Jeff| Marketing| 35000|29000|
| Fred|Engineering| 21000|21000|
| Tom|Engineering| 23000|21000|
|Chloe|Engineering| 23000|21000|
| Paul|Engineering| 29000|21000|
+-----+-----------+------+-----+
SELECT name, salary,
LAG(salary) OVER (PARTITION BY dept ORDER BY salary) AS lag,
LEAD(salary, 1, 0) OVER (PARTITION BY dept ORDER BY salary) AS lead
FROM employees;
+-----+-----------+------+-----+-----+
| name| dept|salary| lag| lead|
+-----+-----------+------+-----+-----+
| Lisa| Sales| 10000|NULL |30000|
| Alex| Sales| 30000|10000|32000|
| Evan| Sales| 32000|30000| 0|
| Fred|Engineering| 21000| NULL|23000|
|Chloe|Engineering| 23000|21000|23000|
| Tom|Engineering| 23000|23000|29000|
| Paul|Engineering| 29000|23000| 0|
|Helen| Marketing| 29000| NULL|29000|
| Jane| Marketing| 29000|29000|35000|
| Jeff| Marketing| 35000|29000| 0|
+-----+-----------+------+-----+-----+
更新时间:2021-08-20 22:22:54 标签:sql spark 窗口函数