提示
Spark SQL 教程 正在筹划编写中,使用过程中有任何建议,提供意见、建议、纠错、催更加微信 gr99123。
GROUP BY 子句用于基于一组指定的分组表达式对行进行分组,并基于一个或多个指定的聚合函数计算行组上的聚合。
Spark 还支持高级聚合,通过 GROUPING SETS, CUBE, ROLLUP 子句对同一输入记录集进行多个聚合。当 FILTER 子句附加到聚合函数时,只将匹配的行传递给该函数。
GROUP BY 子句语法:
GROUP BY group_expression [ , group_expression [ , ... ] ]
[ { WITH ROLLUP | WITH CUBE | GROUPING SETS (grouping_set [ , ...]) } ]
GROUP BY GROUPING SETS (grouping_set [ , ...])
聚合函数定义为:
aggregate_name ( [ DISTINCT ] expression [ , ... ] ) [ FILTER ( WHERE boolean_expression ) ]
语法中的参数有:
对分组集中指定的表达式的每个子集的行进行分组。例如,GROUP BY GROUPING SETS (warehouse, product)
在语义上等价于 GROUP BY warehouse
和 GROUP BY product
结果的并集。此子句是 UNION ALL 的简写,其中 UNION ALL 运算符的每个分支执行 GROUPING SETS 子句中指定的列子集的聚合。
分组集(grouping set)由括号中的零个或多个逗号分隔的表达式指定。
语法: ( [ expression [ , ... ] ] )
指定行分组所基于的条件。行的分组是基于分组表达式的结果值执行的。分组表达式可以是列别名、列位置或表达式。
在一条语句中指定聚合的多个级别。此子句用于基于多个分组集计算聚合。ROLLUP 是 GROUPING SETS 的简写。例如,GROUP BY warehouse, product WITH ROLLUP
相当于 GROUP BY GROUPING SETS ((warehouse, product), (warehouse), ())
。汇总规范的 N 个元素产生 N+1 个 GROUPING SETS。
CUBE子 句用于根据 GROUP BY 子句中指定的分组列的组合执行聚合。CUBE 是对GROUPING SETS 的缩写。例如,GROUP BY warehouse, product WITH CUBE
相当于 GROUP BY GROUPING SETS ((warehouse, product), (warehouse), (product), ())
。多维数据集规范的 N 个元素产生 2^N 个分组集。
指定聚合函数名(MIN、MAX、COUNT、SUM、AVG 等)。
在将输入行中的重复项传递给聚合函数之前,删除这些重复项。
将 WHERE 子句中的 boolean_expression 布尔表达式计算为 true 的输入行传递给聚合函数;其他行将被丢弃。
构造数据:
-- 造数据
CREATE TABLE dealer (id INT, city STRING, car_model STRING, quantity INT);
INSERT INTO dealer VALUES
(100, 'Fremont', 'Honda Civic', 10),
(100, 'Fremont', 'Honda Accord', 15),
(100, 'Fremont', 'Honda CRV', 7),
(200, 'Dublin', 'Honda Civic', 20),
(200, 'Dublin', 'Honda Accord', 10),
(200, 'Dublin', 'Honda CRV', 3),
(300, 'San Jose', 'Honda Civic', 5),
(300, 'San Jose', 'Honda Accord', 8);
示例用法如下:
-- 每个经销商的数量总和. Group by `id`.
SELECT id, sum(quantity)
FROM dealer
GROUP BY id
ORDER BY id;
+---+-------------+
| id|sum(quantity)|
+---+-------------+
|100| 32|
|200| 33|
|300| 13|
+---+-------------+
-- 使用列位置 GROUP by clause.
SELECT id, sum(quantity)
FROM dealer
GROUP BY 1
ORDER BY 1;
+---+-------------+
| id|sum(quantity)|
+---+-------------+
|100| 32|
|200| 33|
|300| 13|
+---+-------------+
-- 多个聚合
-- 1. 每个经销商的数量总和
-- 2. 每个经销商的最大数量
SELECT id, sum(quantity) AS sum, max(quantity) AS max
FROM dealer
GROUP BY id
ORDER BY id;
+---+---+---+
| id|sum|max|
+---+---+---+
|100| 32| 15|
|200| 33| 20|
|300| 13| 8|
+---+---+---+
-- 计算每个车型的不同经销商城市数
SELECT car_model, count(DISTINCT city) AS count
FROM dealer
GROUP BY car_model;
+------------+-----+
| car_model|count|
+------------+-----+
| Honda Civic| 3|
| Honda CRV| 2|
|Honda Accord| 3|
+------------+-----+
-- 每个经销商的“本田思域”和“本田CRV”数量之和
SELECT id, sum(quantity) FILTER (
WHERE car_model IN ('Honda Civic', 'Honda CRV')
) AS `sum(quantity)` FROM dealer
GROUP BY id ORDER BY id;
+---+-------------+
| id|sum(quantity)|
+---+-------------+
|100| 17|
|200| 23|
|300| 5|
+---+-------------+
-- 在一条语句中使用多组分组列的聚合
-- 下面根据四组分组列执行聚合
-- 1. city, car_model
-- 2. city
-- 3. car_model
-- 4. 空分组集,返回所有城市和汽车模型的数量
SELECT city, car_model, sum(quantity) AS sum FROM dealer
GROUP BY GROUPING SETS ((city, car_model), (city), (car_model), ())
ORDER BY city;
+---------+------------+---+
| city| car_model|sum|
+---------+------------+---+
| null| null| 78|
| null| HondaAccord| 33|
| null| HondaCRV| 10|
| null| HondaCivic| 35|
| Dublin| null| 33|
| Dublin| HondaAccord| 10|
| Dublin| HondaCRV| 3|
| Dublin| HondaCivic| 20|
| Fremont| null| 32|
| Fremont| HondaAccord| 15|
| Fremont| HondaCRV| 7|
| Fremont| HondaCivic| 10|
| San Jose| null| 13|
| San Jose| HondaAccord| 8|
| San Jose| HondaCivic| 5|
+---------+------------+---+
-- “GROUPING set”的另一种语法,其中“GROUP BY”和“GROUPING set”`
-- 规范已存在
SELECT city, car_model, sum(quantity) AS sum FROM dealer
GROUP BY city, car_model GROUPING SETS ((city, car_model), (city), (car_model), ())
ORDER BY city, car_model;
+---------+------------+---+
| city| car_model|sum|
+---------+------------+---+
| null| null| 78|
| null| HondaAccord| 33|
| null| HondaCRV| 10|
| null| HondaCivic| 35|
| Dublin| null| 33|
| Dublin| HondaAccord| 10|
| Dublin| HondaCRV| 3|
| Dublin| HondaCivic| 20|
| Fremont| null| 32|
| Fremont| HondaAccord| 15|
| Fremont| HondaCRV| 7|
| Fremont| HondaCivic| 10|
| San Jose| null| 13|
| San Jose| HondaAccord| 8|
| San Jose| HondaCivic| 5|
+---------+------------+---+
-- 使用“ROLLUP”子句按处理分组。
-- 等价 GROUP BY GROUPING SETS ((city, car_model), (city), ())
SELECT city, car_model, sum(quantity) AS sum FROM dealer
GROUP BY city, car_model WITH ROLLUP
ORDER BY city, car_model;
+---------+------------+---+
| city| car_model|sum|
+---------+------------+---+
| null| null| 78|
| Dublin| null| 33|
| Dublin| HondaAccord| 10|
| Dublin| HondaCRV| 3|
| Dublin| HondaCivic| 20|
| Fremont| null| 32|
| Fremont| HondaAccord| 15|
| Fremont| HondaCRV| 7|
| Fremont| HondaCivic| 10|
| San Jose| null| 13|
| San Jose| HondaAccord| 8|
| San Jose| HondaCivic| 5|
+---------+------------+---+
-- 使用“CUBE”子句按处理分组。
-- 等价 GROUP BY GROUPING SETS ((city, car_model), (city), (car_model), ())
SELECT city, car_model, sum(quantity) AS sum FROM dealer
GROUP BY city, car_model WITH CUBE
ORDER BY city, car_model;
+---------+------------+---+
| city| car_model|sum|
+---------+------------+---+
| null| null| 78|
| null| HondaAccord| 33|
| null| HondaCRV| 10|
| null| HondaCivic| 35|
| Dublin| null| 33|
| Dublin| HondaAccord| 10|
| Dublin| HondaCRV| 3|
| Dublin| HondaCivic| 20|
| Fremont| null| 32|
| Fremont| HondaAccord| 15|
| Fremont| HondaCRV| 7|
| Fremont| HondaCivic| 10|
| San Jose| null| 13|
| San Jose| HondaAccord| 8|
| San Jose| HondaCivic| 5|
+---------+------------+---+
-- 为忽略空值示例准备数据
CREATE TABLE person (id INT, name STRING, age INT);
INSERT INTO person VALUES
(100, 'Mary', NULL),
(200, 'John', 30),
(300, 'Mike', 80),
(400, 'Dan', 50);
-- 选择列中的第一行
SELECT FIRST(age) FROM person;
+--------------------+
| first(age, false) |
+--------------------+
| NULL |
+--------------------+
-- 获取列“age”中的第一行忽略空值
-- 列“id”中的最后一行和列“id”的和。
SELECT FIRST(age IGNORE NULLS), LAST(id), SUM(id)
FROM person;
+-------------------+------------------+----------+
| first(age, true) | last(id, false) | sum(id) |
+-------------------+------------------+----------+
| 30 | 400 | 1000 |
+-------------------+------------------+----------+
更新时间:2021-06-29 22:53:41 标签:spark sql groupby