大数据之HIVE入门(二十)
统计函数count 、sum、min、max、avg用法count:计数,如果count的字段中有null行会被忽略,所以一般统计行数可以直接写count(1);如果需要针对某个字段非空字段计数,如对设备号字段统计则写count(device_id);如果要对某个非空字段去重统计,如对设备号字段去重统计则写count(distinct device_id),当然也可通过先group by 再以计数的方式进行统计。一般用法如下:--查询每个部门有多少人 select dept,count(1) from staff group by dept; --查询部门中有多少人参与投票及投票次数(假设1人可投多票) select dept,count(1) as vote_num,count(distinct u) as vote_uv from dept_vote_log group by dept; --查询部门中有多少人参与投票及投票次数(假设1人可投多票),不用count(distinct)方式实现 select dept,sum(vote_num) as vote_num,count(1) as vote_uv from ( select dept,u,count(1) as vote_num from dept_vote_log group by dept,u ) t group by dept
sum :求和。如果指定字段中有null字段则被忽略。详见示例:--对部门工资进行求和 select dept,sum(salary) as salary from dept_salary where year="2021" group by dept min: 求最小值。--取部门工资最低人员 select dept,min(salary) as min_salary from dept_salary where year="2021" group by dept;max:求最大值。--取部门工资最高人员 select dept,max(salary) as max_salary from dept_salary where year="2021" group by dept; avg:求平均数。--取部门平均工资 select dept,avg(salary) as avg_salary from dept_salary where year="2021" group by dept;
以上基础统计函数,除了常规的功能外,它还有一个开窗的用法:
over(partition by col order by col asc|desc window_specification ):
partition by :指定分组字段。
order by : 指定排序顺序。
window specification:指定行的计算范围,可选,不指定默认是从行首到当前行。
窗口规范语法如下:(ROWS | RANGE) BETWEEN (UNBOUNDED | [num]) PRECEDING AND ([num] PRECEDING | CURRENT ROW | (UNBOUNDED | [num]) FOLLOWING) (ROWS | RANGE) BETWEEN CURRENT ROW AND (CURRENT ROW | (UNBOUNDED | [num]) FOLLOWING) (ROWS | RANGE) BETWEEN [num] FOLLOWING AND (UNBOUNDED | [num]) FOLLOWING --------------------------------说明------------------------------------- preceding 往前 following 往后 unbounded preceding 往前所有行,即初始行 n preceding 往前n行 unbounded following 往后所有行,即末尾行 n following 往后n行 current row 当前行 --------------------------------示例------------------------------------ --计算全部数据 rows between unbounded preceding and unbounded following --计算从开始到当前行 rows between unbounded preceding and current row --计算前3行到当前行 rows between 3 preceding and current row
以sum over为例,通过开窗功能,它可以实现累计求和滑动窗口求和功能。
假如我们要计算当月每个人截至到当天从月初累计到现在的收入总和。传统的方法是计算每天的从月初到当天的汇总,生成子查询或中间表,再用中间表关联上用户每天的汇总数据,来计算占比等数据指标。---原始数据--sale_log--- sale_date saler revenue 2022-03-01 saler1 200 2022-03-02 saler2 400 2022-03-03 saler1 300 2022-03-04 saler2 300 2022-03-05 saler1 1000 2022-03-06 saler2 500 2022-03-07 saler3 200 2022-03-08 saler1 600 select sale_date,saler,revenue,sum(revenue) over (partition by saler order by sale_date asc rows between unbounded preceding and current row) as total_current from sale_log ; sale_date saler revenue total_current 2022-03-01 saler1 200 200 2022-03-03 saler1 300 500 2022-03-05 saler1 1000 1500 2022-03-08 saler1 600 2100 2022-03-02 saler2 400 400 2022-03-04 saler2 300 700 2022-03-06 saler2 500 1200 2022-03-07 saler3 200 200
同理count over 可以实现累计计数和滑动记数功能;min over 、max over可以实现求累计求最小最大值和滑动窗口的最小最大值功能;avg over可以求实现累计平均和滑动平均数功能。就不一一举例了。