Contents of this article
- 1. Built-in functions
- 2. Hive function classification
- 3. Introduction to commonly used functions
- 3.1 nvl
- 3.2 case when then else end
- 3.3 concat related functions
- 3.4 collect_ related functions
- 3.5 explode + lateral view
- 3.6 grouping sets multi-dimensional analysis
- 3.7 Window function
- Ⅰ.Over() general introduction
- Ⅱ.over() Several keywords that will be used in the brackets (refer to 3.1 below for how to use them)
- Ⅲ. Understand the use of over() through several requirements:
- 1. Query the total number of customers who purchased in April 2017
- 2. Query the customer's purchase details and monthly purchase amount
- 3. Query the customer's purchase details, and accumulate the cost of each customer according to the date
- 3.1 Introduction to the use of several keywords in over
- 4. LAG queries each customer's last purchase time
- 5. Query the order information of the first 20% of the time
- 3.8 rank、dense_rank
- 3.9 row_number
- 4. Custom function
1. Built-in functions
Enter the hive client and
show functionsview the built-in functions of the system through commands . There are 289 built-in functions by default. Command
desc function 函数名can view the description that comes with the function, the command
desc function extended 函数名to view the function comes with detailed usage (included with Example). As follows:
# 1.查看函数描述 hive (test)> desc function upper; OK tab_name upper(str) - Returns str with all characters changed to uppercase Time taken: 0.014 seconds, Fetched: 1 row(s) # 2.查看函数详细用法 hive (test)> desc function extended upper; OK tab_name upper(str) - Returns str with all characters changed to uppercase Synonyms: ucase Example: > SELECT upper('Facebook') FROM src LIMIT 1; 'FACEBOOK' Function class:org.apache.hadoop.hive.ql.udf.generic.GenericUDFUpper Function type:BUILTIN Time taken: 0.066 seconds, Fetched: 7 row(s)
2. Hive function classification
There are three types of functions in Hive:
- OUT OF
UDF function : input a row of data, and return a row of data. For example: upper (to uppercase), substring (intercept) function
UDTF function : input multiple rows of data and return one row of data, which is our commonly used aggregate function. For example: sum (sum), avg (average) function
UDAF function : input one row of data, return multiple rows of data, also called burst function. For example: explode (used to break up rows, split one row of data into multiple rows, its parameters must be map or array) function (explode function: for example, there are many words in a row, comma separated, and finally the words are separated by commas , Put a word on one line, that is, input one line and output multiple lines)
As shown in the figure below , through the command:
desc function extended 函数名;you can see what type of function is
3. Introduction to commonly used functions
Because the blogger has MySQL and Oracle related knowledge. The use of some functions here is exactly the same as that of MySQL and Oracle, so I won’t describe them here too much.
Function description :
Assign a value to the data whose value is NULL, and its format is
NVL( value，default_value). Its function is that if value is NULL, the NVL function returns the value of default_value, otherwise it returns the value of value, and if both parameters are NULL, it returns NULL. [There are also NVL functions in Oracle].
select nvl(null,'默认值'); #返回默认值 select nvl("111","222"); #返回111 select nvl(100+100,"222"); #返回200
3.2 case when then else end
Function Description :
多条件判断函数，类似 Java 中的 if else 语句. Condition is an expression that returns a boolean type. If the expression returns true, the entire function returns the value of the corresponding result. If the expressions are all false, it returns the value of result after ElSE. If the ELSE clause is omitted, it returns NULL. . [There are case when then else end functions in MySQL and Oracle]
CASE WHEN condition THEN result [WHEN...THEN...] ELSE result END
3.3 concat related functions
Function description :
CONCAT(string A/col, string B/col…) returns the result of input string concatenation, supports any number of input strings. [There are concat functions in MySQL and Oracle]
SELECT concat("Hello","----","World"); # 返回 Hello----World
Function description :
CONCAT_WS(separator, str1, str2,...): It is a special form of CONCAT(). The first parameter is the separator between the parameters. The separator can be the same string as the remaining parameters. If the separator is NULL, the return value will also be NULL. This function will skip any NULL and empty strings after the delimiter parameter. The delimiter will be added between the connected strings;
select concat_ws("-","Hello","World","HaHaHa"); #返回Hello-World-HaHaHa
CONCAT_WS in the incoming parameters, must be
array<string>字符串数组. Input array<string> string array, concat_ws can be used with collect_set and collect_list .
3.4 collect_ related functions
Function description : The
collect_set function only accepts basic data types. Its main function is to de- duplicate and summarize the value of a field to generate an Array type field.
# collect_list select dept_no,collect_list(user_name) from user_info group by dept_no;
Function description : The
difference between collect_list and collect_set is that it does not remove duplicates.
#collect_set select dept_no,collect_set(user_name) from user_info group by dept_no;
3.5 explode + lateral view
Tip: explode is a UDTF function, which usually needs to be used with the lateral view profile table.
explode(col) : Convert a row of data into column data, which can be used for array and map data. It is to split the complex array or map structure in one line of hive into multiple lines.
Expand the array data in the movie classification. The results are as follows:
# 表名.* 方式 select movie, category_name.* from movie_info lateral view explode(split(category,'/')) category_name; # 列名方式 select movie, category_name from movie_info lateral view explode(split(category,'/')) movie_info_tmp as category_name;
If you use
select movie,explode(split(category,'/')) from movie_info;it will error: UDTF's are not Supported at The Outside the SELECT clause, NOR nested in Expressions
When using UDTF function, hive only allowed to split the field visit.
select explode(split(category,'/')) from movie_info;
select movie,explode(split(category,'/')) from movie_info;
If you want to access fields other than split fields, what should I do?
Use lateral view side view! Official document, refer to: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+LateralView
lateral view is a side view, meaning it is used in conjunction with UDTF functions to split a row of data into multiple rows of data . The UDTF without lateral view can only extract a single field to split, and cannot plug it back into the original data table. With the addition of a lateral view, the split single field data can be associated with the original table data. explode only explodes one column, just explode directly. If you need the fields related to the original table at the same time as the explosion, you need to add the lateral view profile table.
Note: When using the lateral view, you need to specify the view alias and the generated new column alias
[reference from: https://blog. csdn.net/u011110301/article/details/104198588】
3.6 grouping sets multi-dimensional analysis
Use grouping sets to analyze data in multiple dimensions. How to use grouping sets can refer to the official documents: Enhanced Aggregation, Cube, Grouping and Rollup
1. Requirements: ①Check
the total number of people in the
company ②Check how many men and how many women are in the company ③Check how many people are in
each department of the
company ④Check every company How many people are male and female in each department?
2. The data are as follows:
3. Basic SQL statement query
① View the total number of people in the company
select count(*) from db_user;
②Check how many men and women are in the company
select user_sex, count(*) from db_user group by user_sex;
③Check how many people are in each department of the company
select dept_no, count(*) from db_user group by dept_no;
④Check the number of men and women in each department of the company
select dept_no, user_sex, count(*) from db_user group by dept_no, user_sex;
4. Use GROUPING SETS to complete in one sentence! ! ! !
select dept_no, user_sex, count(*) from db_user group by dept_no,user_sex grouping sets ((),user_sex,dept_no,(user_sex,dept_no));
5. Aggregate query, using the comparison chart of grouping sets and group by (taken from official documents)
3.7 Window function
Tip: MySQL also introduced window function in 8.X version, which is the same as that used in Hive. Just find a blog to find out about it. MySQL 8.x windowing function blog
official document, reference: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+WindowingAndAnalytics
Ⅰ.Over() general introduction
group by can group data sets according to a certain field. Similar to over(), you can specify fields. But it is still different from group by.
The difference is :
group by is grouped according to a certain field, for example, department id, then as long as the id is the same, all data in the group will be aggregated. When aggregating, all data in the group will be aggregated.
over() is different, it can specify that the window size changes with the change of rows, for example, over grouping according to a certain field ( grouping is not appropriate here ), it can open a window in the group, and again specify which row is divided As a group.
Ⅱ.over() Several keywords that will be used in the brackets (refer to 3.1 below for how to use them)
|PARTITION BY||Partition by a field|
|ORDER BY||Used for sorting, with aggregate function to achieve accumulation|
|CURRENT ROW||Current line|
|n PRECEDING||N rows forward|
|n FOLLOWING||The next n rows of data|
|UNBOUNDED||Starting point |
UNBOUNDED PRECEDING means from the starting point in the front
UNBOUNDED FOLLOWING means to the end point in the back
|LAG(col,n,default_val)||The previous nth row of data [col field, n pushes forward by n rows, default_val pushes forward by n rows, if the row is empty, use default_val field]|
|LEAD(col,n,default_val)||Next nth row of data [col field, n pushes forward by n rows, default_val pushes forward by n rows, if the row is empty, use default_val field]|
|FIRST_VALUE||This requires up to two parameters. The first parameter is the column for which you want the first value, and the second (optional) parameter must be a boolean with a default value of false. If set to true, skip null values|
|LAST_VALUE||This requires up to two parameters. The first parameter is the column for which you want the last value, and the second (optional) parameter must be a Boolean value that defaults to false. If set to true, null values are skipped.|
|NTILE(n)||Distribute the rows of the ordered window to the specified data groups. Each group has a number, and the number starts from 1. For each row, NTILE returns the number of the group to which Cihang belongs. Note: n must be of type int|
Tip: These keywords can be very convenient to control the number of rows in the window. In the following [It will be used in Requirements 3.1], click the link to view →→To 3.1 . You can also refer to the official document example: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+WindowingAndAnalytics
Through the case to understand the meaning of over()
Case 1 : The meaning of over()
# 这个SQL是会直接报错的，因为使用count聚合函数，却没有 group by name; select name,count(name) from order_info;
# 加个 over() 在聚合函数后，在hive中就能够正常运行了。 select name,count(name) over() from order_info;
Why is it 14? Because there are a total of 14 records in the library.
over() takes effect after the group by query. We conducted
select name from order_info;have 14 records. over() means opening a window. There are no restrictions in the parentheses of over(), which means that
select name from order_info;the 14 records are used as the window data set, and count(name) is to count the data set in the window. There are 14 pieces of data in select name, 14 pieces of returned data, count(name) is calculated on the data set in the over() window, so the count(name) of 14 pieces of data is 14.
over() opens a window, and the data in the window is the data queried except for the aggregate function. In this example,
select name from order_info;there will be 14 records, and count(name) is to count the data in the window, which is 14.
Case 2 : over() partition usage
# over() 指定按照 name 分区 select name, count(name) over(partition by name) from order_info;
Query result: >
Ⅲ. Understand the use of over() through several requirements:
1. Query the total number of customers who purchased in April 2017
select name, count(*) over () from order_info where date_format(orderdate,'yyyy-MM') = '2017-04' group by name;
2. Query the customer's purchase details and monthly purchase amount
select name, orderdate, cost, sum(cost) over(partition by month(orderdate)) from order_info;
3. Query the customer's purchase details, and accumulate the cost of each customer according to the date
Remarks:Accumulation function () was added in one over Order by to。
select name, orderdate, cost, sum(cost) over(partition by name order by orderdate) from order_info;
3.1 Introduction to the use of several keywords in over
sum(cost) over(partition by name order by orderdate)by name, and accumulate data in the group
sum(cost) over(partition by name order by orderdaterows between UNBOUNDED PRECEDING and CURRENT ROW ) Group by name, aggregation from the starting point to the current row , the same as ①, ①The default is also the aggregation from the starting point to the current row
sum(cost) over(partition by name order by orderdaterows between 1 PRECEDING and CURRENT the rOW ) by the name packet, the current line and the front a row do polymerizable
sum(cost) over(partition by name order by orderdate rows BETWEEN 1 pRECEDING and 1 FOLLOWING ) by the name packet, the current line and the front one row and the back row of a polymerisation
sum(cost) over(partition by name order by orderdate rows BETWEEN the cURRENT the rOW and 1 FOLLOWING ) by the name packets ,The current row and the next row are aggregated
sum(cost) over(partition by name order by orderdate rows between CURRENT ROW and UNBOUNDED FOLLOWING ) are grouped by name, and the current row and all subsequent rows are aggregated
4. LAG queries each customer's last purchase time
Prompt: lag(col,n,default_val): [col field, n pushes forward by n rows, default_val pushes forward by n rows, if the change is empty, use default_val field]
① Lag moves the first n data down as a whole The meaning of shift. You can do page access trajectory. For example: access from [Page A> Page B> Page C]
select name, orderdate, (lag(orderdate,1,orderdate) over(partition by name order by orderdate)) upPurchaseTime from order_info;
② Lead and Lag are in opposite directions. Consistent function
# lead 1行，为 null 使用本行的 orderdate select name, orderdate, lead(orderdate,1,orderdate) over(partition by name order by orderdate) from order_info;
② Lead 2 rows behind
# lead 2行，为 null 使用本行的 orderdate select name, orderdate, lead(orderdate,2,orderdate) over(partition by name order by orderdate) from order_info;
5. Query the order information of the first 20% of the time
Sort by time, 20% is divided into 5 groups, take the data of the first group, use ntile(n) to complete with over()
select tb.name, tb.orderdate, tb.cose from ( select name, orderdate, cost, ntile(5) over(order by orderdate) groupId from order_info; ) tb where tb.groupId = 1
Tip: It needs to be used with the windowing function over().
MySQL also introduced a windowing function in version 8.x, which makes it easy to implement functions similar to rank and dense_rank.
Function description: The
rank function is used for sorting. When the sorting is the same, it will repeat and the total number will not change.
The dense_rank function is used for sorting, and the total number will decrease when the sorting is the same.
similar results ranking. 500 points for the first person, 500 points for the second person. Tied for first place. The third person 499. It means he is second? It's still a question of third place. The total number of rank will not change, it will be 1, 1, 3 after sorting ; the total number of dense_rank will be reduced, and the
data will be 1, 1, 2 after sorting :
# rank 排序后总数不会变，排序后是 1、1、3 select name, grade, rank() over(order by grade desc) from grade_info;
# dense_rank 总数则会减少，排序后是1、1、2 select name, grade, dense_rank() over(order by grade desc) from grade_info;
Tip: Need to cooperate with the windowing function over() to use
MySQL. The windowing function is also introduced in the 8.x version, which makes it easy to achieve a function similar to row_number. If there is no windowing function, it is more responsible, you can refer to: MySQL statement exercise 50 questions
Function description: The
row_number function is to add a number to each row, the number starts from 1, which is similar to the ROWNUM in Oracle.
Use 3.7 data
select name, grade, row_number() over() from grade_info;
other built-in functions not introduced, please study by yourself.
show functionsTo view the built-in functions of the system through commands , there are 289 built-in functions by default.
4. Custom function
There are always problems that Hive built-in functions cannot solve, and we need to customize Hive functions. Commonly used custom functions are UDF and UDTF functions. The aggregate function UDAF is basically built-in. Focus on the introduction
Hive custom functions, this article will not introduce them, if you need to understand. Jump to view: https://blog.csdn.net/lzb348110175/article/details/117653668
It’s not easy for bloggers to write, so add some attention
Ask for attention, ask for likes, add attention to not get lost ヾ(◍°∇°◍)ﾉﾞ
I cannot guarantee that the content written is correct, but I can guarantee that it will not be copied or pasted. Make sure that every sentence and every line of code has been typed by yourself. Please point out any mistakes. Thank you ♪(･ω･)ﾉ