Hive built-in functions

Contents of this article

1. Built-in functions

Enter the hive client and show functionsview the built-in functions of the system through commands . There are 289 built-in functions by default. Command desc function 函数名can view the description that comes with the function, the command desc function extended 函数名to view the function comes with detailed usage (included with Example). As follows:

# 1.查看函数描述
hive (test)> desc function upper;
OK
tab_name
upper(str) - Returns str with all characters changed to uppercase
Time taken: 0.014 seconds, Fetched: 1 row(s)

# 2.查看函数详细用法
hive (test)> desc function extended upper;
OK
tab_name
upper(str) - Returns str with all characters changed to uppercase
Synonyms: ucase
Example:
  > SELECT upper('Facebook') FROM src LIMIT 1;
  'FACEBOOK'
Function class:org.apache.hadoop.hive.ql.udf.generic.GenericUDFUpper
Function type:BUILTIN
Time taken: 0.066 seconds, Fetched: 7 row(s)

2. Hive function classification

There are three types of functions in Hive:

  1. UDF (一进一出,普通函数)
  2. UDTF (多进一出,聚合函数)
  3. OUT OF (一进多出,炸裂函数)
And refers to 输入数据的行数.

UDF function : input a row of data, and return a row of data. For example: upper (to uppercase), substring (intercept) function

UDTF function : input multiple rows of data and return one row of data, which is our commonly used aggregate function. For example: sum (sum), avg (average) function

UDAF function : input one row of data, return multiple rows of data, also called burst function. For example: explode (used to break up rows, split one row of data into multiple rows, its parameters must be map or array) function (explode function: for example, there are many words in a row, comma separated, and finally the words are separated by commas , Put a word on one line, that is, input one line and output multiple lines)

As shown in the figure below , through the command: desc function extended 函数名;you can see what type of function is
Insert picture description here

3. Introduction to commonly used functions

Because the blogger has MySQL and Oracle related knowledge. The use of some functions here is exactly the same as that of MySQL and Oracle, so I won’t describe them here too much.

3.1 nvl

Function description :
Assign a value to the data whose value is NULL, and its format is NVL( value,default_value). Its function is that if value is NULL, the NVL function returns the value of default_value, otherwise it returns the value of value, and if both parameters are NULL, it returns NULL. [There are also NVL functions in Oracle].

Example:
select nvl(null,'默认值');   #返回默认值
select nvl("111","222");  #返回111
select nvl(100+100,"222");  #返回200

3.2 case when then else end

Function Description :
多条件判断函数,类似 Java 中的 if else 语句. Condition is an expression that returns a boolean type. If the expression returns true, the entire function returns the value of the corresponding result. If the expressions are all false, it returns the value of result after ElSE. If the ELSE clause is omitted, it returns NULL. . [There are case when then else end functions in MySQL and Oracle]

Format:
CASE WHEN condition THEN result 

[WHEN...THEN...] 

ELSE result 

END

Ⅰ.concat

Function description :
CONCAT(string A/col, string B/col…) returns the result of input string concatenation, supports any number of input strings. [There are concat functions in MySQL and Oracle]

Example:
SELECT concat("Hello","----","World");   # 返回 Hello----World

Ⅱ.concat_ws

Function description :
CONCAT_WS(separator, str1, str2,...): It is a special form of CONCAT(). The first parameter is the separator between the parameters. The separator can be the same string as the remaining parameters. If the separator is NULL, the return value will also be NULL. This function will skip any NULL and empty strings after the delimiter parameter. The delimiter will be added between the connected strings;

example:
select concat_ws("-","Hello","World","HaHaHa");   #返回Hello-World-HaHaHa
Note:
  CONCAT_WS in the incoming parameters, must be string字符串or array<string>字符串数组. Input array<string> string array, concat_ws can be used with collect_set and collect_list .
data:
Insert picture description here

Ⅰ.collect_set

Function description : The
collect_set function only accepts basic data types. Its main function is to de- duplicate and summarize the value of a field to generate an Array type field.

Example:
# collect_list
select dept_no,collect_list(user_name) from user_info group by dept_no;
Insert picture description here

Ⅱ.collect_list

Function description : The
difference between collect_list and collect_set is that it does not remove duplicates.

Example:
#collect_set
select dept_no,collect_set(user_name) from user_info group by dept_no;
Insert picture description here

3.5 explode + lateral view

Tip: explode is a UDTF function, which usually needs to be used with the lateral view profile table.
Function description:
explode(col) : Convert a row of data into column data, which can be used for array and map data. It is to split the complex array or map structure in one line of hive into multiple lines.

Data:
Insert picture description here


Requirements:
Expand the array data in the movie classification. The results are as follows:

Insert picture description here


SQL:

# 表名.* 方式
select 
   movie,
   category_name.*    
from 
   movie_info 
lateral view 
   explode(split(category,'/')) category_name;
# 列名方式
select 
   movie,
   category_name    
from 
   movie_info 
lateral view 
   explode(split(category,'/')) movie_info_tmp as category_name;
Analysis:
  If you use select movie,explode(split(category,'/')) from movie_info;it will error: UDTF's are not Supported at The Outside the SELECT clause, NOR nested in Expressions

reason :
When using UDTF function, hive only allowed to split the field visit.
Correct command: select explode(split(category,'/')) from movie_info;
Wrong command: select movie,explode(split(category,'/')) from movie_info;

If you want to access fields other than split fields, what should I do?
Use lateral view side view! Official document, refer to: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+LateralView

lateral view is a side view, meaning it is used in conjunction with UDTF functions to split a row of data into multiple rows of data . The UDTF without lateral view can only extract a single field to split, and cannot plug it back into the original data table. With the addition of a lateral view, the split single field data can be associated with the original table data. explode only explodes one column, just explode directly. If you need the fields related to the original table at the same time as the explosion, you need to add the lateral view profile table.

Note: When using the lateral view, you need to specify the view alias and the generated new column alias

[reference from: https://blog. csdn.net/u011110301/article/details/104198588

3.6 grouping sets multi-dimensional analysis

Use grouping sets to analyze data in multiple dimensions. How to use grouping sets can refer to the official documents: Enhanced Aggregation, Cube, Grouping and Rollup

1. Requirements: ①Check
the total number of people in the
company ②Check how many men and how many women are in the company ③Check how many people are in
each department of the
company ④Check every company How many people are male and female in each department?

2. The data are as follows:
Insert picture description here


3. Basic SQL statement query

① View the total number of people in the company

select count(*) from db_user;
②Check how many men and women are in the company
select user_sex, count(*) from db_user group by user_sex;
③Check how many people are in each department of the company
select dept_no, count(*) from db_user group by dept_no;
④Check the number of men and women in each department of the company
select dept_no, user_sex, count(*) from db_user group by dept_no, user_sex;
4. Use GROUPING SETS to complete in one sentence! ! ! !
select 
    dept_no,
    user_sex,
    count(*) 
from 
    db_user 
group by 
    dept_no,user_sex 
grouping sets
    ((),user_sex,dept_no,(user_sex,dept_no));
Query results:
Insert picture description here


5. Aggregate query, using the comparison chart of grouping sets and group by (taken from official documents)

Insert picture description here

3.7 Window function

Tip: MySQL also introduced window function in 8.X version, which is the same as that used in Hive. Just find a blog to find out about it. MySQL 8.x windowing function blog

official document, reference: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+WindowingAndAnalytics

data:
Insert picture description here

Ⅰ.Over() general introduction

Description:
  group by can group data sets according to a certain field. Similar to over(), you can specify fields. But it is still different from group by.

The difference is :
group by is grouped according to a certain field, for example, department id, then as long as the id is the same, all data in the group will be aggregated. When aggregating, all data in the group will be aggregated.
over() is different, it can specify that the window size changes with the change of rows, for example, over grouping according to a certain field ( grouping is not appropriate here ), it can open a window in the group, and again specify which row is divided As a group.

Ⅱ.over() Several keywords that will be used in the brackets (refer to 3.1 below for how to use them)

Key wordsmeaning
PARTITION BYPartition by a field
ORDER BYUsed for sorting, with aggregate function to achieve accumulation
CURRENT ROWCurrent line
n PRECEDINGN rows forward
n FOLLOWINGThe next n rows of data
UNBOUNDEDStarting point
UNBOUNDED PRECEDING means from the starting point in the front
UNBOUNDED FOLLOWING means to the end point in the back
LAG(col,n,default_val)The previous nth row of data [col field, n pushes forward by n rows, default_val pushes forward by n rows, if the row is empty, use default_val field]
LEAD(col,n,default_val)Next nth row of data [col field, n pushes forward by n rows, default_val pushes forward by n rows, if the row is empty, use default_val field]
FIRST_VALUEThis requires up to two parameters. The first parameter is the column for which you want the first value, and the second (optional) parameter must be a boolean with a default value of false. If set to true, skip null values
LAST_VALUEThis requires up to two parameters. The first parameter is the column for which you want the last value, and the second (optional) parameter must be a Boolean value that defaults to false. If set to true, null values ​​are skipped.
NTILE(n)Distribute the rows of the ordered window to the specified data groups. Each group has a number, and the number starts from 1. For each row, NTILE returns the number of the group to which Cihang belongs. Note: n must be of type int
Tip: These keywords can be very convenient to control the number of rows in the window. In the following [It will be used in Requirements 3.1], click the link to view →→To 3.1 . You can also refer to the official document example: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+WindowingAndAnalytics


Through the case to understand the meaning of over()

Case 1 : The meaning of over()
# 这个SQL是会直接报错的,因为使用count聚合函数,却没有 group by name;
select name,count(name) from order_info;

# 加个 over() 在聚合函数后,在hive中就能够正常运行了。
select name,count(name) over() from order_info;
Operation result:
Insert picture description here 111 description


Insert picture description here


Result analysis:

  Why is it 14? Because there are a total of 14 records in the library.

over() takes effect after the group by query. We conducted select name from order_info;have 14 records. over() means opening a window. There are no restrictions in the parentheses of over(), which means that select name from order_info;the 14 records are used as the window data set, and count(name) is to count the data set in the window. There are 14 pieces of data in select name, 14 pieces of returned data, count(name) is calculated on the data set in the over() window, so the count(name) of 14 pieces of data is 14.

Summary:
  over() opens a window, and the data in the window is the data queried except for the aggregate function. In this example, select name from order_info;there will be 14 records, and count(name) is to count the data in the window, which is 14.



Case 2 : over() partition usage
# over() 指定按照 name 分区
select 
   name,
   count(name) over(partition by name) 
from 
   order_info;
Query result: >
Insert picture description here

Ⅲ. Understand the use of over() through several requirements:

1. Query the total number of customers who purchased in April 2017

select 
   name,
   count(*) over ()
from 
   order_info
where 
   date_format(orderdate,'yyyy-MM') = '2017-04'
group by 
   name;
Insert picture description here

2. Query the customer's purchase details and monthly purchase amount

select 
   name,
   orderdate,
   cost,
   sum(cost) over(partition by month(orderdate))
from
   order_info;
Insert picture description here

3. Query the customer's purchase details, and accumulate the cost of each customer according to the date

Remarks:Accumulation function () was added in one over Order by to。
select 
   name,
   orderdate,
   cost,
   sum(cost) over(partition by name order by orderdate)
from
   order_info;
Insert picture description here
3.1 Introduction to the use of several keywords in over
①Group sum(cost) over(partition by name order by orderdate) by name, and accumulate data in the group

sum(cost) over(partition by name order by orderdate rows between UNBOUNDED PRECEDING  and CURRENT ROW ) Group by name, aggregation from the starting point to the current row , the same as ①, ①The default is also the aggregation from the starting point to the current row

sum(cost) over(partition by name order by orderdate rows between 1 PRECEDING  and CURRENT the rOW ) by the name packet, the current line and the front a row do polymerizable
Insert picture description here


sum(cost) over(partition by name order by orderdate rows BETWEEN 1 pRECEDING  and 1 FOLLOWING ) by the name packet, the current line and the front one row and the back row of a polymerisation

Insert picture description here


sum(cost) over(partition by name order by orderdate rows BETWEEN the cURRENT the rOW  and 1 FOLLOWING ) by the name packets ,The current row and the next row are aggregated

Insert picture description here


sum(cost) over(partition by name order by orderdate rows between CURRENT ROW  and UNBOUNDED FOLLOWING ) are grouped by name, and the current row and all subsequent rows are aggregated

Insert picture description here

4. LAG queries each customer's last purchase time

Prompt: lag(col,n,default_val): [col field, n pushes forward by n rows, default_val pushes forward by n rows, if the change is empty, use default_val field]

① Lag moves the first n data down as a whole The meaning of shift. You can do page access trajectory. For example: access from [Page A> Page B> Page C]
select 
    name,
    orderdate,
    (lag(orderdate,1,orderdate) over(partition by name order by orderdate)) upPurchaseTime
from 
    order_info;
Insert picture description here


② Lead and Lag are in opposite directions. Consistent function

# lead 1行,为 null 使用本行的 orderdate
select 
    name,
    orderdate,
    lead(orderdate,1,orderdate) over(partition by name order by orderdate)
from 
    order_info;
Insert picture description here


② Lead 2 rows behind

# lead 2行,为 null 使用本行的 orderdate
select 
    name,
    orderdate,
    lead(orderdate,2,orderdate) over(partition by name order by orderdate)
from 
    order_info;
Insert picture description here

5. Query the order information of the first 20% of the time

Sort by time, 20% is divided into 5 groups, take the data of the first group, use ntile(n) to complete with over()
select
    tb.name,
    tb.orderdate,
    tb.cose
from (
    select 
        name,
        orderdate,
        cost,
        ntile(5) over(order by orderdate) groupId  
    from 
        order_info;
) tb
where
    tb.groupId = 1
Insert picture description here

3.8 rank、dense_rank

Tip: It needs to be used with the windowing function over().
MySQL also introduced a windowing function in version 8.x, which makes it easy to implement functions similar to rank and dense_rank.
Function description: The
rank function is used for sorting. When the sorting is the same, it will repeat and the total number will not change.
The dense_rank function is used for sorting, and the total number will decrease when the sorting is the same.

Example:
  similar results ranking. 500 points for the first person, 500 points for the second person. Tied for first place. The third person 499. It means he is second? It's still a question of third place. The total number of rank will not change, it will be 1, 1, 3 after sorting ; the total number of dense_rank will be reduced, and the

data will be 1, 1, 2 after sorting :
Insert picture description here


Example:

# rank 排序后总数不会变,排序后是 1、1、3
select
   name,
   grade,
   rank() over(order by grade desc)
from 
   grade_info;
Insert picture description here
# dense_rank 总数则会减少,排序后是1、1、2
select
   name,
   grade,
   dense_rank() over(order by grade desc)
from 
   grade_info;
Insert picture description here

3.9 row_number

Tip: Need to cooperate with the windowing function over() to use

MySQL. The windowing function is also introduced in the 8.x version, which makes it easy to achieve a function similar to row_number. If there is no windowing function, it is more responsible, you can refer to: MySQL statement exercise 50 questions

Function description: The
row_number function is to add a number to each row, the number starts from 1, which is similar to the ROWNUM in Oracle.

Data:
  Use 3.7 data

example:
select
    name,
    grade,
    row_number() over()
from 
    grade_info;
Insert picture description here
Tips: For
other built-in functions not introduced, please study by yourself. show functionsTo view the built-in functions of the system through commands , there are 289 built-in functions by default.

4. Custom function

There are always problems that Hive built-in functions cannot solve, and we need to customize Hive functions. Commonly used custom functions are UDF and UDTF functions. The aggregate function UDAF is basically built-in. Focus on the introduction 自定义UDF函数and自定义UDTF函数
Hive custom functions, this article will not introduce them, if you need to understand. Jump to view: https://blog.csdn.net/lzb348110175/article/details/117653668

It’s not easy for bloggers to write, so add some attention

Ask for attention, ask for likes, add attention to not get lost ヾ(◍°∇°◍)ノ゙

I cannot guarantee that the content written is correct, but I can guarantee that it will not be copied or pasted. Make sure that every sentence and every line of code has been typed by yourself. Please point out any mistakes. Thank you ♪(・ω・)ノ