20210109_hive study notes

1. Hive installation and configuration    
1. Download hive apache-hive-3.1.2-bin.tar.gz, you need to install hadoop and mysql first, and start the service
2. sftp upload to linux, tar -xzvf apache-hive-3.1.2 -bin.tar.gz decompression
3. Modify the configuration file
mv /root/apache-hive-3.1.2-bin /root/hive-3.1.2
mv /root/hive-3.1.2/conf/hive-env.sh .template /root/hive-3.1.2/conf/hive-env.sh
vim /root/hive-3.1.2/conf/hive-env.sh
  HIVE_CONF_DIR=/root/ hive-3.1.2/conf
4. Configure environment variables and log file directory
vim /root/.bash_profile
export HIVE_HOME=/root/hive-3.1.2
export PATH=$HIVE_HOME /bin:$PATH
source /root/.bash_profile
cp /root/hive-3.1.2/conf/hive-log4j2.properties.template /root/hive-3.1.2/conf/hive-log4j2.properties
vim /root/hive-3.1.2/conf/hive-log4j2 .properties
property.hive.log.dir = /root/hive-3.1.2/logs #Modify the log file directory
mkdir /root/hive-3.1.2/logs
5. Initialize the data; if the derby database is not applicable here, please Jump directly to 7
cd /root/hive-3.1.2/bin
./schematool -dbType derby -initSchema
6. Use the hive command to enter the operation

7. Do not use derby single database, change to mysql data step, first install mysql data, download mysql-connector-java-8.0.22.tar.gz package (java package must be the same as mysql version), unzip, upload mysql -connector-java-8.0.22.jar to the specified directory of the hive server
mv /root/mysql-connector-java-8.0.22.jar /root/hive-3.1.2/lib/
8. Modify the configuration file
touch /root/ hive-3.1.2/conf/hive-site.xml
vim /root/hive-3.1.2/conf/hive-site.xml
<?xml version="1.0" encoding="UTF-8" standalone="no" ?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?><!--
   Licensed to the Apache Software Foundation (ASF) under one or more
   contributor license agreements. See the NOTICE file distributed with
   this work for additional information regarding copyright ownership.
   The ASF licenses this file to You under the Apache License, Version 2.0
   (the "License"); you may not use this file except in compliance with
   the License.  You may obtain a copy of the License at


Unless required by applicable law or agreed to in writing, software
   distributed under the License is distributed on an "AS IS" BASIS,
   See the License for the specific language governing permissions and
   limitations under the License.
  <!-- WARNING!!! This file is auto generated for documentation purposes ONLY! -->
  <!-- WARNING!!! Any changes you make to this file will be ignored by Hive.   -->
  <!-- WARNING!!! You must make your changes in hive-site.xml instead.         -->
  <!-- Hive Execution Parameters -->
    <description>When the schematool -dbType mysql -initSchema is executed, the database metastore will be created in mysql. If the initialization fails, execute again When you need to delete this library first. This name can be modified to not be a library that already exists in mysql</description>
    <value>com.mysql.cj.jdbc.Driver</ description> </property> <property> <name>javax.jdo.option.ConnectionDriverName</name> <value>com.mysql.cj.jdbc.Driver</ description> </property> <property> <name>javax.jdo.option.ConnectionDriverName</name> <value>com.mysql.cj.jdbc.Driver</ description> </property> <property> <name>javax.jdo.option.ConnectionDriverName</name> <value>com.mysql.cj.jdbc.Driver</ description> </property> <property> <name>javax.jdo.option.ConnectionDriverName</name> <value> value>
    <description>To upload the mysql driver to the lib folder of hive</description>
    </ property>

    <description>Force the schema consistency of the metastore, if enabled, it will verify the version of the information stored in the metastore and hive The version in the jar package is consistent, and the automatic schema migration is turned off. The user must manually upgrade hive and migrate the schema. If it is turned off, only a warning will be given when the version is inconsistent. The default is false and not enabled;</description>
    </property >
9. Re-initialize the database
schematool -dbType mysql -initSchema
10.hive enter the client

Loading class `com.mysql.jdbc.Driver'. This is deprecated. The new driver class is `com.mysql.cj.jdbc.Driver'. The driver is automatically registered via the SPI and manual loading of the driver class is generally unnecessary
更改vim hive-site.xml

Startup question 2:
 InnoDB is limited to row-logging when transaction isolation level is READ COMMITTED or READ UNCOMMITTED
Reason: InnoDB engine with transaction isolation level is READ-COMMITTED is not safe when the binlog mode is STATEMENT.
The binary file cannot be written because the BINLOG_FORMAT method is STATEMENT, and the storage engine used by one or more tables is row-based logging. The InnoDB database engine,  
when the transaction isolation level is READ COMMITTED or READ UNCOMMITTED, is limited to binlog Row way.
Link to mysql, execute the following command
SET GLOBAL binlog_format ='MIXED'; to
prevent errors next time, modify /etc/my.cnf and set binlog-format=MIXED #Set the binary file format

When a process enters hive, jps will display the following process

2. The use of hive commands is mostly the same as mysql; you can execute dfs -ls / all commands in hive, just remove hadoop; you can also execute linux commands, just add an exclamation mark! ls /
also has .hivehistory under root user, yes The history of all commands executed in hive
show databases like'*hive*'; #mysql wildcard is %, hive is *
use default;
create table student(id int,name varchar(20))row format delimited fields terminated by ' \t';
insert into student values(100,'yangweiming');
show tables;
show create table table_name;
desc formatted table_name;
select * from student;
select count(1) from student;
select cast('1' as int) ;
#Force conversion of data type load data local inpath'/root/import_hive.txt' into table student; #External data import hdfs dfs -put /root/import_put.txt /user/hive/warehouse/student2 #Same as load data above , Through select * from student2;
set mapred.reduce.tasks; #View current system parameters in
hive set mapred.reduce.tasks=20; #Set current system parameters in
hive create database if not exists hive location'/user/hive/warehose/'; #Create library
drop database hive cascade; #If there are tables in hive, cascade must be added

In linux command bash mode, you can use hive -e to execute sql, which is very suitable for writing shell script
hive -e "select * from student;"> /root/data/my.txt.
In linux command bash mode, you can use hive -f to execute sql file
hive -f /root/hive_test.sql #You
can set the current session system parameters when entering hive
hive -hiveconf mapred.reduce.tasks=10

cd /root/hive-3.1.2/bin
beeline> !connect jdbc:hive2://master:10000
Connecting to jdbc:hive2://master:10000
Enter username for jdbc:hive2://master:10000: root
Enter password for jdbc:hive2://master:10000: *******
21/01/07 14:33:01 [main]: WARN jdbc.HiveConnection: Failed to connect to master:10000
Error: Could not open client transport with JDBC Uri: jdbc:hive2://master:10000: Failed to open new session: java.lang.RuntimeException: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.authorize.AuthorizationException): User: root is not allowed to impersonate root (state=08S01,code=0)
Add the following content to /root/hadoop-2.10.1/etc/hadoop/core-site.xml. The reason is that all my users use root, and the system thinks that I pretend to log in as the root user
        <name>hadoop. proxyuser.root.hosts</name>
3. After entering jdbc mode, you can operate the hadoop cluster like hive

Four, hive data type, DDL and DML operations, basically the same as mysql
string # Same as varchar, but can store 2G data

struct #collection data type, and python list, tuple, dictionary, collection
map #collection data type, and python list, tuple, dictionary, collection
array #collection data type, and python list, tuple, dictionary, collection

1. Management table MANAGED_TABLE
2. External table EXTERNAL
3. Partition table
4. Bucket table
create external table if not exists hive.table_name2( #Add external to create an external table, not a management table. The external table will be deleted using drop table , Hadoop did not delete the data, the data still exists after the table is rebuilt, and it is safer to save the data
id int,
name string,
friends array<string>,
children map<string,int>,
address struct<street:string,city :string>
partitioned by (yr_mon string,biz_dt string) #Partition the table and specify the partition field. Do not write
row format delimited fields terminated by','  
collection items terminated by'_ '  
map keys terminated by';'  
lines terminated by'\n'
stored as orc tblproperties ('orc.compress'='none'); #File storage format, refer to the hadoop storage format in the hadoop advanced study notes. The orc format comes with compression, you can turn off none; or snappy compression

alter table hive.table_name2 set tblproperties('EXTERNAL'='TRUE');
#Modify the table as an external table alter table hive.table_name2 set tblproperties('EXTERNAL'='FALSE'); #Modify the table as a management table (internal table)
alter table hive.table_name2 add partition(yr_mon='202103') partition(yr_mon='202102');
alter table hive.table_name2 drop partition(yr_mon='202103'), partition(yr_mon='202102');

show partitions hive.table_name2;

load data local inpath'/root/import_hive.txt' into table hive.table_name partition(yr_mon='202012');
#Load data local inpath'/root/import_par.txt' into table hive. table_name2 partition(yr_mon='202101',biz_dt='20210101'); #Load data into fixed partitions and
subpartitions load data local inpath'/root/import_par.txt' overwrite into table hive.table_name2 partition(yr_mon='202101' ,biz_dt='20210101'); #Load data into fixed partitions and sub-partitions, and overwrite previous data
insert into table hive.table_name partition(yr_mon) #This enables dynamic partitioning and allows it to allocate all data to different partitions, partitions The field is the last field
select id,name,yr_mon from emp; #yr_monThe partition field must be the last field

The difference between partition table and bucket table.
Partition table: put in different folders with different storage paths.
Bucket table: hash to different files according to a field; in fact, it is also a partition, just a hash partition

create table user_bucket(id int comment'ID',name string comment'name',age int comment'age') comment'test bucket'  
clustered by (id) sorted by (id) into 4 buckets row format delimited fields terminated by '\t';  #Create a
bucket table, partition the hash according to id, and divide it into 3 buckets, and each bucket is sorted by id.
When using the bucket table, you need to set several parameters
set hive.enforce.bucketing=true #enable
set mapreduce.job.reduces=-1
#The number of buckets that the bucket table lets it go by default select * from user_bucket tablesample(bucket 1 out 4 on id); #The bucket table is only used for data sampling calculations, use less

1. First upload the data and then repair
dfs -mkdir -p /user/hive/warehouse/student/yr_mon=202101
msck repair table student;
2. After uploading the data, add the partition
dfs -mkdir -p /user/hive/warehouse/student /yr_mon=202101
alter table student add partition(yr_mon='202101')
3. After creating the folder, load data to the partition
dfs -mkdir -p /user/hive/warehouse/student/yr_mon=202101
load data local inpath'/root /student.txt' into table hive.student partition(yr_mon='202101');

alter table table_name rename to table_name_new;

insert overwrite into hive.table_name  
select * from hive.table_name2;
insert overwrite hive.table_name  
select * from hive.table_name2;
create table table_name as
select * from xxx;

truncate table default.student; # Only the management table can be emptied, not the external table

4. Multi-table association
a and b association can only use equivalent links, and cannot contain or, otherwise the association will report an error

Five, hadoop data export
hadoop fs -get /user/hive/warehouse/student/000000_0 /root/study/hadoop_get.txt #copy files from hadoop to local

insert overwrite local directory'/root/study/export/' row format delimited fields terminated by'\t' #After adding overwrite, all contents in the /root/ directory will be overwritten, and the root authority is too large, please execute it with caution; /root/ study/export/ can only be given to the directory, and everything in the directory will be emptied
select * from default.student distribute by deptno order by sal; #Export file allows mapreduce to partition the data first and distribute by, each department has a mapreduce process, and finally order by is to summarize the order of each department

hive -e 'select * from default.student;' > /root/study/hive_e/student.txt;

export table default.student to'/root/study/export_dir'; #This is exported to the hadoop cluster, not to the local. In fact, the hive table in Hadoop is imported to the Hadoop file.
Import table default.student partition(biz_dt='20210111') from'/root/study/export_dir'; is not commonly used import table default.student partition(biz_dt='20210111') from'/root/study/export_dir'; data). uncommonly used

5. Select query in hive 1.
order by, sort by, cluster by, distribute by
order by (global sort) will sort the input globally, so there is only one reducer (multiple reducers cannot guarantee global order), also Because there is only one reducer, the calculation time will be longer when the input data is large.
Sort by (internal sorting of each MapReduce): Each Reducer is sorted internally, not sorting for the global result set.
distribute by (partition sort) controls which reducer a particular row should go to, usually for subsequent aggregation operations.
cluster by When the distribute by and sorts by fields are the same, the cluster by method can be used instead. In addition to the function of distribute by, cluster by also has the function of sort by. But the sorting can only be in ascending order, and the sorting rules cannot be specified as ASC or DESC like distribute by, otherwise an error will be reported:

create table employInfo(deptID int,employID int,employName string,employSalary double)
row format delimited fields terminated by ',';

Import test data into the test table:
load data local inpath'/home/hadoop/datas/employInfo.txt' into table employInfo;

(1) Set the number of reducers to 4
set mapreduce.job.reduces=4;
(2) Import the query results into the file (divided by department number and sorted in descending order of salary)
insert overwrite local directory'/root/study/ distribute-result'  
select * from employInfo distribute by deptID sort by employSalary desc; #General distribute by and sort by are used together
The following two writings are the same
select * from employInfo cluster by deptID;
select * from employInfo distribute by deptID sort by deptID;

2. Window function
select name,
       sum(cost) over (distribute by name sort by sal) windows_sum_sal, #hive supports window function, and supports the above 4 sorting methods
       lag(orderdate,1) over (partition by name order by orderdate) windows_lag_orderdate,
       lead(orderdate,1) over (partition by name order by orderdate) windows_lead_orderdate,
       sum(cost) over (rows between unbounded preceding and current row) win_1, #unbounded preceding There is no limit starting point, and the current row ending point is the current row
       sum( cost) over (partition by name rows between 3 preceding and current row) win_2, #Starting from the previous third record, the end of current row is the current row
       sum(cost) over (partition by name rows between current row and 3 following) win_3 , #Starting from the third record in front, the end of current row is the current row
       ntile(5) over (order by orderdate ASC NULLS LAST) win_ntile # Divide the total data into five parts according to the order time, and the time is in the front. If the fifth point data is not divisible, the data may be less than count(1) / 5
  from empl;

Six, hive related functions
1. The system comes with functions
show functions;
#View all functions desc function trim; #View function trim description
desc function extended trim; #View detailed use cases of trim functions

select date_format(current_timestamp(),'yyyy-MM-DD');
select date_add(current_timestamp(),5);
select datediff(current_timestamp(),'2021-01-01');
select regexp_replace('2021/09/21','/','-');

if(10<5,'big','small') similar to oracle case when else end mysql also has case when and Oracle
select concat('a','b','c'),collect_set(col_to_row); #collect_set Same as oracle wm_concat

select movie,category_name from movie_info lateral view explode(category) table_tmp as category_name;

2. User-defined function
If you want to create a custom function, you need to develop the jar function package () in java, and then upload it to the hive server.
sftp [email protected];
put the uploaded jar function package;
mv jar function package /root/hive-3.1.2/lib/
enter hive mode
    add jar'/root/ hive- 3.1.2/lib/ uploaded jar function Package';
    create [temporary] function dbname.function_name as'jar package of java function, give the full class name, such as: .com.myname.myudf';
    drop [temporary] function if exists dbname.function_name;

Seven, hive optimization
set hive.fetch.task.conversion=more; #The native hadoop 2.10.1 has defaulted to more, let fetch instead of mr application when fetching data.
set hive.exec.mode.local.auto=true; #Turn on local mode.
    When a job meets the following conditions, the local mode can be used:
    1. The input data size of the job must be smaller than the parameter: hive.exec.mode.local.auto .inputbytes.max (default 128MB)
    2. The maximum number of input files for job mr must be less than the parameter: hive.exec.mode.local.auto.input.files.max (default 4)
    3. The number of reduce jobs for job must be 0 Or 1
set hive.auto.convert.join=true; #Turn on mapjoin to cache small tables in memory;
set hive.mapjoin.smalltable.filesize=25000000; by default, set hive.mapjoin.smalltable.filesize=25000000; #Memory cache small table definition, smaller than this size is Small table; default 25M

If a field is associated or partitioned, mapreduce will be performed, and this field has a null value, then the field will be converted to a fixed value for processing. Prevent data skew

set hive.map.aggr=true; #Whether to aggregate on the map side, the default is on. Mainly use gourp by when fast
set hive.groupby.skewindata=true; #Load balancing when there is data skew; off by default

Use select less * write more select to specify the required fields

set hive.exec.dynamic.partition=true; #Enable dynamic partition, which is enabled by default, which is to partition by the data of a field in the table
set hive.exec.dynamic.partition.mode=nonstrict;#Set dynamic partition non-strict mode . The default is strict mode. When inserting data in strict mode, a partition must be specified
set hive.exec.max.dynamic.partitions=1000; #Set the global maximum number of dynamic partitions; the default is 1000;
set hive.exec.max.dynamic. partitions.pernode=1000; #Set the maximum number of nodes in a dynamic partition; the default is 100; the setting is the same as setting the global maximum number of dynamic partitions

Set a reasonable number of maps:
increase the number of maps for complex files: when the input file is relatively large, the task logic is complex, and the map execution is very slow, you can consider increasing the number of maps to reduce the amount of data processed by each map, thereby improving the task effectiveness.

Small file merging:
If a task has many small files (much less than the block size of 128m), each small file will be treated as a block and completed by a map task, and the time for a map task to start and initialize is much longer The logic processing time will cause a lot of waste of resources.

Set the number of Reduce reasonably
(1) If the Reduce setting is too large, a lot of small files will be generated, which will have a certain impact on the NameNode, and the running time of the entire job may not be reduced;
(2) If the Reduce setting is too small, a single The data processed by Reduce will increase, which is likely to cause OOM (out of memory) exceptions.
set hive.exec.reducers.bytes.per.reducer=256000000; #The amount of data processed by each reduce
set hive.exec.reducers.max=1009; #Maximum number of reducers per task
set mapreduce.job.reduces=-1 ; #The number of reduce for each job (order by and distinct have only one reduce by default); -1 represents the
number of reducers that are actually adjusted according to the input data N = min(mapreduce.job.reduces, the total input data size/ hive.exec.reducers.bytes.per.reducer)

Set parallel execution
set hive.exec.parallel=false; #Turn on parallel execution, which is turned off by default. It will be turned on unless there are many cluster resources.
set hive.exec.parallel.thread.number=8; #Concurrent number

Strict mode
set hive.mapred.mode=strict; #hive In strict mode, several statements are not allowed to be executed.
1. Cartesian product
2. Partition table insert data if the partition is not specified
3. Big int, big string, big double data when inserting
4. Order by must specify limit

JVM reuse
mapreduce.job.jvm.numtasks=1 in the mapred-site.xml file; #One jvm can continuously start multiple tasks of the same type, the default value is 1, if it is -1, it means unlimited.

Execution plan analysis
explain select * from emp;