OLAP-impala-Big Data Week13-DAY6-impala

Article Directory


Preface


# Big data technology of impala

1. Preparation

Install hive and hadoop running environment, and start hadoop and hive normally

2. Theme

Realize the normal installation of impala cluster environment and master the basic grammar of impala

3. Goal

Familiar with impala grammar

Four, knowledge points

Overview of offline task processing flow

Insert picture description here

Due to most of the software frameworks, CDH provides a compressed package installation method, but because impala has some code written in C++, impala is in the selection of the installation package, cloudera company does not provide a tar package installation method, only provides rpm For the installation method, we can install it by downloading the rpm package. Note: The rpm package is an installation compressed package on the linux operating system

1. Overview of impala

Basic introduction to imala

Impala is a high-efficiency SQL query tool provided by cloudera, providing real-time query results. The official test performance is 10 to 100 times faster than hive. Its SQL query is faster than sparkSQL. It is known as the fastest in the current big data field. The query sql tool, impala is implemented by referring to Dremel in Google’s new three papers (Caffeine, Pregel, Dremel), among which the old three papers are (BigTable, GFS, MapReduce) respectively corresponding to the HBase we are about to learn and the existing Learned HDFS and MapReduce

Impala is based on hive and uses memory for calculation, taking into account the data warehouse, and has the advantages of real-time, batch processing, and multiple concurrency.

The relationship between impala and hive

Impala is a big data analysis and query engine based on hive. It directly uses hive's metadata, which means that impala metadata is stored in hive's metastore, and impala is compatible with most of the sql syntax of hive. So if you need to install impala, you must install hive first to ensure that the hive installation is successful, and you also need to start the metastore service of hive

Advantages of impala

1. Impala is relatively fast, very fast, especially fast, because all calculations can be put into memory for completion, as long as your memory is large enough

2. Abandon the calculation of MR and use C++ to realize it, and targeted hardware optimization

3. It has the characteristics of a data warehouse and performs data analysis on the original data of hive

4. Support ODBC, jdbc remote access

Disadvantages of impala:

1. Based on memory calculation, it is highly dependent on memory

2. Switch to C++ to write, which means more difficult to maintain

3. Based on hive, coexist and die with hive, tightly coupled

4. The stability is not as good as hive, and there is no data loss

Impala architecture and query plan

Insert picture description here

Impala's architecture modules:

The daemon started by impala-server ==> executes our query plan from the node. The official recommendation is to install it with all the datanodes. The fast data query can be achieved through the short-circuit reading feature of hadoop.

impala-statestore == "master node of the state store

impalas-catalog == "Master node of metadata management area

Query execution

Impalad is divided into two levels: frontend and backend. Frondend is implemented in java (embedded in impalad through JNI) and is responsible for query plan generation, while backend is implemented in C++ and is responsible for query execution.

The frontend**** generating query plan is divided into two stages:

(1) Generate a single-machine query plan. The single-machine execution plan is the same as the relational database execution plan, and the query optimization methods used are similar.

(2) Generate a distributed query plan. According to the single-machine execution plan, generate a truly executable distributed execution plan, reduce data movement, and try to put data and calculation together.

Insert picture description here

The figure above is an example of a SQL query. The goal of this SQL is to calculate aggregation based on the join of three tables, and take topN according to the sorting of the aggregation column.

Impala's query optimizer supports cost models: Using statistical data such as the cardinality of tables and partitions, and the number of distinct values ​​in each column, impala can estimate the cost of the execution plan and generate a better execution plan. The left side of the figure above is the single-machine query plan generated by the frontend query optimizer. Unlike traditional relational databases, the single-machine query plan cannot be executed directly and must be converted into a distributed query plan as shown in the right half of the figure. The distributed query plan is divided into 6 segments (the colored borderless rounded rectangle in the figure), and each segment is a plan subtree that can be independently executed by a single server.

Insert picture description here

2. Preparation of impala installation environment

Hadoop, hive, these two frameworks need to be installed in advance, and hive needs to copy the hive installation package to all servers and save a copy, because impala needs to reference some dependent jar packages under the hive installation directory

3. Download all dependent packages of impala

Since impala does not provide a tar package for us to install, only the rpm package is provided, so when we install impala, we need to use the rpm package for installation. The rpm package is only provided by cloudera, so we go to the cloudera company website to download the rpm The package is enough, but another problem is that impala's rpm package depends on a lot of other rpm packages. You can find out the dependencies one by one, or download all the rpm packages and make them into our local yum source for installation. . Here we choose to make our local yum source for installation, so first we need to download all the rpm packages, the download address is as follows

http://archive.cloudera.com/cdh5/repo-as-tarball/5.14.2/cdh5.14.2-centos7.tar.gz

After the download is complete, keep it for future use

Upload the compressed package we downloaded to the /kkb/soft path of the node03 server, and decompress it

cd /kkb/soft
tar -zxvf cdh5.14.2-centos7.tar.gz

4. Make a local yum source

The mirror source is the address for downloading relevant software in centos. We can specify where we should download the impala rpm package by making our own mirror source. Here we use httpd as the server and start the httpd service as our mirror source. Download address

Here we choose the third machine as the server of the mirror source

Execute the following commands on the node03 machine

sudo yum  -y install httpd
sudo service httpd start

cd /etc/yum.repos.d
sudo vim localimp.repo 

[localimp]
name=localimp
baseurl=http://node03/cdh5.14.2/
gpgcheck=0
enabled=1

Create a read link for apache httpd

sudo ln -s /kkb/soft/cdh/5.14.2 /var/www/html/cdh5.14.2

The page accesses the local yum source, and the appearance of this interface means that the local yum source is successfully made

http://node03/cdh5.14.2

If you can access the file browsing page normally, it proves that our local yum source is installed successfully

Distribute the prepared localimp configuration file to all nodes that need to install impala

node03执行以下命令进行分发
cd /etc/yum.repos.d/

sudo scp localimp.repo  node02:$PWD
sudo scp localimp.repo  node01:$PWD

5. Start to install impala

Installation planning

service namenode01node02node03
impala-catalogDon't installDon't installinstallation
impala-state-storeDon't installDon't installinstallation
impala-serverinstallationinstallationinstallation
#主节点node03执行以下命令进行安装

sudo yum  install  impala -y
sudo yum install impala-server -y
sudo yum install impala-state-store  -y
sudo yum install impala-catalog  -y
sudo yum  install  impala-shell -y

#从节点node01与node02安装以下服务
sudo yum install impala-server -y

6. Configure impala for all nodes

Step 1: Modify hive-site.xml

Modify the content of hive-site.xml on node03 machine as follows

hive-site.xml configuration

vim /kkb/install/hive-1.1.0-cdh5.14.2/conf/hive-site.xml
添加以下三个配置属性

<property>
	<name>hive.server2.thrift.bind.host</name>
 	<value>node03.hadoop.com</value>
</property>
 <property>
     <name>hive.metastore.uris</name>
     <value>thrift://node03.kaikeba.com:9083</value>
 </property>
<property>
    <name>hive.metastore.client.socket.timeout</name>
    <value>3600</value>
 </property>

Step 2: Send the hive installation package to node02 and node01 machines

Execute on node03 machine

cd /kkb/install/

scp -r hive-1.1.0-cdh5.14.2/ node02:$PWD
scp -r hive-1.1.0-cdh5.14.2/ node01:$PWD

The third step: node03 starts the metastore service of hive

Start the metastore service of hive

Node03 machine starts the metastore service of hive

cd  /kkb/install/hive-1.1.0-cdh5.14.2

nohup bin/hive --service metastore &
nohup bin/hive -- service hiveserver2 &

Note: Make sure that the mysql service is started normally, otherwise the metastore service cannot be started

Step 4: Modify hdfs-site.xml of all hadoop nodes and add the following content

Create folders for all nodes

sudo mkdir -p /var/run/hdfs-sockets

Modify the hdfs-site.xml of all nodes to add the following configuration, after the modification, restart the hdfs cluster to take effect

vim  /kkb/install/hadoop-2.6.0-cdh5.14.2/etc/hadoop/hdfs-site.xml<property>    <name>dfs.client.read.shortcircuit</name>    <value>true</value></property><property>     <name>dfs.domain.socket.path</name>     <value>/var/run/hdfs-sockets/dn</value></property><property>    <name>dfs.client.file-block-storage-locations.timeout.millis</name>    <value>10000</value></property><property>     <name>dfs.datanode.hdfs-blocks-metadata.enabled</name>     <value>true</value></property>

Three machines execute the following commands to authorize the folder

sudo  chown  -R  hadoop:hadoop   /var/run/hdfs-sockets/

Step 5: Restart hdfs

Restart hdfs file system

Execute the following commands on the node01 server

cd /kkb/install/hadoop-2.6.0-cdh5.14.2/sbin/stop-dfs.shsbin/start-dfs.sh

Step 6: Create a connection between hadoop and hive configuration file

The configuration directory of impala is /etc/impala/conf

Under this path, core-site.xml, hdfs-site.xml and hive-site.xml need to be copied here, but we will use the soft connection method here.

All nodes execute the following command to create a link to the impala configuration directory

sudo ln -s /kkb/install/hadoop-2.6.0-cdh5.14.2/etc/hadoop/core-site.xml /etc/impala/conf/core-site.xmlsudo ln -s /kkb/install/hadoop-2.6.0-cdh5.14.2/etc/hadoop/hdfs-site.xml /etc/impala/conf/hdfs-site.xmlsudo ln -s /kkb/install/hive-1.1.0-cdh5.14.2/conf/hive-site.xml /etc/impala/conf/hive-site.xml

Step 7: Modify the configuration file of impala

Modify the default configuration of impala on all nodes

All nodes change impala default configuration file and add mysql connection driver package

sudo vim /etc/default/impalaIMPALA_CATALOG_SERVICE_HOST=node03IMPALA_STATE_STORE_HOST=node03所有节点创建mysql的驱动包的软连接sudo mkdir -p /usr/share/javasudo ln -s /kkb/install/hive-1.1.0-cdh5.14.2/lib/mysql-connector-java-5.1.38.jar /usr/share/java/mysql-connector-java.jar

Modify the java path of bigtop on all nodes

Modify the java_home path of bigtop

sudo vim /etc/default/bigtop-utilsexport JAVA_HOME=/kkb/install/jdk1.8.0_141

Step 8: Start the impala service

Start impala service

The master node node03 starts the following three service processes

sudo service impala-state-store startsudo service impala-catalog startsudo service impala-server start

Start node01 and node02 from the node to start impala-server

sudo service  impala-server  start

Three machines can use the following command to check whether the impala process exists

ps -ef | grep impala

Note: After startup, all logs about impala are under the path /var/log/impala by default. There should be three processes on the node03 machine, and there is only one process on the node02 and node01 machines. If the number of processes is wrong, go to the corresponding directory to view Error log

Browser page access:

Access to the management interface of impalad

http://node03:25000/

Access the management interface of statestored

http://node03:25010/

Access catalog management interface

http://node03:25020

7. Use of impala

1. Impala-shell syntax

1.1, impala-shell external command parameter syntax

Command parameters that can be executed without entering the impala-shell interactive command line

When impala-shell is executed later, it can take many parameters:

-h View help documentation

impala-shell -h

-r refresh the entire metadata, when the amount of data is large, it will consume server performance

impala-shell -r

-v View the corresponding version

impala-shell -v -V

-f execute query file

cd /kkb/installvim impala-shell.sqlselect * from course.score;通过-f 参数来执行执行的查询文件impala-shell -f impala-shell.sql

-p show query plan

impala-shell -f impala-shell.sql -p

9.1.2, the internal command line parameter syntax of impala-shell

Syntax that can be executed after entering the impala-shell command line

help command

Help document

connect command

connect hostname connect to a machine to execute

refresh command

refresh dbname.tablename Incremental refresh, refresh the metadata of a certain table, mainly used to refresh the data in the data table in hive.

refresh course.score;

invalidate metadata command:

invalidate  metadata全量刷新,性能消耗较大,主要用于hive当中新建数据库或者数据库表的时候来进行刷新

explain command:

Used to view the execution plan of the sql statement

explain select * from course.score;explain的值可以设置成0,1,2,3等几个值,其中3级别是最高的,可以打印出最全的信息set explain_level=3;

profile command:

After executing the sql statement, you can print out more detailed execution steps,

Mainly used for viewing query results, cluster tuning, etc.

select * from course.score;profile;

Note: The data inserted in the hive window or the newly created database or database table cannot be directly queried in impala. You need to refresh the database. The data inserted in impala-shell can be queried directly in impala. There is no need to refresh the database. What is used is the function of the catalog service. Catalog is a module function added after impala 1.2. The main function is to synchronize the metadata between impala.

2. Create a database

impala-shell enters the interactive window of impala

2.1, view all databases

show databases;

2.2, create and delete the database

Create database

CREATE DATABASE IF NOT EXISTS mydb1;drop database  if exists  mydb;

Create a database table and specify the location where the database table data is stored in hdfs (similar to the syntax of hive table creation)

hdfs dfs -mkdir -p /input/impalacreate  external table  t3(id int ,name string ,age int )  row  format  delimited fields terminated  by  '\t' location  '/input/impala/external';

3. Create a database table

创建student表CREATE TABLE IF NOT EXISTS mydb1.student (name STRING, age INT, contact INT );创建employ表create table employee (Id INT, name STRING, age INT,address STRING, salary BIGINT);

3.1, insert data into the database table

insert into employee (ID,NAME,AGE,ADDRESS,SALARY)VALUES (1, 'Ramesh', 32, 'Ahmedabad', 20000 );insert into employee values (2, 'Khilan', 25, 'Delhi', 15000 );Insert into employee values (3, 'kaushik', 23, 'Kota', 30000 );Insert into employee values (4, 'Chaitali', 25, 'Mumbai', 35000 );Insert into employee values (5, 'Hardik', 27, 'Bhopal', 40000 );Insert into employee values (6, 'Komal', 22, 'MP', 32000 );

Data coverage

Insert overwrite employee values (1, 'Ram', 26, 'Vishakhapatnam', 37000 );执行覆盖之后,表中只剩下了这一条数据了另外一种建表语句create table customer as select * from employee;

3.2. Data query

select * from employee;select name,age from employee;

3.3, delete table

DROP table  mydb1.employee;

3.4. Clear table data

truncate  employee;

3.5. View view data

select * from employee_view;

4. Order by statement

Basic grammar

select * from table_name ORDER BY col_name [ASC|DESC] [NULLS FIRST|NULLS LAST]Select * from employee ORDER BY id asc;

5. Group by statement

Select name, sum(salary) from employee Group BY name;

6, having statement

Basic grammar

select * from table_name ORDER BY col_name [ASC|DESC] [NULLS FIRST|NULLS LAST]按年龄对表进行分组,并选择每个组的最大工资,并显示大于20000的工资select max(salary) from employee group by age having max(salary) > 20000;

7, limit statement

select * from employee order by id limit 4;

8. There are several ways to import data tables in impala

The first way is to load hdfs data to impala

create table user(id int ,name string,age int ) row format delimited fields terminated by "\t";准备数据user.txt并上传到hdfs的 /user/impala路径下去1	hello	152	zhangsan	203	lisi	304	wangwu	50

Download Data

load data inpath '/user/impala/' into table user;

Query the loaded data

select  *  from  user;如果查询不不到数据,那么需要刷新一遍数据表refresh  user;

The second way:

create  table  user2   as   select * from  user;

The third way:

insert into is not recommended because it will generate a lot of small files

Never use impala as a database

The fourth type:

insert  into  select  用的比较多

9, impala java development

In actual work, because impala queries are relatively fast, impala may be used for database queries. We can use java code to operate impala queries

Step 1: Import the jar package

  <repositories>        <repository>            <id>cloudera</id>            <url>https://repository.cloudera.com/artifactory/cloudera-repos/</url>        </repository>        <repository>            <id>central</id>            <url>http://repo1.maven.org/maven2/</url>            <releases>                <enabled>true</enabled>            </releases>            <snapshots>                <enabled>false</enabled>            </snapshots>        </repository>    </repositories>    <dependencies>        <dependency>            <groupId>org.apache.hadoop</groupId>            <artifactId>hadoop-common</artifactId>            <version>2.6.0-cdh5.14.2</version>        </dependency>        <dependency>            <groupId>org.apache.hive</groupId>            <artifactId>hive-common</artifactId>            <version>1.1.0-cdh5.14.2</version>        </dependency>        <dependency>            <groupId>org.apache.hive</groupId>            <artifactId>hive-metastore</artifactId>            <version>1.1.0-cdh5.14.2</version>        </dependency>        <dependency>            <groupId>org.apache.hive</groupId>            <artifactId>hive-service</artifactId>            <version>1.1.0-cdh5.14.2</version>        </dependency>        <dependency>            <groupId>org.apache.hive</groupId>            <artifactId>hive-jdbc</artifactId>            <version>1.1.0-cdh5.14.2</version>        </dependency>        <dependency>            <groupId>org.apache.hive</groupId>            <artifactId>hive-exec</artifactId>            <version>1.1.0-cdh5.14.2</version>        </dependency>        <!-- https://mvnrepository.com/artifact/org.apache.thrift/libfb303 -->        <dependency>            <groupId>org.apache.thrift</groupId>            <artifactId>libfb303</artifactId>            <version>0.9.0</version>            <type>pom</type>        </dependency>        <!-- https://mvnrepository.com/artifact/org.apache.thrift/libthrift -->        <dependency>            <groupId>org.apache.thrift</groupId>            <artifactId>libthrift</artifactId>            <version>0.9.0</version>            <type>pom</type>        </dependency>        <dependency>            <groupId>org.apache.httpcomponents</groupId>            <artifactId>httpclient</artifactId>            <version>4.2.5</version>        </dependency>        <dependency>            <groupId>org.apache.httpcomponents</groupId>            <artifactId>httpcore</artifactId>            <version>4.2.5</version>        </dependency>    </dependencies>

Step 2: Java code query development of impala

public class ImpalaJdbc {
     public static void main(String[] args) throws Exception {
     //定义连接驱动类,以及连接url和执行的sql**语句     
	 String driver = "org.apache.hive.jdbc.HiveDriver";
     String driverUrl = "jdbc:hive2://192.168.52.120:21050/mydb1;auth=noSasl";
     String sql = "select * from student";
 
     //通过反射加载数据库连接驱动*    
	 Class.forName(driver);
     Connection connection = DriverManager.getConnection(driverUrl);
     PreparedStatement preparedStatement = connection.prepareStatement(sql);
     ResultSet resultSet = preparedStatement.executeQuery();
     //通过查询,得到数据一共有多少列     
	 int col = resultSet.getMetaData().getColumnCount();
     //遍历结果集     
	 while (resultSet.next()){
         for(int i=1;i<=col;i++){
             System.out.print(resultSet.getString(i)+"\t");
         }
         System.out.print("\n");
     }
     preparedStatement.close();
     connection.close();
 }
 }

to sum up