hadoop offline day08--data warehouse, Apache Hive

hadoop offline day08--data warehouse, Apache Hive


Today's content outline

#1、数据仓库基础知识     数仓是什么 数据分析的平台 面向分析    数仓由何而来    数仓核心特性    数仓和数据库区别        解决一个核心的误区:数仓绝不是大型数据库。~~~ 凉凉        OLTP  T 事务        OLAP  A 分析    数仓的分层架构 #2、当下大数据领域最著名的数仓软件  Apache Hive    Hive是什么 数仓        基于Hadoop的数仓  如何理解和Hadoop关系    核心功能:        1、将结构化文件映射成为一张表(Table)        2、基于表提供了SQL分析能力 (Hive SQL HQL,类SQL)    本质:用户HQL--->Hive转换MR--->数据分析        Hive架构组件    Hive搭建安装        不是分布式软件 分布式能力基于Hadoop实现        metastore  metadata        远程模式部署安装        客户端使用 IDEA中使用     #3、 HQL--DDL--Create Table      DDL 数据定义语言  表结构信息     建表语句及其重要 占比99.9999%  

Data Warehouse Fundamentals

Data warehouse concept

Check the number of warehouses. It is called Data WareHouse in English, or DW for short.

It is an integrated data analysis platform, oriented to analysis and carrying out analysis. The analysis results provide decision-making support to the enterprise.

Data warehouse itself does not produce data

其数据来自于各种数据源​RDBMS关系型数据库--->业务数据log file----->日志文件数据爬虫数据其他数据

Data warehouse itself does not consume data

其分析的结果给外部各种数据应用(Data application)来使用。​Data visualization(DV)数据可视化Data Report 数据报表Data Mining(DM) 数据挖掘Ad-Hoc 即席查询:    即席查询(Ad Hoc)是用户根据自己的需求,灵活的选择查询条件,系统能够根据用户的选择生成相应的统计报表。即席查询与普通应用查询最大的不同是普通的应用查询是定制开发的,而即席查询是由用户自定义查询条件的。

Core Features of Data Warehouse

Subject-oriented

分析主题  主题是一个抽象概念 分析数据的综合体一个分析的主题可以对应多个数据源

Integration

数仓本身不产生数据 其分析的主题数据来自于各个不同的数据源 需要集成到数仓主题下面。因为主题的数据来自于不同的数据源 可能会存在差异  数据源之间不同的结构集成到主题下面的要保证数据格式是干净规整统一的结构化数据。​需要持续的动作 :ETL(抽取 Extra, 转化 Transfer, 装载 Load)

Non-volatile (non-renewable)

数仓上面的数据几乎没有修改操作,都是分析的操作。​数仓是分析数据规律的平台 不是创造数据规律的平台。​指的数据之间的规律不能修改。​当下发现有些时间也是需要修改的  数据校正。数据缓慢变化。​​#当下所学的侧重于离线数据 分析的数据都是历史数据 过去的数据  t+1 t+7 #后面还会接触实时数仓   kafka+flink

Time-varying

数仓中主题数据站在时间的维度会明显成批次变化。一天一分析 一周一分析  批次分析Batch ​batch(天 月 离线)---> batch(秒 spark Streaming 微批处理)--->Stream(毫秒 flink storm)

Data warehouse and database

The essence is the difference between OLTP and OLAP systems.

OLTP: On-Line Transaction Processing.

OLTP系统注重的是数据安全、完整、响应效率。通常指的就是RDBMS关系型数据库。​#面向事务 支持事务​#RDBMS: MySQL ORACLE​#注意 不是NoSQL数据库 没有事务支持: Redis HBase

OLAP: On-Line Analytical Processing. Chinese pronunciation: 欧莱普

OLAP系统注重的数据分析。主要指的是数据仓库、数据集市(小型数仓)、面向分析数据库、面向分析的软件​#面向分析 支持分析​#数据仓库:Apache Hive 、Apache impala、Apache Kylin​

in conclusion

A data warehouse is by no means a large database, even if its appearance and syntax are similar to those of a database.

The data warehouse does not have to replace the responsibilities of the database, and is mainly used in the field of data analysis rather than the field of affairs.

Data warehouse layered architecture

According to the inflow and outflow of data warehouse data

The most basic and classic three-tier architecture. In practice, companies build other layers according to their needs.

specific:

ODS source data layer operational data storage layer

把各个数据源数据拉取过来存储 解耦 临时存储 数据之间一般差异较大 不用于直接分析

DW data warehouse layer core

其数据来自于ODS经过层层的ETL变成各种模型的数据  数据干净规则 统一基于各种模型开展各种分析​企业中根据业务复杂度 继续在DW中继续划分子层。 存储大量的中间结果。

DA data application layer

最终消费数仓的数据的。数据报表 数据挖掘

Why do data warehouses need to be layered?

Decoupling

Changing space for time improves the efficiency of data usage at the final application layer


Apache Hive

Hive is a data warehouse tool based on Hadoop, contributed by Facebook open source. It is an analysis-oriented tool.

Why is the number based on Hadoop?

#一款合格的数仓软件应该具备什么?    1、具备存储数据的能力?    2、具备数据分析的能力?hive作为数仓软件,当然也具备上述功能,只不过基于Hadoop实现的。    1、数据存储--->Hadoop HDFS    2、数据分析、计算--->Hadoop MapReduce​#基于上述原因,说Hive是基于Hadoop的。    

During this process, what did Hive do? Where is its greatest charm?

Map the structured data file to a database table

#什么叫做映射映射指的就是一种对应关系。y=2x+1​结构化文件  <---映射--->  表

And provide SQL-like query function for users to write SQL to realize data analysis.

Hive SQL--->MapReduce--->数据分析

Hive's architectural components

User interface

提供一种方式给用户写HQL。包括CLI、JDBC、WebUI

Hive Driver driver service

hive的核心完成从接受HQL到编译成为MR程序的过程。sql解释 编译 校验 优化 制定计划

Metadata storage

元数据指的是描述性数据 记录数据的数据。​在hive中,元数据指的是表和结构化文件之间的各种映射关系。基于这种关系才能正确的操作文件。​元数据通常保存在RDBMS中。

Hadoop components

Execution engine

It uses MapReduce to process by default, and now supports other engines, such as Tez and Spark.

Storage components

HDFS, HBase

Conclusion: Hive itself is not a distributed software, it only needs to be deployed on a single machine, but it has distributed capabilities.

分布式存储和分布式计算的能力是依托Hadoop实现的。​Hive专职与sql转MR的过程。

Hive and database relationship

In terms of appearance, formal model, and grammar, hive is very similar to database (Mysql).

The underlying application scenarios are completely different.

Hive belongs to the olap system, which is analysis-oriented and focuses on data analysis (select)

The database belongs to the oltp system, which is transaction-oriented and focuses on data-time interaction (CRUD)

Hive is by no means a large database, nor is it intended to replace a database like MySQL.


Apache Hive installation and deployment

Pre-knowledge: metadata related

metadata

指的是元数据,hive中指的是表和文件之间的映射关系。元数据存储在rdbms中,主要有两种    1、使用Hive内置的Apache derby(内存轻量级RDBSM low)    2、使用外置第三方的 MySQL

metastore metadata service

访问Hive元数据的服务 某种程度上保证了metadata的安全。

The three deployment modes of Hive are related to the above two concepts.

Where is metadata stored?

Does the metastore service need to be configured and started separately? It is still integrated in the Hive driver and started together.

Hive 3 deployment methods

the difference:

1、metadata存储在哪里?2、metastore服务是否需要单独配置,单独启动?还是集成在Hive驱动中一起启动。

Embedded mode

1、使用内置Derby存储元数据2、metastore集成在Hive中 不需要单独配置 不需要单独启动​适合体验场景。

Local mode

1、使用MySQL来存储元数据。2、metastore集成在Hive中 不需要单独配置 不需要单独启动​适合测试场景。

Remote mode

#1、使用MySQL来存储元数据。#2、metastore单独配置 单独启动 全局唯一。​<!-- 远程模式部署metastore metastore地址 --><property>    <name>hive.metastore.uris</name>    <value>thrift://node1:9083</value></property>​适合生产环境。

In this course, the remote mode is used to deploy and install, and the real environment is used.

Hive remote mode installation

Server base environment

Depends on Hadoop, MySQL. Ensure that the service is up and available.

HDFS安全模式等待结束。MySQL的权限。

step1: Modify the Hadoop configuration core-site.xml to set the hive proxy user. All 3 machines need to be modified and restart to take effect.

<property>    <name>hadoop.proxyuser.root.hosts</name>    <value>*</value></property><property>    <name>hadoop.proxyuser.root.groups</name>    <value>*</value></property>

step2: Upload the Hive installation package, decompress and rename it (you only need to install it on node1)

step3: modify hive configuration file

hive-env.sh

export HADOOP_HOME=/export/server/hadoop-2.7.5export HIVE_CONF_DIR=/export/server/hive/confexport HIVE_AUX_JARS_PATH=/export/server/hive/lib

hive-site.xml

<configuration><!-- 存储元数据mysql相关配置 --><property>    <name>javax.jdo.option.ConnectionURL</name>    <value>jdbc:mysql://node1:3306/hive?createDatabaseIfNotExist=true&amp;useSSL=false</value></property><property>    <name>javax.jdo.option.ConnectionDriverName</name>    <value>com.mysql.jdbc.Driver</value></property><property>    <name>javax.jdo.option.ConnectionUserName</name>    <value>root</value></property><property>    <name>javax.jdo.option.ConnectionPassword</name>    <value>hadoop</value></property><!-- H2S运行绑定host --><property>    <name>hive.server2.thrift.bind.host</name>    <value>node1</value></property><!-- 远程模式部署metastore metastore地址 --><property>    <name>hive.metastore.uris</name>    <value>thrift://node1:9083</value></property><!-- 关闭元数据存储授权 --><property>    <name>hive.metastore.event.db.notification.api.auth</name>    <value>false</value></property><!-- 关闭元数据存储版本的验证 --><property>    <name>hive.metastore.schema.verification</name>    <value>false</value></property></configuration>

step4: add mysql driver and hive-jdbc-2.1.0-standalone.jar to hive

mysql driver

mysql-connector-java-5.1.32.jar

jdbc-standalong.jar

cp /export/server/hive/jdbc/hive-jdbc-2.1.0-standalone.jar /export/server/hive/lib/

step5: Manually initialize the metastore

cd /export/server/hive/​bin/schematool -dbType mysql -initSchema 

step6: start the metastore service

Foreground start

/export/server/hive/bin/hive --service metastore​#关闭方式ctrl+c 结束metastore服务

Suspend startup in the background

nohup /export/server/hive/bin/hive --service metastore &​#nohup命令,在默认情况下(非重定向时),会输出一个名叫nohup.out 的文件到当前目录下​#关闭使用jps查看进程 Runjar 配合kill -9

Hive CLI command line client

Overview

Since the development of Hive, there have been two generations of command-line clients.

The first generation belongs to the shell client bin/hive

The second generation belongs to the jdbc client bin/beeline

First generation client

Command: bin/hive

Need to visit: metastore service

Configuration

<!-- 远程模式部署metastore metastore地址 --><configuration><!-- 远程模式部署metastore metastore地址 --><property>        <name>hive.metastore.uris</name>        <value>thrift://node1:9083</value></property></configuration>

Demo: In order to simulate the enterprise environment, install the scphive package to the node3 machine.

hive service (node1) <------------ hive client

Second-generation client

Command: bin/beeline

Need access: hiveserver2 service (need to access metastore service)

The client does not need any configuration and the server needs to add parameters

<!-- HS2运行绑定host --><property>    <name>hive.server2.thrift.bind.host</name>    <value>node1</value></property>

Start service

nohup /export/server/hive/bin/hive --service metastore &​nohup /export/server/hive/bin/hive --service hiveserver2 &

beeline can be connected

/export/server/hive/bin/beeline​beeline> ! connect jdbc:hive2://node1:10000   #JDBC地址Connecting to jdbc:hive2://node1:10000Enter username for jdbc:hive2://node1:10000: root #用户名 需要具备在HDFS操作权限Enter password for jdbc:hive2://node1:10000:      #密码 可以为空Driver: Hive JDBC (version 2.1.0)21/06/01 16:44:36 [main]: WARN jdbc.HiveConnection: Request to set autoCommit to false; Hive does not support autoCommit=false.Transaction isolation: TRANSACTION_REPEATABLE_READ0: jdbc:hive2://node1:10000> ​

Hive experience

Purpose: How to create a table in Hive and successfully map a structured data file.

Conjecture 1: Does the order, number, and type of the table fields need to be consistent with the file? Must be consistent.

If they are inconsistent, Hive will try to perform type conversion, but the conversion is not guaranteed. If the content is displayed successfully, null will be displayed if it is unsuccessful.

create table t_1(id int,name string,age int);--没有指定分隔符​create table t_2(id int,name string,age int) row format delimited fields terminated by ',';​create table t_3(id string,name int,age int) row format delimited fields terminated by ','; --类型和文件不一样

Guess 2: Where is the file location placed in HDFS? Put it in the corresponding directory? Must it be placed here? Not necessarily

#在hive中创建的表 默认在HDFS有与之对应的目录/user/hive/warehouse/数据库名.db/表名

Guess 3: Must specify the file separator? Not necessarily.

Why do you want to map the successful file when you want to build a table?

Because the mapping is successful, there are tables, tables have data, and only data can be written to sql for analysis.

There is no table, the table has no data, analyze a hammer.


Hive SQL DDL--Create Table

IF NOT EXISTS

Function: Ignore the exceptions that already exist in the table.

0: jdbc:hive2://node1:10000> create table t_2(id int,name string,age int) row format delimited fields terminated by ',';Error: Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. AlreadyExistsException(message:Table t_2 already exists) (state=08S01,code=1)0: jdbc:hive2://node1:10000> creaeadaa table t_2(id int,name string,age int) row format delimited fields terminated by ',';Error: Error while compiling statement: FAILED: ParseException line 1:0 cannot recognize input near 'creaeadaa' 'table' 't_2' (state=42000,code=40000)​--Error while processing statement 执行期间的错误 执行逻辑问题 sql本身没问题​--Error while compiling statement  编译期间的错误  SQL语法问题 关键字 功能是否支持  语法顺序

Hive data types

In addition to supporting SQL similar, it also supports Java data types.

Case insensitive in Hive.

In addition to supporting basic data types, Hive also supports complex (composite) data types. For example: Array array, Map mapping.

For compound data types, when creating a table, you must specify the grammar with the separator to parse it correctly.

Hive data type conversion

Although the default implicit conversion can automatically convert some types, it is best to determine the exact type based on the data. Avoid conversion failures.

User display conversion cast (type as new type)

Hive SerDe mechanism

Background: How does Hive read and write files on HDFS?

HDFS files --> InputFileFormat --> <key, value> --> Deserializer --> Row objectRow object --> Serializer --> <key, value> --> OutputFileFormat --> HDFS files​#读文件流程 映射文件流程​1、通过InputFormat类读取文件 默认实现TextInputFormat   一行一行的读取数据​2、然后通过SerDe类进行反序列化 默认实现LazySimpleSerDe      在反序列化的时候 需要指定分隔符切割数据 对应上表的字段   #写文件流程Row object --> Serializer --> <key, value> --> OutputFileFormat --> HDFS files

Delimiter specification

grammar

row format delimited | serde serde_name​--row format 表明开始指定分隔符    --delimited  使用默认SerDe类进行序列化(LazySimpleSerDe)    --serde serde_name 指定使用其他的SerDe类进行序列化  比如JsonSerDe

delimited sub-grammar

[fields terminated by char]   --指定字段之间的分隔符[collection items terminated by char] --指定集合元素之间的分隔符[map keys terminated by char]    --指定map类型数据key value之间的分隔符[lines terminated by char]       --指定换行符

chestnut

--Array类型数据分别使用String和Array来建表 如何?​--以Array映射文件create table t_array(name string,work_locations array<string>) row formatdelimited fields terminated by '\t' collection items terminated by ',';+---------------+----------------------------------------------+--+| t_array.name  |            t_array.work_locations            |+---------------+----------------------------------------------+--+| zhangsan      | ["beijing","shanghai","tianjin","hangzhou"]  || wangwu        | ["shanghai","chengdu","wuhan","haerbin"]     |+---------------+----------------------------------------------+--+    --以string映射文件   create table t_string(name string,work_locations string) row formatdelimited fields terminated by '\t';​+----------------+------------------------------------+--+| t_string.name  |      t_string.work_locations       |+----------------+------------------------------------+--+| zhangsan       | beijing,shanghai,tianjin,hangzhou  || wangwu         | shanghai,chengdu,wuhan,haerbin     |+----------------+------------------------------------+--+​--需求:查询每个用户的第二个工作城市select  split(work_locations,",")[1] from t_string;select  work_locations[1] from t_array;

Hive default separator

When building the table, if row format syntax is not written, the default separator is used to cut the data field at this time

If the delimiter in the file is also the default delimiter at this time, the table can be built successfully without specifying the delimiter.

The default separator is an invisible separator, code \001

How to input in the vim editor: continuous input ctrl+v, ctrl+a

In the enterprise, how to deal with data cleaning, consciously use \001 for data segmentation to facilitate storage analysis.

public String toString() {    return upFlow+"\001"+downFlow+"\001"+sumFlow;}

Hive's external and internal tables

the difference:

1. Whether there is an external keyword when building a table, if there is an external table, if not, it is an internal table.

2. When the table is deleted in the drop

Delete the internal table, the information (metadata) of the hive table and the files mapped on HDFS are deleted

因为内部表,hdfs上文件也被hive控制,所以也叫做受控表 manage table.  

To delete external tables, only hive table information is deleted.

chestnut

--内部表create table student_inner(Sno int,Sname string,Sex string,Sage int,Sdept string) row format delimited fields terminated by ',';​--建外部表create external table student_ext(Sno int,Sname string,Sex string,Sage int,Sdept string) row format delimited fields terminated by ',';

Choosing to use external tables, to a certain extent, ensures data security on HDFS.

Location syntax

Function: Specify the storage path of the file mapped by the Hive table on HDFS

Default path: /user/hive/warehouse/database name.db/table name

You can use location to specify any path in hdfs

--在HDFS的/stu目录下 有一个结构化的数据文件  建表映射它create external table student_location(Sno int,Sname string,Sex string,Sage int,Sdept string) row format delimited fields terminated by ',' location '/stu';

It is recommended to use the same directory to manage hdfs data. Regularity is easy to manage.