The most detailed series of Hive articles in the entire network, it is strongly recommended to collect and pay attention!
Later updated articles will list historical article directories to help you review the key points of knowledge.
table of Contents
The most detailed big data notes on the entire network in 2021 will easily take you from entry to proficiency. This column is updated daily to summarize knowledge sharing
H IVE basic concepts
1. Introduction to Hive
1. What is Hive
Hive is a data warehouse framework built on Hadoop. Initially, Hive was developed by Facebook, and later transferred to the Apache Software Foundation as an Apache open source project.
Hive is a data warehouse tool based on Hadoop, which can map structured data files to a database table and provide SQL-like query functions.
Its essence is to convert SQL to MapReduce tasks for operations. The underlying layer is HDFS to provide data storage. To put it bluntly, hive can be understood as a tool for converting SQL to MapReduce tasks, and even further it can be said that hive is a MapReduce customer. end.
2, why use Hive
- The problems faced by using hadoop directly
- Staff learning costs are too high
- Project cycle requirements are too short
- MapReduce is too difficult to implement complex query logic development
- Why use Hive
- The operation interface adopts SQL-like syntax to provide rapid development capabilities
- Avoid writing MapReduce and reduce the learning cost of developers
- Function expansion is very convenient
- The biggest feature of Hive is to analyze big data through SQL-like, and avoid writing MapReduce programs to analyze data, which makes it easier to analyze data.
- Data is stored on HDFS. Hive itself does not provide data storage functions. It can structure the stored data.
- Hive maps data into databases and tables. The metadata information of the database and tables is generally stored in relational databases (such as MySQL).
- Data storage: It can store large data sets and can directly access files stored in Apache HDFS or other data storage systems (such as Apache HBase).
- Data processing: Because the Hive statement will eventually generate MapReduce tasks for calculation, it is not suitable for real-time computing scenarios, it is suitable for offline analysis.
- In addition to supporting the MapReduce computing engine, Hive also supports two distributed computing engines, Spark and Tez;
- There are many data storage formats, such as the data source in binary format, ordinary text format, etc.;
Two, H ive architecture
1. Architecture diagram
2, basic composition
Client: Client CLI (hive shell command line), JDBC/ODBC (java access hive), WEBUI (browser access hive)
Metadata: Metastore: In essence, it is only used to store which databases, which tables, table fields in hive, the database to which the table belongs (default is default), partitions, the directory where the table data is located, etc. Metadata is stored by default in its own In the derby database, it is recommended to use MySQL to store the Metastore.
(1) Parser (SQL Parser): Convert SQL characters into abstract syntax tree AST. This step is generally done using third-party tool libraries, such as antlr, to perform grammatical analysis on the AST, such as whether the table exists, whether the field exists, Is the SQL statement wrong?
(2) Compiler (Physical Plan): Compile AST to generate a logical execution plan
(3) Query Optimizer: optimize the logical execution plan
(4) Execution: Convert the logical execution plan into a physical plan that can be run. For Hive, it is MR/Spark
Storage and execution: Hive uses HDFS for storage and MapReduce for calculation
3, Hive and traditional database comparison
Summary: Hive has the appearance of SQL database, but the application scenarios are completely different. Hive is only suitable for batch data statistical analysis.
The big data series of articles in this blog will be updated every day, remember to collect and pay attention~