The most detailed Hadoop article series on the whole network, it is strongly recommended to collect and pay attention!
Later updated articles will list historical article directories to help you review the key points of knowledge.
table of Contents
This series of historical articles
The most detailed big data notes on the entire network in 2021 will easily take you from entry to proficiency. This column is updated daily to summarize knowledge sharing
Introduction to Hadoop 3.x
Since Hadoop 2.0 was developed based on JDK 1.7, and JDK 1.7 was stopped updating in April 2015, this directly forced the Hadoop community to re-release a new Hadoop version based on JDK 1.8, namely hadoop 3.0. Hadoop 3.0 introduced some important functions and optimizations, including HDFS erasable coding, multiple Namenode support, MR Native Task optimization, YARN cgroup-based memory and disk IO isolation, YARN container resizing, etc.
Hadoop 3.x will adjust the program architecture in the future, and mapreduce will be based on memory + io + disk to process data together. The biggest change in Hadoop3.x is hdfs, hdfs is calculated by the latest block. According to the principle of recent calculation, the local block is added to the memory, first calculated, through the IO, shared memory calculation area, and finally quickly formed the calculation result, which is 10 faster than Spark Times.
New features of Hadoop 3.0
In terms of function and performance, Hadoop 3.0 has made a number of major improvements to the hadoop kernel, including:
1. Streamline the Hadoop kernel, including removing outdated APIs and implementations, and replacing the default component implementations with the most efficient implementations.
2. Classpath isolation: to prevent conflicts between different versions of jar packages
3. Shell script reconstruction: Hadoop 3.0 has reconstructed the Hadoop management script, fixed a large number of bugs, and added new features.
Hdfs has made great changes in reliability and support capabilities in Hadoop 3.x:
1. HDFS supports erasure coding of data, which allows HDFS to save half of the storage space without reducing reliability.
2. Multi-NameNode support, that is, one active and multiple standby namenode deployment methods in a cluster are supported.
Compared with the previous version, MapReduce in Hadoop 3.X has the following changes:
1. Tasknative optimization: C/C++ map output collector implementation (including Spill, Sort, IFile, etc.) is added to MapReduce, and it can be switched to this implementation by adjusting job level parameters. For shuffle-intensive applications, its performance can be improved by about 30%.
2. MapReduce memory parameters are automatically inferred. In Hadoop 2.0, setting memory parameters for MapReduce jobs is very cumbersome. Once the settings are unreasonable, it will cause serious waste of memory resources. This situation is avoided in Hadoop3.0.
HDFS Erasure Code
In Hadoop 3.X, HDFS implements the new function of Erasure Coding. Erasure coding is abbreviated as EC, which is a data protection technology. It was first used for data recovery in data transmission in the communications industry and is a coding fault-tolerant technology.
It adds new verification data to the original data to make the data of each part relevant. In the case of a certain range of data errors, it can be recovered through erasure coding technology.
Before hadoop-3.0, the HDFS storage method was to store 3 copies of each piece of data, which also made the storage utilization rate only 1/3. Hadoop-3.0 introduced erasure coding technology (EC technology) to achieve 1 copy of data + 0.5 copies of redundancy. I verify the data storage method.
Compared with copy, erasure coding is a more space-saving data persistent storage method. Standard encoding (such as Reed-Solomon(10,4)) will have 1.4 times the space overhead; however, the HDFS copy will have 3 times the space overhead.
MapReduce in Hadoop 3.x adds a local implementation of the Map output collector. For shuffle-intensive jobs, this will have a performance improvement of more than 30%.
Support multiple NameNodes
The original HDFS NameNode high-availability implementation only provided one active NameNode and one Standby NameNode; and by copying the edit log to three JournalNodes, this architecture can tolerate the failure of any node in the system.
However, some deployments require higher fault tolerance. We can achieve this with this new feature, which allows users to run multiple Standby NameNodes. For example, by configuring three NameNodes and five JournalNodes, the system can tolerate the failure of two nodes instead of just one node.
Default port change
Before hadoop3.x, the default ports of multiple Hadoop services belonged to the temporary port range of Linux (32768-61000). This means that the user's service may not be started due to port conflicts with other applications when it is started.
Now these conflicting ports are no longer in the scope of temporary ports. The changes of these ports will affect NameNode, Secondary NameNode, DataNode and KMS.
Namenode ports: 50470 --> 9871, 50070--> 9870, 8020 --> 9820
Secondary NN ports: 50091 --> 9869,50090 --> 9868
Datanode ports: 50020 --> 9867, 50010--> 9866, 50475 --> 9865, 50075 --> 9864
Kms server ports: 16000 --> 9600 (the original 16000 conflicts with the HMaster port)
YARN resource type
The YARN resource model has been promoted to support user-defined countable resource types, not just CPU and memory.
The big data series of articles in this blog will be updated every day, remember to collect and pay attention~