Big Data Hadoop in 2021 (Thirty): Introduction to Hadoop 3.xBig Data Hadoop in 2021 (26): Introduction to the three major components of YARN

The most detailed Hadoop article series on the whole network, it is strongly recommended to collect and pay attention!
Later updated articles will list historical article directories to help you review the key points of knowledge.

table of Contents

This series of historical articles

Preface

Introduction to Hadoop 3.x

Introduction

New features of Hadoop 3.0

Versatility

HDFS

MapReduce

HDFS Erasure Code

MapReduce optimization

Support multiple NameNodes

Default port change

YARN resource type


This series of historical articles

Big Data Hadoop in 2021 (29): About YARN commonly used parameter settings

2021 Big Data Hadoop (28): YARN's Scheduler

Big Data Hadoop in 2021 (27): YARN Operation Process

Big Data Hadoop in 2021 (26): Introduction to the three major components of YARN

2021 Big Data Hadoop (Twenty-five): YARN Popular Introduction and Basic Architecture

2021 Big Data Hadoop (24): MapReduce Advanced Training

2021 Big Data Hadoop (23): Detailed explanation of the operating mechanism of MapReduce

Big Data Hadoop in 2021 (22): Custom Grouping of MapReduce

Big Data Hadoop in 2021 (21): Combineer of MapReuce

2021 Big Data Hadoop (20): Sorting and Serialization of MapReduce

2021 Big Data Hadoop (19): MapReduce Partition

2021 Big Data Hadoop (18): MapReduce program operation mode and in-depth analysis

2021 Big Data Hadoop (17): MapReduce programming specification and sample preparation

2021 Big Data Hadoop (16): Introduction to MapReduce Computing Model

2021 Big Data Hadoop (15): Hadoop's Federation Mechanism Federation

Big Data Hadoop in 2021 (14): HDFS's high-availability mechanism

Big Data Hadoop in 2021 (13): Other unexpected functions of HDFS

2021 big data Hadoop (12): HDFS API operation

2021 Big Data Hadoop (11): Metadata Aided Management of HDFS

2021 Big Data Hadoop (ten): HDFS data read and write process

2021 Big Data Hadoop (9): Advanced Commands for HDFS

2021 Big Data Hadoop (8): Shell command line use of HDFS

2021 Big Data Hadoop (7): Introduction to HDFS Distributed File System

2021 Big Data Hadoop (6): The most detailed Hadoop cluster construction on the entire network

Big Data Hadoop in 2021 (5): Hadoop Architecture

2021 Big Data Hadoop (4): Hadoop Distribution Company

2021 big data Hadoop (3): Hadoop domestic and foreign applications

Big Data Hadoop in 2021 (2): A Brief History of Hadoop Development and Features and Advantages

Big Data Hadoop in 2021 (1): Introduction to Hadoop

Preface

The most detailed big data notes on the entire network in 2021 will easily take you from entry to proficiency. This column is updated daily to summarize knowledge sharing

Introduction to Hadoop 3.x

Introduction

Since Hadoop 2.0 was developed based on JDK 1.7, and JDK 1.7 was stopped updating in April 2015, this directly forced the Hadoop community to re-release a new Hadoop version based on JDK 1.8, namely hadoop 3.0. Hadoop 3.0 introduced some important functions and optimizations, including HDFS erasable coding, multiple Namenode support, MR Native Task optimization, YARN cgroup-based memory and disk IO isolation, YARN container resizing, etc.

Hadoop 3.x will adjust the program architecture in the future, and mapreduce will be based on memory + io + disk to process data together. The biggest change in Hadoop3.x is hdfs, hdfs is calculated by the latest block. According to the principle of recent calculation, the local block is added to the memory, first calculated, through the IO, shared memory calculation area, and finally quickly formed the calculation result, which is 10 faster than Spark Times.

New features of Hadoop 3.0

In terms of function and performance, Hadoop 3.0 has made a number of major improvements to the hadoop kernel, including:

Versatility

1. Streamline the Hadoop kernel, including removing outdated APIs and implementations, and replacing the default component implementations with the most efficient implementations.

2. Classpath isolation: to prevent conflicts between different versions of jar packages

3. Shell script reconstruction: Hadoop 3.0 has reconstructed the Hadoop management script, fixed a large number of bugs, and added new features.

HDFS

Hdfs has made great changes in reliability and support capabilities in Hadoop 3.x:

1. HDFS supports erasure coding of data, which allows HDFS to save half of the storage space without reducing reliability.

2. Multi-NameNode support, that is, one active and multiple standby namenode deployment methods in a cluster are supported.

MapReduce

Compared with the previous version, MapReduce in Hadoop 3.X has the following changes:

1. Tasknative optimization: C/C++ map output collector implementation (including Spill, Sort, IFile, etc.) is added to MapReduce, and it can be switched to this implementation by adjusting job level parameters. For shuffle-intensive applications, its performance can be improved by about 30%.

2. MapReduce memory parameters are automatically inferred. In Hadoop 2.0, setting memory parameters for MapReduce jobs is very cumbersome. Once the settings are unreasonable, it will cause serious waste of memory resources. This situation is avoided in Hadoop3.0.

HDFS Erasure Code

In Hadoop 3.X, HDFS implements the new function of Erasure Coding. Erasure coding is abbreviated as EC, which is a data protection technology. It was first used for data recovery in data transmission in the communications industry and is a coding fault-tolerant technology.

It adds new verification data to the original data to make the data of each part relevant. In the case of a certain range of data errors, it can be recovered through erasure coding technology.

Before hadoop-3.0, the HDFS storage method was to store 3 copies of each piece of data, which also made the storage utilization rate only 1/3. Hadoop-3.0 introduced erasure coding technology (EC technology) to achieve 1 copy of data + 0.5 copies of redundancy. I verify the data storage method.

Compared with copy, erasure coding is a more space-saving data persistent storage method. Standard encoding (such as Reed-Solomon(10,4)) will have 1.4 times the space overhead; however, the HDFS copy will have 3 times the space overhead.

MapReduce optimization

MapReduce in Hadoop 3.x adds a local implementation of the Map output collector. For shuffle-intensive jobs, this will have a performance improvement of more than 30%.

Support multiple NameNodes

The original HDFS NameNode high-availability implementation only provided one active NameNode and one Standby NameNode; and by copying the edit log to three JournalNodes, this architecture can tolerate the failure of any node in the system.

However, some deployments require higher fault tolerance. We can achieve this with this new feature, which allows users to run multiple Standby NameNodes. For example, by configuring three NameNodes and five JournalNodes, the system can tolerate the failure of two nodes instead of just one node.

Default port change

Before hadoop3.x, the default ports of multiple Hadoop services belonged to the temporary port range of Linux (32768-61000). This means that the user's service may not be started due to port conflicts with other applications when it is started.

Now these conflicting ports are no longer in the scope of temporary ports. The changes of these ports will affect NameNode, Secondary NameNode, DataNode and KMS.

Namenode ports: 50470 --> 9871, 50070--> 9870, 8020 --> 9820

Secondary NN ports: 50091 --> 9869,50090 --> 9868

Datanode ports: 50020 --> 9867, 50010--> 9866, 50475 --> 9865, 50075 --> 9864

Kms server ports: 16000 --> 9600 (the original 16000 conflicts with the HMaster port)

YARN resource type

The YARN resource model has been promoted to support user-defined countable resource types, not just CPU and memory.


The big data series of articles in this blog will be updated every day, remember to collect and pay attention~