Big data development must learn to read yarn logs: Task fault tolerance mechanism, task speculative execution, counter

Background: Yarn's web interface is more or less viewed by all big data development, such as task running failure, task running slowly, viewing detailed task running progress, detailed error troubleshooting, debugging, etc. However, from the actual feedback, many big data developers are not in-depth about the log view of the yarn interface, and are not familiar with some common indicators. The following takes the Hive/MapRedcue task as an example.

1. Principle and use of Task fault tolerance mechanism

1.1 Map/reduce has failed and killed phenomenon?

The yarn interface of the following tasks is very common. For example, reduce has 2 failed and 17 killed. Does it affect the final result of my task? If there is no effect, what is the reason? For example, the two columns of Task Type and Attemp Type below have map/reduce status and number statistics. What is the difference between the two? To understand these problems in depth, we must first understand the Tast fault tolerance mechanism of mapredcue/hive tasks.

1.2 The difference between Task type and Attemp Type

MapRecue/Hive tasks are divided into maptask and reducetask. The total number of tasks of each type is generally determined according to the amount of data and parameter configuration, which has been determined at the initial stage of the task. The Task Type of the yarn interface counts the number of task tasks that need to be executed when submitting yarn. In fact, a task allows multiple attempts to run, and the instance of each run attempt is called Task Attempt, which is the statistical data in the Attempt Type of the yarn task log interface. This is why the instances of map/reduce all start with attmp_****, because the running instance of the task is attemp.

Note: The task starts with the task. The two viewing entries are different. We generally pay attention to the progress of the task. A task task can have multiple attmpt instances running, which involves speculative execution.

1.3 The use of Task's fault tolerance mechanism

 In actual production, some map/reduce task instances will fail due to various reasons such as machine aging, insufficient resources, process crashes, and bandwidth limitations. This is extremely normal and easy to happen. If an error is reported directly in the entire task at this time, the cost will be too great. So hadoop introduced a task fault tolerance mechanism. After the map/reduce instance fails, it sends an error report to APPMaster before exiting. The error report will be recorded in the user log. APPMaster then marks this task instance as failed and releases its containner resources to other tasks.

Use the following two parameters to control the number of times a map/reduce instance can be retried once a map/reduce instance fails. Generally, you can directly use the default value. Therefore, in actual development, do not pay attention to the failure of a single map, as long as it does not fail four times. It has no effect on the task. After each failure, APPMaster will try to avoid rescheduling the task on the previously failed node until the task is successful, or more than 4 times. If more than 4 times, no attempt will be made. The default job error tolerance rate is 0, and the entire task fails at this time. .

1.控制Map Task失败最大尝试次数,默认值4mapred.map.max.attempts  ---废弃参数mapreduce.map.maxattempts ---推荐新参数 2.1.控制Reduce Task失败最大尝试次数,默认值4mapred.reduce.max.attempts ---废弃参数mapreduce.reduce.maxattempts---推荐新参数

1.3.1 Failed/killed scenes in the task task instance

1. The task instance attempt did not report to MRAPPMaster for a long time, and the latter has not received its progress update. Generally, the attempt instance communicates with APPMaster3s once, and the former reports the task progress and status like the latter; if the threshold is exceeded, the task will be considered dead." "Is marked as failed, and then MRAPPMaster will kill its JVM and release resources. Then try to start a new task instance on another node.

mapreduce.task.timeout=600000 ms,10分钟The number of milliseconds before a task will be terminated if it neither reads an input, writes an output, nor updates its status string. A value of 0 disables the timeout.

2. There are many other reasons for task attempt failure and fialed, such as code problem, outofmemory, GC.

3. A task instance attempt is killed generally in two situations , one is that the client actively requests to kill the task, and the other is that the framework actively kills the task . For the latter, it is generally because the job is killed or the backup task (speculative execution) of the task has been executed, and the task does not need to be executed anymore, so it is Killed. For example, when the nodemanager node fails, such as stopped, all the task instances above will be marked as killed. Another example is that the execution of tasks exceeds certain thresholds. For example, if the dynamic partition exceeds the maximum number of files, all tasks will be killed.

The following tasks are killed because of the backup task that MRAPPMaster actively requested to kill

2. Speculative execution of tasks

In the actual development, you may encounter the yarn log. Why does it show that 257reduce has not finished running, but there are 261redcues in the running state in the attempt below?

When MRappMaster monitors that the running speed of a task instance is slower than other task instances, it will start a backup task for the task, and let the two tasks process a copy of data at the same time, such as map/reduce. Whoever finishes the processing first, who is successfully submitted to MRappMaster will be adopted, and the other one will be killed. This method can effectively prevent those long-tail tasks from dragging their feet. Speculative execution is speculative for the task.

The advantage of task speculative execution is that space is exchanged for time, to prevent long tails from dragging their feet. For example, if the machine where an instance is located is slow, the execution will be completed soon after restarting one. The disadvantage of task speculation is that two repeated tasks waste resources and reduce the efficiency of the cluster, especially the speculative execution of redcue tasks. Pulling group map data increases network io, and speculating tasks may compete with each other.

Speculative execution is enabled in the cluster by default. It can be enabled based on the cluster computing framework or separately based on the task type. Map tasks are recommended to be enabled, and reduce can be carefully enabled based on actual conditions.

mapreduce.map.speculative=true ---默认,开启map推测执行mapreduce.reduce.speculative=true ---默认,开启reduce推测执行,都可单独开。mapreduce.job.speculative.speculative-cap-running-tasks=0.1--在任何时刻可以被推测执行的任务数百分比 --其他还有很多关于推测执行的参数,可以参考官网

3. Task progress and counter

Often refer to the progress of an entire task or task progress. Do you know how this progress is calculated?

When the map/reduce task is running, it will communicate with MRAPPMaster, and report its task progress and task status through the UMbilical interface at an interval of 3s. The final summary display is shown in the figure above.

The map task is relatively simple, and the task progress is the proportion of the total amount of slice input data that has been processed. For example, if 85% of the data is processed, the progress is 85%, mainly through counter statistics.

The reduce task is calculated based on the completion of the reduce phase, and the three processes are in order; pull and copy map data 1/3, merge and sort 1/3, reduce function aggregation processing 1/3; for example, a reduce process half of the data output by the map , The reduce progress is 1/3+1/3+1/3*1/2.

The following reflects the three stages of a reduce