Performance analysis to build Linux operating system analysis decision tree

Article Directory

I. Introduction

Beginners’ feelings about performance analysis are: when viewed horizontally, there are ridges and peaks, with different distances and heights. So how should we learn to establish our own performance analysis system, so that we can achieve the same month of thousands of mountains, and all the families will be spring. Thousands of rivers have water and thousands of rivers and moons, but where is the sky without clouds?

Analyzing the decision tree at this time reflects the value to performance analysts and is an indispensable part of performance analysis. It is a combing of architecture, a combing of systems, a combing of problems, a combing of the process of finding evidence chains, and a combing of analytical thinking. It plays the guiding role of looking at the overall situation and building a high-level building. After the performance has reached the level of art, the analysis decision tree is refined, which can follow the methodology of analogy. And what I want to tell you here is such a methodology.

Second, the key to building a decision tree

The decision tree is basically a summary of our previous analysis experience. When making a decision tree, it generally goes through two stages: construction and pruning .

The concept is simply:

  • The process of construction is the process of choosing what attribute as the node. The process of construction is the process of choosing what attribute as the node;
  • Pruning is to slim down the decision tree. The goal you want to achieve in this step is to get good results without too much judgment. The reason for this is to prevent the occurrence of " over-fitting ".

To understand from the perspective of performance analysis:

  • Structure : According to experience, it is to sort out the structure, sort out the system, sort out the problem, sort out the process of finding evidence chain, sort out the analysis thinking;
  • Pruning : It is necessary to analyze the correlation of different time series performance data. The core is to understand the relationship between various performance indicators, and at the same time to search for evidence chains, and infer various conclusions based on changes in data, such as fault identification, root cause Cause analysis, etc.
Insert picture description here

Three, build a CPU analysis decision tree

Insert picture description here

The first layer is business indicators:

Response time


Error rate

The second layer is resource indicators:





Analysis method (Java application):

Use the TOP command to find out who is consuming a high CPU process, for example pid=1232

Use top -p 1232a separate monitor this process

Enter a capital H to list all threads in the current process

Look at the threads that consume more CPU, and look at the thread number, for example: 12399

Use jstack 1232>pagainfo.dumpget dump thread in the current process information

Convert the thread number 12399 obtained in the fourth step into hexadecimal 306f ( printf "%x\n" 12399)

According to 306f look at the stack information obtained in step 5 tid=0x306thread

Locate the code location (check the code location based on the printed stack information)

The simple method for the above process is to use some open source shell tools, such as:

Fourth, build an I/O analysis decision tree

Insert picture description here

Let's talk about the structure of the disk system first:

  • If it is an IDE drive, the disk is named: hda, hdb, hdc, etc.;
  • If it is a SCSI drive, the disk is named: sda, sdb, sdc, etc.
  • Disks are usually divided into multiple partitions. The name of the partition device is created by adding the partition number to the end of the basic device name. Each separate partition usually contains a file system or a swap partition, according to the /etc/fstab Specifies that these partitions are mounted in the Linux root file system. These mounted file systems contain files that are read and written by the application.
  • When an application performs a read or write operation, the Linux kernel may store a copy of the file in its cache or buffer and return the requested information without accessing the disk. However, if the Linux kernel does not have a copy of the data stored in memory, it will add a request to the disk's I/O queue. If the Linux kernel notices consecutive locations on the disk for multiple requests, it merges them into one large request. This consolidation eliminates the seek time for the second request, thereby improving overall disk performance. When the request is put into the disk queue, if the disk is not currently busy, it will start servicing the I/O request. If the disk is busy, the request will wait in the queue until the drive is available and then service it.

At this level, we mainly focus on I/O. Since I/O is concerned, how should I analyze it if I/O is high? How to locate?

vmstat [-D] [-d] [-p 分区] 

Parameter Description:

  • d: Display statistics about the disk.
  • D: Display the total statistical information of the Linux I/O subsystem, the statistical data is the total number since the system started
  • p: The statistics are the total number since the system was started, not just the total number that occurred between this example and the previous example.
    Use the disk in vmstat, which is bo/bi/wa
Insert picture description here

Use the disk in vmstat, which is bo/bi/wa:

  • bo This represents the total number of blocks written to disk in the previous interval. (In vmstat, the block size of the disk is usually 1024 bytes.)
  • bi displays the number of blocks read from disk in the previous interval. (In vmstat, the block size of the disk is usually 1024 bytes.)
  • wa represents the CPU time spent waiting for I/O to complete. The rate at which disk blocks are written per second

The most common command for I/O analysis in the Linux operating system is iostat

iostat -d -x -k 1 10
Insert picture description here

Meaning of counter information:

  • rsec/s: the number of sectors read per second;
  • wsec/: The number of sectors written per second.
  • avgrq-sz: average requested sector size
  • avgqu-sz: is the length of the average request queue. There is no doubt that the shorter the queue length, the better.
  • await: The average time for processing each I/O request (in microseconds and milliseconds). This can be understood as the I/O response time. Generally, the system I/O response time should be less than 5ms, if it is greater than 10ms, it will be larger. This time includes queue time and service time. That is to say, under normal circumstances, await is greater than svctm. The smaller the difference between them, the shorter the queue time. On the contrary, the larger the difference, the longer the queue time, indicating that the system is out of service. Problem.
  • svctm: Represents the average service time (in milliseconds) of each device I/O operation. If the value of svctm is very close to await, it means that there is almost no I/O waiting and the disk performance is very good. If the value of await is much higher than the value of svctm, it means that the I/O queue waits too long and the application running on the system will Slow down.
  • %util: All processing I/O time within the statistical time, divided by the total statistical time. For example, if the statistics interval is 1 second, the device has 0.8 seconds to process I/O, and 0.2 seconds is idle, then the device %util = 0.8/1 = 80%, so this parameter implies how busy the device is. Generally, if this parameter is 100%, it means that the device is running at full capacity (of course, if it is a multi-disk, even if %util is 100%, because of the concurrency of the disk, the disk usage may not be the bottleneck).

Mainly focus on the counter:






Counter with bottleneck:

  • %util is very high
  • await is much larger than svctm
  • avgqu-sz is relatively large
  • cpu> wa is too large (reference value exceeds 20)
  • system> bi&bo is too large (reference value exceeds 2000)

Five, build a memory analysis decision tree

Insert picture description here

At this level, we mainly focus on mem. Since mem is concerned, how should we analyze if the mem is high? How to locate?

Common commands free:

Insert picture description here

counter description:

  • total: total physical memory size
  • used: How much has been used
  • free: how much is available
  • Shared: The total amount of memory shared by multiple processes
  • Buffers/cached: the size of the disk cache

In addition to the CPU counter provided by vmstat, it can also be used in memory statistics:

vmstat [-a] [-s] [-m] [-d] [-p] [n] [-f] [-v]

Parameter Description:

  • -a: Display active and inactive memory
  • -f: Display the number of forks since the system was started.
  • -m: display slabinfo
  • -n: Display each field name only once at the beginning.
  • -s: Display memory-related statistics and the number of various system activities.
  • delay: refresh interval. If not specified, only one result will be displayed.
  • count: the number of refreshes. If you do not specify the number of refreshes, but specify the refresh interval, then the number of refreshes is infinite.
  • -d: Display statistics about the disk.
  • -p: Display statistics of the specified disk partition
  • -S: Use the specified unit to display. The parameters are k, K, m, and M, which represent 1000, 1024, 1000000, and 1048576 bytes (byte) respectively. The default unit is K (1024 bytes)
  • -V: Display vmstat version information.
Insert picture description here

What does each piece of information represent in the above counter? Please search for it yourself. There are a lot of information on the Internet.

vmstat -mThe information displayed cat /proc/slabinfoshows the same information:

Insert picture description here

This is a detailed description of how the kernel memory is allocated and help determine which area most memory consumption of kernel (by this command will be able to know the most memory consumption in that region)

vmstat -s
Insert picture description here

It is very useful to track exactly how the kernel uses its memory.

We can use the vmstat command to monitor the Linux CPU usage, memory usage, virtual memory exchange, IO read and write, and analyze the disk pressure, whether it is swap or load files, etc.;

Six, build a network analysis decision tree

Insert picture description here

At the learning network level, you need to understand the seven-layer protocol of the network:

Insert picture description here

Commonly used commands:

  • hostname
  • ping
  • ifconfig
  • wconfig
  • netstat
  • nslookup
  • traceroute
  • finger
  • telnet
  • ethtool
ip -s [-s] 链接
Insert picture description here

Counter explanation:

  • bytes: The total number of bytes sent or received.
  • packets: The total number of data packets sent or received.
  • errors: The number of errors that occurred during sending or receiving.
  • dropped: The number of data packets that were not sent or received due to insufficient network card resources.
  • overruns: The number of times that the network does not have enough buffer space to send or receive more data packets.
  • mcast: The number of multicast packets received.
  • carrier: The number of packets discarded due to link media failure (such as cable failure).
  • collsns: This is the number of collisions encountered by the device during transmission. This happens when two devices try to use the network at the same time.
sar [-n DEV | EDEV | 袜子 | FULL ] [DEVICE] [间隔] [计数]
Insert picture description here

You can also use the command:

yum install  iptraf
iptraf-ng -d eth0 -t 1 
Insert picture description here

Check which port is in and out of traffic?

 iptraf-ng -s eth0 -t 10  
Insert picture description here

Network statistics:

[-p] [-c] [–interfaces=<名称>] [-s] [-t] [-u] [-w]
Insert picture description here
netstat -t -c
Insert picture description here
netstat -t -p
Insert picture description here
netstat -s -u
Insert picture description here

Seven, summary

If you see a big picture of the Linux operating system architecture, then you should feel hopeful at this time. I think the problem judgment on the operating system is relatively clear, so based on the decision tree, everyone can find the evidence chain of the performance problem in the operating system .