In-depth analysis of the reasons for a large number of CLOSE_WAIT online

In-depth analysis of the tcp/ip protocol stack of the tcp thematic training camp

An analysis of 10 classic interview questions, how does the technical direction determine the career direction

c/c++ linux server development and learning address: c/c++ linux background server senior architect

Restarting this time really cannot solve the problem: once MySQL was actively shut down, a large number of CLOSE_WAIT whole-process troubleshooting processes occurred in the service.

Recently, an online service socket resource was constantly being filled. Analyze online problems through various tools and locate the problem code. Here is a review and summary of the problem discovery and repair process.

Look at the two pictures first. One picture is the socket status monitored when the service is normal, and the other picture is of course abnormal!

Insert picture description here


Picture 1: Monitoring when normal

Insert picture description here


Picture 2: Monitoring when abnormal

From the performance in the figure, starting from 04:00, socket resources continue to rise, and return to normal after restarting at the bottom of each valley, and then continue to rise without releasing, and the interval between peaks is getting more and more. short.

After restarting, I checked the log and did not see panic, so I didn't check further at this time. I really thought it would be good to restart Dafa.

Description of the situation

The service has been online for nearly a year, and it is available for other service calls. The main underlying resources are DB/Redis/MQ.

For the convenience of subsequent descriptions, the architecture diagram of the service will be described.

Insert picture description here


Figure 3: Service architecture

The architecture is very simple. The problem started around 08:20 in the morning, and the alarm received a 504 for the service. At this time, the first reaction is that the service has not restarted for a long time (almost two months), there may be some memory leaks, so I don’t want to do it directly. Restarted. That is, at the first trough in Figure 2, the service returned to normal after restarting (restarting is really easy to use, happy).

Nearly 14:00, I was warned again that 504 appeared. At that time, I felt a little bit wrong, but because there happened to be a large-scale promotion that day, I immediately restarted the service again. It was not until about 1 hour later that the alarm started again. After several consecutive restarts, it was found that the time interval between restarts became shorter and shorter. Finding the problem at this time is by no means simple. This time restarting really cannot solve the problem, so I immediately applied for machine permissions and started troubleshooting. The screenshots below are all sourced from my replay demo and have nothing to do with online.

problem found

After a problem occurs, first analyze and infer, then verify, and finally locate and modify. Based on the performance at that time, the following conjectures were made.

Inference one

The socket resource is constantly being filled up, and it has never appeared before. It suddenly appeared today. I wonder if the request volume is overwhelming the service.

After checking the real-time qps, I gave up the idea. Although the amount has increased, it is still within the range of the server (far from the benchmark value of the stress test).

Inference two

Two machine failures occur at the same time, restart one, and the other will be alleviated. It is very strange as a service deployed independently in two clusters.

With the above basis, the result of the launch is definitely that the underlying resources that the service depends on are in addition to the problem, otherwise it is impossible for the services of the independent cluster to have problems at the same time.

Since the monitoring showed that it was a socket problem, the current tcp connection status was checked through the netstat command (local test, the actual value on the line is much larger)

/go/src/hello # netstat -na | awk '/^tcp/ {++S[$NF]} END {for(a in S) print a, S[a]}'
LISTEN 2 
CLOSE_WAIT 23 # 非常异常 
TIME_WAIT 1

Found that most of the links are in the CLOSE_WAIT state, which is very unbelievable. Then use the netstat -an command to check.

Insert picture description here


Figure 4: A large number of CLOSE_WAIT

CLOSED means that the socket connection is not used. LISTENING means that incoming connections are being monitored. SYN_SENT indicates that it is trying to establish a connection. SYN_RECEIVED performs the initial synchronization of the connection. ESTABLISHED indicates that the connection has been established. CLOSE_WAIT indicates that the remote calculator closed the connection and is waiting for the socket connection to be closed. FIN_WAIT_1 indicates that the socket connection is closed and the connection is being closed. CLOSING first closes the local socket connection, then closes the remote socket connection, and finally waits for the confirmation message. LAST_ACK After the remote calculator is turned off, it waits for the confirmation signal. FIN_WAIT_2 After the socket connection is closed, it waits for the close signal from the remote calculator. After the TIME_WAIT connection is closed, wait for the remote calculator to close and resend.

Then began to focus on thinking about why there are a large number of mysql connections that are CLOSE_WAIT? To make it clear, let's insert a little knowledge of TCP's four waved hands.

[Article benefits] C/C++ Linux server architect learning materials are needed to add group 812855908 (data include C/C++, Linux, golang technology, Nginx, ZeroMQ, MySQL, Redis, fastdfs, MongoDB, ZK, streaming media, CDN, P2P, K8S, Docker, TCP/IP, coroutine, DPDK, ffmpeg, etc.)

Insert picture description here

TCP waved four times

Let's take a look at the process of the four waves of TCP:

Insert picture description here


describe this process in Chinese:

Client: Big brother on the server, I'm done with everything, and I am going to withdraw. The corresponding here is that the client sent a FIN

Server: I see, but wait for me, I want to close it, here is the ACK that the server responds after receiving the FIN

After the above two steps, the server will be in the CLOSE_WAIT state. After a while, Server is finished

Server: Brother, I'm done with my brother, let's withdraw, the server sends FIN

Client: Brother, goodbye, here is an ACK from the client to the server

At this point, the server can run away, but the client can't. why? The client must wait for 2MSL. Why can't the client run directly here? The main purpose is to prevent the server from not receiving the sent ACK, and the server retransmits the FIN and asks again. If the client runs off after sending it, then the server will ignore it when it retransmits. The length of the waiting time is also very particular.

Maximum Segment Lifetime message maximum survival time, it is the longest time any message exists on the network, after this time, the message will be discarded

Here must not be confused by the client/server in the picture and the client server in the project. Just remember: the party that actively shuts down sends out the FIN packet (Client), and the party that passively shuts down (Server) responds to the ACK packet. At this time, The passively closed party enters the CLOSE_WAIT state. If everything is normal, the passively closed party will also send a FIN packet and then migrate to the LAST_ACK state. In this case, the TCP packet capture analysis:

/go # tcpdump -n port 3306
# 发生了 3次握手
11:38:15.679863 IP 172.18.0.5.38822 > 172.18.0.3.3306: Flags [S], seq 4065722321, win 29200, options [mss 1460,sackOK,TS val 2997352 ecr 0,nop,wscale 7], length 0

11:38:15.679923 IP 172.18.0.3.3306 > 172.18.0.5.38822: Flags [S.], seq 780487619, ack 4065722322, win 28960, options [mss 1460,sackOK,TS val 2997352 ecr 2997352,nop,wscale 7], length 0

11:38:15.679936 IP 172.18.0.5.38822 > 172.18.0.3.3306: Flags [.], ack 1, win 229, options [nop,nop,TS val 2997352 ecr 2997352], length 0

# mysql 主动断开链接
11:38:45.693382 IP 172.18.0.3.3306 > 172.18.0.5.38822: Flags [F.], seq 123, ack 144, win 227, options [nop,nop,TS val 3000355 ecr 2997359], length 0 # MySQL负载均衡器发送fin包给我

11:38:45.740958 IP 172.18.0.5.38822 > 172.18.0.3.3306: Flags [.], ack 124, win 229, options [nop,nop,TS val 3000360 ecr 3000355], length 0 # 我回复ack给它

... ... # 本来还需要我发送fin给他,但是我没有发,所以出现了close_wait。那这是什么缘故呢?
src> dst: flags data-seqno ack window urgent options src> dst indicates that from the source address to the destination address flags are the flag information in the TCP packet, S is the SYN flag, F(FIN), P(PUSH), R(RST) "." (without mark) data-seqno is the sequence number of the data in the packet ack is the sequence number expected next time window is the window size of the receive buffer urgent indicates whether there is an emergency pointer in the packet options is an option

Combining the above information, let me explain in words: MySQL load balancer sends a FIN packet to my service, and I responded. At this time, I entered the CLOSE_WAIT state, but I, as the passive closing party, did not send FIN. As a result, my server has been in the CLOSE_WAIT state and cannot finally enter the CLOSED state. Then I infer that there are several possible reasons for this situation:

1. The load balancer exits abnormally,

This is basically impossible, his problem is definitely a large-scale service alarm, not just me as a service

2. The timeout setting of the MySQL load balancer is too short, causing the business code to be processed before the MySQL load balancer closes the tcp connection

This is also unlikely, because this service does not have any time-consuming operations. Of course, I checked the configuration of the load balancer and set it to 60s.

3. Code problem, MySQL connection cannot be released

At present, it seems that it should be a code quality problem. In addition, there is an abnormality in the data this time, which triggers a point that has not been tested before. It seems that this is likely to be the cause.

Find the cause of the error

Since the business logic of the code is not written by me, I am worried that the problem will not be seen for a while, so I directly use perf to draw all the calling relationships using flame graphs. Since the above we infer that the mysql connection is not released in the code. It is nothing more than:

  1. Did not call close
  2. There is a time-consuming operation (the flame diagram can be seen very clearly), resulting in a timeout
  3. The mysql transaction is not handled correctly, for example: rollback or commit

Because the flame graph contains too much content, in order to let everyone see clearly, I folded some unnecessary information.

Insert picture description here


Figure 5: Problematic flame diagram

The flame graph obviously sees that the transaction is started, but in the rest of the part, it does not see Commit or Rollback operations. This will definitely cause operational problems. Then it is clear that the problem is:

Inside the MainController.update method, not much to say, go directly to the update method to check. Found the following code:

func (c *MainController) update() (flag bool) {
	o := orm.NewOrm()
	o.Using("default")
	
	o.Begin()
	nilMap := getMapNil()
	if nilMap == nil {// 这里只检查了是否为nil,并没有进行rollback或者commit
		return false
	}

	nilMap[10] = 1
	nilMap[20] = 2
	if nilMap == nil && len(nilMap) == 0 {
		o.Rollback()
		return false
	}

	sql := "update tb_user set name=%s where id=%d"
	res, err := o.Raw(sql, "Bug", 2).Exec()
	if err == nil {
		num, _ := res.RowsAffected()
		fmt.Println("mysql row affected nums: ", num)
		o.Commit()
		return true
	}

	o.Rollback()
	return false
}

At this point, all the analysis is over. After checking getMapNil returned nil, but the following judgment conditions did not roll back.

if nilMap == nil {
    o.Rollback()// 这里进行回滚
    return false
}

to sum up

The whole analysis process still wastes a lot of time. The most important thing is that the subjective consciousness is too strong. Why does it suddenly go wrong if it has been running for a year without any problems? Therefore, at the beginning, I questioned SRE, DBA, and various infrastructure problems (people always doubt others first). This led to a lot of time spent on this.

Analyze the correct analysis ideas:

1. After a problem occurs, you should check the log immediately, and indeed the log does not find any problems;

2. The monitoring clearly shows that the socket continues to grow, and it is clear that netstat should be used to check the situation immediately to see which process is the pot;

3. According to the check of netstat, use tcpdump to capture packets to analyze why the connection will be passively disconnected (TCP knowledge is very important);

4. If you are familiar with the code, you should check the business code directly. If you are not familiar with it, you can use perf to print out the call link of the code;

5. Whether it is analysis code or flame graph, you should be able to quickly locate the problem at this point. So why does CLOSE_WAIT appear this time? Most students should already understand, I will briefly explain here:

Since that line of code did not roll back the transaction, the server did not initiate a close. Therefore, the MySQL load balancer actively triggered the close operation when it reached 60s, but it was found through tcp packet capture that the server did not respond. This is because the transaction in the code was not processed, which caused a large number of ports and connection resources to be affected. Occupied. Posting the packet capture data when waving:

# mysql 主动断开链接
11:38:45.693382 IP 172.18.0.3.3306 > 172.18.0.5.38822: Flags [F.], seq 123, ack 144, win 227, options [nop,nop,TS val 3000355 ecr 2997359], length 0 # MySQL负载均衡器发送fin包给我
11:38:45.740958 IP 172.18.0.5.38822 > 172.18.0.3.3306: Flags [.], ack 124, win 229, options [nop,nop,TS val 3000360 ecr 3000355], length 0 # 我回复ack给它

I hope this article will help you troubleshoot online problems. In order to help everyone understand, the flame diagram under correct conditions and the flame diagram under wrong conditions are attached below. You can compare by yourself.

Insert picture description here


Right flame illustration

Insert picture description here


wrong flame illustration

Here are two thought questions. I think it is very meaningful. Think about it yourself:

  1. Why does a machine with a few hundred CLOSE_WAITs make it impossible to continue accessing? Don’t we often say that a machine has 65,535 file descriptors available?
  2. Why do I have load balancing, but the two machines where the service is deployed actually CLOSE_WAIT almost simultaneously?