Got an error reading communication packet

problem

Frequently receive questions from customers about communication failure errors—Customers are facing intermittent "Got an error reading communication packet" errors.
Here is an analysis of the reasons for this error and how to solve the problem.

Official explanation

First, when a communication failure error occurs, the counts of the MySQL state variables Aborted_clients and Aborted_connects will increase. These two status variables describe the number of connections that were interrupted because the client did not properly close the connection and the number of connections that failed to log in to MySQL (respectively). There are many possible reasons for the two errors.
Let’s see what the official website says:
Aborted_connects:
If a client is unable even to connect, the server increments the Aborted_connects status variable.

  • A Client Access attempts unsuccessful attempts to A Database But has privileges for IT NO.
    # The client does not have permission but try to access the MySQL database
  • A client uses an incorrect password. #The password
    entered by the client is incorrect.
  • Packet Connection does Not Contain A The right Information.
    # Connection information packet does not contain the correct
  • takes more than connect_timeout seconds to obtain a connect packet. #Exceeding the
    connection time limit is mainly controlled by this system variable connect_timeout (mysql default is 10s, basically, unless the network environment is extremely bad, it will generally not time out.)

Aborted_clients:
lIf a client successfully connects but later disconnects improperly or is terminated, the server increments the Aborted_clients status variable

  • The client program did not call mysql_close() before exiting… #The client program did not
    close
  • The client had been sleeping more than wait_timeout or interactive_timeout seconds without issuing any requests to the server. #The
    client's sleep time exceeded wait_timeout or interactive_timeout seconds without issuing any requests to the server .
  • The client program ended abruptly in the middle of a data transfer. #The client program ended abruptly in the middle of a data transfer
    .

Aborted_connects OR Aborted_clients:
Other reasons for problems with aborted connections or aborted clients:

  • the max_allowed_packet variable value is too small or queries require more memory than you have allocated for mysqld
    #max_allow_packet设置过小
  • Use of Ethernet protocol with Linux, both half and full duplex. Some Linux Ethernet drivers have this bug
    #Linux Ethernet drivers have this bug
  • A problem with the thread library that causes interrupts on reads.#Thread library that causes interrupts on reads.
  • Badly configured TCP/IP. #tcp/iip Configuration information is confused
  • , Ethernets Faulty, Hubs, Switches, Cables, and Forth SO. This CAN BE Properly diagnosed only by Replacing Hardware.
    # Failure of Ethernet, hubs, switches, cables, etc.

https://dev.mysql.com/doc/refman/5.7/en/communication-errors.html

analysis

a. Wrong password & wrong authority:

mysql> flush status;Query OK, 0 rows affected (0.00 sec)mysql> show status like 'Abort%';+------------------+-------+| Variable_name    | Value |+------------------+-------+| Aborted_clients  | 0     || Aborted_connects | 0     |+------------------+-------+2 rows in set (0.00 sec)

Incorrect password login:

image.png

Error.log:

image.png

status information:

image.png

##Permission error
Login without permission:

image.png

Error.log:

image.png

b. Transmission packet error
A connection packet does not contain the right information:

 connection packet does not contain the right information :
image.png
image.png
image.png

c. The program did not call close The
client program did not call mysql_close() to properly close the MySQL connection before exiting.

####python代码#####!/usr/bin/python# -*- coding: UTF-8 -*-import pymysql# 打开数据库连接db = pymysql.connect(host='192.168.244.130', port=3390, database='sbtest',                     charset='utf8',user='dbadmin', password='123456')# 使用 cursor() 方法创建一个游标对象 cursorcursor = db.cursor()# 使用 execute()  方法执行 SQL 查询#cursor.execute("SELECT VERSION()")cursor.execute("SELECT * FROM t1;")# 使用 fetchone() 方法获取单条数据.data = cursor.fetchone()print("Database version : %s " % str(data))# 关闭数据库连接#db.close()

Error log information:

image.png
image.png

d.timeout settings
First look at the mysql parameters related to timeout

mysql>show variables like '%timeout%';
image.png

There are mainly 4 timeout parameters:
connect_timeout works in the handshake authentication phase (authenticate), interactive_timeout and wait_timeout work in the connection idle phase (sleep),
and net_read_timeout and net_write_timeout start when the connection is busy (query) or when there is a problem with the network effect.

First look at the variables interactive_timeout and wait_timeout are set to 4 seconds

mysql> set global interactive_timeout=4;Query OK, 0 rows affected (0.00 sec)
image.png
image.png

e.max_allowed_packet setting Set
the global system variables interactive_timeout and wait_timeout to 4 seconds

mysql> set global max_allowed_packet=1024;Query OK, 0 rows affected, 1 warning (0.00 sec)

###But when the following scripts are concurrently multiple, it is found that
###python script is as follows:

image.png
image.png

f. Network delay
MySQL system variables net_read_timeout and net_write_timeout values, see if it will reduce the number of errors, net_read_timeout is a rare exception, unless your network environment is really too bad, try to adjust these values, because in large In most cases, a query is generated and sent to the server as a single packet, and the application cannot switch to perform other operations, leaving the server as a partially received query.

netem provides network emulation function for test protocol by simulating the characteristics of wide area network. The current version simulates variable delay, loss, repetition and reordering.
##Link to the mysql server through the python code and found that the error log records are as follows:

[[email protected] ~]# tc qdisc add dev ens33 root netem delay 90000ms

From the analysis of the above situation: "Got an error reading communication packet" phenomenon

1. The client has successfully connected, but terminated abnormally (may be related to the incorrect closing of the connection)
2. The client sleep time exceeds the defined value of the system variable wait_timeout or interactive_timeout (eventually causes the
connection to sleep for more than the system variable wait_timeout Value, and then forcibly closed by MySQL)
3. The client is interrupted abnormally or the query exceeds the max_allowed_packet value.

Inside MySQL, a large number of MySQL connections are dormant and dormant for hundreds of seconds is one of the symptoms that the application does not close the connection after completing the work. They rely on the wait_tiemout system variable to close the connection. It is recommended to modify the application logic and properly close the connection after the operation is over.

mysql> select @@log_warnings;+----------------+| @@log_warnings |+----------------+|              2 |+----------------+1 row in set (0.00 sec)mysql> set global log_warnings=0;Query OK, 0 rows affected (0.00 sec)mysql> select @@log_warnings;+----------------+| @@log_warnings |+----------------+|              0 |+----------------+1 row in set (0.00 sec)

官方文档对log_warnings的解释:
The server logs messages about statements that are unsafe for statement-based logging if the value is greater than 0. Aborted connections and access-denied errors for new connection attempts are logged if the value is greater than 1
https://dev.mysql.com/doc/refman/5.7/en/server-options.html#option_mysqld_log-warnings

Case 1

During the preparation of the March operation and maintenance monthly report, it was found that the abnormal rate of the MySQL connection of the tag database from the database was high. After checking the error log, a large number of Abort Connection (Got an error reading communication packets) were found. The recorded mysql user is baihang and the IP is HAproxy. The IP addresses are distributed evenly but there is no obvious time rule, as shown in the figure (Figure 1 is the Nginx log, and Figure 2 is the MySQL error log):

image.png
image.png

Business Process:

image.png

First of all, as shown in Figure 1.2 above, the Got an error reading problem printed in the MySQL error log indicates that the client has successfully connected but terminated abnormally. \The problem is in tomcat, HAproxy or MySQL. It is suspected that this problem is caused by the timeout handling mechanism among the three. The following is the timeout parameter setting and general explanation of the three.

Timeout parameters in Tomcat:

image.png
image.png
hummer.datasource.hikari.idle-timeout#mySql连接-idle状态的最大时长hummer.datasource.hikari.max-lifetime#mySql连接-生命时长hummer.datasource.hikari.connection-timeout#mySql连接-超时时间

Timeout parameters in HAproxy:

image.png
timeout queue 1m#一个请求在队列里的超时时间timeout connect 10s#连接超时timeout client 1m#客户端超时timeout server 1m#服务器端超时timeout check 10s#检测超时maxconn 65535#每个进程可用的最大连接数

Timeout parameters in MySQL:

image.png
connect_timeout#连接过程中握手的超时时间wait_timeout & interactive_timeout#不活跃的连接超时时间net_read_timeout  &  net_write_timeout#在网络条件不好的情况下起作用
Processing process:

The configuration adjustments that have been made and the results feedback
MySQL parameter tuning, optimize mysql operating parameters, adjust max_allowed_packet to 1G, slave_max_allowed_packet to 1G, and slave_pending_jobs_size_max to 256M. However, from the MySQL error log around the 21st, no obvious difference can be seen. It can be preliminarily considered that it has nothing to do with these three items. In addition, Got an error reading... appears in the error log more uniformly rather than concentrated in large numbers, and MySQL Both wait_timeout & interactive_timeout are set to 8 hours, inferring that the problem is not MySQL.

By modifying the Tomcat back-end connection IP, before connecting Tomcat directly to the MySQL database, Got an error reading... related information will not be printed in the MySQL error log. It is preliminarily inferred that the problem is probably HAproxy.

By commenting out the HAproxy parameters timeout client 1m (client-side timeout) and timeout server 1m (server-side timeout), this problem no longer occurs. The official explanation of the relevant parameters is as follows:

The inactivity timeout applies when the client is expected to acknowledge or send data. In HTTP mode, this timeout is particularly important to consider during the first phase, when the client sends the request, and during the response while it is reading data sent by the server. That said, for the first phase, it is preferable to set the “timeout http-request” to better protect HAProxy from Slowloris like attacks.

This parameter is specific to frontends, but can be specified once for all in “defaults” sections. This is in fact one of the easiest solutions not to forget about it. An unspecified timeout results in an infinite timeout, which is not recommended. Such a usage is accepted and works but reports a warning during startup because it may results in accumulation of expired sessions in the system if the system’s timeouts are not configured either.

The general idea is: when the client is expected to confirm or send data, the application inactivity times out. In HTTP mode, it is particularly important to consider this timeout in the first phase (when the client sends a request) and during the response (when reading the data sent by the server). In other words, for the first stage, it is best to set "timeout http-request" to better protect HAProxy from attacks like slow slackers.

This parameter is specific to the front end, but can be specified once in the "default value" section. This is actually one of the simplest solutions, don't forget it. Failure to specify a timeout will result in an infinite timeout, which is not recommended. This usage is acceptable and works normally, but
a warning will be reported during startup , because if the system timeout is not configured, it may cause the accumulation of expired sessions in the system.

Other cases:
MySQL Group Replication cluster, also need to track network abnormalities and other errors

solution

1. Add a query operation similar to a heartbeat in Tomcat, so that all idle connections in the connection pool send test SQL statements periodically.
Advantages: Thorough solution;
Disadvantages: However, this approach requires modifying the configuration of all applications connected to HAproxy, and there may be other chain reactions after modification, such as whether the current strategy and method of recycling connection resources will be affected.

2. Comment out the parameter timeout client in the HAproxy configuration file (this timeout period does not refer to the timeout period of the connection process, but refers to the time after the connection is connected, there is no heartbeat or operation, and it is considered timeout if this time is exceeded, and then the connection is disconnected ).
Advantages: easy to operate, fewer points need to be changed;
disadvantages: see 2.2.4.

3. Ignore the Abort Connection reported in the MySQL error log.
Advantages: No operation is required, and will not have any impact on the existing configuration.
Disadvantages: Thousands of Abort Connection records are printed every day in the error log, and zabbix monitoring can perceive this record. The existing connection success rate calculation method will be affected by this Information interference.

to sum up:

But when a problem occurs, you can take a look and observe whether the show global status like'abort%'; Aborted_clients and Aborted_connects have changed drastically.
Looking at the log of the corresponding application