Trample Log-Record a troubleshooting process for service performance damage caused by a bug in Druid 1.1.22

phenomenon

One afternoon, a large number of interface delay alarms suddenly appeared in the test environment of a certain service. From the monitoring point of view, the time consumption of interface p99 basically reached more than 5s, and the interface delay was getting longer and longer.

Enterprise WeChat screenshot_383676f3-8e13-416b-b429-b3bb3158bf86.png


With the continuous timeout and retry of the business side, the database threads are full, the number of waiting threads soars, and a large number of threads are blocked

Enterprise WeChat screenshot_f16d7a6b-3f10-4382-8800-48a45f01213a.png

Initial investigation process

1. Through APM link monitoring, we found that the main time consumption of all links is concentrated in the process of DataSource#getConnection

Enterprise WeChat screenshot_b4cb118d-3cea-4980-8057-dc618e4bd959.png

2. It was initially suspected that it might be caused by slow SQL queries, but it was later discovered that the root cause was not here, because monitoring showed that the slowest SQL99 line at that time was 56.99ms, and the qps was lower than 0.01 (the reason for slow SQL queries can basically be ruled out. )

3. GC analysis: There are 289 GCs in half an hour, and the longest STW is 280ms (when garbage collection is performed, all threads except the garbage collector thread in the Java application are suspended, which means this During this period, the user's normal working threads were all suspended. At first, it was suspected that STW280ms affected DataSource#getConnection, but the fact is not, because STW is greater than 100ms to account for only 1.73%, which is not high)

Enterprise WeChat screenshot_f91b8d75-60c4-48cb-8246-043cf5ca85fa.png
Enterprise WeChat screenshot_54a82d3d-0a08-4cd1-b766-42f799be2115.png

4. Interface QPS analysis: Under the premise of the service single instance , the test students in the loop running the use case caused the number of requests for these 3 interfaces to reach 3000 times/min, and the average 1s to deal with 3000/60=50 requests, while online The average time for these three interfaces is 20~30ms. At that time, it was suspected that the request load of these three interfaces would affect the normal operation of other interface-related threads, but this does not explain the time it takes for the Java thread to wait for the MySQL connection to take so long

Enterprise WeChat screenshot_f66c4e65-5e24-45a3-a124-70e9a6e4e115.png

Preliminary solution: As the phone was blown up and WeChat was bombarded by business frantically, in order to solve the problem as soon as possible, we tried to expand the container to three, and the problem was miraculously alleviated

Further investigation

1. In the preliminary analysis, the preliminary conclusion is that 3000 requests/min for a single interface under a single instance affects the normal query of other interfaces, but this does not explain the time-consuming problem of DataSource#getConnection , so further steps are needed. Troubleshoot the problem and investigate the cause

2. In order to reproduce the problem, I used Jmeter to request the two interfaces that took the longest time when the problem occurred at the time at 3000qpm, but the problem did not reproduce, the interface response speed was very fast, and the average response time was within 10ms.

Enterprise WeChat screenshot_0da68c6d-2e02-4757-ba4f-050925066925.png

3. In order to further reproduce the problem, I restored the Apollo configuration to the point in time when the service was abnormal (this time I found the Druid configuration test-on-borrow and test-on-return and modified it to false, and later found the modification of these two configurations Become the key to fix the problem), then restart Jmeter, the problem reappears, DataSource#getConnection is up to 3s

Enterprise WeChat screenshot_821169a6-1e32-458d-a3dd-4b62e653e5b9.png

4. Use Arthas to trace the calling link of DataSource#getConnection and analyze the time-consuming link. Command: trace com.alibaba.druid.pool.DruidDataSource getConnection'#cost> 100' -n 5

Enterprise WeChat screenshot_3bb6b276-8241-4017-81d9-820b054bab33.png

Through layer-by-layer analysis, we finally came to the method com.alibaba.druid.util.MySqlUtils#getLastPacketReceivedTimeMs

Enterprise WeChat screenshot_9d827865-1ba3-473b-8704-0e07b6179c77.png

Combined with the analysis of the code, we found that in the source code of Druid 1.1.22, there is a loading of the com.mysql.jdbc.MySQLConnection class, but this class does not exist in the mysql-connector 8.x version we currently use. (8.x uses com.mysql.cj.MysqlConnection ), causing com.alibaba.druid.util.Utils#loadClass to frequently load a non-existent class, which affects the overall performance of the service (the root cause)

Enterprise WeChat screenshot_99c348e9-e91f-4e2e-b837-1ad113212749.png

5. In point 3, we mentioned: Druid's configuration test-on-borrow and test-on-return were modified to false, and later it was found that the modification of these two configurations became the key to the problem repair.

The reason is: when testOnBorrow and testOnReturn are True, the method this.testConnectionInternal is called, and this.testConnectionInternal contains a call to the method com.alibaba.druid.util.MySqlUtils#getLastPacketReceivedTimeMs, and then triggers the method mentioned in point 4. For the bugs found, interested students can read the source code

Enterprise WeChat screenshot_d6ffe783-cea2-4616-81da-4aa93594af5b.png

to sum up

  • The root cause of the problem: Druid 1.1.22 will repeatedly load a non-existent com.mysql.jdbc.MySQLConnection class, resulting in thread blockage and limited performance
  • Optimization item:
  • Mysql's official recommendation is to set the configuration test-on-borrow and test-on-return to false, see: DruidDataSource reference configuration
  • If you are using mysql-connector 8.x, it is recommended to upgrade the Druid version to 1.2.1 or higher