One afternoon, a large number of interface delay alarms suddenly appeared in the test environment of a certain service. From the monitoring point of view, the time consumption of interface p99 basically reached more than 5s, and the interface delay was getting longer and longer.
With the continuous timeout and retry of the business side, the database threads are full, the number of waiting threads soars, and a large number of threads are blocked
Initial investigation process
1. Through APM link monitoring, we found that the main time consumption of all links is concentrated in the process of DataSource#getConnection
2. It was initially suspected that it might be caused by slow SQL queries, but it was later discovered that the root cause was not here, because monitoring showed that the slowest SQL99 line at that time was 56.99ms, and the qps was lower than 0.01 (the reason for slow SQL queries can basically be ruled out. )
3. GC analysis: There are 289 GCs in half an hour, and the longest STW is 280ms (when garbage collection is performed, all threads except the garbage collector thread in the Java application are suspended, which means this During this period, the user's normal working threads were all suspended. At first, it was suspected that STW280ms affected DataSource#getConnection, but the fact is not, because STW is greater than 100ms to account for only 1.73%, which is not high)
4. Interface QPS analysis: Under the premise of the service single instance , the test students in the loop running the use case caused the number of requests for these 3 interfaces to reach 3000 times/min, and the average 1s to deal with 3000/60=50 requests, while online The average time for these three interfaces is 20~30ms. At that time, it was suspected that the request load of these three interfaces would affect the normal operation of other interface-related threads, but this does not explain the time it takes for the Java thread to wait for the MySQL connection to take so long
Preliminary solution: As the phone was blown up and WeChat was bombarded by business frantically, in order to solve the problem as soon as possible, we tried to expand the container to three, and the problem was miraculously alleviated
1. In the preliminary analysis, the preliminary conclusion is that 3000 requests/min for a single interface under a single instance affects the normal query of other interfaces, but this does not explain the time-consuming problem of DataSource#getConnection , so further steps are needed. Troubleshoot the problem and investigate the cause
2. In order to reproduce the problem, I used Jmeter to request the two interfaces that took the longest time when the problem occurred at the time at 3000qpm, but the problem did not reproduce, the interface response speed was very fast, and the average response time was within 10ms.
3. In order to further reproduce the problem, I restored the Apollo configuration to the point in time when the service was abnormal (this time I found the Druid configuration test-on-borrow and test-on-return and modified it to false, and later found the modification of these two configurations Become the key to fix the problem), then restart Jmeter, the problem reappears, DataSource#getConnection is up to 3s
4. Use Arthas to trace the calling link of DataSource#getConnection and analyze the time-consuming link. Command: trace com.alibaba.druid.pool.DruidDataSource getConnection'#cost> 100' -n 5
Through layer-by-layer analysis, we finally came to the method com.alibaba.druid.util.MySqlUtils#getLastPacketReceivedTimeMs
Combined with the analysis of the code, we found that in the source code of Druid 1.1.22, there is a loading of the com.mysql.jdbc.MySQLConnection class, but this class does not exist in the mysql-connector 8.x version we currently use. (8.x uses com.mysql.cj.MysqlConnection ), causing com.alibaba.druid.util.Utils#loadClass to frequently load a non-existent class, which affects the overall performance of the service (the root cause)
- Someone in Druid has already submitted an issue for this issue, and it has been officially marked as a bug. For details, see: com.alibaba:druid:1.1.20 MysqlUtils is hard-coded to load the MySQLConnection class of mysql-connector-java version 5.1, causing thread blocking , Performance is limited
- This bug was fixed in version 1.2.1, and the related commit number for the fix is detailed in: commit ef2f87b6: When using mysql-connector 8.X, the repair check connection validity has no corresponding class
5. In point 3, we mentioned: Druid's configuration test-on-borrow and test-on-return were modified to false, and later it was found that the modification of these two configurations became the key to the problem repair.
The reason is: when testOnBorrow and testOnReturn are True, the method this.testConnectionInternal is called, and this.testConnectionInternal contains a call to the method com.alibaba.druid.util.MySqlUtils#getLastPacketReceivedTimeMs, and then triggers the method mentioned in point 4. For the bugs found, interested students can read the source code
to sum up
- The root cause of the problem: Druid 1.1.22 will repeatedly load a non-existent com.mysql.jdbc.MySQLConnection class, resulting in thread blockage and limited performance
- Optimization item:
- Mysql's official recommendation is to set the configuration test-on-borrow and test-on-return to false, see: DruidDataSource reference configuration
- If you are using mysql-connector 8.x, it is recommended to upgrade the Druid version to 1.2.1 or higher