Practical case: MySQL containerization practice of JD Daojia

This article introduces the practice of JD Daojia's MySQL containerization, including the container-based underlying resource platform, monitoring system, and database automated operation and maintenance platform. At the same time, it also introduces the specific technical implementation in detail, including software and hardware selection, container scheduling algorithm, database high availability implementation, monitoring system and database automation operation and maintenance platform development.

1. Background

2. Technical scheme

3. Technical realization

3.1 The construction of the underlying resource platform of the database ①Software
and hardware selection ②Container
scheduling algorithm
③The implementation of MySQL's high availability

3.2 Monitoring system

3.3 Development of an automated operation and maintenance platform
①MySQL application and delivery ②Configuration
changes
③MySQL tools

4. Summary

1. Background

With the rapid development of JD Daojia's business, the MySQL database is getting more and more visits, and building MySQL on the cloud host is increasingly unable to meet our requirements.

  • The cloud hard disk IO performance of the cloud host cannot meet the high concurrent access requirements required by MySQL.
  1. The host where the cloud host is located is opaque, and the problem cannot be located in time when a network or hardware failure occurs.
  2. When the MySQL built on the cloud host needs to change its specifications, it needs to be shut down to change the configuration.
  3. The cost of building MySQL on the cloud host is high, and the cost of purchasing cloud MySQL is higher.

Based on the above reasons, we finally chose to purchase a physical machine and deploy MySQL containerized on the physical machine to solve the above problems.

2. Technical solution

We believe that a complete database operation and maintenance program includes the following parts: the underlying database resource platform, the monitoring system, and the database automated operation and maintenance platform.

img

Database underlying resource platform

  • Build a Docker environment on a physical machine, deploy MySQL instances in Docker, and realize resource isolation and resource oversold based on the characteristics of Docker.
  • Customize the rule algorithm to schedule the container.
  • In terms of high availability, secondary development is carried out for MHA and Zabbix to achieve rapid switching after downtime.

surveillance system

The monitoring system uses Zabbix, where the monitoring data (CPU, memory, etc.) of each container is obtained through Docker Api.

Database automated operation and maintenance platform

Based on the Python and Flask framework, the MySQL automated operation and maintenance platform is developed to realize the automatic operation and maintenance of MySQL's complete life cycle and improve the work efficiency of operation and maintenance personnel.

3. Technical realization

3.1 Construction of the underlying resource platform of the database

Software and hardware selection

The scenario where MySQL runs requires high concurrency and high IO. We have made the following choices on software and hardware:

  • Physical machine: 64 cores, 256G memory, 16*960G SSD form RAID10 or 4T NVME RAID0
  • Operating system: CentOS 7.5
  • Container: Docker version 1.13.1, network mode select host mode. (It can also be seen from this that we mainly isolate resources from CPU and memory, and there is no resource isolation at the network level)
  • Mirror: custom MySQL5.6.36 mirror, MySQL5.7.22 mirror

The container-based MySQL instance and the cloud-based MySQL instance were compared under the same specifications. The maximum MySQL QPS in the cloud host is 23K.

img

The MySQL in the container makes full use of the high IO of the local SSD hard disk, with a maximum QPS of 90K.

[External link image transfer failed. The source site may have an anti-leech link mechanism. It is recommended to save the image and upload it directly (img-yB6A154M-1622809129544)(https://img0.tuicool.com/yArIZjj.png!web)]

After containerization, MySQL's performance far exceeds that of MySQL on the cloud host, and it can fully meet the current MySQL performance requirements of JD Daojia.

Container scheduling algorithm

MySQL is a stateful service. DBA does not need to perform frequent operations on it. To keep it relatively stable and robust, we define our own rules and algorithms for container scheduling.

  • Instances of the same cluster are distributed in different Availability Zones.
  • Instances of the same cluster are distributed on different hosts.
  • According to different business levels, MySQL distributes different hosts, and core services cannot be deployed too much on one host.
  • Sub-database system, each segment MySQL is distributed to different hosts.
  • Give priority to the host with the most free CPU, memory, and disk space resources.
  • Through Docker oversold CPU, the oversold multiple does not exceed 2 times the actual number of CPU cores.

Based on the above principles, we have developed a container scheduling system to perform overall scheduling on the distribution of containers.

img

High-availability implementation of MySQL

The Daojia application server accesses MySQL through the domain name method. We carry out secondary development of the MHA and Zabbix monitoring system, and perform failover by quickly changing the domain name resolution in the event of a failure. The entire switching process can be completed within 10 seconds.

[External link image transfer failed. The source site may have an anti-leech link mechanism. It is recommended to save the image and upload it directly (img-CvS2MxWr-1622809129552)(https://img2.tuicool.com/QBve2yU.png!web)]

After the Zabbix monitoring system finds that MySQL is down, it first determines whether it is the main library or the slave library. If the main library is down, the MHA Manager will handle it, and the slave library will be processed by Zabbix monitoring and calling scripts.

The main library is down: MHA Manager upgrades the slave library with the highest Master_Log_File and Read_Master_Log_Pos to the new main library. At the same time, MHA Manager also calls the DNS resolution interface to resolve the domain name of the main library to the new main library IP. Since the DNS resolution of the domain name may be cached, it may take a long time for the domain name update to take effect. The failover system will also look up all connected application server IPs based on the domain name of the MySQL database that is down, modify the /etc/hosts file to bind the domain name to the new IP in batches through Saltstack, and send it to each application host to shorten the time for domain name resolution to take effect. It can be resolved in seconds to take effect.

Slave library downtime: Zabbix calls the DNS resolution interface to resolve the domain name bound to the downtime slave library to the main library. The subsequent operations are similar to the main library downtime operation process.

[External link image transfer failed. The source site may have an anti-leech link mechanism. It is recommended to save the image and upload it directly (img-YMcgi3S9-1622809129554)(https://img0.tuicool.com/i2Ifuii.png!web)]

3.2 Monitoring system

Daojia's monitoring system uses Zabbix, which uses a custom template to monitor the running status of MySQL. It should be noted that the values ​​obtained from the OS layer of the Docker internal CPU and memory monitoring data are not accurate. We collect them by calling Docker Api and then aggregate them to Zabbix.

img

Zabbix sets a trigger to automatically execute a custom script when an alarm occurs. Using this function, it can realize the function of self-healing after MySQL failure.

  • When the MySQL instance triggers an alarm for insufficient disk space, it will automatically execute a script to delete redundant files to free up space.
  • When a MySQL instance triggers a high CPU usage alarm, it automatically executes the script and sends an email to the DBA and related research and development of the currently running SQL and all connections, so as to quickly find the cause of the CPU alarm.
  • The aforementioned automatic domain name switch after MySQL crashes is also using this feature of Zabbix.

3.3 Develop an automated operation and maintenance platform

We developed the MySQL automated operation and maintenance system based on Python and Flask. From MySQL resource application, instance creation, destruction, master-slave architecture deployment, cluster backup strategy selection, monitoring addition, destruction, etc., the entire life cycle of MySQL is streamlined. And automation.

MySQL application and delivery

[External link image transfer failed. The source site may have an anti-leech link mechanism. It is recommended to save the image and upload it directly (img-VcuAGlR3-1622809129557)(https://img0.tuicool.com/QF3uquI.png!web)]

Research and development apply for MySQL resources through the database operation and maintenance platform. After DBA approval, the program will automatically create a set of MySQL database clusters with a master-slave architecture on the back end according to the container scheduling algorithm. And automatically add the corresponding account and authorization for MySQL, the account types include: monitoring, backup, master-slave, tools, etc. At the same time, after the container is created, it includes: MySQL server, Zabbix client, Saltstack client, Percona Toolkit, backup script, slow log cutting script, etc.

The entire process has been automated, and a container is quickly created based on the image, and a complete and usable MySQL cluster can be delivered within 5 minutes.

Configuration changes

Thanks to the characteristics of the container, calling Docker's update command can change the CPU and memory quota of the corresponding container in real time. In this way, the MySQL instance specification can be changed without downtime, and rapid expansion or shrinkage can be achieved.

img

MySQL tools

After the delivery of MySQL, we provide a large number of MySQL tools: syntax analysis tool, current slow log analysis tool, MySQL connection number query tool, slave database delay query tool, master-slave relationship query tool, running SQL query tool, MySQL fast health check , Physical machine monitoring, error logs, etc. These tools facilitate the daily use of R&D and can be used for troubleshooting.

[External link image transfer failed. The origin site may have an anti-hotlink mechanism. It is recommended to save the image and upload it directly (img-Gl4QUITo-1622809129559)(https://img0.tuicool.com/URjEniB.png!web)]

4. Summary

At present, more than 95% of the MySQL instances that have arrived home are running in containers. The containerized MySQL platform has withstood all the tests of the 415th anniversary, 618, and 1020. For us, the current containerized MySQL platform gives We have brought three major benefits:

  • Satisfy performance: The MySQL instance running in the physical machine Docker has high performance and can meet the performance requirements of Daojia's multiple business scenarios.
  • Reduce costs: Docker containers can be isolated from resources, and multiple MySQL instances can be deployed on the same machine. Moreover, overselling CPU resources through Docker can improve resource utilization. Compared with building MySQL on a cloud host, the cost of building MySQL is reduced by 50%, and the cost of purchasing cloud MySQL is reduced by 100%.
  • Improve efficiency: All MySQL processes are automated, a set of available MySQL master-slave clusters can be delivered within 5 minutes, and the operation and maintenance efficiency has been greatly improved.