High-availability core of RDS-Aurora

Alibaba Cloud relational database RDS is a stable, reliable, and elastically scalable online database service.
Based on the Alibaba Cloud distributed file system and high-performance storage, Alibaba Cloud relational database RDS provides a complete set of solutions for disaster tolerance, backup, recovery, monitoring, migration, etc., completely solving the troubles of database operation and maintenance.
So, in order to ensure the high availability of the database, how does RDS control guarantee it?

1. Highly available function of RDS

First of all, from the perspective of the overall architecture, RDS is divided into the following three architectures and disaster recovery methods.

1.1 Active/Standby Architecture

The RDS instance adopts the active and standby architecture, and the two instances are located on different servers, and data is automatically synchronized. When the primary instance is unavailable, the system will automatically switch the database connection to the standby instance.

1.2 Disaster Recovery in the Same City

Deploying active and standby instances in different availability zones, independent power and network environments can improve data reliability.

1.3 Remote disaster recovery

RDS MySQL supports the creation of remote disaster recovery instances and real-time synchronization of remote data through data transmission. In emergencies, users can switch the remote disaster recovery instance to the primary instance to ensure business availability.

2. Aurora, the high-availability core of RDS

The high-availability core of RDS is the Aurora service. Aurora is the code implementation of the RDS HA service and is a distributed service deployed in a cluster. A set of Aurora clusters are deployed in a region, with about 3-5 Aurora instances in each computer room. The total is about 20. One instance is leader (elected by JGroup), and the others are followers. The leader and follower are essentially the same service process, but after the election, the leader not only has the function of a follower, but is also responsible for the management of system metadata and HA task distribution.

2.1 Aurora's control structure

figure 1


This service is responsible for maintaining the information in metaDB and diagnosing instances hosted in aurora. Repair the instances where the diagnosis result is abnormal.
The repair work depends on the situation of the standby database. If you want to ensure high availability, the status of the standby database is very important. In addition to ensuring that the synchronization status is normal, it is also necessary to ensure that the main and standby delays are within 60S to ensure that the repair work proceeds normally.
The aurora service, which is the high-availability core of RDS products, has the functions of Diagnose, Decision, and Treat.


This service is the agent node of aurora, which is deployed in the physical database machine. It mainly cooperates with aurora repair work to complete the commands issued by aurora, such as restarting the instance, killing the process, setting read-only, etc.

2.2 Process architecture of HA switching

figure 2

The aurora service detects the health status of the managed instance every 15 seconds (the instance is automatically hosted in aurora after the instance is created). Issue the checkMaster task.

  • If the detection is successful, enter the decision-making stage and wait for the next detection task.
  • If the detection is unsuccessful, continue to perform the checkMaster task, if the detection fails three times in a row, then enter the decision-making phase and failover.

After entering the decision-making stage, the reason for the HA switchover is determined first, and after the reason is determined, the repair stage is entered to perform the HA switchover repair.
The repair phase first requests the SLB to determine the physical IP bound to the backend, and determine the IP and port of the main library. Find the IP and port of the corresponding standby library through the IP of the main library. And determine whether the synchronization status of the standby database, the size of the delay between the master and the standby, and the disk space meet the switching conditions.

  • If the confirmation is successful, proceed to the next step.
  • If the confirmation fails, exit.

Kill all the processes of the standby database, and then test whether the main database can be connected again.

  • If you can connect, you need to kill all threads of the main library, or stop the MySQL process. Make sure that no new connections are made.
  • If you cannot connect, continue to the next step.

Confirm the heartbeat time difference between the standby database and the main database.

  • If it is longer than 60s, exit.
  • If it is less than 60s, continue.

After the pre-check is successful, the formal repair work will begin. The repair details are as follows:

2.3 Detection


Open diagnosis: diagnose started
to determine the current main library based on the IP on the SLB link, and execute the heartbeat check SQL in the main library:
connection_timeout=15, squery_timeout=15s

SET sql_log_bin = 1;/* rds internal mark */ INSERT INTO mysql.ha_health_check (id, type)VALUES (${当前时间戳}, 'm')ON DUPLICATE KEY UPDATE id = ${当前时间戳};/* rds internal mark */ DELETE FROM mysql.ha_health_check WHERE id <${当前时间戳} AND type = 'm';


In the first inspection, if it is successful, the decision is made directly, and if it is unsuccessful, the second inspection is performed.


In the second inspection, if it is successful, the decision will be made directly, and if it is unsuccessful, the third inspection will be carried out.


In the third inspection, if it is successful, the decision will be made directly, and if it is unsuccessful, the failure decision will be made.

2.4 Decision

2.4.1 SetHaReason

Set the HA switching type according to the connection error.


If there is a network failure error message similar to "Connection refused", the network is blocked. Set the switch reason to MASTER_DOWN.


If there is a connection timeout similar to "Connection timed out", it is a connection timeout. Set the switch reason to CONNECTION_TIMEOUT.


Whether the connection is normal, but the read timeout. If there is a "Read timed out" error message, it means the read timed out.
Execute the following sql, if 1 execution is abnormal, or 2 returns "Is it greater than 1000", it is regarded as IO Hang, and the switch reason is set to TCP_TIMEOUT.

1.SELECT id FROM mysql.ha_health_check WHERE type = 'm'2.SELECT max(max_time) FROM information_schema.INNODB_IO_STATUS


If there is
an error like " MySQLTimeoutException.class", set the switch reason to MASTER_HANG.

2.4.2 Failover

Turn on repair.

2.5 Repair


Get the backend IP, port, ins_id by requesting SLB.


All standby database nodes perform the following query operations, the conditions are from top to bottom, and the one with more matches is preferred.
show slave statue;

检查主备延迟,不能超过60s。1. show slave status 有返回2. Master_Log_File最大3. Read_Master_Log_Pos最大

Check whether the disk space of the target standby database is full (95%).
Set the target standby database as read-only.
Execute the following commands:

show global variables like 'read_only'

If the return is not "ON", execute the following command:

set global read_only = ON


Kill all connections to the standby database.


Detect whether the main library can be connected.
If it is not connectable, continue to the next stage.
If you can connect, perform the following operations.
1. Kill the client connection of the original main library.
2. Set it to read-only.

  • Success: Kill the client connection of the master library again to ensure that the original master will not write again.
  • Failure: Kill the MySQL process on the original master database to ensure that the original master does not have any more write operations.

3. Zoom in sync_binlog: set global sync_binlog = 1000
4. Close event: set global event_scheduler = OFF
5. Wait for the synchronization of the standby database, if the synchronization fails, exit, roll back and set the main database to read-write status.

show global variables like rds_rpl_double_sync_enabled 如果有返回就执行1, 否则21 select master_pos_wait('${file}', ${pos}, ${timeout});2 select master_pos_wait('${file}', ${pos}, ${timeout}, '')其中file, pos是前面show status返回的Master_Log_File, Read_Master_Log_Pos如果SQL返回第1列结果是空,或者是-1,都认为同步失败。


Detect heartbeat on master, slave

SELECT id FROM mysql.ha_health_checkWHERE type ='m'SELECT id FROM mysql.ha_health_checkWHERE type ='s'

Compare whether the time stamp of ID conversion is within 60s.


Send a request to the fireman process on the standby database host to obtain log information.


Check the slave status.


Request SLB and replace backend for vip.
Change backend, from the main library IP: port to the standby library IP: port.


Kill the process on the original main library.


Modify the active/standby relationship in the instance_stat table of the dbaas library.


Send a request to the fireman process of the new main library, and the new main library executes the following command to turn off read-only.

set read_only OFF


Reduce the sync_binlog parameter.
The new main library executes the following commands:

set sync_binlog =1


Open the event_scheduler parameter.
The new main library executes the following commands:

set event_scheduler = 'ON'


Start the change master for readonly instance task.


Send a request to the fireman process of the new standby database to start the standby database process.

Original link

This article is the original content of Alibaba Cloud and may not be reproduced without permission.