Redis (6): Sentinel mechanism

Speaking of the master-slave library cluster mode last time. In this mode, if the slave library fails, the client can continue to send requests to the master library or other slave libraries to perform related operations, but if the master library fails, it will not directly affect the synchronization of the slave library , Because the slave library does not have the corresponding master library to perform data copy operations.
Moreover, if the client sends all read operation requests, the slave library can continue to provide services, which can be accepted in this pure read business scenario. However, once some operations are requested, according to the read-write separation requirements in the master-slave library mode, the master library needs to complete the write operation. At this time, there is no instance that can serve the client's write operation request, as shown in the figure

Insert picture description here


. It is unacceptable whether the write service is interrupted or the data synchronization cannot be performed from the library. Therefore, if the main library is down, we need to run a new main library, for example, switch the slave library to the main library. If you really let him become the main library, you need to confirm the following three questions:

  1. Is the main library really down?
  2. How to choose a slave library?
  3. How to notify the slave library and the client of the relevant information of the new master library?
    This involves the sentinel mechanism. In the Redis master-slave cluster, the sentinel mechanism is the key mechanism for switching between master and slave libraries. It effectively solves the failover problem in the master-slave mode.

The basic process of the sentinel mechanism

The sentinel is actually a Redis process running in a special mode. While the master-slave library instance is running, it is also running. The main load of the sentry is three tasks: monitoring, election and notification.
Monitoring means that when the sentinel process is running, it periodically sends ping commands to all the master and slave libraries. If the slave library does not have the corresponding sentinel's ping command within the specified time, the sentinel will judge that the master library is offline, and then automatically switch the master library. The process of
this process is first to perform the second task of the sentry, the election of the master. After the main library is hung up, the sentry needs to select a slave library instance from many slave libraries according to certain rules and use it as the new main library. After this step is completed, there is a new main library in the current cluster.
Then the sentry will perform the last task: notification. When performing the notification task, the sentry will send the link information of the new main library to other slave libraries, let them execute the replicaof command, establish a link with the new main library, and perform data replication. At the same time, the sentry will notify the library guardian of the link information of the new main library, so that they can send the requested operation to the new main library.
We drew a picture showing the three tasks and their respective goals.

Insert picture description here

Among the three tasks, the notification task is relatively simple. The sentry only needs to send the new master library information to the slave library and the client, and let them establish a connection with the new master library, and does not involve decision-making logic. However, in the two tasks of monitoring and election, the sentry needs to make two decisions:

  • In the monitoring task, the sentry needs to judge whether the main library is on the offline window sill;
  • In the task of selecting the master, the sentry also decides which slave library to choose as the master library.
    Next, let's talk about how to judge the offline status of the main library.
    The first thing to know is that the sentinel's judgment on the offline of the main library can be "subjective offline" and "objective offline". So, why are there two judgments? What is their difference and connection?

Subjective offline and objective offline

Subjective offline: The sentinel process will use the PING command to detect the network status of itself and the master-slave library to determine the status of the instance. If the sentinel finds that the response of the main library or the slave library to the PING command has timed out, then the sentinel will first mark it as subjective offline.
If the detected matter is from the library, then the sentry simply marks it as "subjectively offline", because the impact of the slave library is generally not too great, and the external services of the cluster will not be interrupted.
But if the main library is detected, then the sentinel cannot simply be marked as "subjectively offline" and the master-slave switch is enabled. Because there may be such a situation: that is the sentinel misjudgment, the main library is not faulty. However, once the master-slave switch is started, subsequent master selection and notification operations will bring additional computer and traffic expenses.
In order to avoid these unnecessary expenses, we need to pay special attention to misjudgments. First of all, we need to know what a misjudgment is. Quite simply, the main library has not actually gone offline, but the sentinel mistakenly thought it was offline. Misjudgment generally occurs when the cluster network is under high pressure, network traffic is blocked, or the main library itself is under high pressure.
Once the sentry judges that the main library is offline, it will start to select the new main library and synchronize the data between the slave library and the new main library. This process itself will have overhead. For example, the sentry will take time to select the new main library. The slave library also needs time to synchronize with the new master library. In the case of misjudgment, the main library itself does not need to be switched, and the overhead of all this process is worthless