03_XA specification and 2PC distributed transaction

03_XA specification and 2PC distributed transaction

1. Why should we introduce concepts such as XA specification and 2PC distributed transaction?

Suppose we have developed an application and only connected to a database at the beginning. The system is running very well, but as the system traffic increases and the amount of data that needs to be stored increases, we have to divide the database and tables.

In the application, some methods may require multiple databases at the same time.

@Transactional
public void method() {
  	queryDb1();
  	queryDb2();
  	queryDb3();
}

At this time, if you want to make the method method into a transaction, use the @Transactional annotation directly, it must be impossible to implement it. Because three methods are called in method(), each method accesses a different database, and uses a different java.sql.Connection.

So for this kind of operation that spans multiple databases, how to implement transactions? Here we need to introduce a distributed transaction model.

2. What is the XA specification?

An organization called X/Open has defined a set of distributed transaction models, including the following roles:

  • AP
    Application is an application written by ourselves.
  • TM
    Transaction Manager, transaction manager. I just saw the transaction manager in spring-tx.jar, which is actually a third-party component.
  • RM
    Resource Manager, resource manager, can roughly regard Mysql, Oracle, etc. as resource managers.
  • CRM
    Communication Resource Manager, communication resource manager, is generally a message middleware, MQ.

X/Open puts forward a concept: global affairs. The meaning of global transaction is that when a method contains multiple operations across multiple databases, as long as any operation fails, all operations must be rolled back without affecting any database, if and only if all The operations of all are successfully executed before the impact on the database can be submitted.

XA specification means that TM needs to organize "language" according to the interface specification defined by XA in order to communicate with RM (that is, database).

It is not difficult to find that the distributed transaction model of X/Open is too virtual, and various implementation details are difficult to implement, so we need the next 2PC theory.

Insert picture description here

3. What is 2PC theory?

2PC, Two-Phase-Commit, two-phase commit. This protocol is very simple to understand, and there is no need to be as complicated as written in other blogs on the Internet.

In fact, there are two stages: preparation stage and submission stage.

  1. In the preparation phase, the
    transaction manager (TM) sends a request to all connected resource managers (RM) to execute their own internal business logic, open their own transactions, but do not submit them, and send their own execution status to the transaction Manager report.
  2. Commit phase After the
    transaction manager (TM) receives the execution report of each resource manager, it will analyze it. There are nothing more than two results: either there is an execution error reported by the resource manager, or all executions are successful. As long as any resource manager fails to execute, the transaction manager will issue a request to all resource managers again, asking them to roll back their own transactions. If and only if all executions are successful, the transaction manager will send a request to all resource managers to submit their own transactions.

4. What are the shortcomings of 2PC theory?

The four shortcomings can be memorized: synchronization blocking, single point of failure, loss of transaction status, and split brain.

  • Synchronous blocking
    2PC spans multiple RMs, and some resources may need to be operated behind each RM. We know that in the preparation phase of 2PC, it is necessary to wait until all RMs return the execution results before TM can give instructions for the next step, and the duration of this process may be longer (because of the sequential execution sequence, and some TM execution Fast speed, some slow execution speed), occupy resources, if there are other programs that also want to access these resources, they will be blocked. (When doing single-machine database transactions before, this problem also exists, and it will also occupy resources, but because it is a single point, it does not occupy resources for a long time, and the impact is not so great.)
  • The single point of failure
    assumes that the TM is deployed at a single point and sends a request for the preparation phase to the connected RM. If the RM suddenly goes down at this time, it will be troublesome. Because the transaction is not closed in time, such as a lock table may be generated. As a result, other services can no longer operate this table, or can no longer operate these occupied resources.
  • Loss of transaction status
    In order to solve a single point of failure, we may choose to make multiple TMs, such as dual-system hot backup. Once one of the TMs is down, the other TM can be topped. The problem is, if TM-1 successfully sends a prepare request to other RMs, but just after sending the commit, TM-1 unfortunately goes down, and there is an RM-1 that goes down with it. At this time, TM- 2 came up, because there is no execution of persistent transactions, the newly appointed TM-2 does not know whether TM-1 has sent a prepare request, let alone whether RM-1 has received a commit, so there is no way to make it. The decision is whether to let the remaining RM roll back the transaction or do nothing.
  • Split-brain problem
    When it comes to split-brain, it is basically related to network fluctuations. For example, TM clearly sends a commit request to other RMs. As a result, some RMs are isolated from the TM and other RMs due to network fluctuations and cannot receive the request. What should I do at this time? The 2PC theory does not give guidance on this situation.

In response to these problems of 2PC, the 3PC theory was invented.

5. What is 3PC theory?

3PC, Three-Phase-Commit, a three-phase commit protocol. In order to solve or alleviate the problems of 2PC, 3PC is designed into the following three phases:

  1. The CanCommit stage
    TM will send CanCommit to all connected RMs at the very beginning. The main task is to check whether each RM has the ability to execute SQL, such as whether the database is running normally, whether there is a lock table, whether the space is enough, and the network environment Is it OK? As long as RM can return to say that everything is normal after the CanCommit stage, then there is a high probability that there will be no problems in the next SQL execution stage.
  2. PreCommit stage
    If and only if all RMs return OK, TM will enter the PreCommit stage. This stage is no different from the prepare stage of 2PC. It is to let all RMs execute SQL locally, but do not commit the transaction. As long as any RM does not return OK, TM will send abort to all RMs, requesting to abandon this distributed transaction.
  3. DoCommit phase
    If all RMs return successful execution in the PreCommit phase, then TM will send DoCommit to all RMs, asking them to commit their respective transactions. As long as any RM thinks that the execution of its PreCommit stage has reported an error, then TM will think that the execution of this distributed transaction has failed, and will send an abort to all RMs, requesting to abandon this distributed transaction.

There are two major improvements in 3PC theory,

  1. Introduced the CanCommit stage
  2. A timeout mechanism was added in the DoCommit phase.

The timeout mechanism means that if a certain RM receives the PreCommit request sent by TM and returns the execution successfully, within the specified time range, it fails to wait for the subsequent instructions of the TM (such as DoCommit or abort), then this can be inferred. TM is probably down. Considering that the CanCommit phase has been passed before, and the resources cannot be occupied all the time, the RM will actively perform the commit operation and commit the transaction.

Where does RM's confidence in actively committing transactions come from?

To a large extent, 3PC's CanCommit stage gave it confidence. Since in this distributed transaction, all RMs involved can pass the CanCommit test, it means that all RMs execute business logic and have a high probability of passing the PreCommit stage, and it is unlikely to report an error.

But then again, if TM sends abort to all RMs, it happens that a certain RM has a brain split and cannot receive the abort message. After waiting for a period of time, it fails to wait for the message of TM, and it commits itself. In this way, the data will be messed up.

Therefore, whether it is 2PC or 3PC, problems may occur under special circumstances, and there is no way to fully guarantee distributed transactions.

Six. Simulate the realization of Mysql XA distributed transaction

The scenario we want to simulate is that service A now has an operation and hopes to execute a SQL on DB1 and DB2 respectively, and hope to implement distributed transactions.

The specific steps of implementation are shown in the following figure:

Insert picture description here


Supplement: If any one of the execution results of steps 5 and 6 returns false (the corresponding code is not XAResource.XA_OK), then TM will send an XA Rollback txid command to RM.

The specific code will not be posted~