The relationship between InnoDB transactions and various logs

Four characteristics of transactions

To be called a transaction, it must have the following four characteristics: atomicity, consistency, isolation, and durability. Commonly known as ACID.


Atomicity means that all operations included in the transaction are either all successful or all failed to roll back. Therefore, if the transaction operation is successful, it must be completely applied to the database. If the operation fails, it cannot have any impact on the database.


Consistency means that the transaction must change the database from one consistency state to another consistency state, that is to say, a transaction must be in a consistent state before and after execution. For example, suppose that the sum of the money of both user A and user B is 1000, then no matter how or how many transfers between A and B, the sum of the money of the two users after the transaction should add up to be 1000, this is the consistency of the transaction.


Isolation means that when multiple users access the database concurrently, such as operating the same table at the same time, the transaction opened by the database for each user cannot be interfered by the operation of other transactions, and multiple concurrent transactions must be isolated from each other. About transaction isolation, the database provides a variety of isolation levels, which will be introduced later.


Persistence means that once a transaction is committed, the changes to the data in the database are permanent, and the operation of committing the transaction will not be lost even if the database system encounters a failure. For example, when we use JDBC to operate the database, after submitting the transaction method, the user is prompted to complete the transaction operation. When our program is executed until we see the prompt, we can determine that the transaction has been submitted correctly. Even if there is a problem with the database at this time, we must To complete the execution of our affairs. Otherwise, it will cause a major error that although we see that the transaction is completed, the database fails to execute the transaction due to a failure. This is not allowed.

How does MySQL guarantee the atomicity and durability of transactions?

First of all, let's think about what difficulties MySQL has encountered in implementing transactions?

First of all, we know that there is the concept of Buffer Pool in MySQL. That is to say, when many modification operations of the client are only stored in the memory and have not actually landed on the disk, the client is notified that the transaction is complete.

Assuming unfortunate, what are the consequences of the MySQL server downtime? That is the loss of memory data, which violates the principle of persistence of our MySQL.

So how to ensure that every committed transaction will eventually land on the disk?

WAL mechanism guarantees durability

MySQL uses the WAL (Write-ahead logging, write-ahead logging) mechanism to achieve.

Many explanations on the Internet are like this: The so-called WAL mechanism is to write the log first, and then write to the disk. Of course, this explanation is absolutely correct. It's just that people who understand can understand at a glance, but people who don't understand are easy to confuse. We are now analyzing from the perspective of MySQL as a whole, which parts are really modified by a MySQL transaction.

What does MySQL do to execute a transaction

Let's take redo log as an example and execute this insert statement:

insert into t(id,k) values(id1,k1),(id2,k2);

Here, we assume that the current is a secondary B+ tree index with k as the index. After finding the location, the data page where k1 is located is in the memory (InnoDB buffer pool), and the data page where k2 is located is not in the memory.


Analyzing this update statement, you will find that it involves four parts: memory, redo log (ib_log_fileX), data table space (t.ibd), system table space (ibdata1).

This update statement does the following operations (according to the numerical order in the figure):

  1. Page 1In the memory, update the memory directly;
  2. Page 2 is not in the memory, just in the change buffer area of ​​the memory, record the message "I want to insert a line into Page 2"
  3. Record the above two actions in the redo log (3 and 4 in the figure).

After doing the above, the transaction can be completed.

In other words, there are two completion indicators for an update transaction:

  1. Modify the buffer in the Buffer Pool.
  2. Record the log.

When the real MySQL data arrives on the disk is determined by many mechanisms, but it is definitely not a synchronous action. If you have studied JVM, you can easily think that the timing of persisting memory data to disk is like the garbage collection mechanism of JVM. In the following content will be carefully analyzed with everyone.

Therefore, from a global perspective, we summarize the MySQL flash disk process (including the WAL mechanism) as follows: first write to the memory, then write to the log, and finally write to the disk.

Therefore, assuming that the transaction has been completed, but the data is down when the data is in the memory, then the WAL technology can ensure the data recovery of these completed transactions.

Important log module: redo log

The role of redo log

The InnoDB storage engine manages storage space in units of pages, and the addition, deletion, modification, and checking operations we perform are essentially all operations in units of pages.

Before actually accessing the page, the page on the disk needs to be cached in the Buffer Pool in the memory before it can be accessed. But in the transaction, a feature called persistence was emphasized, that is, for a committed transaction, even if the system crashes after the transaction is committed, the changes made by the transaction to the database cannot be lost.

If we only modify the page in the Buffer Pool of the memory, assuming that a certain failure occurs suddenly after the transaction is committed, causing the data in the memory to become invalid, then the changes made by the committed transaction to the database are also Then it was lost, which is beyond our tolerance. Redo log was born because of this. With the principle of WAL technology mentioned above, the lost data in the memory that has not been flushed can be retrieved.

Special note: redo log is a product of InnoDB, other engines do not have this log. That is to say, it is difficult for other engines to solve the problem of data recovery after an abnormal database crash.

The underlying structure of the redo log

InnoDB's redo log has a fixed size. For example, it can be configured as a set of 4 files, each of which is 1GB in size, so this "pink board" can record 4GB of operations in total. Start writing from the beginning, and then return to the beginning to write cyclically at the end, as shown in the figure below.


The write pos in the figure is the redo written by the transaction before it is committed. The check point represents the position of the pointer that has been successfully brushed. Obviously, the area between check point and write pos is the dirty page data that was not successfully flushed. These two pointers are also persisted. Assuming that there is a downtime, after MySQL restarts, you can continue to restore data directly based on these two pointers.

Of course, this picture may still not be intuitive enough, so use another picture to help you deepen your understanding.


As shown in the figure, assuming that the checkpoint position is advanced from CP to CP, it is necessary to flush the log (light green part) between the two points and all the corresponding dirty pages to the disk. After that, from write pos to CP' in the figure is the area of ​​redo log that can be written again.

Redo log writing process


In order to solve the problem of too slow disk speed, Buffer Pool was introduced. In the same way, when writing the redo log, it cannot be directly written to the disk. In fact, when the server starts, it applies to the operating system for a large contiguous memory space called redo log buffer.

When a transaction generates multiple redo logs, it will not generate one and write one, but temporarily cache it in this redo log buffer. When the transaction is committed, the modified Buffer Pool page may not be flushed to disk. But in order to ensure durability, it is necessary to modify the data temporarily cached in the redo log buffer of this transaction to the real redo log.

Usage of innodb_flush_log_at_trx_commit

We said earlier that in order to ensure the durability of the transaction, the user thread needs to flush all redo logs generated during the execution of the transaction to disk when the transaction is committed. Will significantly reduce database performance. If the durability requirements for transactions are not so strong, you can choose to modify the value of a system variable called innodb_flush_log_at_trx_commit, which has 3 optional values:

0: When the value of the system variable is 0, it means that the redo log will not be synchronized to the disk immediately when the transaction is committed. This task is handed over to the background thread to do. This obviously speeds up the request processing speed, but if the server hangs after the transaction is submitted, and the background thread does not flush the redo log to the disk in time, the page modification of the transaction will be lost.

1: When the value of the system variable is 1, it means that the redo log needs to be synchronized to the disk when the transaction is committed, which can ensure the durability of the transaction. 1 is also the default value of innodb_flush_log_at_trx_commit.

2: When the value of the system variable is 2, it means that the redo log needs to be written to the buffer of the operating system ( not the redo log buffer ) when the transaction is committed , but there is no need to ensure that the log is actually flushed to disk. In this case, if the database is hung up and the operating system is not hung up, the durability of the transaction can still be guaranteed, but if the operating system is also hung up, the durability cannot be guaranteed.

Simple understanding of undo log

Due to some reasons (machine downtime/operating system error/user active rollback, etc.), the transaction is executed in half, but at this time, the execution of the transaction has caused a lot of information to be modified (it will be executed and modified before submission), but there are still Some are not executed. In order to ensure the consistency and atomicity of the transaction, either all executed successfully or all failed, so it needs to be rolled back, and rollback needs the old value basis, and these old value records are stored in the undo log.

Combining redo log is also very easy to understand. Before the transaction is committed, it is impossible to ensure that the redo log is written. Then this transaction can only be rolled back through undo log.

In addition, the repeatable read isolation level implemented by our MVCC is actually rolled back with the help of undo log.

Important log module: binlog

The role of binlog

As we mentioned earlier, MySQL as a whole has two parts: one is the Server layer, which mainly does things at the MySQL functional level; and the other is the engine layer, which is responsible for specific storage-related matters. The redo log we talked about above is a log unique to the InnoDB engine, and the Server layer also has its own log, called binlog (archive log).

The server log must be shared by all engines. By looking at this name, binlog is also used to record data. So now that there is binlog, why redo log? How are they different?

Because there was no InnoDB engine in MySQL at the beginning. MySQL's own engine is MyISAM, but MyISAM does not have the ability to crash-safe, binlog logs can only be used for archiving. InnoDB is another company that introduced MySQL in the form of a plug-in. Since binlog alone has no crash-safe capability, InnoDB uses another set of logging system—that is, redo log to achieve crash-safe capability.

The difference between redo log and bin log (must ask for interview)

  1. The two methods are used differently. binlog will record all changes to the table, including updating and deleting data, changing the table structure, etc. It is mainly used to manually restore data, while redo log is invisible to us. It is used by InnoDB to ensure crash-safe capabilities, that is If MySQL crashes after the transaction is committed, the durability of the transaction can be guaranteed, that is, the changes are permanent after the transaction is committed. In one sentence: binlog is used to restore data manually, and redo log is used by MySQL itself to ensure transaction durability when the database crashes.
  2. The redo log is unique to the InnoDB engine, and the binlog is implemented by the server layer of MySQL, and all engines can be used.
  3. Redo log is a physical log, which records "what has been modified on a certain data page", and the recovery speed is faster; binlog is a logical log, which records the original logic of this statement, such as "for ID=2 c field plus 1";
  4. The redo log is a "circular write" log file. The redo log will only record logs that have not been flushed to the disk, and the data that has been flushed to the disk will be deleted from the limited size log file, the redo log. Binlog is an additional log, which saves the full amount of logs.
  5. When the database crashes, when you want to restore the data that has not been flushed but has been written to the redo log and binlog to the memory, the redo log record maintains two pointers write pos (the location where the transaction is written) and check point (successfully flushed the disk). Position), from the difference between these two positions, combined with the data pages in the original disk, you can restore the data that was originally lost in the Buffer Pool without flushing the disk. Binlog does not have a similar record mark, so it cannot be recovered.

Binlog data recovery process

How to use binlog to restore the database to the state in any second within half a month?

As we said before, binlog will record all logical operations, and it is in the form of "additional write". If your DBA promises to restore within half a month, all binlogs of the last half month will be saved in the backup system, and the system will regularly back up the entire database. The "regular" here depends on the importance of the system. It can be prepared once a day or once a week.

When you need to restore to a specified second, for example, at two o'clock in the afternoon one day, a table was mistakenly deleted at 12 o'clock in the afternoon, and you need to retrieve the data, then you can do this:

  • First, find the most recent full backup. If you are lucky, it may be a backup from last night and restore from this backup to the temporary library;
  • Then, starting from the time of the backup, take out the backup binlogs in turn, and replay them to the time before the accidentally deleted table at noon.

In this way, your temporary database is the same as the online database before the accidental deletion, and then you can take the table data out of the temporary database and restore it to the online database as needed. (Friendly reminder, when deleting the library and running away, remember to delete the binlog together).

Two-phase commit of transaction

What is a two-stage submission?

The so-called two-stage commit is for the data that must be written into the redo log and bin log before the transaction is committed. To ensure that both logs are successfully written, a two-stage commit method must be used.

Two-stage submission process

The existing sql statement is as follows:

create table T(ID int primary key, c int);update T set c=c+1 where ID=2;

The two-stage submission process is as follows:

  1. The actuator first finds the engine and fetches the line ID=2. ID is the primary key, and the engine directly uses the tree search to find this row. If the data page where the row of ID=2 is originally in the memory, it is directly returned to the executor; otherwise, it needs to be read from the disk into the memory and then returned.
  2. The executor gets the row data given by the engine, and adds 1 to this value. For example, it was originally N, but now it is N+1 to get a new row of data, and then call the engine interface to write this new row of data.
  3. The engine updates this new row of data to the memory, and at the same time records the update operation in the redo log, at this time the redo log is in the prepare state. Then inform the executor that the execution is complete and the transaction can be submitted at any time.
  4. The executor generates the binlog of this operation and writes the binlog to disk.
  5. The executor calls the commit transaction interface of the engine, and the engine changes the redo log that has just been written to the commit state, and the update is complete.

Here I give the execution flow chart of this update statement. The light box in the figure indicates that it is executed inside InnoDB, and the dark box indicates that it is executed in the executor:


You may have noticed that the last three steps seem a bit "circumflex". The writing of redo log is split into two steps: prepare and commit, which is the "two-phase commit". Intuitively speaking, redo log and bin log have become serial writing and can only be submitted after writing.

There are no problems with two-stage submissions

Since redo log and binlog are two independent logics, if two-phase commit is not required, either the redo log is written first and then the binlog is written, or the reverse order is adopted. Let's see what are the problems with these two methods.

Write redo log first and then binlog.

When the redo log has a value but the binlog has no value, MySQL is down. Restart MySQL at this time to recover data normally through redo log. But if you want to restore MySQL someday, or use it for master-slave replication, the loss of this part of the value in the binlog is missing, resulting in a part of the data missing when recovering the data.

Write binlog first and then redo log.

The redo log has no value, and the binlog has been successfully recorded. We all know that if the redo log is not written, the transaction will not succeed, and the operations performed will be rolled back according to the undo log. But at this time, the data that should be rolled back is still stored in the binlog. As a result, a portion of the data is extra when restoring the data.

It can be seen that if the "two-phase commit" is not used, the state of the database may be inconsistent with the state of the library restored with its log.

And it's not just that you need to use this process to recover data after misoperation. When you need to expand the capacity, that is, when you need to build more backup libraries to increase the system's read capacity, the common practice now is to use full backups and apply binlog to achieve this. This "inconsistency" will lead to your problems. The master-slave database is inconsistent online.