Four major features of InnoDB

The three key features of the InnoDB storage engine: insert buffer, double write, and adaptive hash index.

insert/change buffer

What is change buffer?

Before MySQL5.5, it was called insert buffer, which was only optimized for insert; now it is also effective for delete and update, which is called change buffer.

It is an application in the non-unique general index page (non-unique secondary index page) is not in the pool, on the pages of the write operation, and will not immediately load the disk page to the pool, but merely change the record buffer (buffer changes), when the future data is read, the data is merged and restored to the buffer pool. The purpose of write buffering is to reduce disk IO for write operations and improve database performance.

Conditions for adding elements to the flush linked list

What do we say is the condition of adding a free list?

  1. This page has been read from the disk into the Buffer Pool.
  2. When we modify the data on this page, this cache page becomes a dirty page and is added to the flush list to wait for flushing.

The role of change buffer

Friends who are not sure about the role of change buffer should be clear after doing the following comparison.

When there is no change buffer, update a page that does not exist in memory

Then suppose that the element we are reading is not in memory, and someone wrote an update statement to update the data page. The workflow of the InnoDB engine is as follows:

  1. Load data pages from the disk to the buffer pool, a random disk read operation;
  2. Modify the page in the buffer pool, a memory operation;
  3. Write to redo log, one disk sequential write operation;

When there is no hit to the buffer pool, at least one disk IO is generated . Is there still room for optimization for business scenarios where more writes and less reads ?

When there is a change buffer, update a page that does not exist in memory (the difference from flush linked list)

  1. Record this operation in the write buffer, a memory operation;
  2. Write to redo log, one disk sequential write operation;

It can be found that the appearance of the change buffer directly reduces the disk IO by one time.

Will there be consistency issues when reading data?

Of course not. In our change buffer, it is equivalent to storing a lot of data modification logic in a page unit. When the change buffer is not flushed to the disk, the data in the disk must be dirty data. Then the data read is definitely wrong.

The solution is also very simple, that is, first read the dirty data into the memory, and then restore the latest version of the data page information according to the data page modification record in the change buffer. (Note that at this time, the data related to this page in the change buffer is gone and synchronized to the cache. Afterwards, if you modify the data on this disk page, it will enter the flush linked list). Does it feel more integrated?

When to refresh the change buffer

  1. As described above, when there is data in the change buffer, a disk read operation occurs. It will read the disk once, and then cooperate with the change buffer to get the latest data. At this time, the page information in the change buffer will be erased;
  2. There is a background thread that will determine whether the database is flushed when it is idle;
  3. When the database buffer pool is not enough;
  4. When the database is closed normally;
  5. When the redo log is full; (the redo log is almost never full, otherwise it will cause a serious drop in MySQL throughput for a period of time)

What should I do if there is a downtime when there is data in the change buffer?

Every time the data in the change buffer is synchronized to the redo log, the database crashes abnormally, and the data can be recovered from the redo log.

Why is change buffer optimized only for secondary indexes?

Let's compare the difference between the primary key index and the secondary index for a new operation:

The target page of the record to be inserted is in memory

  1. For the unique index, find the position between 3 and 5, judge that there is no conflict, insert this value, and the statement execution ends;
  2. For ordinary indexes, find the position between 3 and 5, insert this value, and the statement execution ends.

In this way, the difference between the impact of the normal index and the unique index on the performance of the update statement is just a judgment and only consumes a small amount of CPU time.

However, this is not our focus.

The target page of the record to be inserted is not in the memory

  1. For the unique index, the data page needs to be read into the memory, it is judged that there is no conflict, the value is inserted, and the statement execution ends;
  2. For ordinary indexes, the update is recorded in the change buffer, and the statement execution ends.

Reading data from disk into memory involves random IO access and is one of the most expensive operations in the database. Because change buffer reduces random disk access, the improvement of update performance is obvious. (We may accumulate a lot of data in a page, and then update the entire page together, thereby reducing IO).

In the interview process, you can raise and solve such a problem. One day, it was discovered that the memory hit rate of the database dropped from 99% to 75%, the entire system was in a blocking state, and all update statements were blocked. After exploring the reasons, I found that this business has a large number of data insertion operations, and he changed one of the ordinary indexes to a unique index the day before.

Comparison of change buffer and redo

How are these two comparable? A cached data, a log file, can't help it!

However, friends who have known redo log will know that redo log has a feature that is shared with change buffer: minimize random reads and writes. So around this point of view, let's analyze the difference between change buffer and redo log.

Write the order of redo log

Now, we are going to execute this insert statement on the table:

insert into t(id,k) values(id1,k1),(id2,k2);

Here, we assume that the current is a secondary B+ tree index with k as the index. After finding the location, the data page where k1 is located is in the memory (InnoDB buffer pool), and the data page where k2 is located is not in the memory. As shown in the figure is the update state diagram with change buffer.


Analyzing this update statement, you will find that it involves four parts: memory, redo log (ib_log_fileX), data table space (t.ibd), system table space (ibdata1).

This update statement does the following operations (according to the numerical order in the figure):

  1. Page 1In the memory, update the memory directly;
  2. Page 2 is not in the memory, just in the change buffer area of ​​the memory, record the message "I want to insert a line into Page 2"
  3. Record the above two actions in the redo log (3 and 4 in the figure).

After doing the above, the transaction can be completed. Therefore, you will see that the cost of executing this update statement is very low, that is, two memory locations are written, and then a disk is written (the two operations are written together to write to a disk), and it is written sequentially . (Note: The above three steps are a transaction, that is, the transaction must be completed after the redo log is written. This also confirms that the redo log must be able to restore the data in the change buffer)

Change buffer to reduce random reads

We are now going to execute

select * from t where k in (k1, k2)

Here, I drew the flow chart of these two read requests.

If the read statement occurs shortly after the update statement and the data in the memory is still there, then the two read operations at this time have nothing to do with the system table space (ibdata1) and redo log (ib_log_fileX). Therefore, I did not draw these two parts in the picture.


As can be seen from the figure:

  1. When reading Page 1, directly return from memory.
  2. When you want to read Page 2, you need to read Page 2 from the disk into the memory, and then use the operation log in the change buffer to generate a correct version and return the result.

It can be seen that this data page will not be read into memory until Page 2 needs to be read. The actual writing to the disk will only be done when the database is idle or as a last resort. When flushing the disk, there may be multiple statements that operate the disk multiple times. At this time, flushing the entire page as a whole to the disk reduces the interaction with the disk many times, thereby achieving the purpose of reducing disk IO.

to sum up

Therefore, if you want to simply compare the benefits of these two mechanisms in improving update performance, redo log mainly saves the IO consumption of random disk writes (converted to sequential writing), while the main saving of change buffer is random read disks. IO consumption.

double write buffer

Is writing to disk in units of pages an atomic operation?

We know that even if a piece of data is changed, writing to the disk will still write all the disk pages where the piece of data is located from the memory to the disk. Of course, loading is also in units of pages. But is writing data in units of pages atomic?

The answer is of course no. With so much data on a page, it must not be atomic. So here comes the problem. In case the data of a disk page is half updated and it crashes, how to ensure that the data is not lost at this time?

At this time everyone has to rush to answer again. I know there is a redo log, InnoDB is relying on him to ensure that data is not lost! Yes, but the redo log records the physical operations on the page. And if a partial page write (partial page write) problem occurs, Redo Log is powerless at this time. So how to solve this situation where half of the page is written?

Doublewrite buffer solves the partial write of the page

The doublewrite buffer is 128 pages (2 areas, extend1 and extend2) on the InnoDB table space, and the size is 2MB. In order to solve the problem of partial page writing.

When MySQL flushes the dirty data to the data file, it first uses memcopy to copy the dirty data to an area in memory (also 2M), and then divides it through this memory area twice, and writes 1MB to the system table space each time. Then immediately call the fsync function to synchronize to the disk of the independent table space. In this process, it is written sequentially, and the overhead is not large.

When you see the buffer in the doublewrite buffer for the first time, you think it is memory as soon as you see the buffer. But here, doublewrite buffer is actually a file. Writing to the system table space will cause more fsync operations in the system, and the fsync performance factor of the hard disk will reduce the overall performance of MySQL. However, in terms of storage, doublewrite is in a continuous storage space, so when the hard disk writes data, it is written sequentially instead of randomly. This has little performance impact. Compared with non-double write, it is reduced by about 5-10%. .

Therefore, if the doublewrite buffer in the system table fails to be written, the actual disk data in the independent table is less likely to be written successfully. Because these two are in strict order. At this time, you need to recover the data from the redo log, the entire page.

If the doublewrite buffer is successfully written, and the actual disk data is partially written (started when the database is shut down abnormally), the database will be restored (redo). During the recovery process, the database will check whether the page is legal (verification, etc.) Etc.), if the verification result of a page is found to be inconsistent, the double-write function will be used at this time.

Adaptive hash index (adaptive hash index)

We know that InnoDB does not support Hash index. The biggest reason is that this data structure does not support range queries, which is very unfriendly in the MySQL environment.

Generally speaking, Hash index does not meet MySQL's underlying index requirements. However, its close to O(1) query efficiency has been coveted by InnoDB developers. Therefore, the developers of InnoDB decided to store hot data in hash indexes as much as possible.

The Innodb storage engine monitors the lookup of the secondary index on the table. If it is found that a secondary index is frequently accessed, the secondary index becomes hot data, and the establishment of a hash index can bring speed improvements.

The frequently accessed secondary index data will be automatically generated into the hash index (data that has been accessed three times in a row recently). The adaptive hash index is constructed from the B+ tree of the buffer pool, so the establishment speed is very fast.


When reading data from the disk, InnoDB thinks that when reading a certain disk page data, InnoDB thinks that it will access the disk pages near the secondary disk page with a high probability, so the disk pages near the access disk page are read into the memory in advance. Mechanisms.