Detailed explanation of InnoDB-level buffer pool

MySQL level cache


Speaking of caching, we recall that MySQL’s Server layer has a cache, which was directly disabled after MySQL8.0.

With the advancement of technology and the test of time, the MySQL engineering team found that there are not many benefits of enabling caching.

First of all, the effect of the query cache depends on the cache hit rate. Only the query effect that hits the cache can be improved, so its performance cannot be predicted. And in order to maintain the correctness of the cached results, we also need to update the cache frequently.

Secondly, another big problem with the query cache is that it is protected by a single mutex. On servers with multiple cores, a large number of queries can lead to a large number of mutex contention. (So ​​the InnoDB level row lock is used)

Through benchmark tests, it is found that most workloads are best to disable query caching (default setting of 5.6): According to the official statement: it causes more problems than it solves, and the harm is greater than the benefit and it is directly cut off.

So, does it mean that there is no cache in MySQL, and it is completely a persistent database?

No, the caching mechanism is implemented in the storage engine assembly. This chapter will introduce InnoDB-level caching in detail.


The importance of InnoDB level caching

We know that for tables that use InnoDB as a storage engine, whether it is an index used to store user data (including clustered indexes and secondary indexes), or various system data, they are stored in the table space in the form of pages Medium, which means that our data is still stored on disk after all.

But the speed of the disk is slow, so when the InnoDB storage engine is processing the client's request, when it needs to access the data of a certain page, it will load all the data of the complete page into the memory, which means that even if we only need to access one For a record of a page, you also need to load the data of the entire page into memory first.

After the entire page is loaded into the memory, read and write access can be performed. After the read and write access is completed, the memory space corresponding to the page is not in a hurry to release, but it is cached, so that there will be a request to access the page again in the future. When the page is used, the overhead of disk IO can be saved. In other words, if you execute a query again in the figure, you don't need InnoDB to access the file system.

Buffer Pool

View the size of the Buffer Pool

InnoDB, in order to cache the pages in the disk, applied for a piece of contiguous memory from the operating system when the MySQL server started. They gave this piece of memory a name called Buffer Pool (the Chinese name is buffer pool). How big is it then? This actually depends on the configuration of our machine. By default, the Buffer Pool is only 128M in size, which is actually too small.

show variables like'innodb_buffer_pool_size';


Modify the size of the Buffer Pool

You can configure the value of the innodb_buffer_pool_size parameter when you start the server, it represents the size of the Buffer Pool, like this:


innodb_buffer_pool_size = 268435456

Among them, the unit of 268435456 is bytes, that is, the size of the designated Buffer Pool is 256M. It should be noted that the Buffer Pool cannot be too small, and the minimum value is 5M (when it is less than this value, it will be automatically set to 5M).

Control block and cache page

The default cache page size in the Buffer Pool is the same as the default page size on disk, both of which are 16KB. In order to better manage these cache pages in the Buffer Pool, InnoDB creates some so-called control information for each cache page. These control information include the table space number to which the page belongs, the page number, and the cache page in the Buffer Pool. The address, linked list node information, some lock information, LSN information, and of course some other control information.

The memory size occupied by the control information corresponding to each cache page is the same, which we call the control block. The control block and the cache page have a one-to-one correspondence. They are all stored in the Buffer Pool. The control block is stored in the front of the Buffer Pool and the cache page is stored in the back of the Buffer Pool, so the memory space corresponding to the entire Buffer Pool looks like That's it:


Each control block occupies about 5% of the cache page size, and the innodb_buffer_pool_size we set does not include the size of the memory space occupied by this part of the control block. That is to say, when InnoDB applies for continuous memory space from the operating system for the Buffer Pool, this The contiguous memory space of the slice is generally about 5% larger than the value of innodb_buffer_pool_size

The meaning of page control

free (free) linked list management

When the MySQL server is initially started, it is necessary to complete the initialization process of the Buffer Pool, which is to first apply for the memory space of the Buffer Pool from the operating system, and then divide it into several pairs of control blocks and cache pages. But at this time, no real disk pages are cached in the Buffer Pool (because they are not used yet), and then as the program runs, pages on the disk will continue to be cached in the Buffer Pool.

So the question is, which cache page should be placed when reading a page from the disk to the Buffer Pool? In other words, how to distinguish which cache pages in the Buffer Pool are free and which have been used?

All cache pages in the Buffer Pool that has just been initialized are free, so the control block corresponding to each cache page will be added to the free linked list. Assuming that the number of cache pages that can be accommodated in the Buffer Pool is n, that increases The rendering of the free linked list looks like this:


With this free linked list, whenever a page needs to be loaded from the disk into the Buffer Pool, a free cache page is taken from the free linked list, and the information of the control block corresponding to the cache page is filled in (this is the The table space where the page is located, page number and other information), and then remove the free linked list node corresponding to the cache page from the linked list, indicating that the cache page has been used.

How a query hits the cache page quickly

As we said before, when we need to access the data in a page, we will load the page from the disk into the Buffer Pool. If the page is already in the Buffer Pool, it can be used directly. So to determine whether the query page exists in the Buffer Pool, do you need to traverse each cache page in the Buffer Pool?

InnoDB uses the table space number + page number as the key and the cache page as the value to create a hash table. When you need to access the data of a certain page, first look at the hash table according to the table space number + page number to see if there is a corresponding If there is a cache page, just use the cache page directly. If not, select a free cache page from the free linked list, and then load the corresponding page in the disk to the location of the cache page.

Flush (dirty page) linked list management

If we modify the data of a cache page in the Buffer Pool, it will be inconsistent with the page on the disk. Such a cache page is also called a dirty page (English name: dirty page).

Of course, the easiest way is to immediately synchronize to the corresponding page on the disk every time a modification occurs, but frequently writing data to the disk will seriously affect the performance of the program. So every time we modify the cache page, we are not in a hurry to synchronize the modification to the disk immediately, but to synchronize it at a certain point in the future.

But if it is not synchronized to the disk immediately, how do we know which pages in the Buffer Pool are dirty pages and which pages have never been modified during the subsequent synchronization?

Therefore, it is necessary to create another linked list to store dirty pages. Any control block corresponding to the modified cache page will be added to a linked list as a node, because the cache page corresponding to this linked list node needs to be flushed to the disk. So it is also called flush linked list. The structure of the linked list is similar to the free linked list.

LRU (Hot Data) Linked List Management

The dilemma of insufficient cache

After all, the memory size corresponding to the Buffer Pool is limited. What should I do if the memory size occupied by the pages that need to be cached exceeds the size of the Buffer Pool, that is, when there are no more free cache pages in the free linked list? Of course, some old cache pages are removed from the Buffer Pool, and then new pages are put in. Then the question is, which cache pages should be removed?

In order to answer this question, we also need to go back to the original intention of setting up the Buffer Pool. We just want to reduce the IO interaction with the disk. It is best to cache it in the Buffer Pool every time a page is accessed. That is, the higher our cache rate, the better.

So how do you count as a hot spot? Recall our WeChat chat list. The ones in the front are those who have been newly contacted recently. The ones in the back are the ones who haven't been contacted for a long time. If the list of contacts is limited, then you must first delete those at the end of the list to save those who can't be contacted. Based on this idea, InnoDB created the LRU linked list.

Simple LRU linked list

This linked list is to eliminate cached pages according to the principle of least recently used, so this linked list can be called an LRU linked list (the full English name of LRU: Least Recently Used). When we need to access a page, we can deal with the LRU linked list like this:

  1. If the page is not in the Buffer Pool

When the page is loaded from the disk to the cache page in the Buffer Pool, the control block corresponding to the cache page is plugged into the head of the LRU linked list as a node

  1. If the page has been cached in the Buffer Pool

Move the control block corresponding to the page to the head of the LRU linked list again.

In other words: as long as we use a certain cache page, adjust the cache page to the head of the LRU linked list, so that the end of the LRU linked list is the least recently used cache page. So when the free cache pages in the Buffer Pool are used up, it is enough to find some cache pages to be eliminated at the end of the LRU linked list.

A variant of MySQL based on short answer LRU linked list

The traditional LRU buffer pool algorithm is very intuitive. Many software such as OS and memcache are used. Why is MySQL so hypocritical that it can't be used directly? That's because, combined with some optimization features of MySQL, led to the following two problems:

  • Pre-read failure
  • Buffer pool contamination

Pre-reading failed

We know that in order to save the cost of disk IO, MySQL has such an optimization: when loading a disk page, just load the next few disk pages into memory together. This operation is called pre-reading. However, the optimization of MySQL is a prediction based on luck. In many cases, the memory that comes in pre-reading has not been used from beginning to end.

For MySQL, memory is a precious part. These disk pages that are loaded into the memory based on prediction alone will undoubtedly waste a lot of memory space.

Then just turn off the pre-reading function? Certainly not, this prediction can indeed greatly reduce the time spent on disk IO. Then InnoDB chose to optimize the LRU linked list for this problem.

Buffer pool contamination

When we perform a full table scan on the table for various reasons. Assume that the disk pages that are actually available only occupy a small portion. Then this table will have a lot of useless pages loaded into memory.

For these two situations, the LRU linked list has been optimized. The goal must be to keep such meaningless disk pages from occupying precious memory resources as much as possible.

Optimization 1: Add the new generation, and solve the pre-reading failure in the old generation

InnoDB divides this LRU linked list into two sections according to a certain proportion, namely:

  • Cache pages that are used very frequently are stored, so this part of the linked list is also called hot data, or the young area.
  • Cache pages that are not frequently used are stored, so this part of the linked list is also called cold data, or old area.

By default, the old area occupies 37% of the LRU linked list, which means that the old area occupies approximately 3/8 of the LRU linked list. We can set this ratio.

InnoDB stipulates that when a page on the disk is first loaded into a cache page in the Buffer Pool, the control block corresponding to the cache page will be placed at the head of the old area. In this way, pages that are pre-read to the Buffer Pool without subsequent access will be evicted from the old area earlier. If the data page added to the head node of the old generation is actually read, it will enter the new generation, which has a longer life cycle than the old generation.

Let's take a more detailed process:

If it is a pre-read page


If a new page with page number 50 is pre-read and added to the buffer pool:

  1. 50 will only be inserted from the head of the old generation, and the pages at the end of the old generation (also the overall tail) will be eliminated;
  2. Assuming that page 50 will not be actually read, that is, read-ahead fails, it will be eliminated from the buffer pool earlier than the new generation of data;

If the page is read


If page 50 is read immediately, for example, SQL accesses the row data in the page:

  1. It will be immediately added to the head of the new generation;
  2. The pages of the new generation will be squeezed into the old generation, and no pages will be really eliminated at this time;

Optimization 2: Add a new residence time window for the old generation to solve the pollution problem

What is the difference between a full table scan and a disk page? Pre-read disk pages may not be hit from beginning to end. The full table scan hits every disk page in the table space firmly, so that every data page (containing a large number of meaningless data pages) will enter the new generation from the old generation. Such generations are completely meaningless.

Therefore, the MySQL buffer pool has added a "old generation residence time window" mechanism:

  1. Assume that T = the residence time window of the old generation;
  2. Pages inserted into the head of the old generation, even if they are accessed immediately, will not be placed in the head of the new generation immediately;
  3. Only when “visited” is satisfied and the “stay time in the old generation” is greater than T, it will be put into the head of the new generation;

After adding the "old generation stay time window" strategy, pages that are loaded in a large amount in a short time will not be inserted into the new generation head immediately. After the time period T after the old generation is added, if it is still hit by the query cache, the old generation enters the new generation at this time.

This optimization will give priority to eliminating those pages that have been accessed only once in a short period of time. In layman's terms, the real hot data page must be the test of time.

Timing of Buffer Pool refresh

There is a special thread in the background responsible for flushing dirty pages to disk at regular intervals, so that it does not affect the processing of normal requests by user threads. (This is a garbage collector).

Then our memory flashing operation is based on the flush (dirty page) linked list and the LRU (hot spot) linked list.

The main flush chain watch brush plate

This disk is full of dirty pages, who do you flush without flushing?

The background thread will also periodically refresh a part of the page from the flush linked list to the disk. The refresh rate depends on whether the system is very busy at the time. This way of refreshing the page is called BUF_FLUSH_LIST.

At the end of the flush list are some of the oldest dirty pages. So it makes sense to brush them. Although these data may be hot data, it will take up a large amount of redo log content for a long time without refreshing it for a long time, so you still have to refresh it again.

Cold data LRU linked list brushing

We said that the purpose of LRU is to eliminate cold data after the Buffer Pool is full. In theory, the more cached data, the better (increase the query hit rate). The probability that the data that is cold immediately will be accessed again is also very high. Therefore, the data in the LRU linked list is generally not cleaned up actively.

But the background thread will periodically scan some pages from the end of the LRU linked list, and if dirty pages are found inside, they will be flushed to disk. Think about it, everyone, if this cached data page hasn't been hit for too long and it has become cold, it doesn't matter. At the same time, you are still a dirty page. There is really no reason to leave it.

This way of refreshing the page is called BUF_FLUSH_LRU.

Some other last resort methods

Sometimes the background thread flushing dirty pages is slow, causing the user thread to load a disk page into the Buffer Pool when there is no available cache page. At this time, it will try to see if there is any unmodified end of the LRU linked list that can be directly released. If there is no page, you will have to flush a dirty page at the end of the LRU linked list to disk synchronously (interacting with the disk is very slow, which will reduce the speed of processing user requests). This method of refreshing a single page to disk is called BUF_FLUSH_SINGLE_PAGE. (This is the purpose of the existence of the LRU linked list),

Of course, sometimes when the system is particularly busy, user threads may also flush dirty pages from the flush linked list in batches. Obviously refreshing dirty pages in the process of processing user requests is a behavior that seriously slows down the processing speed. A unavoidable situation.

Multiple Buffer Pool instances improve concurrency speed

As we said above, Buffer Pool is essentially a continuous memory space that InnoDB applies to the operating system. In a multi-threaded environment, access to various linked lists in the Buffer Pool requires lock processing. The Buffer Pool is particularly large and multi-threaded concurrently. In the case of particularly high access, a single Buffer Pool may affect the processing speed of the request.

So when the Buffer Pool is very large, we can split them into several small Buffer Pools. Each Buffer Pool is called an instance. They are all independent, apply for memory space independently, and manage each one independently. A kind of linked list, so it will not affect each other during concurrent access by multiple threads, thereby improving concurrent processing capabilities.

Buffer Pool changes its size in chunks

Before MySQL 5.7.5, the size of the Buffer Pool can only be adjusted by configuring the innodb_buffer_pool_size startup parameter when the server is started. It is not allowed to adjust the value during the server operation.

However, MySQL in 5.7.5 and later versions supports the function of adjusting the Buffer Pool size during server operation, but there is a problem, that is, every time we want to re-adjust the Buffer Pool size, we need to reapply to the operating system. A contiguous memory space, and then copy the contents of the old Buffer Pool to this new space, which is extremely time-consuming.

Therefore, MySQL decided not to apply for a large contiguous memory space from the operating system for a Buffer Pool instance at one time, but to apply for space from the operating system in units of a so-called chunk. In other words, a Buffer Pool instance is actually composed of several chunks. A chunk represents a continuous memory space, which contains several cache pages and their corresponding control blocks:


It is precisely because the concept of this chunk was invented that when we adjust the size of the Buffer Pool during server operation, we increase or delete memory space in units of chunks, without the need to re-apply for a large memory from the operating system, and then cache pages copy.

Four major features of InnoDB-change buffer (write buffer)

Once geometric, the concepts of flush and change buffer have been confused. It feels that they all count dirty data, but they are always stupid and can't tell the difference. Let's take care of it together.

What is change buffer?

Before MySQL5.5, it was called insert buffer, which was only optimized for insert; now it is also effective for delete and update, which is called change buffer.

It is an application in the non-unique general index page (non-unique secondary index page) is not in the pool, on the pages of the write operation, and will not immediately load the disk page to the pool, but merely change the record buffer (buffer changes), when the future data is read, the data is merged and restored to the buffer pool. The purpose of write buffering is to reduce disk IO for write operations and improve database performance.

Conditions for adding elements to the flush linked list

What do we say is the condition of adding a free list?

  1. This page has been read from the disk into the Buffer Pool.
  2. When we modify the data on this page, this cache page becomes a dirty page and is added to the flush list to wait for flushing.

The role of change buffer

Friends who are not sure about the role of change buffer should be clear after doing the following comparison.

When there is no change buffer, update a page that does not exist in memory

Then suppose that the element we are reading is not in memory, and someone wrote an update statement to update the data page. The workflow of the InnoDB engine is as follows:

  1. Load data pages from the disk to the buffer pool, a random disk read operation;
  2. Modify the page in the buffer pool, a memory operation;
  3. Write to redo log, one disk sequential write operation;

When there is no hit to the buffer pool, at least one disk IO is generated . Is there still room for optimization for business scenarios where more writes and less reads ?

When there is a change buffer, update a page that does not exist in memory (the difference from flush linked list)

  1. Record this operation in the write buffer, a memory operation;
  2. Write to redo log, one disk sequential write operation;

It can be found that the appearance of the change buffer directly reduces the disk IO by one time.

Will there be consistency issues when reading data?

Of course not. In our change buffer, it is equivalent to storing a lot of data modification logic in a page unit. When the change buffer is not flushed to the disk, the data in the disk must be dirty data. Then the data read is definitely wrong.

The solution is also very simple, that is, first read the dirty data into the memory, and then restore the latest version of the data page information according to the data page modification record in the change buffer. (Note that at this time, the data related to this page in the change buffer is gone and synchronized to the cache. Afterwards, if you modify the data on this disk page, it will enter the flush linked list). Does it feel more integrated?

When to refresh the change buffer

  1. As described above, when there is data in the change buffer, a disk read operation occurs. It will read the disk once, and then cooperate with the change buffer to get the latest data. At this time, the page information in the change buffer will be erased;
  2. There is a background thread that will determine whether the database is flushed when it is idle;
  3. When the database buffer pool is not enough;
  4. When the database is closed normally;
  5. When the redo log is full; (the redo log is almost never full, otherwise it will cause a serious drop in MySQL throughput for a period of time)

What should I do if there is a downtime when there is data in the change buffer?

Every time the data in the change buffer is synchronized to the redo log, the database crashes abnormally, and the data can be recovered from the redo log.

Why is change buffer optimized only for secondary indexes?

Let's compare the difference between the primary key index and the secondary index for a new operation:

The target page of the record to be inserted is in memory

  1. For the unique index, find the position between 3 and 5, judge that there is no conflict, insert this value, and the statement execution ends;
  2. For ordinary indexes, find the position between 3 and 5, insert this value, and the statement execution ends.

In this way, the difference between the impact of the normal index and the unique index on the performance of the update statement is just a judgment and only consumes a small amount of CPU time.

However, this is not our focus.

The target page of the record to be inserted is not in the memory

  1. For the unique index, the data page needs to be read into the memory, it is judged that there is no conflict, the value is inserted, and the statement execution ends;
  2. For ordinary indexes, the update is recorded in the change buffer, and the statement execution ends.

Reading data from disk into memory involves random IO access and is one of the most expensive operations in the database. Because change buffer reduces random disk access, the improvement of update performance is obvious. (We may accumulate a lot of data in a page, and then update the entire page together, thereby reducing IO).

In the interview process, you can raise and solve such a problem. One day, it was discovered that the memory hit rate of the database dropped from 99% to 75%, the entire system was in a blocking state, and all update statements were blocked. After exploring the reasons, I found that this business has a large number of data insertion operations, and he changed one of the ordinary indexes to a unique index the day before.

Comparison of change buffer and redo

How are these two comparable? A cached data, a log file, can't help it!

However, friends who have known redo log will know that redo log has a feature that is shared with change buffer: minimize random reads and writes. So around this point of view, let's analyze the difference between change buffer and redo log.

Redo log write less disk

Now, we are going to execute this insert statement on the table:

insert into t(id,k) values(id1,k1),(id2,k2);

Here, we assume that the current is a secondary B+ tree index with k as the index. After finding the location, the data page where k1 is located is in the memory (InnoDB buffer pool), and the data page where k2 is located is not in the memory. As shown in the figure is the update state diagram with change buffer.


Analyzing this update statement, you will find that it involves four parts: memory, redo log (ib_log_fileX), data table space (t.ibd), system table space (ibdata1).

This update statement does the following operations (according to the numerical order in the figure):

  1. Page 1In the memory, update the memory directly;
  2. Page 2 is not in the memory, just in the change buffer area of ​​the memory, record the message "I want to insert a line into Page 2"
  3. Record the above two actions in the redo log (3 and 4 in the figure).

After doing the above, the transaction can be completed. Therefore, you will see that the cost of executing this update statement is very low, that is, two memory locations are written, and then a disk is written (the two operations are written together to write to a disk), and it is written sequentially . (Note: The above three steps are a transaction, that is, the transaction must be completed after the redo log is written. This also confirms that the redo log must be able to restore the data in the change buffer)

Change buffer read less disk

We are now going to execute

select * from t where k in (k1, k2)

Here, I drew the flow chart of these two read requests.

If the read statement occurs shortly after the update statement and the data in the memory is still there, then the two read operations at this time have nothing to do with the system table space (ibdata1) and redo log (ib_log_fileX). Therefore, I did not draw these two parts in the picture.


As can be seen from the figure:

  1. When reading Page 1, directly return from memory. Several students asked in the comments of the previous article, if you read the data after WAL, do you have to read the disk? Do you have to update the data from the redo log before returning? In fact, it is not used. You can look at this state in Figure 3. Although the previous data is still on the disk, the result is returned directly from the memory, and the result is correct.
  2. When you want to read Page 2, you need to read Page 2 from the disk into the memory, and then use the operation log in the change buffer to generate a correct version and return the result.

It can be seen that this data page will not be read into memory until Page 2 needs to be read. The actual writing to the disk will only be done when the database is idle or as a last resort. When flushing the disk, there may be multiple statements that operate the disk multiple times. At this time, flushing the entire page as a whole to the disk reduces the interaction with the disk many times, thereby achieving the purpose of reducing disk IO.

to sum up

Therefore, if you want to simply compare the benefits of these two mechanisms in improving update performance, redo log mainly saves the IO consumption of random disk writes (converted to sequential writing), while the main saving of change buffer is random read disks. IO consumption.