The storage format of page and row data in InnoDB

MySQL is still a black box for us. We are only responsible for using the client to send the request and waiting for the server to return the result. Where is the data in the table stored? In what format? In what way does MySQL access this data? We have no idea about these issues. To understand the principles behind query optimization, we must go deep into the bottom layer of MySQL, and the principles of transactions and locks also require us to go deep into the bottom layer.

The actual process of InnoDB storing data

InnoDB is a storage engine that stores the data in the table to disk, so our data still exists even if we restart it after shutting down.

So, many novices will understand it for granted-every time we store, query data is directly interacting with the disk. However, in fact, the real data processing process is carried out in the InnoDB layer memory. In other words, our modification and update operations are all directly written to an InnoDB's internal cache. And when will the cache be flushed to disk? And when the data in this kind of memory is down before it is flushed to the disk, how to restore the data, we will talk about it later.

In addition to writing the memory to the disk, when we want to get a record in the table, does the InnoDB storage engine take the disk data into the memory in units of records?

Recalling our B+ tree storage structure, InnoDB adopts the method: divide data into several pages, and use pages as the basic unit of interaction between disk and memory. The size of pages in InnoDB is generally 16 KB. That is, under normal circumstances, at least 16KB of content from the disk is read into the memory at a time, and 16KB of the content in the memory is flushed to the disk at least at a time.

We know that even a row of data is often manipulated. From the perspective of InnoDB, it is equivalent to directly operating the entire disk page where the row is located. From the perspective of our users, data insertion is often done in units of row records.

The storage method of these records on the disk is also called line format or record format. The InnoDB storage engine has designed 4 different types of row formats, namely Compact, Redundant, Dynamic and Compressed row formats.

Row format

Do you think that the row data in the disk file is the same as what you found out by select *, and that the information corresponding to the rows of a large number of columns is over. Unfortunately, things are not that simple.

Set the row format in units of tables

We can specify the row format in the statement to create or modify the table:

CREATE TABLE table name (column information) ROW_FORMAT=row format name

Row format category



Variable length field list

We know that MySQL supports some variable-length data types, such as VARCHAR(M), VARBINARY(M), various TEXT types, and various BLOB types. We can also call columns with these data types as variable-length fields. The number of bytes of data stored in the field is not fixed, so when we store real data, we need to store the number of bytes occupied by these data by the way. If the maximum number of bytes (M×W) allowed to be stored in the variable field exceeds 255 bytes and the actual number of bytes (L) exceeds 127 bytes, then 2 bytes are used, otherwise 1 byte is used. (The maximum length of the varcahr type of MySQL database is limited to 65535)

List of NULL values

Some columns in the table may store NULL values. If these NULL values ​​are stored in the real data of the record, it will take up a lot of space. Therefore, the Compact row format manages these columns with NULL values ​​in a unified manner and stores them in the NULL value list. .

Each column that allows NULL to be stored corresponds to a binary bit. When the value of the binary bit is 1, it means that the value of the column is NULL. When the value of the binary bit is 0, it means that the value of the column is not NULL.

This is why it is recommended that we try not to use Null columns. In the row data of the InnoDB layer, it is always necessary to open up a little space to maintain the list of Null values.

You see, if a table has 100 million pieces of data, and you set a column to be NULL, a lot of storage space is wasted for no reason to record a list of NULL values. At this point you are dissatisfied. Then I set it to not null, do I have to store something in this column to take up memory? But even so, we will be more convenient and unified when managing rows in the future, because the meaning of NULL values ​​in different application scenarios of MySQL is too different.

Record header information

  1. Reserved 1 is not used
  2. Reserved 2 is not used
  3. delete_mask mark whether the record is deleted
  4. min_rec_mask B+ The smallest record in each non-leaf node of the tree will add this mark
  5. n_owned represents the number of records owned by the current record
  6. heap_no represents the location information currently recorded on the page
  7. record_type represents the type of the current record, 0 represents a normal record, 1 represents a B+ tree non-leaf node record, 2 represents the smallest record, and 3 represents the largest record
  8. next_record16 represents the relative position of the next record

Hidden column


In addition to the data in the columns defined by ourselves, MySQL will add some columns (also known as hidden columns) to each record by default, including:

  1. DB_ROW_ID(row_id): optional, 6 bytes, representing the row ID, uniquely identifying a record
  2. DB_TRX_ID: required, 6 bytes, representing the transaction ID
  3. DB_ROLL_PTR: required, 7 bytes, indicating the rollback pointer

The primary key generation strategy for InnoDB tables is to use a user-defined primary key as the primary key first. If the user does not define a primary key, select a unique key as the primary key. If even the Unique key is not defined in the table, InnoDB will add a hidden column named row_id as the primary key by default.

The two columns DB_TRX_ID (also called trx_id) and DB_ROLL_PTR (also called roll_ptr) are necessary. The isolation of MVCC transactions and undo log rollback data are all dependent on him.

Redundant row format

Redundant row format is a row format used before MySQL 5.0, so I won’t go into it further.

Dynamic and Compressed row formats

The default row format of MySQL 5.7 is Dynamic. The Dynamic and Compressed row formats are similar to the Compact row format, except that they are different when dealing with row overflow data. The difference between the Compressed row format and Dynamic is that the Compressed row format uses compression algorithms to compress pages to save space.

Data overflow

If we define a table, there is only one VARCHAR field in the table, as follows:

CREATE TABLE test_varchar( c VARCHAR(60000))

Then insert 60,000 characters into this field, what will happen?

As mentioned earlier, the basic unit of interaction between disk and memory in MySQL is pages, which means that MySQL manages storage space with pages as the basic unit, and our records will be allocated to a certain page for storage. The size of a page is generally 16KB, which is 16384 bytes, and a VARCHAR(M) type column can store up to 65532 bytes, which may cause a situation where a page cannot store a record.

Compact and Redundant handle page overflow

In Compact and Redundant row formats, for columns that occupy a very large storage space, the column data corresponding to the row will only store the first 768 bytes of data in the column, and then the remaining data will be stored in several other pages. Among these 768 bytes, 20 bytes are used to store addresses that point to these overflow pages. This process is also called row overflow, and those pages that store more than 768 bytes are also called overflow pages.

Dynamic and Compressed handle page overflow

The format of these two rows is more straightforward. For a column that occupies a very large storage space, the column data corresponding to the row only stores the overflow page address, and all the actual data in the column is all scattered and stored in other pages.

Conditions for the occurrence of data overflow

The examples of VARCHAR(60000) we mentioned before are often also likely to appear, so how does InnoDB determine that a row of data is too large to require data overflow?

Suppose there is only one field in the table. To review again, the size of a page in our MySQL is 16384 bytes, and MySQL stipulates that a page stores at least two rows of records. In addition to storing our records, each page also needs to store some information about the page itself, which adds up to 132 A byte of space, other spaces can be used to store records. The additional information required for each record is 27 bytes.

Assuming that the number of data bytes stored in a column is n, MySQL stipulates that if the column does not overflow, the following formula needs to be satisfied:

132 + 2×(27 + n) <16384

The solution obtained by solving this formula is: n <8099. That is to say, if the data stored in a column is less than 8099 bytes, then the column will not become an overflow column, otherwise the column needs to become an overflow column. However, this 8099-byte conclusion is only for the test_varchar table with only one column. If there are multiple columns in the table, the above formula and conclusion need to be changed. So the point is: the specific value of this critical point does not matter, as long as we know that if the data stored in a column of a record occupies a lot of bytes, the column may become an overflow column.

Index page

We briefly mentioned the concept of a page, which is the basic unit of InnoDB to manage storage space. The size of a page is generally 16KB.

InnoDB has designed many different types of pages for different purposes. The type of page that stores records in our table is naturally one of them. Officially, this type of page that stores records is called an index (INDEX) page. It is no problem to understand it as a data page. Corresponds to the leaf nodes in our B+ tree. There are both indexes and actual row data.

Index page format


The storage space of an InnoDB data page is roughly divided into 7 parts:

  1. File Header Some general information about the 38-byte page of the file header
  2. Page Header 56-byte data page specific information
  3. Infimum + Supremum minimum record and maximum record 26 bytes, two virtual line records
  4. User Records User record size is not sure of the actual stored row record content
  5. Free Space The size of the free space is not sure about the unused space in the page
  6. Page Directory The size of the page directory is uncertain about the relative position of some records in the page
  7. File Trailer 8 bytes at the end of the file to check whether the page is complete

File Header + File Trailer

File Header

File Header is common to various types of pages, which means that different types of pages will use File Header as the first component. It describes some information that is common to various pages, such as the type of page. What is the number of the page, who is the previous page, the next page, the checksum of the page, etc. This part occupies a fixed 38 bytes.

Page type, including Undo log page, segment information node, Insert Buffer free list, Insert Buffer bitmap, system page, transaction system data, table space header information, extended description page, overflow page, index page.

A doubly linked list is established through the previous page and the next page to concatenate many pages, without the need for these pages to be physically connected. But not all types of pages have the attributes of the previous and next pages. Data pages have these two attributes, so all data pages are actually a doubly linked list.

File Trailer

We know that the InnoDB storage engine will store data on the disk, but the disk speed is too slow, and the data needs to be loaded into the memory for processing in units of pages. If the data in the page is modified in the memory, then the modified The data needs to be synchronized to the disk at a certain time. But what if the power is interrupted when the synchronization is halfway?

In order to check whether a page is complete (that is, whether there is an awkward situation where only half of the synchronization occurs during synchronization), InnoDB has added a File Trailer part to each page. This part is composed of 8 bytes and can be divided into 2 small section:

  1. The first part: The first 4 bytes represent the checksum of the page.

This part corresponds to the checksum in the File Header. Whenever a page is modified in memory, its checksum must be calculated before synchronization and stored in the File Header. When all the data on this page is synchronized, the value in File Trailer will also be updated. At this time, the checksums of the first File Heade and the tail File Trailer should be the same. If they are inconsistent, it means a problem occurred during the synchronization process.

  1. Part 2: The last 4 bytes represent the log sequence position (LSN) when the page was last modified

This is also related to the integrity of the check page.

InnoDB, in order to get the status information of the records stored in a data page, such as how many records have been stored in this page, what is the address of the first record, how many slots are stored in the page directory, etc., are specifically in the page A part called Page Header is defined. It is the second part of the page structure. This part occupies a fixed 56 bytes and is dedicated to storing various status information.

User Records + Infimum + Supremum + Free Space (line record)


The records we store will be stored in the User Records section according to the row format we specify. But when the page is first generated, there is actually no User Records section. Whenever we insert a record, we will apply for a record-sized space from the Free Space section, which is the unused storage space, and divide it into the User Records section. When the Free Space part of the space is completely replaced by the User Records part, it means that the page is used up. If there are new records inserted, you need to apply for a new page.


When the current record is deleted, the delete_mask in the record header information will be modified to 1, which means that the deleted record is still on the page and on the real disk. The reason why these deleted records are not removed from the disk immediately is because it takes performance to rearrange other records on the disk after removing them. So it's just a delete mark. All deleted records will form a so-called garbage linked list. The space occupied by the records in this linked list is called the so-called reusable space. Later, if a new record is inserted into the table. , May overwrite the storage space occupied by these deleted records.

The relationship between heap_no and Infimum + Supremum


The record we insert will record our position on this page and insert it into the heap_no of the row of data. The records with heap_no values ​​of 0 and 1 are two records that InnoDB automatically adds to each page, which are called pseudo records or virtual records. One of the two pseudo records represents the smallest record and the other represents the largest record. The two pseudo records are stored in the User Records section of the page. They are separately placed in a section called Infimum + Supremum.


In the record header information, next_record records the address offset from the real data of the current record to the real data of the next record. This is actually a linked list, and the next record can be found through one record.

But you need to pay attention to another point to note is that the next record does not refer to the next record in the order of our insertion, but the next record in the descending order of the primary key value.

And it is stipulated that the next record of the Infimum record (that is, the smallest record) is the user record with the smallest primary key value on this page, and the next record of the user record with the largest primary key value on the page is the Supremum record (that is, the largest record)


Our records form a singly-linked list in the order of the primary key from small to large. When the record is deleted, it is removed from this linked list.

Page Directory and n_owned

Page Directory mainly provides the search problem of a linked list of records in a disk page. The easiest and simplest way for us to locate a piece of data in a disk page is to traverse it one by one from beginning to end. The time complexity is O(1).

InnoDB is not satisfied with this slow traversal method. Its improvement is: a table of contents is created for the records in a page, the production process is like this:

The concept of InnoDB data pages, row records, and slots

  1. Divide all normal records (including the largest and smallest records, excluding the records marked as deleted) into several groups.
  2. The n_owned attribute in the header information of the last record of each group (that is, the largest record in the group) indicates how many records the record has, that is, how many records are in the group.
  3. The address offset of the last record of each group is extracted separately and stored in order, which is the so-called Page Directory, which is the page directory. These address offsets in the page directory are called slots (English name: Slot) , So this page directory is composed of slots. Slot 0 points to the Infimum of the first row array, and all subsequent slots point to the Supremum of the ascending array. Of course, this is only stipulated by the developer for convenience. Through these records, we can easily obtain the start address and end address of each slot. When querying a row of data, we can quickly locate which slot of the row of data, and then traverse in the slot to improve the traversal speed. .

How does InnoDB quickly locate the row records of a page

The process of finding the record of the specified primary key value in a data page is divided into two steps:

  1. Determine the slot where the record is located by dichotomy, and find the record with the smallest primary key value in the group where the slot is located.
  2. Traverse each record in the group where the slot is located through the next_record attribute of the record.

We see that InnoDB is not satisfied with the concept of slots. Binary search is still used when locating slots, which really squeezes out the query performance.

Slot number limit for different slots

The number of records in each group is specified: the group with the smallest record can only have 1 record, the group with the largest record can only have 1 to 8 records, and the rest of the group The range of the number of records in can only be between 4 and 8. As shown below: