MySQL's varchar water is really too deep-InnoDB record storage structure

This article explains the varcharprinciple of storage. The knowledge is difficult and involves calculations. Those who are interested are welcome to read it.

Article Directory

1. What does InnoDB do?

InnoDBIt is a storage engine that stores the data in the table to disk.

2. How does InnoDB read and write data?

InnoDBThe process of processing data takes place in the memory. The data in the disk needs to be loaded into the memory. If it is to process a write or modify request, the content in the memory needs to be flushed to the disk.

The speed of reading and writing to the disk is very slow, which is several orders of magnitude worse than the reading and writing of the memory, so when we want to get some records from the table, the InnoDBstorage engine divides the data into several pages, and the page is used as the interaction between the disk and the memory. The basic unit of the InnoDBmiddle page size is by default 16 KB. That is, under normal circumstances, 16KBthe content read from the disk at least once to the memory, or the content in the memory is 16KBflushed to the disk at least at a time .

Note: innodb_page_sizeVariables cannot be changed while the server is running, and can only be MySQLspecified when the data directory is initialized for the first time . Therefore, the size of the page cannot be changed at runtime.

3. Tens of thousands of varchar questions-InnoDB row format

Seeing this, you must have the same questions as me, such as varchar(255)how to choose the maximum length later? Why can't varchar(65535)it be the best varchar(16383)? Let me show you!

We usually insert data into the table in record units . The storage method of these records on disk is also called row format or record format. There are four row format, respectively Dynamic, Compact, RedundantandCompressed

MySQL 5+The default line format is Dynamic, in MySQL 5and MySQL 8after verification is indeed yes.

SHOW VARIABLES LIKE "innodb_default_row_format"

Everyone has hardly modified or paid attention to the InnoDBline format in business and normal use , so I will only focus on the default line formatdynamic to give you a deeper understanding of the usual development varchar.

Please remember this table structure, we will talk about this later

c1 VARCHAR(10), 
c3 CHAR(10), 
c4 VARCHAR(10)) CHARSET = utf8mb4;

Now business database character sets are all utf8mb4, I will use this to minimize the difficulty of understanding.

INSERT INTO test ( c1, c2, c3, c4 )
VALUES('aaaa', '你好啊', 'cc', 'd'),('eeee', 'fff', NULL, NULL);

Now, the record in the table is like this

3.1 dynamic-innodb default row format

The additional information about the record is some additional information that the server has to add in order to describe the record. The additional information is divided into 3categories, namely the variable-length field length list , the NULLvalue list, and the record header information .

Here I only talk about variable-length field length lists and NULLvalue lists . Because the record header information is very convoluted, it has nothing to do with this article.

3.2 How does innodb know how long a varchar really is? ——Variable length field length list

Some variable-length data types, such as VARCHAR(M)various TEXTtypes, various BLOBtypes of variable-length data type stored in the field how many bytes of data are not fixed, the need to store the actual data when the number of bytes occupied by data Save it too .

The storage space occupied by these variable-length fields (for examplevarchar ) is divided into two parts:
  1. The real data content part is placed in the corresponding column
  2. The actual number of bytes occupied, placed in the variable length field list part

Let's take testthe first record in the table as an example. Because the test table c1, c2, c4the column is VARCHAR(10)the type described maximum 10characters, so that three columns are required length value stored at the beginning of the recording, since the testrespective columns in the table are used in a utf8mb4character set, each Characters need 4bytes to encode . Let’s take a look at the length of the variable-length fields in the first record:

Column nameStore contentContent length (decimal representation)Content length (hexadecimal representation)
c1'aaaa'16 bytes0x10
c2'Hello'12 bytes0x0c
c4'd'4 bytes0x04

The number of bytes occupied by each variable-length field data is stored in the reverse order of the column order ! !

Since the first row record c1, c2, c4the string is shorter than the column, that is to say the number of bytes occupied by the content is relatively small, with the 1bytes can be expressed, but if the content of variable length the number of bytes occupied by the column comparison More, it may need 2to be represented by a byte. In the end varcharhow many characters can put it? Continue to look down.

3.3 How many characters can be stored in varchar(M), why is the maximum 16383 displayed?

We must first understand varchar(M)the Mmean number of characters, not bytes.

Why can't varchar(20000)it be 20000a character that can't fit?

Why can the prompt only have a maximum 16383of characters? How is this number calculated?

I have to chat with you about this!

varcharIt is variable length. varchar(64)I may only store 2characters instead 64of characters. Who knows how many characters are stored in this type? innodbWhen designing, it has been taken into account, but the unit is in bytes, and the innodbactual number of bytes occupied by the variable-length field must be recorded L. Of course, it can't be too long, because at innodbmost 2a byte of space is used to record this L.

InnoDBWith its set of rules, we introduce W, Mand Lthese symbols:
  1. Assuming that a character set requires at most Wbytes to represent a character
  • utf8mb4The character set Wis4
  • utf8The character set Wis3
  • gbkThe character set Wis2
  • asciiThe character set Wis 1.
  1. For variable length types VARCHAR(M), this type represents the maximum number Mof characters that can be stored (note that characters are not bytes),
    so the maximum number of bytes occupied by the string that this type can represent is M × W.
  2. Suppose it actually stores the number of bytes occupied by the string L.

Looking at the extreme boundary conditions, in innodborder to record varcharhow many bytes are actually stored , at most 2a byte of space is provided to record, 2each byte 16and every bit, all of them 1, the maximum number that can be recorded 2^16-1is 65535one, and the innodbmaximum number that can be recorded is varcharoccupied. The number of bytes is just 65535one. utf8mb4A character in 4a character set is a byte. 65535 / 4 = 16383.75As long as the varcharnumber of characters does not exceed 16383one innodb, the actual occupied length can be recorded , and no more can be recorded! So I can explain the picture just now, I’ll post it again here, varchar(20000)no, the maximum is only 16383one character

Let's explain the rules (the character set used in the explanation utf8mb4, W=4)

Rule 1: If the maximum number of bytes allowed to be stored M × W <= 255, that is , the maximum numberM <= 63 of characters in varchar(M) , innodbonly use 1bytes to represent varcharthe actual number of bytes occupied.

InnoDBWhen the read variable-length field length list to view the record table structure, if a variable length field allows the maximum number of bytes is not larger than the storage 255time, i.e. the number of characters is not larger than 63the time , you can only 1bytes to represent real data occupies byte.

Rule 2: If the maximum number of bytes allowed to be stored M × W > 255, that is, the maximum numberM > 63 of characters in varchar(M) , there are two cases:

If the actual storage of bytes L <= 127, that is, the actual storage of characters <= ⌊127 / 4⌋ = 31 , innodbonly 1one byte can represent varcharthe actual number of bytes occupied. (⌊… ⌋ means rounding down)

If the actual storage of bytes L > 127, that is, the actual storage of characters> ⌊127 / 4⌋ = 31 , innodbuse 2bytes to represent varcharthe actual number of bytes occupied.

Another thing to note is that the variable-length field list only stores NULLthe length of the non- column.
The table record is like this.

For the second record, the c4column value is NULL, so just store c1and c2column.

The variable-length field length list part of the first record occupies 3byte space, and the variable-length field length list part of the second record occupies 2bytes.

Of course, not all records have this variable-length field length list section, for example in the table all the columns are not variable-length data types or values for all columns areNULL , then, this part does not need to have. In actual business development, almost nothing is not used varchar, so records in actual development will have variable-length field length list part

3.4 If the record is NULL, how does innodb handle it? ——NULL value list

If you can see here carefully, you must be a master. If you are like me, it is not recommended in the development specification NULL, generally write it NOT NULL. In fact, there is no NULLvalue list in the record , which also saves space.

If some columns in the table may store NULLvalues, it will take a lot of space to store these NULLvalues ​​in the recorded real data . Therefore, the dynamicrow format NULLmanages the columns with these values ​​in a unified manner and stores them in the NULLvalue list. Its processing The process is like this:

NULLWhich columns are allowed to be stored in the statistics table .

Primary key columns and NOT NULLmodified columns cannot store NULLvalues, so these columns will not be counted in the statistics. For example test, 3each column of the table c1、c3、c4is allowed to store NULLvalues, and the c2columns are NOT NULLmodified and are not allowed to store NULLvalues.

If the table does not allow the storage NULLcolumn, the NULLvalue of the list does not exist , or would allow each store NULLa corresponding bit of a row, column order bits in reverse order . Bit binary value 1when the representative value of the column NULL, is 0when the representative value is not the column NULL. Since the table testof c1、c3、c4all allow the storage NULLof allowed values for the NULLcolumn, so that 3the correspondence relationship columns and bits is such:

NULLThe value list must be represented by an integer number of bytes. If the number of binary bits used is not an integer number of bytes, the high-order0 bit of the byte is filled .

By analogy, if 9all fields in the table are allowed to be NULL, then the NULLvalue list of this record needs 2one byte to represent.

For the first record , c1, c3, c4not to NULL, the corresponding bit is binary 0, hexadecimal notation is0x00

For the second record , c3, c4are NULL, for the corresponding binary digit 1, hexadecimal representation is0x06

After these two records are filled with the NULLvalue list, the schematic diagram is as follows:

3.5 What should I do if a column of data occupies a lot of bytes? ——Overflow column in dynamic row format

If the data stored in a column occupies a lot of bytes, the column may be called an overflow column.

For a column that takes up a lot of storage space, when recording real data, the column will only use 20byte space , and this 20byte of space does not store data, because the data is scattered and stored in several other pages. This 20byte of space stores the addresses of scattered pages and the number of bytes occupied. The scattered pages are the structure of singly linked lists.

Follow-up: If you are innodbinterested in other row formats of the storage structure, or the record header information that I did not say, you can read the book "How MySQL Works". The difference between me and the book is that the Compactformat and character set described in the book are ascii, I chose the default dynamicformat used in usual development , the character set is utf8mb4, and the difference from the book is that after the character set change, I have recalculated all the data in the text and diagrams. Everyone may not pay attention to the line format, so just follow the dynamicformat to understand it, which is closer to the actual development.

Welcome to one-click triple connection~ If you

have any questions, please leave a message, let’s discuss and learn together

----------------------Talk is cheap, show me the code----- ------------------