Interview hot topics: talk about the past and present of the MySQL index "B+Tree",

Do you want to find out exactly which MySQL article you want to read? Here →  MySQL Jianghu Road | Column List

When we talk about MySQL in the interview, we always mention the B+Tree index. Do you know about the B+Tree index? What are its characteristics, where are its advantages, and how is it different from the B-tree?
The layman looks at the excitement, the insider looks at the doorway. For this kind of popular knowledge and questions, I hope that all students can know what they are doing.
Well, today we will review the past and present of MySQL index together. Let’s talk about the index.
Insert picture description here


table of Contents

1. What is an index?

In a relational database, an index is a single, physical storage structure that sorts the values ​​of one or more columns in a database table. It is a collection of one or more column values ​​in a table and a corresponding pointing table A list of logical pointers in the data page that physically identify these values. The role of the index is equivalent to the catalog of books, you can quickly find the content you need according to the page number in the catalog.

When there are a large number of records in the table, if you want to query the table, the first way to search for information is the full table search, which is to take out all the records one by one, compare them with the query conditions one by one, and then return the records that meet the conditions. Doing this will consume a lot of database system time and cause a lot of disk I/O operations; the second is to create an index in the table, and then find the index value that meets the query conditions in the index, and finally pass the ROWID stored in the index (equivalent to Page number) Quickly find the corresponding record in the table.

After MySQL5.5, the index data structure used by the InnoDB storage engine is mainly used:; B+TreeThis article takes you to talk about the past and present of B+Tree;

Mark:
B+Tree可以对<,<=,=,>,>=,BETWEEN,IN,以及不以通配符开始的LIKE使用索引。(MySQL5.5后)

These facts may subvert some of your perceptions, such as in other articles or books you have read. All of the above are "range queries" and are not indexed!

Yes, before 5.5, the optimizer would not choose to search through the index. The optimizer thinks that the rows retrieved in this way are more than the rows of the full table scan, because the table has to be checked once, which may involve the number of I/O rows. More, abandoned by the optimizer.

After the algorithm (B+Tree) is optimized, it supports scanning of some range types (Deli and the orderliness of the B+Tree data structure). This approach also violates the principle of the leftmost prefix, resulting in the condition after the range query cannot use the joint index, we will explain in detail later.

Second, the advantages and disadvantages of the index

1. Advantages

  1. Index greatly reduces the amount of data that the server needs to scan
  2. Indexes can help the server avoid sorting and temporary tables
  3. Index can turn random I/O into sequential I/O

2. Disadvantages

  1. Although the index greatly improves the query speed, it will also reduce the speed of updating the table, such as INSERT, UPDATE, and DELETE on the table. Because when updating the table, MySQL not only saves the data, but also saves the index file.
  2. Index files that take up disk space for indexing. Generally speaking, this problem is not serious, but if you create multiple composite indexes on a large table and insert a large amount of data, the size of the index file will rapidly expand.
  3. If a data column contains many duplicate content, indexing it will not have much practical effect.
  4. For very small tables, a simple full table scan is more efficient in most cases;

Therefore, only the most frequently queried and most frequently sorted data columns should be indexed. (The total number of indexes in the same data table in MySQL is limited to 16)

One of the meanings of the existence of a database is to solve data storage and quick search. So where does the data in the database exist? Yes, it's a disk. What are the advantages of a disk? Cheap! What are the disadvantages? Compared to memory access speed is slow.

So do you know the main data structure used by MySQL indexes?

B+ tree! You blurt it out.

那 B+树 是什么样的数据结构?MySQL索引又是为什么选择了B+树呢?

In fact, the final choice of B+ tree has experienced a long evolution:

Binary sort treeBinary balanced treeB-Tree (B tree)B+Tree (B+ tree)

A friend asked me "what is the difference between B-tree and B-tree"? Here to popularize, MySQL data structure only has B-Tree (B-tree) and B+Tree (B+Tree), which are mostly just different in reading. "B-Tree" is generally collectively referred to as B-tree. You can call it B-tree. ~~

The red-black tree mentioned by my friends is a storage structure in a programming language, not MySQL; for example, Java's HashMap uses a linked list plus a red-black tree.

Okay, let's take a look at the process of evolving into a B+ tree today.

Three, the past and present of the B+Tree index

1. Binary sort tree

Before understanding the B+ tree, let’s briefly talk about the binary sort tree. For a node, the value of the child node of its left subtree must be smaller than itself, and the value of the child node of its right subtree must be greater than itself, if all nodes All meet this condition, then it is a binary sort tree. (Here you can list the knowledge points for binary search)

Insert picture description here

The picture above is a binary sort tree. You can try to use its characteristics to experience the process of finding 9:

  • 9 is smaller than 10, go to its left subtree (node ​​3) to find
  • 9 is greater than 3, go to the right subtree of node 3 (node ​​4) to find
  • 9 is greater than 4, go to the right subtree of node 4 (node ​​9) to find
  • Node 9 is equal to 9, the search is successful

A total of 4 comparisons have been made. Have you ever thought about how to optimize the above structure?

2. AVL tree (self-balancing binary search tree)

Insert picture description here

The figure above is an AVL tree, the number and values ​​of nodes are exactly the same as the binary sort tree

Let's look at the process of finding 9 again:

  • 9 is greater than 4, go to its right subtree to find
  • 9 is smaller than 10, go to its left subtree to find
  • Node 9 is equal to 9, the search is successful

A total of 3 comparisons, the same amount of data is one less than the binary sort tree, why? Because the height of the AVL tree is smaller than that of the binary sort tree, the higher the height means the greater the number of comparisons; do not underestimate the optimization this time, if it is 200w pieces of data, the number of comparisons will be significantly different.

You can imagine a balanced binary tree with 1 million nodes and a tree height of 20. One query may need to access 20 data blocks. In the era of mechanical hard disks, random reading of a data block from the disk requires about 10 ms of addressing time. In other words, for a table with 1 million rows, if a binary tree is used to store it, it may take 20 10 ms to access a row alone. This query is really slow!

3. B-tree (Balanced Tree) multi-way balanced search tree multi-fork

B-tree is a multi-way self-balancing search tree, which is similar to ordinary binary tree, but B-book allows each node to have more child nodes. The schematic diagram of the B-tree is as follows:值得注意的是,B树的非叶子节点和叶子结点的data数据都是分开存储的,那么针对范围查询、排序等常用特性就很不友好了。

Insert picture description here

Features of B-tree:

  1. All key values ​​are distributed throughout the tree
  2. Any keyword appears and only appears in one node
  3. The search may end at a non-leaf node
  4. Do a search in the full set of keywords, and the performance is close to the binary search algorithm

In order to improve efficiency, it is necessary to minimize the number of disk I/Os. In the actual process, the disk is not read strictly on-demand every time, but read ahead every time.

After the disk has read the required data, it will read a part of the data into the memory in order. The theoretical basis for this is the principle of locality noted in computer science:

  • Because the disk sequential read efficiency is very high (no addressing time is needed, only a small rotation time), for a program with locality, pre-reading can improve I/O efficiency. The length of pre-reading is generally An integral multiple of page.
  • MySQL (using the InnoDB engine by default) manages records in a page manner, and the default size of each page is 16K (can be modified).

B-Tree uses the computer disk pre-reading mechanism:

Every time you create a new node, you apply for a page space, so you only need one I/O to find a node; because in practical applications, the depth of the node will be very small, so the search efficiency is very high.

So how is the final version of the B+ tree made?

4. B+ Tree (B+ Tree is a variant of B tree, and it is also a multi-way search tree)

Insert picture description here

It can also be seen from the figure that the difference between the B+ tree and the B tree is:

  1. All keywords are stored in leaf nodes , and non-leaf nodes do not store real data, so that leaf nodes can be quickly located.
  2. A chain pointer is added for all leaf nodes ,意味着所有的值都是按顺序存储的,并且每一个叶子页到根的距离相同,很适合查找范围数据。说明支持范围查询和天然排序。

Therefore, B+Tree can use indexes for <, <=, =, >, >=, BETWEEN, IN, and LIKE that does not start with a wildcard. And if the index is used, the consumption of the sorting function is greatly reduced.

Advantages of B+ tree:

The number of comparisons is balanced, the number of I/Os is reduced, the search speed is improved, and the search is more stable.

  • B+ tree disk read and write costs are lower
  • B+ tree query efficiency is more stable

What you need to know is that every time you create a table, the system will automatically create an ID-based clustered index (above B+ tree) for you to store all data; each time you add an index, the database will create an additional index for you (above B+ tree), the number of fields selected by the index is the number of data indexes stored in each node. Note that the index does not store all data.

4. Why did MySQL index choose B+ tree instead of B tree?

  1. B+ tree is more suitable for external storage (usually refers to disk storage). Since internal nodes (non-leaf nodes) do not store data, a node can store more internal nodes, and each node can index a larger and more accurate range. That is to say, the amount of information in a single disk I/O using the B+ tree is larger than that of the B tree, and the I/O efficiency is higher.
  2. MySQL is a relational database. It often accesses a certain index column according to the interval. The leaf nodes of the B+ tree establish chain pointers in order to strengthen the accessibility of the interval, so the B+ tree is very friendly to the interval range query on the index column. However, the key and data of each node of the B-tree are together, and the interval search cannot be performed.

Five, index-related knowledge points you should know

1. Back to table query

For example, you created the name, age index name_age_index, and used it when querying data

select * from table where name ='陈哈哈' and age = 26;

Because there are only name and age in the additional index, after hitting the index, the database must go back to the clustered index to find other data. This is the return table, which is the one you memorized: the reason for using select * less.

2. Index coverage

Combining back to the table will be better understood, such as the above name_age_index index, there is a query

select name, age from table where name ='陈哈哈' and age = 26;

At this time, the select field name and age can be obtained in the index name_age_index, so there is no need to return to the table, and the index coverage is satisfied, and the data in the index is returned directly, which is highly efficient. It is the preferred optimization method for DBA students when optimizing.

3. The principle of the leftmost prefix

The node storage index order of the B+ tree is stored from left to right. When matching, it is natural to match from left to right; usually when we build a joint index, that is, to index multiple fields, I believe that the index has been established. Students will find that both Oracle and MySQL will let us choose the order of the index. For example, if we want to build a joint index on the three fields of a, b, and c, we can choose the priority we want, a, b , C, or b, a, c or c, a, b, etc. Why does the database let us choose the order of the fields? Isn't it a joint index of three fields? Here is the principle of the leftmost prefix of the database index.

In our development, we often encounter the problem that a joint index is built for this field, but the index is not used when SQL queries the field. For example, the index abc_index: (a, b, c) is a joint index of the three fields of a, b, and c. When the following SQL is executed, the index abc_index cannot be hit;

select * from table where c = '1';

select * from table where b ='1' and c ='2';

The index will be taken in the following three situations:

select * from table where a = '1';

select * from table where a = '1' and b = '2';

select * from table where a = '1' and b = '2'  and c='3';

From the above two examples, do you have a good idea?

Yes, the index abc_index:(a,b,c) can only be used in three types of queries: (a), (a,b), and (a,b,c). In fact, there is a little ambiguity here. In fact, (a,c) will also go, but only the a field index is used, not the c field.

In addition, there is a special case, the following type will only have a and b index, and c will not.

select * from table where a = '1' and b > '2'  and c='3';

Like the above type of SQL statement, after a and b are indexed, c is already out of order, so c cannot be indexed, and the optimizer will think that it is not as fast as scanning the c field in the full table.

Leftmost prefix: As the name suggests, it is leftmost first. In the above example, we created a_b_c multi-column index, which is equivalent to creating (a) single-column index, (a,b) composite index and (a,b,c) composite index.

Therefore, when creating a multi-column index, according to business needs, the most frequently used column in the where clause is placed on the leftmost side.

4. Index push down optimization

Still index name_age_index, there are the following sql

select * from table where name like '陈%' and age > 26;

There are two execution possibilities for this statement:

  • Hit the name_age_index joint index, query all the data that meets the name starting with "Chen", and then return to the table to query all the rows that meet.
  • Hit the name_age_index joint index, query all the data that meets the name starting with "Chen", then filter out the indexes with age>20 by the way, and then return to the table to query the entire row of data.

Obviously, the number of rows in the second way back to the table query is less, and the number of I/Os will also be reduced. This is the index push down. So not all likes will not hit the index.

Okay, I won’t talk about it if there are too many. I advise you to have a tail juice, but I recommend that you pay attention to me, because I will let you learn a lot from happiness!


Summary of MySQL series articles and "MySQL Jianghu Road | Column Directory"

往期热门MySQL系列文章: