Strategies for using high-performance indexes

Do not do any operation on the index column

Some of our improper use methods often result in our use of the index but no effect.

The MySQL index column must be an independent column in order to use the index normally. What is an independent column? We cite the following two counterexamples:

Do calculations on index columns

mysql>SELECT * FROM order_exp WHERE order_status + 1 = 1;

It is easy to see with the naked eye that the expression in WHERE is actually equivalent to order_status = 0, but MySQL cannot automatically parse this equation. This is entirely user behavior. We should develop the habit of simplifying the WHERE condition, and always put the index column alone on the side of the comparison symbol.

Add a function to the index column

Here is another common mistake:

mysql>SELECT ... WHERE TO_DAYS(insert_time)-TO_DAYS(expire_time) <= 10;

This is also unable to use the index.

Try to match all values ​​(all columns of the joint index should appear in the condition as much as possible)

After the joint index column is established, if the column in our search condition is consistent with the index column, this situation is called full-value matching. For example, the following search statement:

select * from order_exp where insert_time='2021-03-22 18:34:55'and order_status=0 and expire_time='2021-03-22 18:35:14';

Our joint index will include insert_time, order_status, and expire_time, and all the columns of the joint index appear in the conditions of the query statement for equal value matching.

So if these conditions in the where clause are inconsistent with the index order in the joint index, will there be any problems? Such as:

select * from order_exp where order_status=1 and expire_time='2021-03-22 18:35:14' and insert_time='2021-03-22 18:34:55';

The answer is no difference at all. The query optimizer analyzes these search conditions and decides which search condition to use first and which search condition to use later according to the order of the columns in the available index.

Therefore, when the joint index column is established, the index can be used as much as possible in the where condition.

The best left prefix rule (joint index is based on the left column sorted and then sorted to the right index)

A joint index column is established. What if the search conditions are not fully matched?

Then we must follow the principle of the leftmost prefix.

For example, we have established a joint index of the three columns in the order_exp table: insert_time, order_status, and expire_time, which is equivalent to establishing the following three indexes:

  1. insert_time
  2. insert_time,order_status
  3. insert_time,order_status,expire_time

And it must appear in the query condition in accordance with these three orders.

Then suppose the following sql appears:

select * from order_exp where insert_time='2021-03-22 18:23:42' and expire_time='2021-03-22 18:35:14';

Will it be indexed? The answer is yes. This will still take the index of insert_time.

Put the range condition at the end (the range search of the left index will disrupt the order of the right index)

All records are sorted according to the value of the index column from small to large, while the joint index is grouped and sorted in the order in which the index was created.

such as:

select * from order_exp_cut where insert_time>'2021-03-22 18:23:42' and insert_time<'2021-03-22 18:35:00';

Since the data pages and records in the B+ tree are sorted by the insert_time column first, the query process above is actually like this:

  1. Find the record whose insert_time value is '2021-03-22 18:23:42'.
  2. Find the record whose insert_timee value is '2021-03-22 18:35:00'.
  3. Since all records are connected by a linked list, the records between them can be easily retrieved, find the primary key value of these records, and then go back to the table in the clustered index to find the complete record.

The leftmost prefix of the joint index uses a range query

But if you perform a range search on multiple columns at the same time, the B+ tree index can only be used when you perform a range search on the leftmost column of the index:

select * from order_exp_cut where insert_time>'2021-03-22 18:23:42' and insert_time<'2021-03-22 18:35:00' and order_status> -1;

image.png

The above query can be divided into two parts:

  1. Use the conditions insert_time>'2021-03-22 18:23:42' and insert_time<'2021-03-22 18:35:00' to range the insert_time. The result of the search may have multiple records with different insert_time values.
  2. For these records with different insert_time values, continue to filter through the condition of order_status> -1.

In this way, for the joint index, only the part of the insert_time column can be used, but not the part of the order_status column.

Because the value of the order_status column can be used for sorting only when the insert_time value is the same , that is, if the insert_time is large, the order_status may not be large. There is no joint sequence in the middle.

So for a joint index, although only the leftmost index column can be used when performing range search on multiple columns, if the left column is an exact search, the right column can be range searched.

Use range query of the rightmost suffix of the joint index

select * from order_exp_cut where insert_time='2021-03-22 18:34:55' and order_status=0 and expire_time>'2021-03-22 18:23:57' and expire_time<'2021-03-22 18:35:00';

image.png

The middle column of the joint index uses range query

A range query in the middle will cause all the following columns to fail, and this joint index cannot be fully utilized

select * from order_exp_cut where insert_time='2021-03-22 18:23:42' and order_status>-1 and expire_time='2021-03-22 18:35:14';

image.png

Use covering index as much as possible

Covering index is a very useful tool that can greatly improve performance. The most important star in the Samsung index is the wide index star. Consider how many benefits will be brought if the query only needs to scan the index without returning to the table:

Save memory than building multiple secondary indexes

Index entries are usually much smaller than the data row size, so if you only need to read the index, then MySQL will greatly reduce the amount of data access. This is very important for the load of the cache, because most of the response time in this case is spent on data copying. Covering indexes are also helpful for I/O-intensive applications, because indexes are smaller than data and easier to fit all in memory.

Secondary index to avoid back to the table (important)

If the data we query does not appear in the joint index, then we need to return to the table operation to locate the data again in the B+ tree of the primary key index to query, the time consumed by this disk IO is very serious.

Therefore, try to use a covering index (queries that only access the index (the index column contains all the query columns)), and reduce the select * when it is not necessary, unless it is necessary to retrieve all the columns in the table and cache in the program.

It doesn't mean to use it with caution

MySQL cannot use the index when it is not equal to (!= or <>), which will cause a full table scan

SELECT * FROM order_exp WHERE order_no <>'DD00_6S';

image.png

In this case, the scan interval is [the first record,'DD00_6S'] and ['DD00_6S', the last record], plus the return table, it is better to scan all directly.

Null/Not has an impact

Need to pay attention to the possible impact of null/not null on the index.

order_no is the index column, and it is not allowed to be null

image.png

explain SELECT * FROM order_exp WHERE order_no is null; explain SELECT * FROM order_exp WHERE order_no is not null;

image.png

It can be seen that in the case of order_no is null, MySQL directly represents Impossible WHERE, and for the full table scan that is not null goes directly.

order_no is the index column, and null is allowed

explain SELECT * FROM order_exp_cut WHERE order_no is null;explain SELECT * FROM order_exp_cut WHERE order_no is not null;

image.png

is null will take ref type index access, is not null; it is still a full table scan.

In general, null can be used as an index. The specific index will be analyzed by the optimizer.

Covering index is not allowed to be null

explain SELECT order_status, expire_time FROM order_exp WHERE insert_time is null; explain SELECT order_status, expire_time FROM order_exp WHERE insert_time is not null;

image.png

Covering index is allowed to be null

image.png

How does MySQL treat Null?

Looking at Null from different angles will mean different things.

NULL value represents an undetermined value

Some think that each Null value is an independent value.

NULL value in business means there is no

Some people think that in fact, the NULL value means that there is no business, and all the NULL values ​​are added together to count as one

NULL is completely meaningless

Some people think that NULLs are completely meaningless, so they can’t be counted at all when counting numbers.

Case study

Suppose the record of a certain column c1 in a table is (2,1000,null,null). In the first case, the number of records of c1 in the table is 4, and the number of records of c1 in the second table is 3. The number of records of c1 in the three tables is 2.

image.png

The meaning of innodb_stats_method system variable setting Null

MySQL specifically provides a system variable of innodb_stats_method

nulls_equal (all nulls are equal)

All NULL values ​​are considered equal. This value is also the default value of innodb_stats_method

nulls_unequal (all nulls are not equal)

Think that all NULL values ​​are not equal.

nulls_ignored (ignoring null)

Ignore the NULL value directly.

And there are signs that in MySQL version 5.7.22 and later, the modification of this innodb_stats_method does not work, and MySQL writes this value as nulls_equal in the code. That is to say, MySQL treats null as the same situation in the data statistics of index columns (NULL value means no in business, and all the NULL values ​​are added together). It seems that the MySQL value of null value The processing is also very divided. So in general, do not allow null for column declarations as much as possible.

to sum up

Of course, this situation is not fixed, I need us to memorize it by rote. We only need Null, the uncertainty in execution is very large, so when designing the table, do not declare it as null as much as possible. Even if the data is null, we can also set the invalid flag to replace it.

Write the percent sign of Like (the index is invalid)

Leftmost matching principle

According to the left-most matching principle, we like to end with a wildcard ('abc...%') can usually use an index.

Like starts with a wildcard ('%abc...'), mysql index failure will become a full table scan operation

explain SELECT * FROM order_exp WHERE order_no like'%_6S';

Covering index can improve this problem

SELECT order_status,expire_time FROM order_exp_cut WHERE insert_time like'%18:35:09';

image.png

The character type is quoted (otherwise the optimizer will automatically perform type conversion)

The index of the string without single quotes is invalid

explain SELECT * FROM order_exp WHERE order_no = 6;

image.png

MySQL's query optimizer will automatically perform type conversion. For example, in the previous statement, it will try to convert order_no to a number and compare it with 6, which naturally causes the index to fail.

Be careful when using the or keyword

The same index is associated with or

explain SELECT * FROM order_exp WHERE order_no ='DD00_6S' OR order_no ='DD00_9S';

Different indexes are associated with or

explain SELECT * FROM order_exp WHERE expire_time = '2021-03-22 18:35:09' OR order_note ='abc';

in conclusion

image.png

You can see that the same index can be scanned with a range index with or association.

When different indexes use or for associative query, the optimizer will not choose any condition for execution. Why? Because we can only select one index for execution at the same time. After scanning one of the indexes, we must scan the other index (because the second index cannot be taken). Therefore, it is better to directly scan globally. (Question, why can't we take the union of two indexes?)

Use index scan to sort and group

MySQL has two ways to generate ordered results; through sorting operations; or scanning in the order of the index; if the value of the type column from EXPLAIN is "index", it means that MySQL uses an index scan for sorting. Otherwise, you need to sort in a special sort area through MySQL's own sorting algorithm.

If the secondary index uses select *

The original secondary index is very fast. But if the queried data is not completely contained in the index, then you need to frequently return to the table to get all the data. Therefore, even if we have a sorted secondary index, if the search range is too large and the return table IO is too high, the optimizer may choose a full table scan again.

MySQL can design a joint index that not only satisfies sorting, but also used to find rows, while using covering index to reduce back to table IO.

The column order of the joint index and the column order of the ORDER BY clause need to be consistent

Only when the column order of the index is exactly the same as the order of the ORDER BY clause, and the sorting direction (reverse or forward) of all columns is the same, MySQL can use the index to sort the results.

How to use ORDER BY effectively in relational query

If the query needs to associate multiple tables, the index can be used for sorting only when the fields referenced by the 0RDER BY clause are all the first driving table.

Do not mix ASC and DESC in the joint index

For the scenario of sorting using a joint index, we require that the sorting order of each sorting column is consistent, that is, either each column is sorted by ASC rules, or all columns are sorted by DESC rules. This is inseparable from the structural relationship of the joint index.

Sort column contains columns that are not the same index

The multiple columns used for sorting are not in one index, and in this case, the index cannot be used for sorting. The following order_no and insert_time are not in the same joint index.

explain SELECT * FROM order_exp order by order_no,insert_time;

image.png

Insert rows in primary key order as much as possible

It is best to avoid random (discontinuous and very large distribution of values) clustered indexes, especially for I/O-intensive applications. For example, from a performance point of view, using UUID as a clustered index would be very bad, it makes the insertion of the clustered index become completely random, which is the worst case, so that the data does not have any clustering characteristics.

The easiest way is to use AUTO_INCREMENT to increase the column. This can ensure that the data rows are written in order, and the performance of association operations based on the primary key will be better.

Disadvantages of UUID

  1. The value of UUID is random, so the primary key value of the new row is not necessarily larger than the one previously inserted, so InnoDB cannot simply always insert the new row at the end of the index. Therefore, it may cause page splits, resulting in frequent movement of the entire B+ tree.
  2. The UUID primary key not only takes longer to insert rows, but the index takes up more space. Because the primary key field is longer (UUID is generally a very long character at random).

Is the primary key auto-increment necessarily sequential IO?

Note that the newly added data is written in units of pages. Therefore, the data of a page must be in order on the disk.

Because the value of the primary key is sequential, InnoDB stores each record after the previous record. When the maximum fill factor of the page is reached (InnoDB's default maximum fill factor is 15/16 of the page size, leaving some space for later modification), the next record will be written to the new page. Once the data is loaded in this order, the primary key page will be approximately filled with sequential records, which is exactly the desired result.

But this expectation is not necessarily so. Our disk is a shared area of ​​the entire computer. Therefore, in implementation, we MySQL introduced the concept of a zone. A zone contains multiple pages. At this time, the pages of this area are all ordered on the disk.

Although these optimizations do not necessarily ensure that the primary key of MySQL must be continuous in the disk, this is often the goal of our efforts.

Optimize Count query

The role of Count

The first thing to note is that COUNT() is a special function with two very different functions.

  1. Count the number of columns whose value is not Null.
image.png
image.png
  1. Count the number of rows.
image.png

Generally speaking, COUNT() needs to scan a large number of rows (meaning access to a large amount of data) to obtain accurate results, so it is difficult to optimize. Basically, the only thing that can be done at the MySQL level is to scan the entire primary key index.

Statistics total optimization plan

If this is not enough, you need to consider modifying the application architecture. You can use estimated values ​​instead of accurate values, add summary tables, or add external caching systems like Redis.

Optimize limit paging

How to use ordinary limit

When paging operations are needed in the system, we usually use LIMIT plus offset to achieve, and at the same time add a suitable ORDER BY clause. A very common and troublesome problem is that when the offset is very large, for example, it may be:

select * from order_exp limit 10000,10;

Pure SQL optimization

One of the simplest ways to optimize this type of paging query is:

SELECT * FROM (select id from order_exp limit 10000,10) b,order_exp a where a.id = b.id;

It will first query the primary key values ​​of the N pieces of data required in the paging, and then query the required N pieces of data according to the primary key values. In this process, query the primary key id of the N pieces of data in the index, so the efficiency will be high some.

image.png

It can be seen from the execution plan that the order_exp table in the subquery is executed first, the index full table scan is performed based on the primary key, and then the primary key associated query is performed with the a table through the id. Compared with the traditional writing method, the full table scan efficiency will be higher.

From the two writing methods, it can be seen that there is a certain gap in performance. Although it is not obvious, as the amount of data increases, the efficiency of the two executions will be reflected.

Cooperate with business code optimization

Although the above writing method can achieve a certain degree of optimization, there are still performance problems. The best way is to cooperate and modify the following statement in business:

select * from order_exp where id> 67 order by id limit 10;

image.png

In this way of writing, the front end needs to click More to get more data instead of pure page turning. Therefore, each query only needs to use the id in the data queried last time to get the next data, but This way of writing requires business cooperation.