Detailed explanation of mysql index data structure and mysql optimization actual combat

Article Directory


Graphical data structure:

https://www.cs.usfca.edu/~galles/visualization/Algorithms.html

1. Indexes are [data structure] [sorted] to help MySQL obtain data efficiently

Two, binary tree data structure

Insert picture description here

Three, B tree index structure

Insert picture description here
叶节点具有相同的深度,叶节点的指针为空

所有索引元素不重复

节点中的数据索引从左到右递增排列

The first row of b tree, each data 1Kb, a node 16 KB, then the first row can only put 16

16 to the Nth power = more than 20 million, the number of nodes will be more than b+ tree, b+ tree only needs 3 nodes

There is no pointer between the leaf nodes, the access efficiency will be a bit slower than the leaf nodes of b+ tree

Four, B+ Tree data structure

Insert picture description here
非叶子节点不存储数据data,只存储索引(冗余)

叶子节点包含所有索引字段,和每行数据

叶子节点之间有指针连接,提高了访问效率

The query shows that the size of each page (each layer) is 16KB:

SHOW GLOBAL STATUS LIKE 'Innodb_page_size';

The query shows that the size of each page is 16KB, and the allocation size of each row in mysql is 16KB by default.

There are pointer connections between leaf nodes to improve interval access performance [can quickly find data within a certain range during range search]

Assuming that bigint is used as the index, it occupies 8 bytes; the blank represents 6 bytes of the address of the next node;

The first node: 16KB/(8+6)B = 1170 stores 1170 * 1170 * 16 = more than 20 million pieces of data

Load a node to the disk is to do a disk I/O, the most time-consuming process of searching is the process of loading the node

Find data in the memory according to the address, and the time is ignored;

Generally, an auto-incrementing primary key is used as a clustered index when building a table. If it is not, MySQL will create it by default. However, if the primary key is changed, the cost is high. Therefore, the auto-increment ID should not be updated frequently when building the table.

The indexes added by yourself are all auxiliary indexes. The auxiliary index is a secondary index to find the primary key index. Now the primary key index is found and the data is found through the primary key index;

The primary key index in Innobd is a clustered index, and non-clustered indexes are auxiliary indexes, such as composite indexes, prefix indexes, and unique indexes.

Innodb uses a clustered index, MyISam uses a non-clustered index

4.1 What is the difference between [Clustered Index/Clustered Index] and [Non-Clustered Index/Non-Clustered Index/Auxiliary Index]?

[Clustered Index] The row record data of the entire table is stored in the leaf node (the leaf node stores a whole row of data)

[Non-clustered index] The value of the index and primary key are stored in the leaf node . If you need to query the value of the non-[primary key] and [index], you need to find the primary key index in the non-clustered index, and then return the table to the primary key through the primary key Find other data in the index;

The existence of auxiliary indexes does not affect the organization of data in the clustered index, so a table can have multiple auxiliary indexes;

The clustered index is to construct a B+ tree according to the primary key of each table. At the same time, the row record data of the entire table is stored in the leaf node, and the leaf node of the clustered index is also called the data page. This feature determines that the data in the index-organized table is also part of the index. Each table can only have one clustered index , and there can be multiple non-clustered indexes .

Innodb aggregates data through the primary key. If the primary key is not defined, Innodb will choose a non-empty unique index instead. If there is no such index, InnoDB will implicitly define a primary key as a clustered index.

The advantages and disadvantages of clustered indexes

advantage:

1. Data access is faster, because the clustered index saves the index and data in the same B+ tree, so getting data from the clustered index is faster than the non-clustered index

2. The clustered index is very fast for the sort search and range search of the primary key.
Disadvantages:

1. Insert heavily dependent on the speed of insertion order, according to the primary key of the order of insertion is the fastest way, otherwise the page will appear split, seriously affect performance. Therefore, for InnoDB tables, we generally define an auto-incremented ID column as the primary key

2. The cost of updating the primary key is very high , because it will cause the updated row to move. Therefore, for InnoDB tables, we generally define the primary key as non-updatable.

3. The secondary index access requires two index lookups, the first time the primary key value is found, and the second time the row data is found based on the primary key value.

Five, MyIsam index file

Insert picture description here

innodb

frm [Storage table and view structure definition] ibd [Data and index]

myIsam

frm [Storage table and view structure definition] myd [data] myi [index index]

Insert picture description here

First find myi to find the specific address, according to the address go to myd to get the data

The primary key index and non-primary key index of myIsam are stored as shown in the figure above.

Six, Innodb

Insert picture description here

innoDb

frm table structure ibd data and index

6.1 Innodb secondary index is a non-clustered index, the index of the primary key is found through the secondary index, and then [back to the table] the value of the entire column is found through the primary key index, [the bottom layer of the secondary index, each grid contains the secondary index and the primary key 】

InnoDB needs to have a column that all elements are unique to organize this B+tree; if the primary key is not set, it will find a column of unique items for all elements to index, and if no unique column is found, it will create a new column Hidden column to organize B+ tree

It is recommended to use an integer as the primary key for auto-increment, because it is more convenient to compare the sizes between the integer types. [Auto-increment] If it is not self-increment, when inserting a number into the tree, the node will split , and the tree may do a balanced operation at the same time

Seven, the principle of the leftmost prefix of the index

Insert picture description here

The above figure joint primary key index

Compare [name] first, then compare [age] and compare last [position] Arrange from smallest to largest

If the name is missing in the above picture, the leftmost prefix will be invalid, because the joint index will be sorted in the b+ tree according to the leftmost

If the leftmost condition is missing, the second condition will not be sorted in the tree, so the index will be invalid

If there is the leftmost condition of name, regardless of the order, mysql will automatically optimize the name condition to the leftmost

If the leftmost condition of name is [missing] in the condition, then the joint index will be [invalid]

Joint index index_name_age_email; from left to right: name,age,email

EXPLAIN select * from user WHERE   age=21 AND name='张三' AND email='[email protected]'

Here mysql will be optimized internally, and name will be placed on the far left

Insert picture description here
EXPLAIN select * from user WHERE   name='张三' AND email='[email protected]';

As long as there is name [leftmost element], the index will be used, here email is the third, so only the index of name is used here

Insert picture description here

Only name index query

EXPLAIN select * from user WHERE   name='张三';
Insert picture description here


Insert picture description here


Insert picture description here


Insert picture description here

Eight, Hash algorithm

Insert picture description here

Hash index quickly locates the location of data storage by performing a hash operation on the key of the index

Nine, Explain use Xiangjie

Explain is the estimated result

mysql (5.7)

Use the EXPLAIN keyword to simulate the optimizer to execute SQL statements and analyze the performance bottleneck of your query statement or structure

Insert picture description here


Insert picture description here
Insert picture description here


Insert picture description here
idThe larger the id, the higher the priority of execution
select_typeQuery type: primary The outermost select subquery in the review query is included in the select subquery (not in the from clause): between the select and form tables, the derived subquery included in the form clause (after the form), Mysql will Put the results in a temporary table, derived table
tableWhich table is being queried
partitionsPartition Table
typeThe associated type or access type is
optimal to the worst: system>const>eq_ref>ref>range>index>ALL
query reaches the range level, preferably ref
null: mysql can decompose the query statement in the optimization stage, and it is not necessary in the execution stage.

When comparing all columns of primary_key or unique_key accessing a table or index with a constant, all tables have at most one match, and it is faster to read once.
eq_ref table association, primary key association
ref uses an index, may query multiple results
range index query range query
index scan the full index, you can get the result, usually a secondary index. If the result set of the search is in the primary key index and the secondary index [both], then [Preferred to select the secondary index], because [the secondary index is smaller than the primary key index, the search is faster]
ALL full table scan
possible_keysindex
keyThe index used in the actual execution
key_lenThe length of the index column data is used. If it is a joint index, if two of the columns (index a, index b) are integers, if key_len is equal to 4, only index a is used, and if key_len is equal to 8, index a is used And index b [string: 3*n(character length)+2]
If the field is allowed to be null, a 1-byte record is required to record whether it is null

byte: 8 bits and one byte
short: 16 bits and two bytes
int: 32-bit four bytes
long: 64-bit eight bytes
float: 32-bit four bytes
double: 64-bit eight bytes
char: 32-bit four bytes
boolean: 8-bit one byte
refIn the index of the key column record, the table finds all the columns or constants of the value
rowsEstimate how many rows may be scanned
filtered
extra**Using index **** means that if the result set we search is in our index tree, there is no need to go back to the table query, that is, [covering query] will use the index

****Using filesort ** will use external sorting instead It is not index sorting. Sort from memory when the data is small, otherwise it needs to be sorted on disk [file sorting, no indexing]

Select tables optimized No need to make any plans away The result is stored in memory, just take it out

Do not do any operations on the index column (calculation or function), will not go to the index

After the character is intercepted, [the value stored in the index is not intercepted]

Insert picture description here

The order of the index:

Insert picture description here

Try to use covering index (only access index query), reduce select * statements

The index may not be used when mysql uses unequal or <>

Or <may go index, may not go index

is null, is not null In general, the index cannot be used

Like starts with a wildcard ('%test') mysql index failure becomes a full table scan query [% before, indicating that there may be other fields before the search field] In the index tree, the prefix is ​​skipped, and the index is unordered

If you must use'%test%' to query, use joint index optimization ; use [covering index] to optimize

Covering index: the results of the query are all [index or index + primary key]

Insert picture description here

The index of the string without single quotes is invalid, such as name = '100' and name = 100; type conversion may be performed internally in mysql

Use or or in sparingly. MySQL does not necessarily use indexes when querying with it . The internal optimizer of MySQL will evaluate whether to use indexes as a whole based on multiple factors such as retrieval ratio and table size.

Range query may not use index, range query may use index, depending on the details of mysql internal optimization

The range of search data is large, and the index may not be used; when the range is small, the index will be used

Insert picture description here


Insert picture description here
Insert picture description here

Ten, B+ tree bottom analysis common index optimization principles

Forced to take the index force index (idx_name_age_position)

EXPLAIN SELECT * FROM employees force index(idx_name_age_position) WHERE name>'LiLei'AND age=22 A
ND position='manager';

Mandatory indexing is not necessarily fast

‐‐关闭查询缓存
set global query_cache_size=0;
set global query_cache_type=0;
‐‐执行时间0.333s
SELECT * FROM employees WHERE name>'LiLei';
‐‐执行时间0.444s
SELECT * FROM employees force index(idx_name_age_position) WHERE name>'LiLei';

2. Covering index optimization **[Result of covering index, add primary key for index or index]**

EXPLAIN SELECT name,age,position FROM employees WHERE name>'LiLei' AND age=22 AND position='manag
er';

What is index pushdown?

For the auxiliary joint index (name, age, position), the principle of the leftmost prefix is ​​normally followed,

SELECT * FROM employees WHERE name like 'LiLei%' AND age=22 AND position='manager' 

In this case, only the name field index is used, because after filtering according to the name field, the age and position in the index row obtained are out of order, and the index cannot be used well.
In versions prior to MySQL 5.6, this query can only match indexes whose names start with'LiLei' in the joint index, and then take the primary keys corresponding to these indexes back to the table one by one, find the corresponding records on the primary key index, and compare Whether the values ​​of age and position match.
MySQL 5.6 introduces index push down optimization. During the index traversal process, all fields contained in the index can be judged first, and then return to the table after filtering out the records that do not meet the conditions, which can effectively reduce the number of return to the table. After using the index push-down optimization, the above query matches the index whose name starts with'LiLei' in the joint index, and at the same time filters the two fields of age and position in the index, and then filters the remaining indexes. The corresponding primary key id will go back to the table to check the entire row of data [not necessarily through the index push down]. Index pushdown will reduce the number of return to the table. For the table index pushdown of the innodb engine, it can only be used for secondary indexes. Innodb's primary key index (clustered index) tree leaf node saves the entire row of data, so this time the index is down Pushing will not reduce the effect of querying the entire row of data.

How mysql finally chooses the index, we can use the trace tool to find out. Turning on the trace tool will affect the performance of mysql, so we can only analyze the use of sql temporarily, and close it immediately after using it.

mysql> set session optimizer_trace="enabled=on" ,end_markers_in_json=on;  ‐‐开启trace
mysql> select * from employees where name>'a' order by position;
mysql> SELECT * FROM information_schema.OPTIMIZER_TRACE;

查看trace字段:
Insert picture description here

Conclusion: The cost of a full table scan is lower than an index scan, so MySQL finally chooses a full table scan

mysql>select *from employees where name>'zzz' order by position;
mysql>SELECT* FROM information_schema.OPTIMIZER_TRACE;

查看trace字段可知索引扫描的成本低于全表扫描,所以mysql最终选择索引扫描

mysql> set session optimizer_trace="enabled=off";   ‐‐关闭trace

in() order by will not go index

Insert picture description here

name>'a' order by name; did not take the index, it may be because of the large amount of data, mysql judged that the index was not sent

Insert picture description here
优化总结:

1、MySQL支持两种方式的排序filesort和index,Using index是指MySQL扫描索引本身完成排序。index

效率高,filesort效率低。

2、order by满足两种情况会使用Using index。

  1) order by语句使用索引最左前列。

  2) 使用where子句与order by子句条件列组合满足索引最左前列。

3、尽量在索引列上完成排序,遵循索引建立(索引创建的顺序)时的最左前缀法则。

4、如果order by的条件不在索引列上,就会产生Using filesort。

5、能用覆盖索引尽量用覆盖索引

6、group by与order by很类似,其实质是先排序后分组,遵照索引创建顺序的最左前缀法则。对于group

by的优化如果不需要排序的可以加上order by null禁止排序。注意,where高于having,能写在where中

的限定条件就不要去having限定了。

Interview

Using filesort file sorting principle detailed
filesort file sorting method

Single-way sorting: Take out all the fields that meet the conditions at one time, and then sort them in the sort buffer; use the trace tool to see the sort_mode information that displays <sort_key,additional_fields> or <sort_key,packed_additional_fields>
two-way sorting (also called Return to the table sort mode) : First, take out the corresponding sort field and the row ID that can directly locate the row data according to the corresponding conditions , and then sort in the sort buffer. After sorting, you need to retrieve other required fields again ; use the trace tool You can see that <sort_key,rowid> is displayed in the sort_mode information

MySQL determines which sorting mode to use by comparing the size of the system variable max_length_for_sort_data (1024 bytes by default) and the total size of the field to be queried.

If the total length of the field is less than max_length_for_sort_data, then the single-way sort mode is used ;
if the total length of the field is greater than max_length_for_sort_data, then the two-way sort mode is used .

Check the corresponding trace result of this sql as follows (only the sorting part is shown):

mysql>set session optimizer_trace="enabled=on", end_markers_in_json=on;‐‐开启trace
mysql>select * from employees where name='zhuge'order by position;
mysql>select * from information_schema.OPTIMIZER_TRACE;
Insert picture description here

Eleven, Mysql index best practice

Index design principles

1. The code goes first, the index goes up

2. The joint index tries to cover the conditions

For example, you can design one or two or three joint indexes (as few as possible to build single-value indexes), so that each joint index should try to include the fields of where, orderby, and groupby in the SQL statement, and ensure that the order of the fields of these joint indexes is as much as possible Satisfy the leftmost prefix principle of sql query.

3. Don't build indexes on small base numbers

The base of the quote refers to how many different values ​​the field has in the table. For example, a table has a total of 1 million rows of records, and there is a gender field whose value is either male or female, so the base of the field is 2. In general, when building indexes, try to use those fields with larger cardinality, that is, fields with more values, so that the advantages of B+ tree fast binary search can be brought into play.

4. We can use prefix index for long strings

For this kind of varchar (255) large field may take up disk space, it can be slightly optimized, such as indexing the first 20 characters of this field, that is, the first 20 characters of each value in this field Put it in the index tree, similar to KEY
index(name(20),age,position) [prefix index]. At this time, when you search in the where condition, if you search according to the name field, then you will first search according to the first 20 characters of the name field in the index tree, and locate the prefix match of the first 20 characters afterwards. After the partial data, return to the clustered index to extract the complete name field value for comparison.

But if you want to order by name, at this time your name only contains the first 20 characters in the index tree, so this sorting cannot use the index, and the group by is the same. So everyone here needs to have an understanding of prefix indexes.

5. When where conflicts with order by, **** has priority where

Generally, in this case, the where condition is used to use the index to quickly filter out a part of the specified data, and then sort it. Because in most cases, where filtering based on the index can filter out the small part of the data you want as quickly as possible, the cost of sorting may be much smaller.

6. Optimize based on slow sql query

According to some slow SQL in the monitoring background, specific index optimization can be done for these slow SQL queries.
If you are not sure about slow sql query, please refer to this article: https://blog.csdn.net/qq_40884473/article/details/89455740

In the joint index: when the amount of data is small, in generally does not go to the index, when the amount of data is large, in will go to the index

Insert picture description here


Insert picture description here

The core idea is to use one or two complex multi-field joint indexes as much as possible to resist more than 80% of the queries, and then use one or two auxiliary indexes to resist the remaining atypical queries as much as possible to ensure the query of this large data scale. As much as possible can make full use of the index, so that you can ensure query speed and performance!

[Try to use a single table to complete the business] [Multi-table query is not conducive to index optimization]

[Joint Index] If there are too many joint indexes built, it will affect the performance of inserting and modifying

Twelve, mysql optimization

1. Paging query optimization

Many times our business system may implement the paging function with the following SQL

mysql> select * from employees limit 10000,10;

Indicates that 10 rows of records starting from row 10001 are retrieved from the employees table. Seemingly consulted only 10 records, real occasion of this SQL is to read 10,010 records, then discard the first 10,000 records, then you want to read the data of 10 back . Therefore, the execution efficiency is very low when querying the relatively later data of a large table.

Prerequisite: Paging query sorted by auto-increasing and continuous primary key , if there is no deletion [Once a middle record is deleted, it will be invalid]

mysql> select * from employees where id > 10000 limit 10;

2. Paging query sorted by non-primary key fields

A paging query sorted by non-primary key fields, the SQL is as follows:

The premise name is the leftmost one of the joint index

# 不会走索引,数据量大了;limit会先查询出 90005 条数据,而且会回表
select * from employees ORDER BY name limit 90000,5;

sql optimization, covering index + linked list

First go to the covering index, then the primary key index, and finally scan the entire table of the 5 data queried [5 left and right data, the amount of data is small, scanning these 5 data ignores its performance consumption]

mysql> select * from employees e inner join (select id from employees order by name limit 90000,5) ed
on e.id = ed.id;

mysql association table, there are two common algorithms
** nested loop join ** Nested-Loop Join algorithm
based nested loops join block Block Nested-Loop Join Algorithms

1. Nested-Loop Join (NLJ) algorithm
*

Read rows from the first table (called the driving table) one row at a time, get the associated fields from this row of data, and fetch the rows that meet the conditions in another table (the driven table) according to the associated fields. Then take out the result collection of the two tables.

t1 10000 rows t2 100 rows

mysql> EXPLAIN select * from t1 inner join t2 on t1.a= t2.a;

You can see this information from the execution plan:

The driven table is t2, and the driven table is t1. The first execution is the driving table (if the id of the execution plan result is the same, the sql is executed in order from top to bottom); the optimizer generally prefers small tables as the driving table . Therefore, when using inner join, the top table is not necessarily the driving table.

When using left join, the left table is the driving table, and the right table is the driven table.

When using right join, the right table is the driven table, and the left table is the driven table.

When using join, mysql will choose a table with a relatively small amount of data as the driving table, and a large table as the driven table.

The NLJ algorithm is used. In a general join statement, if the Using join buffer does not appear in the execution plan Extra, it means that the join algorithm used is NLJ.

The general flow of the above sql is as follows:

Read a row of data from table t2 (if there is a query filter condition in table t2, a row of data will be retrieved from the filter result);

From the data in step 1, take out the associated field a and search it in table t1;

Take out the rows that meet the conditions in table t1, merge them with the results obtained in t2, and return them to the client as the result;

Repeat the above 3 steps.

The whole process will read all the data in the t2 table (scan 100 rows), and then traverse the value of field a in each row of data, and scan the corresponding row in the t1 table according to the value of a in the t2 table (scan 100 times in the t1 table) Index, 1 scan can be considered that only one row of the complete data of the t1 table is scanned in the end, that is, a total of 100 rows of the t1 table are scanned). So 200 lines were scanned in the whole process.

If the associated fields of the driven table are not indexed **, the performance of using the NLJ algorithm will be lower (detailed explanation below), and mysql will choose the Block Nested-Loop Join algorithm. **

2. Block Nested-Loop Join (BNL) algorithm based on block nested loop join

The optimizer generally prefers small tables as driving tables .

Read the data of the driving table into join_buffer , then scan the driven table , and fetch each row of the driven table to compare with the data in join_buffer.

The Using join buffer (Block Nested Loop) in Extra indicates that the associated query uses the BNL algorithm.

The general flow of the above sql is as follows:

Put all the data of t2 into join_buffer

Take out each row in table t1 and compare it with the data in join_buffer

Return data that meets the join conditions

In the whole process, a full table scan is done on tables t1 and t2, so the total number of rows scanned is 10000 (the total amount of data in table t1) + 100 (the total amount of data in table t2) = 10100. And the data in join_buffer is out of order, so 100 judgments are required for each row in table t1, so the number of judgments in the memory is 100 * 10000 = 1 million times.

In this example, table t2 has only 100 rows. What if table t2 is a large table and join_buffer cannot fit?
The size of join_buffer is set by the parameter join_buffer_size, and the default value is 256k . If you can't put all the data in the table t2, the strategy is very simple, which is to put them in sections .

For example, the t2 table has 1000 rows of records, and join_buffer can only put 800 rows of data at a time, so the execution process is to first put 800 rows of records in the join_buffer, then take the data from the t1 table and compare with the data in the join_buffer to obtain partial results, and then clear the join_buffer. Then put the remaining 200 rows of records in the t2 table, and compare the data from the t1 table with the data in join_buffer again. So I scanned the t1 table one more time.

The associated fields of the driven table are not indexed. Why choose to use the BNL algorithm instead of Nested-Loop Join?

NLJ is based on disk scan, and BNL is based on memory scan.
If the second sql above uses Nested-Loop Join, then the number of scan lines is 100 * 10000 = 1 million times, this is disk scan.

Obviously, the number of scans with BNL disk is much less, compared to disk scan, BNL's memory calculation will be much faster . Therefore, MySQL generally uses the BNL algorithm for associative queries where the associated fields of the driven table are not indexed. If there is an index, the NLJ algorithm is generally selected. If there is an index, the NLJ algorithm has higher performance than the BNL algorithm.

Optimization of associated SQL
** [Driven table] Associated field plus index**, try to choose NLJ algorithm
small table to drive large table when mysql do join operation , if you know which table is small table when writing multi-table join sql You can use the straight_join writing method to fix the connection drive mode, saving the time for the mysql optimizer to judge by itself

When the id is the same, execute from top to bottom:

When the id is different, execute the one with the larger id first;

Insert picture description here

When you know the data size of **[Participate in Association]****, you can use straight_join** to force the specified association table:

Straight_join explanation: The straight_join function is similar to join, but it allows the table on the left to drive the table on the right, and can change the execution order of the table optimizer for the join table query.

Use straight_join notation to fix the connection drive mode. For
example: select * from** t2** straight_join t1 on t2.a = t1.a; represents that mysql selects the t2 table as the drive table.
Straight_join only applies to inner join, not to left join and right join. (Because left join and right join already represent the execution order of the specified table) Let the optimizer judge as much as possible, because in most cases the mysql optimizer is smarter than people. Use straight_join must be cautious, because in some cases, the artificially specified execution order may not be more reliable than the optimization engine.

in and exsits optimization

Principle: Small tables drive large tables, that is, small data sets drive large data sets
in: When the data set of table B is smaller than the data set of table A , in is better than exists

select * from A where id in (select id from B)

#等价于: 先执行 B 表,在执行 A 表
 for(select id from B){
   select * from A where A.id = B.id
 }

exists: When the data set of table A is smaller than the data set of table B , exists is better than in. Put the data of main query A into subquery B for conditional verification, and determine the main query based on the verification result (true or false) Whether the data is retained

select * from A where exists (select 1 from B where B.id = A.id)
 #等价于:
 for(select * from A){
 select * from B where B.id = A.id
 }

 #A表与B表的ID字段应建立索引

1. EXISTS (subquery) only returns TRUE or FALSE, so the SELECT * in the subquery can also be replaced with SELECT 1. The official statement is that the SELECT list will be ignored during actual execution, so there is no difference.
2. The actual execution process of the EXISTS subquery may be After optimization instead of one-by-one comparison in our understanding
3. EXISTS sub-queries can often be replaced by JOIN, which is the best need for specific analysis of specific issues

count(*) query optimization

‐‐ 临时关闭mysql查询缓存,为了查看sql多次执行的真实时间
mysql> set global query_cache_size=0;
mysql> set global query_cache_type=0;

# 以下4条sql只有根据某个字段count不会统计字段为null值的数据行
mysql> EXPLAIN select count(1) from employees;
mysql> EXPLAIN select count(id) from employees;
mysql> EXPLAIN select count(name) from employees;
mysql> EXPLAIN select count(*) from employees;

The execution plan of the four SQLs is the same, indicating that the execution efficiency of these four SQLs should be similar

**
Field has an index: count(** )≈count(1)>count(field)>count(primary key id) //Field has an index, count (field) counts the secondary index, and the secondary index stores more data than the primary key There are few indexes, so count(field)>count(primary key id)

Field without index: count(* )≈count(1)>count(primary key id)>count(field) //field has no index count(field) statistics cannot go to index, count(primary key id) can also go to primary key index, so count(primary key id)>count(field)

The execution process of count(1) is similar to count(field), but count(1) does not need to take out the field statistics , just use the constant 1 for statistics, count(field) also needs to take out the field , so theoretically count(1) is better than count( Field) will be faster.

Count(****) is the exception. MySQL does not take out all the fields, but specifically optimizes it. It does not take values ​​and accumulates by rows. The efficiency is very high , so there is no need to use count (column name) or count. (Constant) instead of count(**). count(*) will count null rows, count (column name) will not count rows with null values ​​in this column

Why for count(id), MySQL finally chooses the secondary index instead of the primary key clustered index? Because the secondary index stores less data than the primary key index, the retrieval performance should be higher, and mysql has been optimized internally (it should be optimized only in version 5.7).

1. Query the total number of rows maintained by mysql [for the total number of rows without conditions]

For the myisam storage engine table, the performance of count query without where condition is very high , because the total number of rows of the myisam storage engine table will be stored on the disk by mysql , and the query does not need to be calculated

For the innodb storage engine table mysql does not store the total number of rows of the table (because of the MVCC mechanism), the query count needs to be calculated in real time

2. show table status

If you only need to know the estimated value of the total number of rows in the table, you can use the following SQL query, which has high performance

show table status like 'employees'

But the total number queried by this method is not very efficient

3. Increase the database count table. When
inserting or deleting table data rows, maintain the count table at the same time, so that they can operate in the same transaction

Insert picture description here

Optimization suggestion

If the shaping data has no negative numbers, such as the ID number, it is recommended to specify the UNSIGNED unsigned type, and the capacity can be doubled.

It is recommended to use TINYINT instead of ENUM, BITENUM, SET.

Avoid using integer display width (see the end of the document), that is, do not use INT (10) similar method to specify the field display width , directly use INT. This is the display width, not the length of the code int range.

DECIMAL is most suitable for storing data that requires high accuracy and is used for calculations, such as prices. But when using the DECIMAL type, pay attention to the length setting.

It is recommended to use the integer type to calculate and store the real number, the method is to multiply the real number by the corresponding multiple before the operation.

Integer is usually the best data type because it is fast and can use AUTO_INCREMENT.

Date time

Insert picture description here

Optimization suggestion

The smallest time granularity that MySQL can store is seconds.

It is recommended to use the DATE data type to save the date. The default date format in MySQL is yyyy-mm-dd.

Use MySQL's built-in types DATE, TIME, DATETIME to store time instead of using strings.

When the data format is TIMESTAMP and DATETIME, you can use CURRENT_TIMESTAMP as the default (after MySQL 5.6), and MySQL will automatically return the exact time when the record is inserted.

TIMESTAMP is the UTC timestamp, which is related to the time zone.

The storage format of DATETIME is an integer of YYYYMMDD HH:MM:SS, which has nothing to do with the time zone. What you save is what you read out.

Unless there are special needs, general companies recommend using TIMESTAMP, which saves more space than DATETIME, but companies like Ali generally use DATETIME, because there is no need to consider the future time limit of TIMESTAMP.

Sometimes people save Unix timestamps as integer values, but this usually has no benefit. This format is not convenient to handle, and we don't recommend it.

Insert picture description here


Insert picture description here

Optimization suggestion

Use VARCHAR for string lengths that differ greatly; for short strings, and all values ​​are close to one length, use CHAR.

CHAR and VARCHAR are applicable to any combination of letters and numbers including names, postal codes, telephone numbers and no more than 255 characters in length. Those numbers that are to be used for calculation should not be stored in VARCHAR type, because it may cause some calculation-related problems. In other words, it may affect the accuracy and completeness of the calculation.

Try to use BLOB and TEXT as little as possible. If you really want to use it, you can consider storing the BLOB and TEXT fields in a separate table and associate them with id.

The BLOB series store binary strings, regardless of the character set. The TEXT series store non-binary character strings, which are related to character sets.

Neither BLOB nor TEXT can have default values.

13. How to get the height of MySQL innodb B+tree

Preface

The reason why MySQL's innodb engine uses B+tree to store indexes is to minimize the number of disk IOs during data query. The height of the tree directly affects the performance of the query. Generally, the height of the tree is more suitable for 3~4 floors. The purpose of the database table is also to control the height of the tree. So how to get the height of the tree? Here is an example to illustrate how to get the height of the tree.

Sample data preparation

The table building statement is as follows:

CREATE TABLE `user` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `name` varchar(100) CHARACTER SET latin1 DEFAULT NULL,
  `age` int(11) DEFAULT NULL,
  PRIMARY KEY (`id`),
  KEY `name` (`name`),
  KEY `age` (`age`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8

Insert 1 million pieces of data into the table. Data are as follows:

mysql> select * from user limit 2\G
*************************** 1. row ***************************
  id: 110000
name: ab
 age: 100
*************************** 2. row ***************************
  id: 110001
name: ab
 age: 100
2 rows in set (0.00 sec)

Get the height of the tree by querying the relevant data table

Take MySQL5.6 version as an example to illustrate how to get the height of the tree.

First get page_no

mysql> SELECT b.name, a.name, index_id, type, a.space, a.PAGE_NO FROM information_schema.INNODB_SYS_INDEXES a, information_schema.INNODB_SYS_TABLES b WHERE a.table_id = b.table_id AND a.space <> 0 and b.name='test/user';
+-----------+---------+----------+------+-------+---------+
| name      | name    | index_id | type | space | PAGE_NO |
+-----------+---------+----------+------+-------+---------+
| test/user | PRIMARY |       22 |    3 |     6 |       3 |
| test/user | name    |       23 |    0 |     6 |       4 |
| test/user | age     |       24 |    0 |     6 |       5 |
+-----------+---------+----------+------+-------+---------+
3 rows in set (0.00 sec)

page_no is the serial number of the root page in the index tree. For the meaning of other items, please refer to: https://dev.mysql.com/doc/refman/5.6/en/innodb-sys-indexes-table.html

Read page size

mysql> show global variables like 'innodb_page_size';
+------------------+-------+
| Variable_name    | Value |
+------------------+-------+
| innodb_page_size | 16384 |
+------------------+-------+
1 row in set (0.00 sec) 

Finally read the height of the index tree

$ hexdump -s 49216 -n 10 ./user.ibd
000c040 0200 0000 0000 0000 1600
000c04a

It can be found that PAGE_LEVEL is 0200, which means that the height of this secondary index tree is 3. The following 1600 is the index_id value of the index. The hexadecimal number 16 converted to the decimal number is 22. This 22 happens to be the index_id of the above primary key.
How is 49216 calculated in the above hexdump command? The formula is page_no * innodb_page_size + 64.
3*16384+64=49216

We are using this method to view the height of the other two indexes.

$ hexdump -s 65600 -n 10 ./user.ibd
0010040 0100 0000 0000 0000 1700
001004a
$ hexdump -s 81984 -n 10 ./user.ibd
0014040 0200 0000 0000 0000 1800
001404a

It can be seen that the height of the name index is 2, and the height of the age index is 3.

Estimate based on the structure of the index

If you do not have permission to the database server. You can also estimate the height of the tree based on the database index structure.
According to the B+Tree structure, non-leaf nodes store index data, and leaf nodes store all data for each row.
The size of each index item of a non-leaf node is data size + pointer size. Assume that the pointer size is 8 bytes. Each page will not be full, leaving 1/5 of the gap. Below we estimate the height of the two indexes name and age.

name index height estimate

The number of index items stored on each page of non-leaf nodes. The size of each page is 16k. The value of name is ab. Occupies 2 bytes. The data size of each item is 2+8=10 bytes. The number of index items that can be stored on each page is 16384 * 0.8 / 10 = 1310.
The number of indexes stored on each page of the leaf node. The size of each page is 16k. The data size of each item is 4+2+8=14 bytes. The number of indexes that can be stored without a page is 16384 * 0.8 / 14 = 936.
Two layers can store 1310*936=1226160 data records. It can be seen that below 1.2 million records, the height of the tree is 2.

age index height estimation

The number of index items stored on each page of non-leaf nodes. The size of each page is 16k. The type of age is int. Occupies 4 bytes. The size of each item is 4+8=12 bytes. The number of index items that can be stored on each page is 16384 * 0.8 / 12 = 1092.
The number of indexes stored on each page of the leaf node. The size of each page is 16k. The size of each item is 4+4+8=16 bytes. The number of indexes that can be stored without a page is 16384 * 0.8 / 16 = 819.
Two layers can store 1092*819=894348 data records. It can be seen that below 900,000 records, the height of the tree is 2. 1 million records are 3 levels.

Other tools

There is also a small tool to view. InnoDB table space visualization tool innodb_ruby