Explore the essence of relational database connection query

The nature of join queries

Create two tables

CREATE TABLE e1 (m1 int, n1 char(1)); CREATE TABLE e2 (m2 int, n2 char(1)); INSERT INTO e1 VALUES(1,'a'), (2,'b'), (3,'c'); INSERT INTO e2 VALUES(2,'b'), (3,'c'), (4,'d');

image.png

The essence of the connection is to take out the records in each connection table and add the matched combination to the result set and return it to the user.

Connect internal implementation

image.png

This process seems to connect the records of the e1 table with the records of e2 to form a new and larger record, so this query process is called a join query. The result set of the join query contains the combination of each record in one table and each record in the other table. A result set like this can be called a Cartesian product. Because there are 3 records in table e1 and 3 records in table e2, the Cartesian product after joining these two tables has 3×3=9 rows of records.

Simple syntax for connection

In MySQL, the syntax of the join query is very arbitrary, as long as the FROM statement is followed by multiple table names. For example, we can write the query statement that joins the e1 table and the e2 table as follows:

SELECT * FROM e1, e2;

image.png

Introduction to the connection process

Why have as few association tables as possible

In the earliest learning, in order to show their "profound" optimization skills, often people often use a super-multi-table association query. Some often associate 7 or 8 watches with friends to show off. But after deep learning Mysql, we found that it seems different from what we initially thought.

We know that there is such a statement in "Alibaba JAVA Development Manual": Join is prohibited for more than three tables. So why is join, so convenient, so unwelcome by Ali bosses?

We can join any number of tables, but if there are no restrictions, the Cartesian product generated by joining these tables may be very huge. For example, the Cartesian product of three tables with 100 rows of records is 100×100×100=1000000 rows of data! So it is necessary to filter out specific record combinations when connecting

Sql analysis

SELECT * FROM e1, e2 WHERE e1.m1> 1 AND e1.m1 = e2.m2 AND e2.n2 <'d';

Single table condition filter

e1.m1> 1 is the filter condition only for the e1 table, and e2.n2 <'d' is the filter condition only for the e2 table.

Multi-table condition filtering

e1.m1 = e2.m2, e1.n1> e2.n2, etc. These conditions involve two tables.

Analysis of screening conditions in Sql

  1. e1.m1> 1
  2. e1.m1 = e2.m2
  3. e2.n2 <'d

Then the general execution process of this connection query is as follows:

Step 1: Select the driver table

As we can see, the associated query assumes that there are two tables, A and B. In fact, it is to filter one of the tables separately, and then use the result of that table as a condition to query in another table.

So, let's think about it, assuming we are an optimizer, will we choose the table with the simpler filtered result as the driving table, or the table with the more complex filtered result as the driving table?

The answer is yes, of course, is to choose a simple table of results. In this way, when our driving table filter conditions, it will be less used as a condition of the driven table to execute the next query.

Then we now make assumptions:

  • The condition of e1.m1> 1 has 2 pieces of data
  • There are 5 data in the condition of e2.n2 <'d

So which table will the optimizer choose as the driving table? Naturally it is e1.

Step 2: Take the result conditions of the driven table to query in the driven table

At this time, e1 will take out the two m1s that it has already screened (assuming the first m1=2 and the second m1=3), as equivalent conditions, and then go to the e2 table for query. That is, the e2 table will execute the following sql immediately:

SELECT * FROM e2 WHERE e2.m2 = 2 AND e2.n2 <'d';SELECT * FROM e2 WHERE e2.m2 = 3 AND e2.n2 <'d';

This further explains that the driver table we are talking about should select a table with as few conditions as possible. Because the number of queries of the driven table completely refers to the number of filtered results of this buddy.

Execution flow chart

image.png

In other words, the final result of the entire connection query only has two records that meet the filter conditions.

in conclusion

It can be seen from the above two steps that this two-table join query needs to query the e1 table once and the e2 table twice. That is to say, in the two-table join query, the driving table only needs to be accessed once, and the driven table may be accessed multiple times.

Based on our understanding of IO costs, choosing a table with a smaller result set as the driving table can effectively reduce the number of IOs of the driven table. This is why the optimizer tries to convert outer joins into inner joins. (You can judge the driven table and driven table by yourself)

Inner link and outer link

Build a table

CREATE TABLE student ( number INT NOT NULL AUTO_INCREMENT COMMENT'student number', name VARCHAR(5) COMMENT'Name', major VARCHAR(30) COMMENT'Professional', PRIMARY KEY (number) ) Engine=InnoDB CHARSET=utf8 COMMENT'Customer Information Table'; CREATE TABLE score ( number INT COMMENT'student number', subject VARCHAR(30) COMMENT'Subject', score TINYINT COMMENT'score', PRIMARY KEY (number, subject) ) Engine=InnoDB CHARSET=utf8 COMMENT'Customer Result Sheet';

Single table query display

SELECT * FROM student; SELECT * FROM score;

image.png

Internal connection

Now we want to query the test scores of each student and we need to join the two tables (because there is no name information in the score, we can't just query the score table). The connection process is to fetch records from the student table and find the score records with the same number in the score table, so the filter condition is student.number = socre.number, and the entire query statement is like this:

SELECT s1.number, s1.name, s2.subject, s2.score FROM student AS s1, score AS s2 WHERE s1.number = s2.number;

image.png

The nature of the internal connection thinks that the problem

From the above query results, we can see that the scores of each subject corresponding to each student have been found, but there is a problem. Student King, that is, the student with the student ID of 20200904, did not take the exam for some reason, so in There is no corresponding score record in the score table.

Therefore, the essence of inner connection is actually:

For the two tables in the inner connection, the record in the driving table cannot find a matching record in the driven table, and the record will not be added to the final result set.

How to make up for internal connection defects

If the teacher wants to check the test scores of all students, even the students who are absent should also show it, but the join query we have introduced so far cannot fulfill such a requirement.

In order to solve this problem, there is the concept of inner connection and outer connection.

Compared with the limitations of internal links, the definition of external links is:

For the two tables that are externally joined, the records in the driving table need to be added to the result set even if there is no matching record in the driven table.

In MySQL, depending on the selection of the drive table, outer joins can still be subdivided into two types:

  1. Left outer connection, select the table on the left as the drive table.
  2. Right outer connection, select the table on the right as the drive table.

But this still has a problem. Even for outer joins, sometimes we don't want to add all the records of the driver table to the final result set. This is difficult, what should I do? This problem can be solved by dividing the filter conditions into two types, so the filter conditions placed in different places have different semantics:

WHERE clause semantics (query conditions)

The filter condition in the WHERE clause is the kind we usually see.

Regardless of whether it is an inner join or an outer join, all records that do not meet the filter conditions in the WHERE clause will not be added to the final result set.

ON clause semantics (connection conditions)

The ON clause is specifically for the record in the external connection drive table. When the driven table cannot find a matching record, it should not be added to the result set. ,

When the external link uses ON for related query, the condition after ON is used as the connection condition. If the data of the driving table is successfully matched in the driven table, the data will be pieced together in the case of driving table, driven table one-to-one, or one-to-many. If the matching does not fail, a query result containing all the driving table data and the driven table data is all NULL is generated.

Connection syntax

For external links, both the driving table and the driven table are determined by ourselves.

Left (outer) connection syntax

SELECT * FROM e1 LEFT [OUTER] JOIN e2 ON connection condition [WHERE common filter condition];

The OUTER in parentheses can be omitted. Above e1 is our driven table, and e2 is our driven table.

Right (outer) connection syntax

SELECT * FROM e1 RIGHT [OUTER] JOIN e2 ON connection condition [WHERE common filter condition];

The OUTER in parentheses can be omitted. Above e2 is our driven table, and e1 is our driven table.

Syntax of inner join

SELECT * FROM e1 JOIN e2;SELECT * FROM e1 INNER JOIN e2; SELECT * FROM e1 CROSS JOIN e2;SELECT * FROM e1, e2;

The ON clause and the WHERE clause in the inner join are equivalent, so the ON clause is not required to be mandatory in the inner join.

Internally connected and externally linked driven tables and driven tables

The essence of the connection is to take out the records in each connection table and add the matched combination to the result set and return it to the user. No matter which table is used as the driving table, the Cartesian product generated by the connection of the two tables must be the same.

For inner joins, since all records that do not meet the conditions in the ON clause or WHERE clause will be filtered out, it is actually equivalent to giving the records that do not meet the filter conditions from the Cartesian product of the two tables connected Kicked out, so for inner joins, the driving table and the driven table are interchangeable, and will not affect the final query result.

For outer joins, because the records in the driving table must be added to the result set even if the records that meet the ON clause conditions are not found in the driven table, the relationship between the driving table and the driven table is at this time It is very important, that is to say, the driven table and driven table of the left outer connection and the right outer connection cannot be easily interchanged.

Nested-Loop Join

Time complexity of ordinary nested query

For a two-table connection, the driving table will only be accessed once, but the driven table has to be accessed many times. The specific access times depend on the number of records in the result set after a single-table query is executed on the driving table.

For inner joins, which table is selected as the driving table does not matter the number of results. The drive table of the external connection is fixed, that is to say, the drive table of the left (outer) connection is the table on the left, and the drive table of the right (outer) connection is the table on the right.

If there are 3 tables to be connected, then the result set obtained by the first two tables is like a new driven table, and then the third table becomes the driven table. You can use pseudo code to express this process like this:

for each row in e1 {#Here means that traversal satisfies each record in the e1 single table query result set, N items        for each row in e2 {#here means that for a record in the e1 table, traversal satisfies each record in the e2 single table query result set, M            for each row in t3 {#Here means that for a certain combination of records in the e1 and e2 tables, a single-table query on the t3 table, L                 #Final Results        }    }

This process is like a nested loop, so this drive table is accessed only once, but the driven table may be accessed multiple times. The number of accesses depends on the number of records in the result set after a single table query is executed on the drive table. The connection execution method is called Nested-Loop Join, which is the simplest and most awkward connection query algorithm. The time complexity is O(N*M*L).

Use index to speed up connection

We know that the driving table and driven table of the associated query are often one-to-many relationships. Assuming that the conditions filtered out by the driving table, each item goes to the driven table to perform a full table scan query, the speed is absolutely slow beyond imagination.

Therefore, adding indexes to the driven table can greatly improve query efficiency.

Block-based nested loop join

The cost of having no index on the driven table

The process of scanning a table is actually to first load the table from the disk into the memory, and then compare whether the matching conditions are met from the memory.

In real life, tables with thousands of records are few, and tables with millions, tens of millions, or even hundreds of millions of records are everywhere. The memory may not completely store all the records in the following table, so when scanning the front record of the table, the latter records may still be on the disk. When the latter records are scanned, the memory may be insufficient, so you need to change the front records from Released in memory.

In the process of joining two tables using the nested loop join algorithm, the driven table has to be accessed many times. If the data in the driven table is particularly large and cannot be accessed using the index, it is equivalent to from the disk After reading this table several times, the I/O cost is very high, so we have to find a way: try to reduce the number of accesses to the driven table as much as possible.

Connection process without join buffer

  1. When the data in the driven table is very large and there is no index, the order of associative queries is as follows:
  2. Take out a drive table result into memory
  3. Take out the results of all driven tables and compare them with the results of the driven table
  4. Release the result of the driver table in memory, and put the result of the next driver table into the memory
  5. Repeat the above steps until all the results in the driven table are compared with all the data in the driven table once

We found that when we have no join buffer, how many times the drive table has results, the drive table needs to be connected to memory IO as many times.

Connection process with join buffer

Therefore, the concept of join buffer is proposed:

The join buffer is a fixed-size memory that is requested before the execution of the join query. A number of records in the result set of the driving table are installed in this join buffer, and then the driven table is scanned, and each record of the driven table is joined at one time. Multiple drive table records in the buffer are matched, because the matching process is done in memory, so this can significantly reduce the I/O cost of the driven table.

image.png

The size of this join buffer can be configured through startup parameters or the system variable join_buffer_size. The default size is 262144 bytes (that is, 256KB), and the minimum can be set to 128 bytes.

show variables like'join_buffer_size';

Of course, for optimizing the query of the driven table, it is best to add an efficient index to the driven table. If you can't use the index, and the memory of your own machine is relatively large, you can try to increase the value of join_buffer_size. Join queries for optimization.

In addition, it should be noted that not all columns of the driving table records will be placed in the join buffer, only the columns in the query list and the columns in the filter conditions will be placed in the join buffer, so remind us again, it is best Don't use * as the query list, just put the columns we care about in the query list, so that more records can be placed in the join buffer.