Database Miscellaneous

Pessimistic lock, optimistic lock

Pessimistic lock: This way of using the database lock mechanism to lock the data before modifying it and then modify it is called pessimistic concurrency control. There are shared locks and exclusive locks for pessimistic locks.

Optimistic lock: Optimistic lock assumes that data will not cause conflicts under normal circumstances, so when the data is submitted for update, it will formally detect whether the data conflicts or not. If a conflict is found, it will return the wrong information to the user. The user decides how to do it. Optimistic locking is suitable for scenarios with many read operations, which can improve the throughput of the program. Such as CAS.

First/second/third paradigm

1NF first normal form means that each column of the database is an inseparable basic data item, emphasizing the atomicity of the column, and an attribute cannot have several values ​​in an instance. For example, there can be no fixed phone and mobile phone values ​​in the phone number attribute of the database.
Note: In any relational database, the first normal form (1NF) is the basic requirement for the relational model, and a database that does not meet the first normal form (1NF) is not a relational database.

The 2NF second paradigm is based on the first paradigm, that is, satisfying the second paradigm must satisfy the first paradigm. The second paradigm requires that each instance or row of the data table must be uniquely identified. In addition to satisfying the first normal form, there are two more conditions. One is that the table must have a primary key; the other is that the columns not included in the primary key must be completely dependent on the primary key, and not only part of the primary key.

Each row of data can only be related to one of the columns, that is, a row of data does only one thing. As long as there is data duplication in the data column, it is necessary to split the table.

For example: when the data table is a joint primary key, but some columns only rely on the joint primary key composed of one or a part of the attributes in the joint primary key, the table needs to be split to recombine the second normal form.

3NF If a certain paradigm is the second paradigm, and each non-primary attribute does not transmit candidate keys that depend on the paradigm, it is called the third paradigm, that is, it cannot exist: the non-primary key column A depends on the non-primary key column B, and the non-primary key column A depends on the non-primary key column B. Primary key column B depends on the situation of the primary key.

For example: Employee (emp_id, emp_name, emp_age, dept_id, dept_name, dept_info), when emp_id in the employee table can uniquely determine the employee information, but dept_name can be uniquely determined by dept_id. At this time, the table does not meet the third normal form You can delete other department information except dept_id, and create a separate department table for all department information.

Transfer from:


1NF: Each column of the database table is an indivisible basic data item.

2NF: On the basis of 1NF, each non-primary attribute completely depends on the candidate code. The partial dependence of non-primary attributes on candidate codes is eliminated.

3NF: On the basis of 2NF, each non-primary attribute is not transitively dependent on the candidate code. That is, each column in the table is only directly related to the candidate code, not indirectly.

BCNF: On the basis of 1NF, for any non-trivial dependency X->Y (non-trivial dependency means Y does not belong to X), X must contain candidate keys. It is "modified 3NF".


From 表名1,表名2,...

The result of multi-table joint retrieval is the same as that of multi-table Cartesian product retrieval. Made some optimizations.


Select 列名[[,列名]..]
From 表名1 [NATURAL]
	{ON 连接条件 | Using(Colname{,Colname...})}

The connection operation consists of two parts: connection type and connection condition

Connection type (choose one from four)Connection conditions (choose one of three)
inner joinnatural
left outer joinon <connection conditions>
right outer joinusing(Col1, Col2, …, Coln)
full outer join

Note: MySQL does not support full outer join.

Connection Type

Inner Join: The θ-join operation in relational algebra.

Left Outer Join, Right Outer Join, Full Outer Join: That is, the outer join operation in relational algebra
-such as "Table 1 Left Outer Join Table 2", after the connection, any tuple t of Table 1 will appear in the result table, If there is a tuple s that satisfies the join condition in Table 2, then t is connected to s; otherwise, t is connected to a null value tuple;
-such as "Table 1 Right Outer Join Table 2", then any tuple of Table 2 after joining s will appear in the result table. For example, if there is a tuple t that meets the join conditions in Table 1, then t is connected to s; otherwise, s is connected to a null value tuple;
-such as "Table 1 Full Outer Join Table 2", which is the former Combination of the two.

Connection condition

  • The tuples of the two connection relations that appear in the result relationship using natural in the connection have the same value on the common attribute, and the common attribute only appears once.
  • Use on <connection condition>
    in the connection. The tuple values ​​of the two connection relations that appear in the result relationship meet the connection condition, and the common attribute appears twice.
  • Use using(Col, Col2,...,Coln) in the connection
    (Col, Col2,...,Coln) is a subset of the common attributes of the two connection relations, the tuples are in (Col, Col2,...,Coln), the values ​​are equal , And (Col, Col2,...,Coln) only appears once.

How files are stored on disk

Take the FAT file system as an example:

Insert picture description here

a file is stored on several disk blocks.

The file system has a file allocation table FAT: records the subsequent disk block serial numbers of the files stored in the disk block, similar to a linked list. Concatenate the disk blocks of the same file through FAT.

The file entry in the directory records its starting disk block number.

Index category

Dense index and sparse index

  • Dense Index: For each record in the main file (each index field value formed), there is an index entry corresponding to it, indicating the location of the record.
  • Sparse index: For some records (formed index field values) in the main file, there are index items corresponding to it. Such an index is called a non-dense index or a sparse index.
Insert picture description here

Primary index: Usually there is an index entry for each storage block. The total number of index entries is the same as the number of storage blocks occupied by the storage table. The first record of each storage block in the storage table is also called anchor record. record), or simply called a block anchor. Sparse index

Insert picture description here

Auxiliary index: It is an auxiliary storage structure defined on any one or more non-sorted fields of the main file. Belongs to dense index.

Insert picture description here

Clustered index and non-clustered index

  • Clustered index: It means that the adjacent records in the index are also stored adjacently in the main file;
  • Non-clustered index: It means that the adjacent records in the index are not necessarily stored adjacently in the main file.
Insert picture description here

The inverted index
records which documents each entry appears in, and the position in the document, and can quickly locate the document containing this entry and its appearing position according to the entry.

Insert picture description here

Select query principle

Insert picture description here
  1. DBMS converts query statements into relational algebraic expressions (for example, first product, then selection, and finally projection);
  2. Optimize relational algebra expressions (such as changing the order of operations) to reduce running time; ( logic optimization )
  3. Choose an optimized execution routine for each relational algebra operation to form a physical query plan :
  4. Obtain database related information (regular statistics);
  5. There may be multiple routines for a relational algebra operation. Cost estimation is carried out based on relevant information, and the routine and parameters with the least cost are selected; ( physical optimization )
  6. Form a query plan.
  7. Execution engine: According to the physical query plan, the corresponding routine is called for processing, and the result is returned.

Outer sort

Two-stage multiplexing

  1. The first trip: Divide the subset and sort the subset.
    Divide the data into multiple subsets, and the size of each subset is smaller than the memory size. Use memory to sort each subset.
  2. Second round: merge and sort among the subsets.
    Read parts of each subset (such as a data block) into memory and merge them in multiple ways.

Prerequisite: the number of subsets * the number of blocks of the subset <the number of memory blocks²
(the number of subsets and the number of subset blocks are smaller than the number of memory blocks respectively)

Sorting of larger scale data sets-multiple passes / multiple stages

Memory size: 3 memory blocks
Data to be sorted: 30 data blocks occupied in total

Basic strategy:

  1. 30 pieces of data set => 10 sub-sets, 3 pieces for each sub-set, sorted and stored.
  2. The 10 sorted sub-sets are divided into 5 groups: 2 sub-sets in each group, and two-way merging are performed respectively, and 5 sorted sets can be obtained;
  3. The 5 sets are divided into 3 groups: each group has 2 subsets, and the remaining one is a single group. After two-way merging, you can get 3 sorted sets; regroup and merge to get 2 sorted sets Collection; re-merging can complete the final sorting.


undo/redo (undo/redo transaction) is a means of database failure recovery.

undo/redo log

System failures can be recovered through the running log:
redo the transaction (when the transaction has ended correctly when a failure occurs) or undo the transaction (when the transaction does not end when the failure occurs) according to the transaction operation sequence recorded in the running log.

The Undo log records the value of a certain data before it is modified, which can be used to rollback when the transaction fails, ensuring the atomicity of the transaction ;

The Redo log records the modified value of a data block, which can be used to restore data updated by a successful transaction that has not been written to the medium, ensuring the durability of the transaction .


The checkpoint is this moment: at this moment, the DBMS forces the contents of the memory DB Buffer to be consistent with the contents of the media DB (date file), that is, all the contents of the DB Buffer update are written back to the DB.

Checkpoints are set regularly to reduce the length of database failure recovery.

  • The transaction that ended before the checkpoint does not need to be restored (it has been written back to the DB);
  • The transaction that ends or occurs after the checkpoint needs to be recovered based on the running log (cannot be determined whether to write back to the DB): redo that ended before the point of failure, and undo that did not end at the point of failure.
Insert picture description here


SQL standard transaction isolation levels include:

  • Read uncommitted: When a transaction has not yet been committed, the changes it makes can be seen by other transactions.
  • Read committed (read committed): After a transaction is committed, the changes it makes will be seen by other transactions.
  • Repeatable read: The data seen during the execution of a transaction is always consistent with the data seen at the start of the transaction. Of course, under the repeatable read isolation level, uncommitted changes are also invisible to other transactions.
  • Serializable: For the same row of records, when there is a read-write conflict between different things, it is resolved through serialization, "write" will add "write lock", "read" will add "read Lock", the next transaction must wait for the completion of the previous transaction before it can be executed.

MySQL can choose four isolation levels, innodb uses multi-version concurrency control mvcc to achieve two isolation levels of read commit/repeatable read.

Different isolation levels can solve different isolation problems:

Isolation levelDirty read possibilityPossibility of non-repeatable readingPossibility of phantom reading
Read uncommittedmaymaymay
Read submittedimpossiblemaymay

[Note] The standard repeatable read isolation level, there is the possibility of phantom reading. However, Innodb uses next-key lock (row lock + gap lock) to solve the phantom read problem under the repeatable read level. The gap lock is to lock the gap range of the value to prevent the insertion of new records.


A transaction is a set of atomic SQL queries. The statements in the transaction either all execute successfully or all execute fail.

Four standard characteristics of transactions (ACID):

  • Atomicity (atomicity) A transaction must be regarded as an indivisible minimum unit of work. All operations in the entire transaction are either submitted successfully or all failed and rolled back. For a transaction, it is impossible to perform only part of the operations;
  • Consistency (consistency) database always transition from a consistent state to another consistent state;
  • Isolation (isolation) multiple transactions executed concurrently are not affected by each other;
  • Durability (durability) Once the transaction is committed, the changes made will be permanently stored in the database.

The four features are not mutually orthogonal and have no intersection.

WAL can achieve atomicity (through undo log) and durability (through redo log) to further achieve consistency.

Appropriate destruction of consistency can be used to improve performance and parallelism, for example: final consistency ~= read uncommitted.


A snapshot is a record of the state of data storage at a certain time.

The data in the snapshot database is unchanged. After the snapshot is created, the system will mark all the data pages of the original database. If the data page is modified after the snapshot is created, a data page will be copied. The data page that is not modified will not have a snapshot (the original database and the snapshot database). Share the data page).

A database snapshot is equivalent to a new database, which saves the database state when the snapshot is created, and shares unmodified data pages with the original database.