[Ali Annual Salary Million Database Interview] Will MySQL lose data?

A way to temporarily improve performance during peak business periods.

The WAL mechanism guarantees that as long as redo log and binlog are guaranteed to be persisted to disk, it can ensure that data can be restored after MySQL restarts abnormally.

Binlog writing mechanism

During the execution of the transaction:

  • Write the log to the binlog cache first
  • When the transaction is committed, write the binlog cache to the binlog file

The binlog of a transaction should not be disassembled, no matter how big the transaction is, it must be written once. This involves the preservation of the binlog cache .

The system allocates a piece of memory to the binlog cache, one for each thread, but shares the same binlog file. The parameter binlog_cache_size controls the size of the memory occupied by the binlog cache in a single thread. If this parameter value is exceeded, it must be temporarily saved to disk.

When the transaction is committed, the executor writes the complete transaction in the binlog cache to the binlog and clears the binlog cache.

  • Binlog write disk status in
    TODO
    diagram:
  • write
    writes the log to the page cache of the file system, and does not persist the data to the disk, so the speed is faster
  • fsync
    persists data to disk. It is generally believed that fsync only accounts for disk IOPS

The timing of write and fsync is controlled by the parameter sync_binlog :

  • sync_binlog=0, only write every time a transaction is submitted, not fsync
  • sync_binlog=1, fsync will be executed every time a transaction is submitted
  • sync_binlog=N(N>1), write every time a transaction is submitted, but fsync only after accumulating N transactions

Therefore, in scenarios where I/O bottlenecks occur, setting sync_binlog to a larger value can improve performance. In actual business scenarios, considering the controllability of the amount of lost logs, it is generally not recommended to set this parameter to 0, and it is recommended to set it to a value from 100 to 1000.

But if sync_binlog is set to N, the corresponding risk is: if the host restarts abnormally, the binlog logs of the last N transactions will be lost.

Redo log writing mechanism

Next, let's talk about the writing mechanism of redo log.

During the execution of the transaction, the generated redo log must be written to the redo log buffer first.

  • Then, should the content of the redo log buffer be directly persisted to disk every time it is generated?
    No need.

If MySQL restarts abnormally during transaction execution, this part of the log will be lost. Since the transaction has not yet been committed, the log is lost and there is no loss at this time.

  • Is it possible that part of the logs in the redo log buffer will be persisted to disk before the transaction is committed?
    There will be.

This question starts with the three possible states of redo log. These three states correspond to the three color blocks in Figure 2.

  • MySQL redo log storage status
    TODO
    three states:
  • The existence of redo log buffer is
    physically in the MySQL process memory
  • Write to disk (write), but not yet persisted (fsync),
    physically in the page cache of the file system
  • Persist to disk, namely hard disk

Logs are written to the redo log buffer quickly, and wirte to the page cache is almost the same, but it is very slow to persist to disk.

InnoDB provides the innodb_flush_log_at_trx_commit parameter to control the redo log write strategy:

  • 0, just leave the redo log in the redo log buffer every time a transaction is committed
  • 1. The redo log is directly persisted to disk every time a transaction is committed
  • 2. Every time a transaction is committed, it just writes the redo log to the page cache

A background thread of InnoDB will write the log in the redo log buffer every 1s, call write to the page cache of the file system, and then call fsync to persist to disk.

The redo log in the middle of the transaction execution is also directly written in the redo log buffer, and these redo logs will also be persisted to disk by the background thread. That is, the redo log of an uncommitted transaction may have been persisted to disk.

In addition to the polling operation performed by the background thread every second, there are two scenarios where the redo log of an uncommitted transaction will be written to disk:

  • The space occupied by the redo log buffer is about to reach half of the innodb_log_buffer_size , and the background thread will actively write to the disk.
    Since this transaction has not been committed, the disk writing action is only write without calling fsync, that is, it is only left in the page cache of the file system.
  • When concurrent transaction commit, incidentally, by the transaction's redo log buffer persisted to disk
    suppose to execute a transaction A half've written some to redo log buffer, then another thread B transaction commit, if innodb_flush_log_at_trx_commit 1, Then transaction B must persist all the logs in the redo log buffer to disk. At this time, the log of transaction A in the redo log buffer will be persisted to disk.

In the two-stage submission process, the redo log is prepared first, then the binlog is written, and finally the redo log commit is performed.

If innodb_flush_log_at_trx_commit is set to 1, the redo log will be persisted once in the prepare phase, because there is a crash recovery logic that depends on the redo log of the prepare plus binlog to recover.

Every s background polling and flashing, coupled with crash recovery, InnoDB thinks that redo log does not need fsync when committing, and only writes to the page cache of the file system is enough.

Usually we say that the "double 1" configuration of MySQL means that sync_binlog and innodb_flush_log_at_trx_commit are both 1. That is, before a transaction is completely committed, it needs to wait for two flashes:

  • redo log (prepare phase)
  • binlog

That means I see from MySQL that the TPS is 2w, and it writes to the disk 40,000 times per second. But I use tools to test, the disk capacity is only about 2w, how can I achieve 2w TPS?

Have to use group commit (group commit) to explain.

Log logical sequence number (log sequence number, LSN)

The LSN increases monotonically, corresponding to the write point of the redo log. For example, write length length redo log, then LSN+length.

LSN will also write data to InnoDB to ensure that the data page will not be executed multiple redo logs.

As shown in Figure 3, three concurrent transactions (trx1, trx2, trx3) are in the prepare phase, and the redo log buffer is written and persisted to disk. The corresponding LSNs are 50, 120, and 160, respectively.

  • Redo log group submits
    TODO
  • trx1 arrived first and was selected as the leader of this group
  • When trx1 is about to start writing, there are already three transactions in the group, and the LSN becomes 160.
  • When trx1 writes to the disk, it brings LSN=160. Therefore, when trx1 returns, all redo logs with LSN≤160 have been persisted to disk
  • At this time, trx2 and trx3 can return directly

Therefore, in a group submission, the more group members, the better the effect of saving disk IOPS. But if there is only a single-threaded stress test, one transaction can be honestly corresponding to one persistence operation.

In a concurrent update scenario, after the first transaction finishes writing the redo log buffer, the later this fsync is called, the more group members may be, and the better the effect of saving IOPS.

In order to make more group members with an fsync, MySQL adopts optimization: drag time.

  • Two-phase commit

Writing binlog is actually divided into two steps:

  1. First write the binlog from the binlog cache to the binlog file on the disk
  2. Call fsync persistence

In order to make the group submission effect better, MySQL drags the time for redo log to do fsync to the back of step 1:

  • Two-phase commit refinement

In this case, binlog can also be submitted in groups. In step 4 of the above figure, if the binlogs of multiple transactions have been written, they are also persisted together, which can also reduce IOPS.
Generally, step3 is executed very quickly, so the interval between write and fsync of binlog is very short, resulting in fewer binlogs that can be aggregated together for persistence, so the effect of binlog group submission is usually not as good as the effect of redo log.

If you want to improve the effect of binlog group submission, you can set:

  • binlog_group_commit_sync_delay parameter
    How many microseconds to delay before calling fsync
  • binlog_group_commit_sync_no_delay_count parameter
    How many times will fsync be called after accumulating

These two conditions are in an OR relationship, that is, as long as one of the conditions is met, fsync will be called.

In this case, when binlog_group_commit_sync_delay = 0, binlog_group_commit_sync_no_delay_count is invalid.

WAL is to reduce disk writes, but redo log and binlog must be written every time a transaction is submitted. The number of disk reads and writes has not decreased? s So it can be understood now, WAL mainly benefits from:

  • Both redo log and binlog are written sequentially, and the sequential write to disk is faster than random write
  • Group submission mechanism, which can greatly reduce disk IOPS

Therefore, if MySQL has an IO performance bottleneck, it can be optimized by the following methods:

  • Set binlog_group_commit_sync_delay , binlog_group_commit_sync_no_delay_count to reduce the number of binlog writes.
    This solution is based on "extra deliberate waiting" to achieve, so it may increase the response time of the statement, but will not lose data
  • Setting sync_binlog to a value greater than 1 (100~1000
    is recommended) The risk is that the binlog log will be lost when the host is powered off.
  • The risk of setting innodb_flush_log_at_trx_commit to 2
    is that data will be lost when the host is powered off.

It is not recommended to set innodb_flush_log_at_trx_commit to 0. Because at this time, it means that the redo log is only stored in the memory, so the abnormal restart of MySQL itself will also lose data, which is too risky. The speed of redo log writing to the page cache of the file system is very fast, so setting this parameter to 2 has the same performance as setting it to 0, but in this way, data will not be lost when MySQL restarts abnormally.

summary

MySQL is "how to ensure that redo log and binlog are complete".

crash-safe

After executing an update, execute hexdump to view the contents of the ibd file directly. Why do I not see any changes in the data?
Probably because of WAL. After the update statement is executed, InnoDB only guarantees that the redo log and memory have been written, and may not have time to write the data to disk.

Why is the binlog cache maintained by each thread, while the redo log buffer is shared globally?
The binlog cannot be "interrupted". The binlog of a transaction must be written continuously, so after the entire transaction is completed, it is written to the file together.
The redo log does not have this requirement, and the logs generated in the middle can be written to the redo log buffer. The content in the redo log buffer can also "free ride", and other transactions can be written to disk together when they are committed.

During the execution of the transaction, it has not yet reached the commit stage. If a crash occurs, the redo log must be lost. Will this lead to inconsistencies between the master and the slave?
will not. Because the binlog is still in the binlog cache at this time, it is not sent to the standby database. After the crash, the redo log and binlog are gone, and the transaction is not committed from a business perspective, so the data is consistent.

If a crash occurs after the binlog is written, it restarts without giving the client a reply. Wait for the client to reconnect and find that the transaction has been submitted successfully. Is this a bug?
It's not. Imagine a more extreme scenario. The entire transaction is submitted successfully, the redo log commit is completed, and the standby database also receives the binlog and executes it. However, the main library and the client network are disconnected, and the packet that causes the transaction to be successful cannot be returned. At this time, the client will also receive a "network disconnection" exception. This can only be regarded as a successful transaction, not a bug.

In fact, the crash-safe guarantee of DB is:

  • If the client receives a message that the transaction is successful, the transaction must be persistent
  • If the client receives a transaction failure (such as primary key conflict, rollback, etc.), the transaction must have failed
  • If the client receives an "execution exception" message, the application needs to reconnect and continue the subsequent logic by querying the current state. At this time, the DB only needs to ensure the internal consistency (between the data and the log, between the main database and the standby database).