In-depth analysis of the storage and reading of data in the database

Our daily development will more or less deal with the database, so how to store the data in the database to ensure the efficiency of reading and writing? This article will introduce in detail the storage and reading and writing of data in the database.

The simplest database

Let's first look at one of the simplest databases implemented through bash, which is a key-value database and reads and writes through Bash functions.

There are two functions here, one is writing function, which is simply writing key and value pairs. Another function is the db_get() function, which can read the latest line of data written.

We can use it like this, here we are writing two key and value, one is 123456, corresponding to the following Json format data:'{"name":"San Francisco","attractions":["Exploratorium"] }', another key value is 42, which corresponds to'{"name":"San Francisco","attractions":["Golden Gate Bridge"]}', and finally we can call the get function to get the value corresponding to 42 . This implementation is straightforward and self-explanatory.

So if a key value is set multiple times, how to save it, as shown in the following figure:

We can see that 42 has been rewritten. The simplest implementation here is to continue adding a line to the back, and then read from back to front when reading (the latest is the latest) to get the latest value.

In fact, in a real database, there is a similar implementation that is an ever-increasing log file. We will keep writing to it. Of course, the real data considerations will be much more complicated. For example, there is a problem in the middle of writing, or writing The problem is waiting. But the thinking must be similar.

Smart you may ask, what if I want to delete a key? Generally speaking, for the delete operation, we need to insert a special record at the end. When we traverse to this record, we know that the key has been deleted.

Of course, another problem with this implementation is that the size of this file is constantly increasing as we keep writing, even if we update the same key, this file is also increasing, and it will eventually bring the file size. The problem. The general solution to this problem is to do the merge operation, that is, to save the latest record. This operation can greatly reduce the size of the file when the value of our key is relatively small.

Of course, there is another question that you might be more curious about, why do you keep adding to the end instead of updating it? This is a very good question, for several reasons:

  1. Keep adding to the back, in fact, you can make full use of the HDD's sequential writing. If you have studied HDD reading and writing, you will know that the speed and efficiency of sequential writing are much higher than random writing. The main reason for this is the disk. The head does not need to be changed randomly and continuously. Even for SSDs, sequential writes will extend the lifespan of SSDs compared to random writes.
  2. Concurrency and crash recovery will be much simpler. You don't need to worry too much about a crash in the middle of writing. Basically, as long as you do a checksum, you can know how many logs may be wrong in the end, and you can remove them, and you don't have to worry about half of the updates in a certain place.

In fact, the writing efficiency of this kind of implementation is still good, because you only need to be responsible for writing at the end of the file, but when the data is large, reading will become a problem. You need to find a key value, and you need to go back Iterating continuously before, that is, the complexity of o(n), so is there any good way to speed up this process? The answer is obviously yes, this is the hash index we will discuss below.

Hash index

Taking the above database as an example, we know that the file actually exists on the disk, and it has an offset. If we can have an in-memory hash table to store the corresponding relationship between key and offset, then the efficiency of reading will obviously improve, as shown in the following figure:

With this hash table, we can know that the position corresponding to 42 is at 64 offset, and then it is easy to read the corresponding content without traversing from back to front. There is a problem with the hash index here, that is, it is actually in-memory, that is to say, if a crash occurs, we have to rebuild it, which requires re-traversing the disk to get the required index, a better one The method is that we also save a backup of the hash index on the disk, so that if something goes wrong, we can quickly restore it.

It sounds like this hash index is good in the first day of the new year, but if you think about it, you will find that it also has many problems:

  1. If the number of keys is large, it means that we need to store a lot of content in the hash index.
  2. The search efficiency of the range is very low. For example, to find the value between key001 and key002, you need to traverse all the keys to find the corresponding answer.

So how to solve these problems? Let's take a look at SSTable and LSM-Tree.


We found that in the database implementation described above, each row is a key-value pair. In fact, the order between different keys is not important. That is to say, in the database, whether the record with key=42 or the record with key=53 comes first does not affect the final result. In this way, we can sort according to the order of the keys when merging. We call this method Sorted String Table or SSTable for short.

Its implementation is also very simple, and the process is as follows:

  1. When writing comes, first write to an in-memory balanced tree structure (such as a red-black tree). The reason why you start writing to memory instead of disk is that, generally speaking, maintaining an orderly memory structure is compared to maintaining An orderly disk file is much more convenient, we call this memtable.
  2. When the memtable gets bigger and bigger, for example, it is greater than a certain threshold, it will be written to the disk. Because the memtable itself is ordered, it is relatively easy to write to the disk in an orderly manner. We call it every write As a segment.
  3. If there is a read operation at this time, first visit the memtable, and then visit each segment (from the newest to the oldest) at a time
  4. In the background, we start another thread to merge each segment, as shown in the following figure:

After this is achieved, our index does not even need to write all the keys, just write a range, as shown in the figure below, if you want to find handful, although there is no such key in the index, you know that it is located in the handbag and So as long as the traversal starts from the offset of the handbag and finds the offset of the handsome, it returns if there is one. Only if this range is set reasonably, then this small range of traversal efficiency is still very high. Such an index is what we commonly call Log-Structured Merge-Tree (LSM-Tree).

Of course, LSM-Tree also has some space that can be optimized. For example, the efficiency of finding non-existent keys is relatively low. You need to find memtable first, and then find each index. A more common optimization is to maintain a structure in memory. This structure can determine whether a key exists. If it does not exist, it does not need to be traversed. Another area worth studying is when and in what order is the background merge operation more reasonable. In general, there are two types of methods, one is to merge small new segments into larger new segments. Another method is to merge according to the scope of the key.


The LSM-Tree we mentioned above is very good, but it is not the most widely used in reality. We usually use B-Tree.

The only similarity between B-Tree and LSM-Tree is that it also arranges the keys in order. LSM-Tree divides data into segments of different sizes, while B-Tree is different. It divides the database into fixed-size blocks or pages, generally 4KB, and each read and write is a page. This design is similar to the design of a disk.

As shown in the figure below, each page is distinguished by an address, so that you can point from one page to another, just like a pointer. It is similar to a hierarchical structure. For example, if we are looking for 251, we first find the interval 200 to 300 where it is located, and then find the interval 250 to 270 where it is located in the next level, and then we can find 251 at the next level.

The number of pages pointed to by each level here is called the Branching factor, which is 6 in our example. The actual example is basically determined by the size of the data, which is basically on the order of several hundred. The last layer of pages where the data is finally saved is called leaf page.

If we need to update the value is very simple, as long as the corresponding value is found, update it, and the other parts can remain unchanged. But if we need to insert a value, there may be a problem. For example, its data exceeds the size of our page, and the page needs to be divided into two pages, as shown in the following figure:

Such an algorithm can basically meet the requirements of most current databases. For example, we set the branching factor to 500, and each page is 4KB, with a 4-level tree, and we can save 256TB of data.

There is a problem with this structure, that is, every time we write a page, and this process will have a problem, that is, half of the writing crashes, if this kind of problem occurs, we will only have half of the data. So usually we will save a log file that keeps writing forward. This log file will be updated before we actually update the B-Tree, so if we have a problem in the process of updating the B-Tree, we can also Restore data in this log file.

Of course, in addition to this method to prevent crashes, there are many other methods, such as maintaining two pages at the same time, when the new page is updated successfully, just modify the pointer to point to the past.

Of course, B-Tree has been around for a while, and many people have made a lot of optimizations. For example, they will try to arrange the storage location of each page in the disk in order as much as possible, so that it will be a sequential access when reading. Improve the efficiency of visits. For example, you can add a forward and backward pointer to each leaf page to point to the pages sorted before and after, so that when we query a range, the efficiency will be higher and so on.

Recently I am writing articles related to distributed systems, please refer to :

Welcome to follow the WeChat official account, contact the author to join the WeChat group, and regularly share hardcore content at 6:30 PDT every Friday.