Boom! The practice of cost saving and performance optimization of hundreds of billions of Mysql data migration mongodb

An online IOT core business cluster used mysql as the main storage database before. With the continuous increase of business scale, mysql can no longer meet the demand for massive data storage, and the business is facing capacity pain points, cost pain points, and data imbalance issues.

​ 40 billion After the business was migrated to mongodb, the same data saved a great deal of memory, CPU, and disk costs, and at the same time perfectly solved the pain points of capacity and data imbalance, and achieved a certain performance improvement. In addition, the mysql data at the time of migration was 40 billion, and the corresponding mongodb cluster data has grown to 100 billion after 3 months. If the cost is calculated in proportion to the scale of 100 billion data, the actual cost savings will be even higher.

​ At present, many mongod documents and performance data in China are still stuck in the early MMAP_V1 storage engine. In fact, starting from the mongodb-3.x version, the default storage engine of mongodb has adopted the wiredtiger with high performance, high compression ratio, and smaller lock granularity. Storage engine, so its performance, cost and other advantages are more obvious than the previous MMAP_V1 storage engine.

1. Business migration background

​ The business has about 40 billion data before the migration of mongodb, and 64 sets of mysql clusters have been applied for. The business uses shardingjdbc to divide the database and tables, and split it into 64 databases in advance, each with 100 tables. The master-slave high-availability election is formed by relying on the open source orchestrator. The mysql architecture diagram is shown in the following figure:

img

Note: The red in the above figure represents a disk alarm, and the disk usage level of many nodes is about to 100%.

​ As shown in the figure above, the business applied for 64 sets of MySQL clusters at one time more than a year ago. There are one master and three slaves in a single cluster. The specifications of each node are as follows:

  • cpu: 4
  • mem: 16G
  • Disk: 500G
  • Total number of nodes: 64*4=256
  • SSD server

After the business has been running for more than a year, the total cluster data volume has reached 40 billion, and it has grown at a rate of 20 billion per month. Due to data imbalance and other reasons, some clusters have large data volumes and continue to run out of disks. Due to the large number of nodes, more and more cluster node disks break through the bottleneck. In order to solve the disk bottleneck, the DBA keeps increasing the node disk capacity. Both business and DBA are facing serious pain points, mainly as follows:

  • *Data imbalance problem*
  • *Node capacity problem*
  • *Continuous increase in costs*
  • *DBA workload has increased sharply (some disks cannot be upgraded and data needs to be migrated to new nodes), and the business is also in fear*

*business*

2. Why choose mongodb-summary of ten core advantages

​ After business encounters a bottleneck, based on mongodb's existing influence in the company, the business began to investigate mongodb. Through contact with the business, we learned that business usage scenarios are common operations such as adding, deleting, modifying, checking, and sorting, and query at the same time The conditions are relatively fixed, and there is no problem with mongodb.

​ In addition, compared to traditional open source databases, mongodb has the following core advantages:

  • *Advantage 1: Free mode*

​ Mongodb is a schema-free structure, and the data format is not strictly limited. The business data structure is relatively fixed, and this function is not used in the business, but it does not affect the business's use of mongodb to store structured data.

  • *Advantage 2:********Natural high-availability support*

​ mysql high availability relies on third-party components to achieve high availability. The internal multiple copies of the mongodb replica set naturally support high availability through the raft protocol, which reduces the dependence on third-party components compared to mysql.

  • *Advantage 3: *******Distributed********-******** Solve the pain points of sub-database, sub-table and massive data storage*

​ Mongodb is a distributed database, which perfectly solves the pain points of mysql sub-database and massive data storage. Business does not need to evaluate how many databases and tables need to be split in advance before using the database. Mongodb is an infinitely large table for the business (currently Our largest table stores hundreds of billions of data, and query performance has no impact).

​ In addition, in the early days of the business, generally data is relatively small, and you can only apply for a fragmented mongodb cluster. And if mysql is used, just like the IOT business of this migration, it is necessary to apply for the cluster with the largest capacity in advance. When the amount of data is small in the early stage, resources are wasted seriously.

  • *Advantage 4:******** Perfect data balancing mechanism, different fragmentation strategies, and support for multiple types of fragmentation*

​ About balance: Support automatic balance, manual balance, and arbitrary configuration of balance in time period.

About the fragmentation strategy: support range fragmentation, hash fragmentation, and support pre-fragmentation.

About the type of film creation: support single automatic film creation, multi-field film creation

  • *Advantage Five: Different levels of ******* data consistency and security guarantee*

​ Mongodb is designed to support different types of Read Concern and Write Concern read and write related configurations according to the requirements of different consistency levels. The client can be set according to the actual situation. In addition, the mongodb kernel design has a complete rollback mechanism.

  • *Advantage 6: High concurrency, high performance*

​ In order to adapt to large-scale high-concurrency business reads and writes, mongodb has made many detailed optimizations in thread model design, concurrency control, and high-performance storage engine.

  • *Advantage Seven: Wiredtiger high-performance storage engine design*

​ Many comments on the Internet still remain in the early MMAPv1 storage engine. Compared with MMAPv1, the wiredtiger engine has better performance, higher compression ratio, and smaller lock granularity. The details are as follows:

WiredTiger provides low latency and high throughput

Process data much larger than memory without degrading performance or resources

After system failure, it can quickly restore to the last checkpoint

Support PB-level data storage

Multi-threaded architecture, try to use optimistic lock concurrency control algorithm to reduce lock operations

With hot-caches capability

Maximize the use of disk IO to improve disk IO capability

other

For more WT storage engine design details, please refer to:

http://source.wiredtiger.com/3.2.1/architecture.html

  • *Advantage 8: Cost saving-WT engine high compression ratio support*

​ Mongodb supports snappy and zlib algorithms for data compression. In the past, comparing the real data space size and real disk space consumption online, the following conclusions can be drawn:

  • The default snappy compression algorithm compression ratio of mongodb is about 2.2-4.5 times
  • The zlib compression algorithm compression ratio is about 4.5-7.5 times (this migration uses the zlib high compression algorithm)
img

​ In addition, compared with the actual business disk consumption statistics of the existing online migration from mysql, Es to mongodb, the same data, the proportion of disks stored in mongodb, Mysql, and Es is ≈1:3.5:6, and different data storage accounts for There is a gap.

  • *Advantage 9: Natural N computer room (no matter in the same city or different place) more live disaster recovery support*

​ Mongodb's natural high-availability mechanism and proxy label automatic identification and forwarding function support can be deployed in different computer rooms of nodes to meet the multi-active disaster recovery needs of N computer rooms in the same city and in different places, so as to achieve the "three harvests" of cost, performance and consistency. For more information about the case of multiple live disaster recovery in the computer room, please refer to Qcon Sharing:

OPPO trillion-level document database MongoDB cluster performance optimization practice

*Advantage 10: Perfect client-side balanced access strategy*

The mongodb client access routing strategy is specified by the client itself. This function is implemented through Read Preference and supports five client balanced access strategies: primary, primaryPreferred, secondary, secondaryPreferred, and nearest.

*Distributed transaction support*

​ The mongodb-4.2 version has already supported the distributed transaction function. The current external document version has been iterated to version-4.2.12, and the distributed transaction function has been further enhanced. In addition, as can be seen from the product planning roadmap of the mongodb-4.4 version, the mongodb official will continue to invest in the development of query capabilities and usability enhancements, such as union multi-table joint query, index hiding, etc.

Mongodb source code analysis, more practical case details

3. Mongodb resource evaluation and deployment architecture

​ When the business started to migrate mongodb, through the docking and combing with the business, the cluster scale and business requirements are summarized as follows:

  • The amount of existing data is about 40 billion
  • The total data disk consumption is about 30T
  • The peak reading and writing flow is about 4-5W/s, the flow is very small
  • More disaster recovery in two computer rooms in the same city
  • Read and write separation
  • Estimated to increase by 20 billion data per month
  • Meet 150 billion new data needs in a few months

​ ****Description: ****Data scale and disk consumption are calculated based on a single copy. For example, MySQL has 64 shards and 256 copies. The calculation method of data scale and disk consumption is: the sum of the data of 64 master nodes, The sum of the disk consumption of the 64 shard master nodes.

3.1 mongodb resource evaluation

The selection and evaluation process of the number of shards and storage node package specifications is as follows:

  • *Memory Evaluation*

​ Our company is all containerized deployments. From past experience, mongodb does not consume high memory. The maximum memory of a single container in a mongodb cluster of more than tens of billions in history is basically 64Gb, so the memory specification is determined to be 64G.

  • *Sharding Evaluation*

​ Business traffic peaks are 3-5W/s. Considering that there may be greater peak traffic in the later period, according to the peak value of 10W/s writing and 5w/s reading, that is, the peak value of 15W/s is estimated. It is estimated that 4 shards will be required.

  • *Disk evaluation*

​ There are 40 billion data in mysql, and the disk consumption is 30T. According to the online migration experience, mongodb default configuration disk consumption is about 1/3-1/5 of mysql, 40 billion data corresponding to mongodb disk consumption is estimated to be 8T. Considering 150 billion data, it is estimated that there will be 4 shards. According to the scale of 40 billion per shard, it is estimated that each shard disk will consume 8T.

​ A single physical machine online has more than 10 T disks, hundreds of G memory, and dozens of CPUs. In order to maximize the use of server resources, we need to reserve a portion of the disks for other containers. In addition, due to the restrictions of container group packaging, it is finally determined that the disk of a single node is at 7T. It is estimated that 7T nodes, 4 shards will store about 150 billion data.

  • *CPU specification evaluation*

Due to the restrictions of container scheduling packages, the CPU can only be limited to 16 CPU (actually, so many CPUs are not used).

  • *Mongos proxy and config server specification evaluation*

​ In addition, since the sharded cluster also has mongos proxy and config server replication sets, it is also necessary to evaluate the mongos proxy and config server node specifications. Since the config server only mainly stores routing-related metadata, it consumes very low disk, CUP, and MEM; the mongos proxy only uses routing and forwarding and only consumes CPU, so the consumption of memory and disk is not high. In the end, in order to maximize cost savings, we decided to let an agent and a config server reuse the same container. The container specifications are as follows:

8CPU/8G memory/50G disk, an agent and a config server node reuse the same container.

****Summary of sharding and storage node specifications: ****4 slices/16CPU, 64G memory, 7T disk.

****Mongos and config server specifications summary: ****8CPU/8G memory/50G disk

img

3.2 Cluster deployment architecture

​ Since there are only two computer rooms in the city where the business is located, we adopt 2+2+1 (2mongod+2mongod+1arbiter mode), deploy 2 mongod nodes in computer room A, deploy 2 mongod nodes in computer room B, and deploy one in computer room C. The election node of the specification, as shown in the following figure:

img

*Description:*

  • Each computer room agent deploys 2 mongos agents to ensure high availability of business access agents. If any agent hangs up, the corresponding computer room business will not be affected.
  • If computer room A hangs up, there are 2 mongod + 1 arbiter left in computer room B and computer room C, and a new master node will be re-elected in the mongod of computer room B. Arbiter election node does not consume resources
  • The client configures nearest to realize the nearest reading, ensuring that when the request is forwarded through the proxy, it is forwarded to the nearest network delay node, that is, the corresponding storage node in the same computer room to read the data.
  • Disadvantages: If it is a remote computer room, there is a cross-computer room writing scene in the B and C computer rooms. If AB is a computer room in the same city, there is no such drawback, and the delay of the computer room in the same city can be ignored.

4. Full amount of business + incremental migration

img

​ The migration process is completed by the business itself and implemented through Ali's open source datax tool. For more details on the migration tool, please refer to: https://github.com/alibaba/DataX

5. Performance optimization process

The cluster optimization process is optimized in the following two steps: pre-optimization before the start of data migration, bottleneck analysis and optimization during the migration process, and performance optimization after the migration is completed.

5.1 Pre-operations before the start of data migration

​ Communicate with the business to confirm that each piece of data in the business carries a device identification ssoid. At the same time, the business query and update are based on the ssoid dimension to query a single piece or a batch of data under the device, so the slice construction selects ssoid.

  • *Sharding method*

​ In order to fully hash the data into 4 shards, the hash sharding method is selected, so that the data can be hashed maximally, and at the same time, the same ssoid data can fall into the same shard to ensure query efficiency.

*Pre-sharding*

If the shards in mongodb are built as hashed shards, you can do pre-sharding in advance, so that you can ensure that data is written to multiple shards in a balanced manner when data is written in. The benefits of pre-fragmentation can avoid the problem of chunk migration in the case of non-pre-fragmentation and maximize write performance.

sh.shardCollection("xxx.xxx", {ssoid:"hashed"}, false, {numInitialChunks: 8192})

****Note: ****Remember to create a hashed index on ssoid in advance, otherwise it will affect the subsequent shard expansion.

*Recent Study*

The client adds the nearest configuration and reads from the node closest to itself, ensuring the read performance.

*mongos proxy configuration*

Computer room A only configures the agent of computer room A, and computer room B only configures the agent of computer room B. At the same time, bring the nearest configuration, which maximizes the realization of the nearest reading of the computer room and avoids the client from accessing the agent across the computer room.

*Disable enableMajorityReadConcern*

After disabling this function, ReadConcern majority will report an error. The ReadConcern majority function pays attention to avoid dirty reading. There is no need for business communication and business, so it can be closed directly.

mongodb is enabled by default with enableMajorityReadConcern, this feature will have a certain impact on performance, please refer to:

****MongoDB readConcern principle analysis****

****OPPO million-level high-concurrency MongoDB cluster performance improved dozens of times optimization practice****

  • *Storage engine cacheSize specification selection*

​ Single container specifications: 16CPU, 64G memory, 7T disk, considering that the pressure on memory and memory fragmentation during the full migration process will be relatively large, in order to avoid OOM, set cacheSize=42G.

5.2 Optimization process in the process of full data migration

img

​ In the process of full data migration, the migration speed is relatively large and the memory has more dirty data. When the proportion of dirty data reaches a certain proportion, the thread corresponding to the user's read and write request will be blocked, and the user thread will also eliminate the dirty data page in the memory. The write performance drops significantly.

Several configurations related to the wiredtiger storage engine cache elimination strategy are as follows:

img

Since the full-volume migration data of the business is written continuously with large flow, rather than with sudden large flow, the configuration of eviction_target, eviction_trigger, eviction_dirty_target, eviction_dirty_trigger is not very useful, and the thresholds of these parameters are only sudden in a short time. It is only useful to adjust under flow conditions.

However, in the case of continuous long-term high-traffic writing, we can solve the problem of user request blocking caused by the high proportion of dirty data by increasing the number of background threads of the wiredtiger storage engine. The task of eliminating dirty data is finally handed over to the background thread of the evict module. To be done.

The optimization of the storage engine for continuous write with full volume and large traffic is as follows:

db.adminCommand( {setParameter: 1, "wiredTigerEngineRuntimeConfig": "eviction=(threads_min=4, threads_max=20)"})

5.3 After the full migration is completed, the read and write of business traffic is optimized

img

We mentioned in the previous chapter that when evaluating container resources, we finally decided to choose a single container package as the following:

16CPU, 64G memory, 7T disk.

In order to avoid OOM during the full migration, about 1/3 of the memory is reserved for the mongodb server layer, operating system overhead, etc. After the full data is migrated, the business write traffic is much smaller than the full migration process, and the peak read and write OPS is about 2-4W/s.

In other words, after the previous migration is completed, the proportion of dirty data in the cache is almost very small, basically not reaching the 20% threshold, and the business read traffic is much more than before (the read traffic during the data migration goes to the original mysql cluster) . In order to improve read performance, the following performance adjustments are made (indexes are built in advance):

  1. *The node cacheSize has been adjusted from 42G to 55G, and as much of the hotspot data as possible is cached into the memory for business reading, and the read performance is maximized. *
  2. *Accelerate the release of cache memory once every morning during the low peak period to avoid OOM. *

​ After the above kernel optimization, the service measurement delay monitoring curve changes, the delay is more stable, and the average delay is also about 25% after the performance is optimized, as shown in the following figure:

img

6. Before and after migration, business measurement delay statistics comparison (Mysql vs mongodb)

6.1 Performance benefit comparison

  • *Before migration, business measurement delay monitoring curve (average delay 7ms, data on February 1, at this time, the mysql cluster only has 30 billion data):*
img
  • *After migrating mongodb and all business traffic is cut to mongodb, the business measurement delay monitoring curve (average 6ms, data on March 6, at this time, the mongodb cluster has about 50 billion data)*
img

*to sum up:*

  • mysql (30 billion data) latency: about 7ms
  • mongodb (50 billion data) latency: about 6ms

6.2 Answers to Performance Questions

Some students in this article may question the performance data, thinking that the mongodb instance specification is 16CPU/64G memory/7T disk, and mysql is 4CPU/16G memory/500G disk. It is believed that mongodb has a higher specification, while mysql has a lower resource specification. However, ignoring the factors of single node data volume and traffic, according to the comparison of single instances, the summary is as follows (because only the business measurement latency of mysql 30 billion hours and mongodb 50 billion hours are recorded, we still compare these two time points as examples. ):

The CPU of Mysql and mongodb are not the bottleneck, and they are all idle. The only difference between the container specifications is the memory, single instance specifications, data volume, business measurement latency, etc. Comparison summary (single instance mysql data volume is about 300/64=4.7 Billion, mongodb is about 12.5 billion):

img

​ If mysql adopts the same specifications of mongodb, since mysql consumes 3.3 times the same data disk of mongodb, it needs about 22T disk and bears the same amount of data and traffic, will the performance be better than solution 1? This is not very certain, because it is an online environment, it is impossible to spend a lot of time to verify this test.

As above, the performance comparison between Scheme 3 and Scheme 1 and Scheme 2 needs to be verified. In fact, the current four shards of mongodb have 100 billion data, and the client access delay is basically unchanged, or about 6ms. Therefore, if the same resource specifications are verified, a single mysql node needs to bear the following data volume and business. flow:

img

7. Comparison of migration costs and benefits

7.1 Mysql cluster specifications and the maximum amount of stored data

img

​ There are 64 sets of original mysql clusters, each set of 4 copies, each copy container specifications: 4CPU, 16G mem, 500G disk, a total of 40 billion data can be stored, at this time most of the nodes have started the disk 90% water level alarm, DBA right The disk capacity of some nodes has been increased.

Summarized as follows:

  • Total number of clusters: 64
  • Number of copies of a single cluster: 4
  • Specifications of each node: 4CPU, 16G mem, 500G disk
  • The maximum storage data volume of the 64 sets of clusters: 40 billion

7.2 mongodb cluster specifications and the maximum amount of stored data

img

​ After mongodb migrated from mysql, the data volume has increased from 40 billion to 100 billion, and 20 billion data is added every month. The mongodb cluster specifications and the amount of stored data are summarized as follows:

  • Number of shards: 4
  • Number of copies of a single shard: 4
  • Specifications of each node: 16CPU, 64G mem, 7T disk
  • Four shards store data volume: currently 100 billion is stored, and the maximum can store 150 billion data.

7.3 Cost comparison calculation process

​ ****Description: ****Because the data is not written to mysql after mysql migrates to mongodb, there are about 40 billion data stored in mysql when the traffic is switched to mongodb, so we use this time point as the comparison time point . Based on 40 billion data, the resource consumption comparison is as follows (each shard only calculates the resource consumption of the master node, because both mysql and mongodb are 4 copies):

img

​ Because the four shards of mongodb have a lot of disk redundancy, compared with 40 billion data, the four shards can also write 120 billion data. If calculated according to 160 billion data, if the previous package specifications of mysql are still followed, the number of mysql clusters needs to be increased three times, that is, the total number of clusters needs to be 64*4=256 sets. The comparison of resource usage is as follows:

img

7.4 Summary of benefits (objective comparison)

​ As can be seen from the above content, after the business migration to mongodb, in addition to solving business capacity pain points, promoting rapid business iterative development, and performance improvement, the cost has also been saved several times. The cost savings are summarized as follows:

*40 billion dimensional calculations (mysql and mongodb both store the same 40 billion data):*

CPU and memory cost ratio: 4:1

Disk cost ratio: 3.3:1

*150 billion dimensional calculations (mysql clusters are converted using the previous specifications and proportions):*

CPU and memory cost ratio: 16:1

Disk cost ratio: 3.3:1

From the above analysis, it can be seen that the larger the amount of data, the lower the storage cost of mongodb according to the principle of proportional conversion, the reasons are as follows:

*CPU/memory saving reason:*

Mainly because of the massive data storage and high performance of mongodb, after the index is built, even if the single instance single table has tens of billions of data, read and write are returned in ms (note: remember to query, update, and build the index).

In addition, due to the distributed function of mongodb, it is more convenient for capacity evaluation. There is no need to apply for many sets of mysql in advance, but you can add shards at any time according to actual needs.

*Disk saving reason:*

The mongodb storage engine wiredtiger defaults to high compression and high performance.

****Finally, ****In view of the objective cost evaluation, the CPU/memory cost part may be controversial, such as whether the mysql memory and CPU are too large when applying. The CPU corresponding to mongodb also has this problem. For example, the single container applied for is 16CPU, which actually only consumes a few CPUs.

However, the disk saving is real-time, which is a comparison of the real disk consumption of mysql and mongodb under the same data situation.

At present, the total data volume of the cluster has reached the level of 100 billion, and the scale is increased by 20 billion every month. From the perspective of container billing alone, the 100 billion data is converted in an equal proportion, which is expected to save great costs.

8. Finally: Notes for a medium-scale mongodb cluster of hundreds of billions

Mongodb does not need to sub-database and sub-table, a single table can be infinitely large, but as the amount of data increases, the single table will cause the following problems:

  • Remember to build the index in advance, otherwise it will affect the query update performance (the more data, the slower the scan without index query).
  • Remember to evaluate in advance which indexes are needed for the business. A single node and a single table has tens of billions of data, and it takes a long time to add indexes.
  • The node replacement time will be longer when the server is abnormal.
  • Remember not to use mongodump/mongorestore for data backup, but use hot backup or file copy backup.
  • Node replacement tries to recover from the backup by copying data and loading, instead of using the master-slave full synchronization method, which takes a longer time.

9. Future challenges (the cluster's future trillion-level real-time data scale challenge)

​ As time goes by, business data will grow more and more. The monthly data volume growth curve is expected to increase linearly (the current monthly data volume will increase by about 20 billion). It is expected that the total data volume of the cluster will increase in the next 2-3 years At the trillion level, the number of shards will reach about 20 shards, and they may encounter various problems.

  1. However, IOT business data has obvious cold data problems. Data users from a year ago will basically not access it. Therefore, we consider the following optimization and then meet the further improvement of performance and cost: cold data is archived to low-cost SATA disks
  2. Cold data improves compression ratio and minimizes disk consumption
  3. How to solve the performance problem in the process of cold data archiving SATA disk

10. Final Explanation (Summary of Business Scenarios)

The usage scenarios of this 100 billion-level IOT business are summarized as follows:

  • The shared business data can be read, updated, sorted, etc. can be indexed, including single-field index, multi-field index, and array index. All queries and updates can determine a specific optimal index.
  • Queries are single-table queries and do not involve multi-table joint queries.

​ The database scenario is very important, and it doesn't make any sense to talk about the pros and cons of the database without the business scenario. For example, in the business scenario of this article, the business can determine which indexes need to be built, and all updates, queries, and sorting can correspond to the specific optimal index, so this scenario is very suitable for mongodb.

Like to support the blogger