Alibaba Cloud's PolarDB is going to be open source? What is the strength of this database?

At the end of last year, the Chinese Institute of Electronics officially announced the 2020 China Institute of Electronics Science and Technology Award list. Alibaba Cloud's self-developed cloud database PolarDB won the first prize for scientific and technological progress.

(The fourth from the left is Li Feifei, the person in charge of the PolarDB project)

This is not the first time that Ali has won such honor. Alibaba Cloud’s Feitian operating system won the Special Prize for Scientific Progress of the Institute of Electronics in 2018, and it is also the first special prize for scientific progress since the establishment of the award. This time PolarDB won the award. It also illustrates the strength of Alibaba Cloud's independent technology.

At the Alibaba Cloud Developer Conference on May 29 this year, Li Feifei officially announced that PolarDB for PostgreSql will be open sourced. The announcement of this decision was also one of the biggest surprises that the author got at the entire developer conference.

PolarDB is a self-developed database product family developed by Alibaba Cloud. It adopts the separation of storage and computing, and the integrated design of software and hardware. It not only has the low-cost advantage of distributed design, but also has centralized ease of use, which can meet the needs of large-scale application scenarios. The computing power can be expanded to more than 1000 cores, the storage capacity can be up to 100TB, and the cluster version single database can be expanded to 16 nodes at most, and the performance is 6 times higher than that of MySql. PolarDB series products have been supporting Tmall Double 11 stably for many years, and the processing peak is up to a record 140 million times per second.

PolarDB has three compatible product branches, corresponding to MySql, PostgreSql and Oracle. We noticed that before the distributed version of PolarDB for PostgreSql was open sourced by Alibaba Cloud, Huawei’s OpenGauss plan was actually the version of GaussDB For PostgreSql. This time, Alibaba Cloud’s open source is also quite competitive with Huawei GaussDB, showing its strength through code. The meaning. PolarDB uses the relatively open Apache Version 2.0 protocol for open source this time, and the code can be modified and redistributed.

Since 2018, Alibaba Cloud has entered the Gartner Magic Quadrant for Databases for three consecutive years, and entered the Gartner Global Database Leaders Quadrant in 2020, becoming the first Chinese company in the basic software field to enter the Leaders Quadrant. Currently, Alibaba Cloud's database market share ranks among the top three in the world and the largest in the Asia-Pacific. Therefore, the open source of PolarDB this time is really surprising, because open source actually opens up its core technology completely, which will also give competitors a certain reference and even the opportunity to overtake. Only companies that are extremely confident in their own technological iteration capabilities , Will choose to open source the core technology.

According to rumours, initial discussions on whether PolarDB was open sourced in Alibaba Cloud were fierce. For this reason, Zhang Jianfeng and Li Feifei even took pictures of the table, but the final decision was to open source. After listening to Li Feifei's sharing at this developer conference, the author believes that the confidence of PolarDB open source actually lies in Alibaba Cloud's complete control of the data ecological chain, and open source will inevitably make the Alibaba Cloud database ecosystem stronger.

There is a big difficulty, the database Sql dispute

The authoritative consulting organization IDC defines big data as data that is difficult to process with existing technologies. From a historical point of view, when Google proposed the big data troika paper, the relational database technology at that time was already in a state where it was difficult to handle large-scale data. In the context of the continuous migration of various industries to the cloud, the magnitude of data will inevitably continue to hit new highs. From what I have learned, the magnitude of data stored in the entire IT industry is increasing by about 80% annually. With the increasing speed, it is difficult for traditional Sql databases to handle this amount of data.

With the development of time, there are now two major schools of databases, one is non-relational (NoSql) database, which is a Key-Value database specially used to store massive data, mainly used for user portraits, business reports and other massive data The other is the relational (Sql) database, which is very fast for adding, deleting, modifying, and checking individual records, but it rarely does large-scale association calculations at the full table level, so it is generally used in online transaction scenarios. In short, Sql processing speed is fast, NoSql processing data magnitude is high.

Previously, the application scenarios of Sql and NoSql did not overlap, and the well water did not violate the river water, but new scenarios such as live delivery of goods continued to emerge. Because the transactions in the live broadcast both need to update the merchant’s inventory and the buyer’s account balance, it must also be based on Real-time analysis of customer behavior, precise marketing, business scenarios similar to this integrated Sql and NoSql requirements continue to emerge, and the cloud database represented by PolarDB is the best way to solve such problems.

Take the author’s bank as an example. At present, commercial banks generally use Oracle database as the core system, but Oracle can only process process transaction data, not data mining. If you want to express the value of the data, you have to do it every day. ETL, run batch jobs, store in the data warehouse, and then model, mine, data mart, ODS in the data warehouse, and build the data warehouse report layer by layer.

If you cannot answer more detailed and implicit questions like nonlinear questions, you must copy the data to SAS for machine learning, and then do a statistical indicator system for further mining. Data must be moved here three times, with three redundant copies, and data consistency must be managed. A large amount of data center operation and maintenance work is doing data moving every day. In the process of inefficient transfer and migration of data, a lot of value is wasted in vain. At the same time, it brings about two huge problems of processing timeliness and disaster recovery construction.

In dealing with the issue of timeliness, as we said earlier, the underlying construction models of Sql and NoSql are not the same, and they are not compatible with each other. This will first give rise to the problem of timeliness of data processing. Or take the author’s bank as an example. The analysis data is run batch processing in the transaction core database, and then ODS extracts ETL analysis to the data warehouse, and then further trains streaming computing. Finally, To enter the lake again, the entire data manual process takes at least one day.

Moreover, many components in the open source ecosystem of Hadoop and Data Lake are not compatible, and daily operation and maintenance are already stretched, and there is no way to speed up, but the business is so eager for fleeting marketing opportunities, T+1 minutes may be too slow. The requirements for processing timeliness may be an agreement that can never be reached between a big data engineer and a product manager.

The collaborative battle between PDB and ADB

As you can see from the above introduction, all data centers are eager to find a one-stack solution to shield the differences in the underlying components and create a “All Data In One” solution. Only in this way can efficiency be improved and the efficiency is low. Cost operation and maintenance. The relational database represented by Alibaba Cloud's PolarDB focuses on solving the needs of Sql. It provides automatic parameter optimization, automatic index recommendation and other functions, which greatly improves the happiness of database administrators. AnalyticalDB is the NoSql data warehouse. The top master, the database solution system formed by these two products bridges the gap between Sql and NoSql.

The cloud-native database represented by PolarDB has higher elasticity, high availability, and distributed capabilities through the separation of storage and computing and resource decoupling, to meet business needs for on-demand use and on-demand pay-as-you-go .

PolarDB and AnalyticalDB are both services. Users can completely ignore the specific details hidden behind the PaaS layer. The technical details such as the data flow between the database and the data warehouse are encapsulated by the cloud service, which shields the users. A set of solutions system integrates the advantages of NoSql and Sql, users can have an efficient data warehouse at the same time effortlessly, which can be described as the biggest pain point for users in using the database in one fell swoop.