To understand big data, you can change the angle

When it comes to big data, most technical people may think of its 4V characteristics: data volume, speed, diversity, and value. But at the same time, I also think of its huge technology ecosystem-the number of big data products is very rich.

Just common products, there are so many, dazzling. This can't help but make many big data developers sigh: "Fuck, I can't learn, I can't learn".

But in fact, if you look at big data technology from a stand-alone perspective, it's actually easy to understand.

For example, our traditional databases MySQL and Oracle. When a stand-alone node is deployed, it is installed directly on the operating system. The operating system provides the basic resources for database operation.

What are the necessary things provided by the operating system? The first is the file system, which is used for data storage. The second is the calculation engine, which is used for analysis and calculation. The operating system provides calculation instruction sets, assembly language, and high-level programming languages ​​like C and C++. Finally, during the execution of computing tasks, the operating system will perform scheduling management.

On top of the operating system, components that support business operations, such as relational databases, graph databases, and machine learning products, will be installed.

In fact, big data is the same principle. Because big data is distributed, and the native stand-alone operating system cannot provide powerful scalability, a distributed operating system at the bottom layer is required to support the operation of big data products.


This operating system needs a file system, a computing engine, and resource scheduling. The three major components of Hadoop, HDFS (distributed file system), MapReduce (distributed computing engine), and YARN (distributed resource scheduling), are not just a distributed operating system?

Among them, HDFS provides the storage of massive data, MapReduce provides computing support like a programming language, and Yarn is responsible for the allocation of computing resources.

With such an operating system, it is not enough in production. Because there are many vertical business scenarios in the enterprise, such as data warehouse, graph computing, machine learning, etc. These business scenarios are generally relatively mature and have specific processing syntax, like SQL in a data warehouse. They will not directly use the underlying computing engine, and even in traditional architectures, they will not use the C language to directly process structured data.

Therefore, big data products will install some components of various vertical scenarios on a distributed system composed of Hadoop. For example, Hive is used in data warehouses to store and analyze structured data; use Malhot to build machine learning models.

As long as the underlying Hadoop is treated as a distributed operating system, everything is almost the same as a stand-alone machine. You only need to install different product components for various business scenarios.

Essentially, big data builds a distributed operating system on the existing stand-alone operating system using platform software, and then installs application software on this basis. It is precisely because of this that its coordination and scheduling in the cluster is slower than the native operating system. But it can better adapt to the scale of massive data. After reaching a certain data level (PB), the performance is far better than the traditional architecture.

So can we imagine if we throw away the native stand-alone operating system directly on the basis of the hardware and install a distributed operating system directly, will there be a qualitative leap in performance? Whether such an operating system dedicated to distributed services will be implemented in the future, let us wait and see. If realized, it must be a brand new stage of big data. The performance will definitely be improved qualitatively.

Afterword

In the public account "Several Boats", you can get free video courses and automatic installation scripts for big data clusters for the column "Data Warehouse", and get access to group communication. All my big data technical content will also be released to the public account first. If you are interested in some big data technology, but don't have enough time, you can raise it in the group and I will arrange to share it for you.

Official account self-collection: