In the early hours of the night, my girlfriend asked what a data warehouse was, and my answer surprised her, and then found out. . .

Things are like this

At night, I am concentrating on writing code
Suddenly, I received a message from my girlfriend
She didn't understand the code and proposed such a professional term. I was shocked on the spot, so I decided to write a detailed data warehouse introduction, and then found out that I finished writing, and she fell asleep. . .

table of Contents

Historical articles

Preface

Introduction to data warehouse

One, the basic concept of data warehouse

Two, the main characteristics of the data warehouse

1. Thematic

2. Integration

3. Stability

4. Time-varying

Three, the difference between data warehouse and database

1. Database

2. Data warehouse

​​​​​​​3, the difference between the two

4. Data warehouse layered architecture

​​​​​​​5, ETL for data warehouse


Historical articles

Baidu, Alibaba, and Tencent are all familiar with the platform architecture. Do you understand the evolution of the OLAP architecture of the Xiaomi big data platform?

My girlfriend asked Ali how to realize the real-time big screen on Double Eleven. I was stunned for a while and immediately taught her the big data technology behind it.

Preface

The most detailed big data notes on the entire network in 2021 will easily take you from entry to proficiency. This column is updated daily to summarize knowledge sharing

Introduction Data Warehouse

One, the basic concept of data warehouse

Data warehouse, the English name is Data Warehouse, which can be abbreviated as DW or DWH. As the name suggests, a data warehouse is a large collection of data storage, created for the purpose of enterprise analytical reporting and decision support, and screening and integrating diverse business data . It provides companies with certain BI (business intelligence) capabilities to guide business process improvement, monitoring time, cost, quality, and control.

The input side of the data warehouse is a variety of data sources, and the final output is used for data analysis, data mining, data reporting and other directions of the enterprise.

​​​​​​​Two, the main characteristics of the data warehouse

Data warehouse is subject - oriented (Subject-Oriented), integrated (Integrated), stable (Non-Volatile) and time-varying (Time-Variant) data collection to support management decision-making.

1. Thematic

Different from the traditional database corresponding to one or more projects, the data warehouse integrates the data of different data sources at a higher level of abstraction according to the actual needs of users, and all data is organized around a certain topic.

How to understand the subject here? For example, for cities, "weather and humidity analysis" is a theme, and for Taobao, "user click behavior analysis" is a theme.

2. Integration

The data stored in the data warehouse is derived from the integration of multiple data sources, the original data comes from different data sources, and the storage methods are different. To integrate into the final data collection, it needs to go through a series of extraction, cleaning, and conversion processes from the data source.

3. Stability

The data stored in the data warehouse is a series of historical snapshots and cannot be modified. Users can only query and analyze through analysis tools. Here is a point, the data warehouse basically does not allow users to modify or delete operations. Most scenarios are used to query and analyze data.

4. Time-varying

The data warehouse will regularly receive new integrated data, reflecting the latest data changes. This is not inconsistent with stability characteristics.

Three, the difference between data warehouse and database

1. Database

The database is a transaction-oriented processing system, which is the daily operation of specific business online in the database, and usually queries and modifies records. Users are more concerned about operational response time, data security, integrity, and the number of concurrently supported users. As the main means of data management, the traditional database system is mainly used for operational processing, also known as OLTP (On-Line Transaction Processing).

​​​​​​​2, data warehouse

Data warehouses generally analyze historical data on certain topics to support management decision-making, and are also known as On-Line Analytical Processing (OLAP).

The first thing to understand is that the emergence of data warehouses is not meant to replace databases.

​​​​​​​3, the difference between the two

  • The database is a transaction-oriented design, and the data warehouse is a subject-oriented design.
  • The database generally stores business data, and the data warehouse generally stores historical data.
  • The database design is to avoid redundancy as much as possible. It is generally designed for a certain business application. For example, a simple User table can record simple data such as user names and passwords. It meets business applications but does not meet analysis. The design of the data warehouse deliberately introduces redundancy, and is designed according to analysis requirements, analysis dimensions and analysis indicators.
  • The database is designed to capture data, and the data warehouse is designed to analyze data.

Take banking as an example. The database is the data platform of the transaction system. Every transaction made by the customer in the bank will be written into the database and recorded. Here, it can be simply understood as using the database for accounting.

The data warehouse is the data platform of the analysis system. It obtains data from the transaction system, summarizes and processes it, and provides decision-makers with a basis for decision-making. For example, how many transactions occur in a certain branch of a certain bank in a month, and what is the current deposit balance of the branch. If there are more deposits and more consumer transactions, then it is necessary to set up an ATM in this area.

Obviously, the bank's transaction volume is huge, usually in millions or even tens of millions of times. The transaction system is real-time, which requires timeliness. It is unbearable for customers to deposit a sum of money for tens of seconds. This requires that the database can only store data for a short period of time. The analysis system is after the fact, it must provide all the valid data in the time period of interest. These data are massive, and the summary calculation is slower, but as long as the effective analysis data can be provided, the goal will be achieved.

The data warehouse is produced in order to further mine data resources and for decision-making needs when a large number of databases already exist. It is by no means a so-called "large database".

4. Data warehouse layered architecture

According to the process of data inflow and outflow, the data warehouse architecture can be divided into three layers- source data , data warehouse , and data application.

The data of the data warehouse comes from different source data and provides a variety of data applications. The data flows into the data warehouse from the bottom up and then opens up to the upper layer for applications. The data warehouse is just a platform for integrated data management in the middle.

  • Source data layer (ODS):  Operational Data Store is a transition from database to data warehouse. The data structure of ODS is generally consistent with the data source, and the data cycle of ODS is generally relatively short. The ODS data prepares for the data processing in the next step.
  • Data warehouse layer (DW): Data Warehouse (Data Warehouse)is the home of data, where all the data coming from ODS is kept, and errors are reported for a long time, and these data will not be modified. The data in the DW layer should be consistent , Accurate and clean data, that is, the data after cleaning (removing impurities) from the source system data.
  • Data Application Layer (DA): Data Application, a part of data that is independent from the data warehouse for a specific application purpose or application range. It can also be called departmental data or subject data, and the data is application-oriented. For example, calculate the generated data according to the report and thematic analysis requirements.

​​​​​​​5, ETL for data warehouse

The ETL , English abbreviation Extract-Transform-Load used to describe the data from the source end after extraction (Extract) , conversion (Transform) , load (load) to the destination process. ETL is the process of extracting, cleaning, and transforming business system data and then loading it into a data warehouse. The purpose is to integrate scattered, disorderly, and non-uniform data in the enterprise.

ETL is the assembly line of the data warehouse, and can also be considered as the blood of the data warehouse. It maintains the metabolism of the data in the data warehouse, and most of the energy in the daily management and maintenance of the data warehouse is to keep ETL normal and stable.


The big data series of articles in this blog will be updated every day, remember to collect and pay attention~