ClickHouse (1) Introduction and Architecture

1. OLAP detailed explanation

1.1. OLAP scene characteristics

1. Read more than write

Different from transaction processing (OLTP) scenarios, such as adding shopping carts, placing orders, and paying in e-commerce scenarios, a large number of insert, update, and delete operations need to be performed in-situ. Data analysis (OLAP) scenarios usually involve batch importing of data. Carry out flexible exploration of any dimension, BI tool insight, report production, etc.

After the data is written once, analysts need to try to mine and analyze the data from all angles until they find out the commercial value, business trend and other information in it. This is a process that requires repeated trial and error, continuous adjustment, and continuous optimization, in which the number of data reads far exceeds the number of writes. This requires the underlying database to be specifically designed for this feature, rather than blindly adopting the technical architecture of traditional databases.

2. Large wide table, read a large number of rows but a small number of columns, the result set is small

In OLAP scenarios, there are usually one or several large-width tables with multiple columns, with the number of columns reaching hundreds or even thousands of columns. When analyzing and processing data, select a few of them as dimension columns and other few columns as indicator columns, and then perform aggregation calculations on the entire table or a larger range of data. This process scans a large number of rows of data, but only a few columns are used. The result set of the aggregate calculation is also significantly smaller than the original data, which is often billions of dollars.

3. Data is written in batches, and the data is not updated or less updated

OLTP services have higher requirements for latency, and it is necessary to avoid business losses caused by waiting for customers; while OLAP services, due to the very large amount of data, usually pay more attention to the write throughput (Throughput), requiring massive data to be imported as soon as possible . Once the import is complete, the historical data is often used as an archive, and no more updates or deletions will be performed.

4. No transaction required, low data consistency requirements

OLAP services have less transaction requirements, usually importing historical log data, or collocation with a transactional database and real-time data synchronization from the transactional database. Most OLAP systems support eventual consistency.

5. Flexible and changeable, not suitable for pre-modeling

In the analysis scenario, as the business changes, the analysis dimensions and mining methods must be adjusted in time to discover the value of the data and update the business indicators as soon as possible. However, a large amount of historical data is usually stored in a data warehouse, and the adjustment cost is very high. Although the pre-modeling technology can accelerate calculations in specific scenarios, it cannot meet the flexible and changeable development needs of the business, and the maintenance cost is too high.

1.1.1. Technical selection

  • Small amount of data: stand-alone program
  • Intermediate data: ES, MySQL sub-database sub-table
  • Massive data: druid, kylin, doris, clickhouse
  • Efficient query and analysis of massive data: column database, write mode (ensure that the data type of the same column is the same: easy to compress)

1.2.ClickHouse official website explanation

URL address: https://clickhouse.tech/docs/zh/

1. The vast majority of requests are read requests.
2. The data is updated in fairly large batches (> 1000 rows) instead of a single row; or it is not updated at all.
3. The data has been added to the database, but will not be modified.
4. For reading, each query reads a large number of rows from the database, but at the same time only a small number of columns are required.
5. The "wide" table means that they contain a large number of columns.
6. Relatively few queries (usually hundreds of queries per server or less per second).
7. For simple queries, a delay of about 50 milliseconds is allowed.
8. The data in the column is relatively small: generally speaking, they are numbers and short strings (for example, 60 bytes per URL).
9. High throughput is required when processing a single query (each server has a maximum per second Billions of rows).
10. Transactions are not necessary.
11. Low requirements for data consistency.
12. Each query has a large table. All other tables are very small, except for this large table.
13. The query result is significantly smaller than the source data. In other words, the data can be stored in the memory of a single server after being filtered or aggregated

2.ClickHouse

2.1. Overview

Source code: C++

Summary of typical features: ROLAP, online real-time query, complete DBMS, columnar storage, does not require any data preprocessing, supports batch updates, has very complete SQL support and functions, supports high availability, does not rely on Hadoop complex ecology, out of the box Ready to use

Simply put, as an analytical database, ClickHouse has three major characteristics: one is fast running scores, the other is multiple functions, and the third is literary and artistic style.

1. Fast running points: ClickHouse running points are 5 times faster than Vertica:

The performance of clickHouse surpasses most columnar storage databases on the market. Compared with the traditional data ClickHouse is 100-1000X faster. ClickHouse still has a very big advantage:
100Million data set: ClickHouse is about 5 times faster than Vertica and 279 faster than Hive Times, 801 times faster than MySQL
1Billion Data set: ClickHouse is about 5 times faster than Vertica, MySQL and Hive can no longer complete the task

2. Multiple functions: ClickHouse supports various scenarios for statistical analysis of data

Support SQL-like query,
support a variety of library functions (such as IP conversion, URL analysis, etc., estimate calculation/HyperLoglog, etc.)
Support array and nested data structure (Nested Data Structure)
Support database remote replication deployment

3. Literary and artistic fan: ClickHouse currently has many restrictions, and it was born to serve the petty bourgeois

Relatively lack of documentation, the community is just beginning to be active, only the open source C++ source code
ignores the Hadoop ecosystem and goes its own way

Comparison of the number of functions: Clickhouse: 779, hive-1.x: 216, hive-2.x: 271

2.2 Usage scenarios

Suitable: Used for well-structured, clear and immutable event or log stream analysis.

Not suitable for: Transactional work (OLTP), high-request rate key-value access, low-latency modification or deletion of existing data, Blob or document storage, ultra-standardized data.

2.3 Advantages and disadvantages of ClickHouse

2.3.1 Advantages

1. A true column-oriented DBMS

ClickHouse is a DBMS, not a single database. It allows creating tables and databases, loading data, and running queries at runtime without the need to reconfigure and restart the server.

2. Data compression

Some column-oriented DBMSs (InfiniDB CE and MonetDB) do not use data compression. However, data compression does improve performance.

3. Data stored on disk

Many column-oriented DBMSs (SAP HANA and GooglePowerDrill) can only work in memory. But even on thousands of servers, the memory is too small to store all views and sessions in Yandex.Metrica.

4. Multi-core parallel processing

Multi-core and multi-node parallelization of large-scale queries.

5. Distributed processing on multiple servers

In ClickHouse, data can reside on different shards. Each shard can be used for a set of fault-tolerant replicas, and queries will be processed in parallel on all shards.

6. SQL support

ClickHouse SQL has different function names from real SQL. However, the grammar is basically compatible with SQL grammar, supports JOIN, FROM, IN and JOIN clauses, as well as scalar subqueries and subqueries.

7. Vectorization engine

The data is not only stored in columns, but also processed by the vector-column part, which enables developers to achieve high CPU performance.

8. Real-time data update

ClickHouse supports primary key tables. In order to quickly execute the query on the primary key range, the data is sorted in ascending order using MergeTree. For this reason, data can be continuously added to the table.

9. Support approximate calculation

The library supports running aggregations for a limited number of random keys (rather than all keys). Under the specific conditions of key distribution in the data, this provides relatively accurate results while using fewer resources.

10. Data replication and support for data integrity

ClickHouse uses asynchronous multi-master replication. After writing to any available copy, the data will be distributed to all remaining copies. The system maintains the same data on different copies. Data is automatically restored after failure.

2.3.2 Disadvantages

  1. There is no complete transaction support, no transaction support: don't think about Transaction if you want to quickly
  2. It lacks a complete Update/Delete operation, lacks the ability to modify or delete existing data with high frequency and low latency, and can only be used to delete or modify data in batches.
  3. The aggregation result must be less than the memory size of a machine: not a big problem
  4. Supports limited operating systems and is slowly improving
  5. The open source community has just started, mainly in Russian, and the Chinese community: http://www.clickhouse.com.cn
  6. Not suitable for key-value storage, and does not support document databases such as Blob

3. Installation and startup

3.1 Installation

3.1.1 Online installation

yum install yum-utils -y

rpm --import https://repo.clickhouse.tech/CLICKHOUSE-KEY.GPG

yum-config-manager --add-repo https://repo.clickhouse.tech/rpm/clickhouse.repo

yum install clickhouse-server clickhouse-client -y

3.1.2 Offline installation

Download link: https://repo.yandex.ru/clickhouse/rpm/stable/x86_64/ https://packagecloud.io/Altinity/clickhouse

Installation sequence:

rpm -ivh clickhouse-common-static-20.5.4.40-1.el7.x86_64.rpm

rpm -ivh clickhouse-server-common-20.5.4.40-1.el7.x86_64.rpm

rpm -ivh clickhouse-server-20.5.4.40-1.el7.x86_64.rpm

rpm -ivh clickhouse-client-20.5.4.40-1.el7.x86_64.rpm

Note: Even if it is configured as a cluster, each server still runs separately.

Successful core directory for installation

(1) /etc/clickhouse-server: The configuration file directory of the server, including global configuration config.xml and user configuration users.xml, etc.

(2) /var/lib/clickhouse: the default data storage directory, usually modify the default path configuration, save the data to the large-capacity disk mount path

(3) /var/log/clickhouse-server: the default log save directory, usually modify the path configuration to save the log to the path mounted on the large-capacity disk

executable file

clickhouse: the executable file of the main program.
clickhouse-client: A soft link to ClickHouse executable file for client connection.
clickhouse-server: A soft link to ClickHouse executable file for server startup.
clickhouse-compressor: a built-in compression tool that can be used to decompress and decompress data.

3.2 Start

3.2.1 Stand-alone startup

Front desk start: clickhouse-server --config-file=/etc/clickhouse-server/config.xml

Background startup: nohup clickhouse-server --config-file=/etc/clickhouse-server/config.xml 1>~/logs/clickhouse_std.log 2>~/logs/clickhouse_err.log &

Process view: ps -aux | grep clickhouse, netstat -nltp | grep clickhouse

If an error is reported:

Solution: modify the permissions of the installation directory! , Clickhouse users are used by default! The command is:

cd /var/lib/, chown -R root:root clickhouse

Client start: clickhouse-client --port port number. <tcp_port>9977</tcp_port> in /etc/clickhouse-server/config.xml

3.2.2 Cluster startup

1. Add the configuration file /etc/metrika.xml. Distribute to the server where CK is located. As follows:

<yandex>	<clickhouse_remote_servers>		<!-- 3分片1副本 -->		<perftest_3shards_1replicas>			<shard>				<!-- 数据自动同步 -->				<internal_replication>true</internal_replication>				<replica>					<host>hadoop1</host>					<port>9977</port>				</replica>			</shard>			<shard>				<replica>				<internal_replication>true</internal_replication>					<host>hadoop2</host>					<port>9977</port>				</replica>			</shard>			<shard>				<internal_replication>true</internal_replication>				<replica>					<host>hadoop3</host>					<port>9977</port>				</replica>			</shard>		</perftest_3shards_1replicas>	</clickhouse_remote_servers> 	<!--zookeeper相关配置-->	<zookeeper-servers>		<node index="1">			<host>hadoop1</host>			<port>2181</port>		</node>		<node index="2">			<host>hadoop2</host>			<port>2181</port>		</node>		<node index="3">			<host>hadoop3</host>			<port>2181</port>		</node>	</zookeeper-servers> 	<!-- 配置文件中macros若省略,则建复制表时每个分片需指定zookeeper路径及副本名称,同一分片上路径相同,副本名称不同;若不省略需每个分片不同配置	-->	<macros>		<replica>hadoop2</replica>	</macros> 	<networks>		<ip>::/0</ip>	</networks> 	<!-- 配置压缩 -->	<clickhouse_compression>		<case>			<min_part_size>10000000000</min_part_size>			<min_part_size_ratio>0.01</min_part_size_ratio>			<method>lz4</method>		</case>	</clickhouse_compression> </yandex>

Start zookeeper and CK server, refer to the above to check whether the startup is OK:

zkServer.sh start

nohup clickhouse-server --config-file=/etc/clickhouse-server/config.xml 1>~/logs/clickhouse_std.log 2>~/logs/clickhouse_err.log &

2. Client connection:

clickhouse-client --host hadoop1 --port 9977 -m --user=bigdata --password=bigdata

-m If set, multiple lines of query are allowed.

Pay attention to the use of clusters:

If a table is created, and the engine of this table is a distributed engine, then the machine where the table is stored in that cluster is a cluster. Some tables in each clickhouse are stand-alone, and some tables are distributed.

Common table building statements are stand-alone. Create distributed will be displayed on all nodes.

create database if not exists mydb on cluster perftest_3shards_1replicas

3. Set the connection username and password: vim /etc/clickhouse-server/users.xml Copy to all nodes

<bigdata> <password>bigdata</password> <networks incl="networks" replace="replace">  <ip>::/0</ip> </networks> <profile>default</profile> <quota>default</quota></bigdata>

4. Uninstall clickhouse

  1. Query whether clickhouse is installed: rpm -qa | grep clickhouse
  2. Uninstall clickhouse: rpm -e clickhouse-client-20.5.4.40-2.noarch --nodeps, rpm -e clickhouse-server-20.5.4.40-2.noarch --nodeps, rpm -e clickhouse-common-static-20.5. 4.40-2.x86_64 --nodeps
  3. Delete the data directory: rm -rf /var/lib/clickhouse
  4. Delete the cluster configuration file: rm -rf /etc/metrika.xml
  5. Delete the configuration file: rm -rf /etc/clickhouse-*
  6. Delete the log file: rm -rf /var/log/clickhouse-server
  7. Delete the clickhouse data on zookeeper: rmr /clickhouse
  8. You can also perform a global search: and then perform a delete operation: find / -name'clickhouse'

5. ClickHouse some information summary

  • Configuration file path: /etc/clickhouse-server/config.xml
  • Log file path: /var/log/clickhouse-server/
  • Table creation information path: /var/lib/clickhouse/metadata/
  • Table data path: /var/lib/clickhouse/data/