1. SQL basic syntax
1.1.1. CREATE DATABASE
Database engine: The database currently supports a total of 5 engines, as shown below:
- Ordinary: The default engine. In most cases, we will use the default engine. There is no need to declare it when using it. Any type of table engine can be used in this database.
- Dictionary: dictionary engine, this type of database will automatically create their data tables for all data dictionaries
- Memory: Memory engine, used to store temporary data. The data tables under this type of database will only stay in memory and will not involve any disk operations. When the service restarts, the data will be cleared.
- Lazy: Log engine, only Log series table engines can be used under this type of database
- MySQL: MySQL engine, this type of database will automatically pull the data in the remote MySQL, and create the data table of the MySQL table engine for them
In most cases, you only need to use the default database engine.
The essence of the default database is a file directory on the physical disk, so after the statement is executed, ClickHouse will create the file directory of the database under the installation path:
At the same time, the file used to restore the database will also be created in the metadata path, and there are SQL statements to restore the table in the library directory:
CREATE TABLE [IF NOT EXISTS] [db.]table_name [ON CLUSTER cluster](name1 [type1] [DEFAULT|MATERIALIZED|ALIAS expr1],name2 [type2] [DEFAULT|MATERIALIZED|ALIAS expr2],...) ENGINE = engine
|DEFAULT expr||The default value, usage is similar to SQL.|
|MATERIALIZED expr||Materialized expression, the column specified by the expression cannot be INSERT, because it is always calculated. |
For INSERT, these columns do not need to be considered. In addition, if an asterisk is included in the SELECT query, this column will not be queried.
Three ways to create a table:
1. Create directly
2. Create a table with the same structure as other tables:
CREATE TABLE [IF NOT EXISTS] [db.]table_name AS [db2.]name2 [ENGINE = engine]
3. Use the specified engine to create a table with the same structure as the result of the SELECT clause, and fill it with the result of the SELECT clause:
CREATE TABLE [IF NOT EXISTS] [db.]table_name ENGINE = engine AS SELECT ...
1.1.3. Table fields
Table fields support three definition methods for default value expressions, namely DEFAULT, MATERIALIZED and ALIAS . No matter which form is used, once a table field is defined with a default value, it is no longer mandatory to define the data type, because ClickHouse will infer the type based on the default value. If the data type and default value expression are defined for the table fields at the same time, the clearly defined data type shall prevail
There are also differences between the three definition methods of the default value expression, which can be compared from the following three aspects:
1. Data writing: When writing data, only DEFAULT type fields can appear in the INSERT statement. Neither MATERIALIZED nor ALIAS can be explicitly assigned, they can only rely on calculations. For example, if you try to write data to a field of type MATERIALIZED, you will get the following error
DB::Exception: Cannot insert column URL,because it is MATERIALIZED column..
2. Data query: When querying data, only DEFAULT type fields can be returned by SELECT *. The MATERIALIZED and ALIAS type fields will not appear in the return result set of the SELECT * query.
3. Data storage: In data storage, only DEFAULT and MATERIALIZED type fields support persistence. If the table engine used supports physical storage (such as the TinyLog table engine), then these column fields will have physical storage. The ALIAS type field does not support persistence, its value always needs to be generated by calculation, and the data will not fall to the disk.
You can use the ALTER statement to modify the default value, for example:
ALTER TABLE [db_name.]table MODIFY COLUMN col_name DEFAULT value;
The modification action will not affect the previously existing data in the data table. However, there are many restrictions on the modification of the default value. For example, in the merge tree table engine, its primary key field cannot be modified; and some table engines do not support modification at all (such as TinyLog).
1.1.4. Temporary Table
ClickHouse also has the concept of a temporary table. The way to create a temporary table is to add the TEMPORARY keyword on top of the normal table. Its complete syntax is as follows:
CREATE TEMPORARY TABLE [IF NOT EXISTS] table_name ( name1 [type][DEFAULT|MATERIALIZED|ALIAS expr], name2 [type][DEFAULT|MATERIALIZED|ALIAS expr])
Compared with ordinary tables, temporary tables have the following special features:
- Its life cycle is session-bound, so it only supports the Memory table engine. If the session ends, the data table will be destroyed;
- Temporary tables do not belong to any database, so in its table creation statement, there are neither database parameters nor table engine parameters.
- The priority of temporary tables is greater than that of ordinary tables. When two data tables have the same name, the data in the temporary table will be read first.
- In the daily use of ClickHouse, temporary tables are usually not deliberately used. It is more used inside ClickHouse, and it is the carrier of data dissemination between clusters.
1.1.5. Partition Table
Data partition (partition) and data shard (shard) are two completely different concepts. Data partitioning is for local data and is a vertical segmentation of data. Data fragmentation is a horizontal division of data. Data partitioning is of great significance for an OLAP database: With the help of data partitioning, unnecessary data directories can be skipped in the subsequent query process, thereby improving query performance. Reasonable use of partition features can also realize data update operations in disguise, because data partitions support delete, replace, and reset operations. Assuming that the data table is partitioned by month, the data can be replaced and updated at the granularity of the month.
Although partitioning is good, not all table engines can use this feature. Currently, only table engines in the MergeTree family series support data partitioning . Next, a simple example demonstrates how to use the partition table. First, the partition key is specified by PARTITION BY. For example, the following data table my_table_partition1 uses the date field as the partition key and formats it in the form of year and month:
create table my_table_partition1( id String, url String, eventTime Date) engine = MergeTree()partition by toYYYYMM(eventTime)order by id;
You can query the partition status of the table through the following SQL:
select table, partition, path from system.parts where table = 'my_table_partition1';
ClickHouse has two kinds of views, ordinary and materialized views. Materialized views have independent storage, while ordinary views are just a simple query agent. The complete syntax for creating a normal view is as follows:
CREATE VIEW [IF NOT EXISTS] [db_name.]view_name AS SELECT ...
A materialized view supports a table engine, and the data storage format is determined by its table engine. The complete syntax for creating a materialized view is as follows:
CREATE [MATERIALIZED] VIEW [IF NOT EXISTS] [db.]table_name [To [db.]name] [ENGINE=engine] [POPULATE] AS SELECT ...
After the materialized view is created, if the source table is written with new data, the materialized view will also be updated synchronously. The POPULATE modifier determines the initialization strategy of the materialized view: if the POPULATE modifier is used, then in the process of creating the view, the existing data in the source table will be imported together, as if SELECT INTO was executed; on the contrary, if Without the POPULATE modifier, the materialized view has no data after it is created, and it will only synchronize the data written to the source table after that. Materialized views currently do not support synchronous deletion. If data is deleted in the source table, the data of the materialized view will still be retained.
ALTER only supports tables of the MergeTree series, Merge and Distributed engines, the basic syntax:
ALTER TABLE [db].name [ON CLUSTER cluster] ADD|DROP|MODIFY|COMMENT COLUMN ...
ADD COLUMN – add a new column to the table
DROP COLUMN – delete a column in the table
MODIFY COLUMN – change the type of the column
COMMENT COLUMN – change the comment of the column
View the partition information of the table
select partition_id, name, table, database, path from system.parts where table = 'my_table_partition1';
alter table my_table_partition1 drop partition '202005';
alter table my_table_partition2 replace partition '202005' from my_table_partition1;
Unload and mount partition
The table partition can be uninstalled by DETACH statement. After the partition is uninstalled, its physical data is not deleted, but moved to the detached subdirectory of the current data table directory. The loading partition is the reverse operation, it can reload a partition under the detached subdirectory. The accompanying operation of unloading and loading is often used in partition data migration and backup scenarios.
Remember, once the partition is moved to the detached subdirectory, it means it has been out of ClickHouse management, and ClickHouse will not actively clean up these files. These partition files will always exist, unless we actively delete them or use the ATTACH statement to reload them.
--卸载ALTER TABLE tb_name DETACH PARTITION partition_expr--装载ALTER TABLE tb_name ATTACH PARTITION partition_expr
1.5. CHECK TABLE
Check whether the data in the table is damaged, he will return two results: 0-data is damaged; 1-data is complete
check table mt_table;
This command only supports Log, TinyLog and StripeLog engines.
1.6. Distributed DDL execution
ClickHouse supports cluster mode, a cluster has 1 or more nodes. DDL statements such as CREATE, ALTER, DROP, RENMAE, and TRUNCATE all support distributed execution. This means that if a DDL statement is executed on any node in the cluster, each node in the cluster will execute the same statement in the same order. This feature is of great significance. It is like a batch command, eliminating the need to go to a single node to execute DDL in turn.
It is very simple to convert an ordinary DDL statement into distributed execution, just add the ON CLUSTER cluster_name statement. For example, after executing the following statement, this DDL statement will be broadcast to all nodes in the cluster:
-- 分布式建库create database if not exists nxdb4 on CLUSTER perftest_3shards_1replicas;-- 分布式建表CREATE TABLE nx_table_partition3 ON CLUSTER perftest_3shards_1replicas( ID String, URL String, EventTime Date)ENGINE=MergeTree()PARTITION BY toYYYYMM(EventTime)ORDER BY ID;
1.7. update and delete
ClickHouse provides DELETE and UPDATE capabilities. This type of operation is called Mutation query, which can be seen as a variant of ALTER statement.
- Mutation operation is suitable for the modification and deletion of batch data
- Unsupported transactions Once the statement is submitted for execution, it will immediately affect the existing data and cannot be rolled back.
- The execution of the mutation operation is an asynchronous process. The statement submission will return immediately, but it does not mean that the specific logic has been executed. The specific execution record needs to be queried in the system.mutations system table.
create database if not exists mydb;use mydb; --创建表CREATE TABLE city( id UInt8, country String, area String, province String, city String,create_time datetime DEFAULT now())ENGINE = MergeTree()PARTITION BY toYYYYMM(create_time)ORDER BY id; -- 插入数据insert into city(id,country,area,province,city) VALUES(1,'China','North','Hubei','wuhan'),(2,'China','South','Guangdong','guangzhou'),(3,'China','South','Guangdong','shenzhen'),(4,'China','North','Beijing','Beijing'),(5,'China','South','Shanghai','Shanghai'); --数据查询select id, country, area, province, city, create_time from city; --update 操作ALTER TABLE city UPDATE area='South' WHERE city='wuhan'; --delete操作ALTER TABLE city DELETE WHERE city='guangzhou';
View the directory where the data is located:
[[email protected] mydb]# cd /var/lib/clickhouse/data/mydb/city/[[email protected] city]# ll -atotal 24drwxr-x---. 6 root root 156 Jun 5 17:07 .drwxr-x---. 3 root root 18 Jun 5 17:07 ..drwxr-x---. 2 root root 4096 Jun 5 17:07 202106_1_1_0drwxr-x---. 2 root root 4096 Jun 5 17:07 202106_1_1_0_2drwxr-x---. 2 root root 4096 Jun 5 17:07 202106_1_1_0_3drwxr-x---. 2 root root 6 Jun 5 17:07 detached-rw-r-----. 1 root root 1 Jun 5 17:07 format_version.txt-rw-r-----. 1 root root 109 Jun 5 17:07 mutation_2.txt-rw-r-----. 1 root root 96 Jun 5 17:07 mutation_3.txt
It can be found that the data directory will generate files mutation_2.txt and mutation_3.txt after the update and delete operations are performed. In addition, the suffixes _2 and _3 have been added to the end of the directory with the same name. You can see that mutation_2.txt and mutation_3.txt are log files, which fully record the update and delete operation statements and time.
mutation_id: Generate a corresponding log file for recording related information. The process of data deletion is based on each partition directory of the data table, and all directories are rewritten into new directories. The naming rule of the directory is to add to the original name
block_numbers.number: In the process of data rewriting, the data that needs to be deleted will be removed. The old data will not be deleted immediately, but will be marked as inactive (active =0). When the next merge action of the MergeTree engine is triggered, these inactive directories will be physically deleted.