MySQL performance test-tpch tool

Article Directory

1. Background

tpch is a toolkit provided by the TPC (Transaction Processing Performance Council) organization. Used for OLAP testing to evaluate the performance of the decision support system (DSS) in business analysis. It contains a complete set of business-oriented ad-hoc queries and concurrent data modification, emphasizes the testing of database, platform and I/O performance, and focuses on query capabilities.

TPC-H is a set of benchmarks commonly used in the industry, formulated and released by the TPC committee, to evaluate the analytical query capabilities of databases. TPC-H queries include 8 data tables and 22 complex SQL queries. Most queries include several tables Join, subqueries, and Group by aggregations.

Before using the tpch tool, let's learn about parallel query (OLAP).

Parallel Query (OLAP) example (take Alibaba Cloud's ploarDB as an example)
PolarDB MySQL 8.0 cluster version launches the parallel query (Parallel Query) framework. Parallel query is turned off by default. When parallel query is turned on and the query data volume reaches a certain threshold, the parallel query framework will be automatically started, thereby reducing query time.

Note: Parallel query can be enabled by setting loose_max_parallel_degree parameter. For how to set cluster parameters, please refer to Setting Cluster Parameters.
The loose_max_parallel_degree parameter description is as follows: the
minimum value is 0 (parallel query is disabled).
The maximum value is 1024.
It is recommended to set the parallel query parameter to 16.

PolarDB MySQL 8.0 cluster version shards data into different threads in the storage layer, and multiple threads perform parallel calculations. The result pipeline is aggregated to the total thread. Finally, the total thread performs simple aggregation and returns the results to the user to improve query efficiency.

Parallel query utilizes the parallel processing capabilities of multi-core CPUs. Taking an 8-core 32 GB cluster with exclusive specifications as an example, the parallel query schematic diagram is shown below.

Insert picture description here

Related links:
tpch download address (TPC-H needs to be registered before downloading):
http://tpc.org/tpc_documents_current_versions/download_programs
tpch tool github warehouse address:
https://github.com/electrum/tpch-dbgen
reference Use of Alibaba Cloud tpch tool:
https://help.aliyun.com/document_detail/146099.html?spm=a2c4g.11186623.6.773.4797364bCS7aO5

2.tpch installation

  • step1 , decompress the downloaded installation package
yum install unzip -y
unzip tpch-mysql.zip
  • step2 , enter the dbgen directory and compile
cd tpch-mysql
cd dbgen
yum install gcc-c++

Execute compilation

make
  • step3 , generate data files (about 12G data, a long time, a few minutes, wait patiently)
./dbgen -s 12
  • step4 , enter the database to build tables (using root or high-privileged users) and some need to manually build the database tpcd

This is the end of the preparatory work, and the use of the tpch tool will begin below.

Note: Select the same instance of the same specification for testing, write 12G data volume, test 3 times, and take the final average value for benchmarking (unit: s).

3. Use of tpch tool

  • step5 , log in to the database (take Alibaba Cloud RDSMySQL as an example)
mysql -hrm-uf6cbh8aq31a8prtj.mysql.rds.aliyuncs.com -P3306  -umyroot -p****** --local-infile
Note: Log in to mysql here and add the –local-infile parameter, otherwise it may report:
ERROR 2068 (HY000): LOAD DATA LOCAL INFILE file request rejected due to restrictions on access.
use tpcd;

Build a table

\. /root/tpch-mysql/dbgen/dss.ddl

Create indexes, foreign keys, etc.

\. /root/tpch-mysql/dbgen/dss.ri

If you want to see whether the foreign key and primary key are added successfully, and whether they are added correctly, you can execute SHOW CREATE TABLE table name (if you think the default format is very obstructive, you can try adding \G vertical printing)

  • Step6 . Importing data A
    total of 8 tables need to be imported, part, region, nation, customer, supplier, orders, partssupp, and lineitem. The import time of large tables is very long.
    A total of 8 tables, first import the first 5 tables (the following statement is executed in mysql)
load data local infile '/root/tpch-mysql/dbgen/part.tbl' into table part fields terminated by '|';
load data local infile '/root/tpch-mysql/dbgen/region.tbl' into table region fields terminated by '|';
load data local infile '/root/tpch-mysql/dbgen/nation.tbl' into table nation fields terminated by '|';
load data local infile '/root/tpch-mysql/dbgen/customer.tbl' into table customer fields terminated by '|';
load data local infile '/root/tpch-mysql/dbgen/supplier.tbl' into table supplier fields terminated by '|';

As shown in the figure, the import is successful:

Insert picture description here


##Look at the other three tables again. Note: Do not execute here. Because these three tables are relatively large, it is best to use scripts to disassemble and execute them.

load data local infile '/root/tpch-mysql/dbgen/orders.tbl' into table orders fields terminated by '|';
load data local infile '/root/tpch-mysql/dbgen/partsupp.tbl' into table partsupp fields terminated by '|';
load data local infile '/root/tpch-mysql/dbgen/lineitem.tbl' into table lineitem fields terminated by '|';

The relationship of the eight tables is as follows:

Insert picture description here


Among them, the data volume of Nation and Region has nothing to do with SF; Nation is fixed at 25, and Region 5 is not inevitable; the data volume of LineItem is not strictly a multiple of SF.

  • step7 . The script generates three other tables.
    Exit mysql and go to the /root/tpch-mysql/dbgen directory.
    First, check how many rows there are in the table.
[[email protected] dbgen]# cat orders.tbl | wc -l
18000000
[[email protected] dbgen]# cat partsupp.tbl | wc -l
9600000
[[email protected] dbgen]# cat lineitem.tbl | wc -l
71985077

Edit script

vi split_import_file.sh

Amendment (three places) see the following notes

#! /bin/bash
#文件名不带.tbl !!!,即对应表名
#read -p "please input filename: " filename
filename=orders      # 修改处1:替换文件名orders、partsupp、lineitem、
#获取原文件总行数totalline
totalline=$(cat $filename.tbl | wc -l)
echo totalline=$totalline
#要分割成的每个小文件的行数line
line=1000000   
a=`expr $totalline / $line`    
b=`expr $totalline % $line` 
#获取小文件个数filenum
if (( $b==0 ))   
then
    filenum=$a
else
    filenum=`expr $a + 1`
fi
echo filenum=$filenum
#进行文件分割,分割后第一个小文件名后缀为i,i最小值为1
i=1        # 修改处2:38 修改为1
while(( i<=$filenum ))
do
#每个小文件要截取行数在原文件范围min,max 
    p=`expr $i - 1`
    min=`expr $p \* $line + 1`
    max=`expr $i \* $line`
    sed -n "$min,$max"p ./$filename.tbl > ./$filename.tbl.$i
#将小文件导入数据库,mysql登录信息及小文件路径根据实际修改       # 修改处3:mysql连接信息(加了--local-infile)
#根据自己创建的数据库的用户名、密码、数据库实例的ip、端口号、已经tpc安装包的路径信息进行修改。
    mysql -umyroot -p'******' -h10.185.147.201 -P32307 --local-infile -Dtpcd -e "load data local infile '/root/tpch-mysql/dbgen/$filename.tbl.$i' into table $filename fields terminated by '|';"
    i=`expr $i + 1`
done

Save after modification and exit# :wq

Edit the three scripts according to the above modifications, and run the scripts in the background (this process, in the process of disassembling the data table and disassembling, import data into the database at the same time, so there is no need to perform data import operations) three scripts at the same time Turn on:

nohup sh split_import_file_partsupp.sh>split_import_partsupp.log 2>&1 &
nohup sh split_import_file_orders.sh>split_import_orders.log 2>&1 &
nohup sh split_import_file_lineitem.sh>split_import_lineitem.log 2>&1 &

During the running process, you can check the process to see if the run is over, # ps -ef process number

ps -ef | grep split
  • step8 . Modify the execution test script.
    First modify the database information in the script (account name, password, IP of the database instance, port number)
vi tpch-benchmark-olap.sh

Modify mysql connection information

#!/bin/sh
PATH=$PATH:/usr/local/bin
export PATH
#set -u
#set -x
#set -e
. ~/.bash_profile > /dev/null 2>&1
exec 3>&1 4>&2 1>> tpch-benchmark-olap-`date +'%Y%m%d%H%M%S'`.log 2>&1
I=1
II=3
while [ $I -le $II ]
do
N=1
T=23
while [ $N -lt $T ]
do
  if [ $N -lt 10 ] ; then
    NN='0'$N
  else
    NN=$N
  fi
  echo "query $NN starting"
# /etc/init.d/mysql restart           # 修改这里mysql连接信息
  time mysql -h10.185.147.201 -P32307 -umyroot -p****** -Dtpcd < ./queries/tpch_${NN}.sql
  echo "query $NN ended!"
  N=`expr $N + 1`
  echo -e
  echo -e
done
 I=`expr $I + 1`
done
  • step9 , execute the script in the background
nohup sh tpch-benchmark-olap.sh >output.log 2>&1 &

In the /tpch-mysql/dbgen/ path, the log file tpch-benchmark-olap-20210528143245.log will be automatically generated to
check the log:

cat tpch-benchmark-olap-20210528143245.log

The log output is as follows:

query 01 starting
mysql: [Warning] Using a password on the command line interface can be insecure.
l_returnflag    l_linestatus    sum_qty sum_base_price  sum_disc_price  sum_charge      avg_qty avg_price       avg_disc        count_order
A       F       453004927.00    679255118036.63 645289010691.4043       671106139941.958450     25.499290       38234.734227    0.050007        17765394

real    3m2.575s
user    0m0.024s
sys     0m0.008s
query 01 ended!
...

Quietly wait for the completion of 22 statements.

4. tpcd 22 SQL statement analysis

Using TPC-H for performance testing requires a lot of work to achieve higher performance, such as indexing, reasonable distribution of table data (using table space and clustering technology), etc.

Here, from the perspective of query optimization technology, analyze the 22 query statements of TPC-H and the query execution plan corresponding to each statement executed by mainstream databases. The purpose is to understand the query optimization technology of each mainstream database, and further grasp with TPC-H examples. Query optimization technology, compared with the implementation of mainstream databases, has a good understanding of query optimization technology.

  • 1.Q1: Price statistics report query
    Q1 statement is a pricing summary report for querying lineItems . Query on a single table lineitem within a certain period of time, and perform statistics on all types of goods that have been paid and shipped, including information such as business volume billing, delivery, discounts, taxes, and average prices.
    The characteristics of the Q1 statement are: single table query operation with coexistence of grouping, sorting, and aggregation operations. This query will result in 95% to 97% of rows of data on the table being read.
    The query statement for Q1 is as follows:
select 
    l_returnflag, //返回标志
    l_linestatus, 
    sum(l_quantity) as sum_qty, //总的数量
    sum(l_extendedprice) as sum_base_price, //聚集函数操作
    sum(l_extendedprice * (1 - l_discount)) as sum_disc_price, 
    sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge, 
    avg(l_quantity) as avg_qty, 
    avg(l_extendedprice) as avg_price, 
    avg(l_discount) as avg_disc, 
    count(*) as count_order //每个分组所包含的行数
from 
    lineitem
where 
    l_shipdate <= date'1998-12-01' - interval '90' day //时间段是随机生成的
group by //分组操作
    l_returnflag, 
    l_linestatus
order by //排序操作
    l_returnflag, 
    l_linestatus;
  • 2.Q2:
    Query the supplier with the least cost Q2 statement queries the supplier with the least cost. In a given area, for a specified part (part of a certain type and size), which supplier can supply it at the lowest price, you can choose which supplier to order.
    The characteristics of the Q2 statement are: multi-table query operations with coexistence of sorting, aggregation operations, and sub-queries. The query statement does not grammatically limit the number of tuples returned, but the TPC-H standard stipulates that the query result only returns the first 100 rows (usually dependent on the application implementation).
    The query statement for Q2 is as follows:
select
    s_acctbal, s_name, n_name, p_partkey, p_mfgr, s_address, s_phone, s_comment /*查询供应者的帐户余额、名字、国家、零件的号码、生产者、供应者的地址、电话号码、备注信息 */
from
    part, supplier, partsupp, nation, region //五表连接
where
    p_partkey = ps_partkey
    and s_suppkey = ps_suppkey
    and p_size = [SIZE] //指定大小,在区间[1, 50]内随机选择
    and p_type like '%[TYPE]' //指定类型,在TPC-H标准指定的范围内随机选择
    and s_nationkey = n_nationkey
    and n_regionkey = r_regionkey
    and r_name = '[REGION]' //指定地区,在TPC-H标准指定的范围内随机选择
    and ps_supplycost = ( //子查询
        select
            min(ps_supplycost) //聚集函数
        from
            partsupp, supplier, nation, region //与父查询的表有重叠
        where
            p_partkey = ps_partkey
            and s_suppkey = ps_suppkey
            and s_nationkey = n_nationkey
            and n_regionkey = r_regionkey
            and r_name = '[REGION]'
    )
order by //排序
    s_acctbal desc,
    n_name,
    s_name,
    p_partkey;
  • 3.Q3: Shipping priority query
    Q3 statement query to get the top 10 unshipped orders. Among the orders that have not been shipped before the specified date, the shipping priority of the order with the largest revenue (orders are sorted in descending order of revenue) and potential revenue (potential revenue is the sum of l_extendedprice * (1-l_discount)).
    The characteristics of the Q3 statement are: three-table query operation with coexistence of grouping, sorting and aggregation operations. The query statement does not grammatically limit the number of tuples returned, but the TPC-H standard stipulates that the query result only returns the first 10 rows (usually dependent on the application implementation).
    The query statement for Q3 is as follows:
select
l_orderkey,
sum(l_extendedprice*(1-l_discount)) as revenue, //潜在的收入,聚集操作
o_orderdate,
o_shippriority
from
customer, orders, lineitem //三表连接
where
c_mktsegment = '[SEGMENT]' //在TPC-H标准指定的范围内随机选择
and c_custkey = o_custkey
and l_orderkey = o_orderkey
and o_orderdate < date '[DATE]' //指定日期段,在在[1995-03-01, 1995-03-31]中随机选择
and l_shipdate > date '[DATE]'
group by //分组操作
l_orderkey, //订单标识
o_orderdate, //订单日期
o_shippriority //运输优先级
order by //排序操作
revenue desc, //降序排序,把潜在最大收入列在前面
o_orderdate;
  • 4.Q4: Order priority query
    Q4 statement query to get the order priority statistics value. Calculate the number of orders for a given three-month period. In each order, at least one line must be received by the customer after its submission date.
    The characteristics of the Q4 statement are: single-table query operations with grouping, sorting, aggregation operations, and sub-queries coexisting. Subqueries are correlated subqueries.
    The query statement for Q4 is as follows:
select
o_orderpriority, //订单优先级
count(*) as order_count //订单优先级计数
from orders //单表查询
where
o_orderdate >= date '[DATE]'
and o_orderdate < date '[DATE]' + interval '3' month //指定订单的时间段--某三个月,DATE是在1993年1月和1997年10月之间随机选择的一个月的第一天
and exists ( //子查询
select
*
from
lineitem
where
l_orderkey = o_orderkey
and l_commitdate < l_receiptdate
)
group by //按订单优先级分组
o_orderpriority
order by //按订单优先级排序
o_orderpriority;
  • 5.Q5: Query the income brought by a supplier in a certain area to the company. The
    Q5 statement queries the income (revenue calculated by sum(l_extendedprice * (1 -l_discount))) obtained through a certain area parts supplier. It can be used to determine whether a local distribution center needs to be established in a given area.
    The characteristics of the Q5 statement are: multi-table join query operations with grouping, sorting, aggregation operations, and sub-queries coexisting.
    The query statement for Q5 is as follows:
select
n_name,
sum(l_extendedprice * (1 - l_discount)) as revenue //聚集操作
from
customer,orders,lineitem,supplier,nation,region //六表连接
where
c_custkey = o_custkey
and l_orderkey = o_orderkey
and l_suppkey = s_suppkey
and c_nationkey = s_nationkey
and s_nationkey = n_nationkey
and n_regionkey = r_regionkey
and r_name = '[REGION]' //指定地区,在TPC-H标准指定的范围内随机选择
and o_orderdate >= date '[DATE]' //DATE是从1993年到1997年中随机选择的一年的1月1日
and o_orderdate < date '[DATE]' + interval '1' year
group by //按名字分组
n_name
order by //按收入降序排序,注意分组和排序子句不同
revenue desc;
  • 6.Q6: Forecast income change query
    Q6 statement query to get the incremental income brought by the conversion discount in a certain year. This is a typical "what-if" judgment, used to find ways to increase income. The forecasted revenue change query considers all the shipped orders with a discount between "DISCOUNT-0.01" and "DISCOUNT+0.01" in the specified year, and solves the increase in total revenue after the discount of the order whose l_quantity is less than quantity is eliminated.
    The characteristic of Q6 statement is: single table query operation with aggregation operation. The query statement uses the BETWEEN-AND operator, and some databases can optimize the BETWEEN-AND.
    The query statement for Q6 is as follows:
select
sum(l_extendedprice*l_discount) as revenue //潜在的收入增加量
from
lineitem //单表查询
where
l_shipdate >= date '[DATE]' //DATE是从[1993, 1997]中随机选择的一年的1月1日
and l_shipdate < date '[DATE]' + interval '1' year //一年内
and l_discount between [DISCOUNT] - 0.01 and [DISCOUNT] + 0.01 //between
and l_quantity < [QUANTITY]; // QUANTITY在区间[24, 25]中随机选择
  • 7.Q7: Query of freight profitability The
    Q7 statement is to query the profitability of sales between the country of the supplier and the country where the goods are sold. This query determines the amount of goods shipped between the two countries to help renegotiate the freight contract.
    The characteristics of Q7 statements are: multi-table query operations with coexistence of grouping, sorting, aggregation, and sub-query operations. There are no other query objects in the parent query of the subquery, and it is a relatively simple subquery.
    The query statement for Q7 is as follows:
select
supp_nation, //供货商国家
cust_nation, //顾客国家
l_year, sum(volume) as revenue //年度、年度的货运收入
from ( //子查询
select
n1.n_name as supp_nation,
n2.n_name as cust_nation,
extract(year from l_shipdate) as l_year,
l_extendedprice * (1 - l_discount) as volume
from
supplier,lineitem,orders,customer,nation n1,nation n2 //六表连接
where
s_suppkey = l_suppkey
and o_orderkey = l_orderkey
and c_custkey = o_custkey
and s_nationkey = n1.n_nationkey
and c_nationkey = n2.n_nationkey
and ( // NATION2和NATION1的值不同,表示查询的是跨国的货运情况
(n1.n_name = '[NATION1]' and n2.n_name = '[NATION2]')
or (n1.n_name = '[NATION2]' and n2.n_name = '[NATION1]')
)
and l_shipdate between date '1995-01-01' and date '1996-12-31'
) as shipping
group by
supp_nation,
cust_nation,
l_year
order by
supp_nation,
cust_nation,
l_year;
  • 8.Q8: Country market share query
    Q8 statement is to query the changes in the market share of a given part type in a certain region in a certain country in the past two years.
    The characteristics of the Q8 statement are: query operations with coexistence of grouping, sorting, aggregation, and sub-query operations. There are no other query objects in the parent query of the subquery. It is a relatively simple subquery, but the subquery itself is a multi-table join query.
    The TPC-H standard defines a modified SQL equivalent to the Q8 statement, which is basically the same in format as the above query statement, mainly because the target column uses a different expression, which will not be repeated here.
    The query statement for Q8 is as follows:
select
o_year, //年份
sum(case
when nation = '[NATION]'//指定国家,在TPC-H标准指定的范围内随机选择
then volume
else 0
end) / sum(volume) as mkt_share //市场份额:特定种类的产品收入的百分比;聚集操作
from //子查询
(select
extract(year from o_orderdate) as o_year, //分解出年份
l_extendedprice * (1-l_discount) as volume, //特定种类的产品收入
n2.n_name as nation
from
part,supplier,lineitem,orders,customer,nation n1,nation n2,region //八表连接
where
p_partkey = l_partkey
and s_suppkey = l_suppkey
and l_orderkey = o_orderkey
and o_custkey = c_custkey
and c_nationkey = n1.n_nationkey
and n1.n_regionkey = r_regionkey
and r_name = '[REGION]' //指定地区,在TPC-H标准指定的范围内随机选择
and s_nationkey = n2.n_nationkey
and o_orderdate between date '1995-01-01' and date '1996-12-31' //只查95、96年的情况
and p_type = '[TYPE]' //指定零件类型,在TPC-H标准指定的范围内随机选择
) as all_nations
group by //按年分组
o_year
order by //按年排序
o_year;
  • 9.Q9: Product type profit estimation query
    Q9 statement is to query the total profit of all ordered parts in each country in each year.
    The characteristics of Q9 statements are: query operations with coexistence of grouping, sorting, aggregation, and sub-query operations. There are no other query objects in the parent query of the subquery. It is a relatively simple subquery, but the subquery itself is a multi-table join query. The LIKE operator is used in the subquery, and some query optimizers do not support the optimization of the LIKE operator.
    The query statement for Q9 is as follows:
select
nation,
o_year,
sum(amount) as sum_profit //每个国家每一年所有被定购的零件在一年中的总利润
from
(select
n_name as nation, //国家
extract(year from o_orderdate) as o_year, //取出年份
l_extendedprice * (1 - l_discount) - ps_supplycost * l_quantity as amount //利润
from
part,supplier,lineitem,partsupp,orders,nation //六表连接
where
s_suppkey = l_suppkey
and ps_suppkey = l_suppkey
and ps_partkey = l_partkey
and p_partkey = l_partkey
and o_orderkey = l_orderkey
and s_nationkey = n_nationkey
and p_name like '%[COLOR]%' //LIKE操作,查询优化器可能进行优化
) as profit
group by //按国家和年份分组
nation,
o_year
order by //按国家和年份排序,年份大者靠前
nation,
o_year desc;
  • 10.Q10: The query
    Q10 sentence for shipping problems is to query the customers who have problems with shipping and the losses caused by each country within three months from a certain moment.
    The characteristics of the Q10 statement are: multi-table join query operation with coexistence of grouping, sorting, and aggregation operations. The query statement does not grammatically limit the number of tuples returned, but the TPC-H standard stipulates that the query result only returns the first 10 rows (usually dependent on the application implementation).
    The query statement for Q10 is as follows:
select
c_custkey, c_name, //客户信息
sum(l_extendedprice * (1 - l_discount)) as revenue, //收入损失
c_acctbal,
n_name, c_address, c_phone, c_comment //国家、地址、电话、意见信息等
from
customer, orders, lineitem, nation
where
c_custkey = o_custkey
and l_orderkey = o_orderkey
and o_orderdate >= date '[DATE]' // DATE是位于1993年一月到1994年十二月中任一月的一号
and o_orderdate < date '[DATE]' + interval '3' month //3个月内
and l_returnflag = 'R' //货物被回退
and c_nationkey = n_nationkey
group by
c_custkey,
c_name,
c_acctbal,
c_phone,
n_name,
c_address,
c_comment
order by
revenue desc;
  • 11.Q11: Inventory value query
    Q11 statement is to query the value of parts supplied in a certain country in the inventory.
    The characteristics of the Q11 statement are: multi-table join query operations with coexistence of grouping, sorting, aggregation, and sub-query operations. The subquery is located in the HAVING condition of the grouping operation.
    The query statement for Q11 is as follows:
select
ps_partkey,
sum(ps_supplycost * ps_availqty) as value //聚集操作,商品的总价值
from
partsupp, supplier, nation
where
ps_suppkey = s_suppkey
and s_nationkey = n_nationkey
and n_name = '[NATION]'
group by
ps_partkey having //带有HAVING子句的分组操作
sum(ps_supplycost * ps_availqty) > ( //HAVING子句中包括有子查询
select
sum(ps_supplycost * ps_availqty) * [FRACTION] //子查询中存在聚集操作;FRACTION为0.0001/SF1
from
partsupp, supplier, nation //与父查询的表连接一致
where //与父查询的WHEWR条件一致
ps_suppkey = s_suppkey
and s_nationkey = n_nationkey
and n_name = '[NATION]' //指定国家
)
order by //按商品的价值降序排序
value desc;
  • 12.Q12: Shipping mode and order priority query
    Q12 statement query to obtain shipping mode and order priority. Can help decision-making: whether choosing a cheap shipping mode will cause consumers to receive more goods after the contract date, which will have a negative impact on urgent priority orders.
    1 SF, Scale Factor, the scale factor of the database. The TPC-H standard stipulates that the scale factor of the test database must be selected from the following fixed values: 1, 10, 30, 100, 1000, 3000, 10000 (equivalent to 1GB, 10GB, 30GB, 100GB, 1000GB, 3000GB, 10000GB). The size of the database is defined as 1 by default (for example: SF=1; approximately 1GB).
    The characteristics of the Q12 statement are: the two-table join query operation with the coexistence of grouping, sorting, and aggregation operations.
    The query statement for Q12 is as follows:
select
l_shipmode,
sum(case //聚集操作
when o_orderpriority ='1-URGENT' //OR运算,二者满足其一即可,选出URGENT或HIGH的
or o_orderpriority ='2-HIGH'
then 1
else 0
end) as high_line_count,
sum(case
when o_orderpriority <> '1-URGENT' //AND运算,二者都不满足,非URGENT非HIGH的
and o_orderpriority <> '2-HIGH'
then 1
else 0
end) as low_line_count
from
orders,lineitem
where
o_orderkey = l_orderkey
and l_shipmode in ('[SHIPMODE1]', '[SHIPMODE2]') /* 指定货运模式的类型,在TPC-H标准指定的范围内随机选择,SHIPMODE2必须有别于SHIPMODE1 */
and l_commitdate < l_receiptdate
and l_shipdate < l_commitdate
and l_receiptdate >= date '[DATE]' //从1993年到1997年中任一年的一月一号
and l_receiptdate < date '[DATE]' + interval '1' year //1年之内
group by //分组操作
l_shipmode
order by //排序操作
l_shipmode;
  • 13.Q13: Consumer order quantity query
    Q13 statement query obtains the consumer's order quantity, including consumers who have no order records in the past and present.
    The characteristics of Q13 statements are: query operations with coexistence of grouping, sorting, aggregation, sub-query, and left outer join operations.
    The query statement for Q13 is as follows:
select
c_count, count(*) as custdist //聚集操作,统计每个组的个数
from //子查询
(select
c_custkey,
count(o_orderkey)
from
customer left outer join orders on //子查询中包括左外连接操作
c_custkey = o_custkey
and o_comment not like ‘%[WORD1]%[WORD2]%’ //LIKE操作
//WORD1 为以下四个可能值中任意一个:special、pending、unusual、express
//WORD2 为以下四个可能值中任意一个:packages、requests、accounts、deposits
group by //子查询中的分组操作
c_custkey
)as c_orders (c_custkey, c_count)
group by //分组操作
c_count
order by //排序操作
custdist desc, //从大到小降序排序
c_count desc;
  • 14.Q14: Promotional effect query
    Q14 statement queries how much of the revenue obtained in a certain month is from promotional parts. It is used to monitor the market reaction brought by the promotion.
    The characteristics of the Q14 statement are: query operations with coexistence of grouping, sorting, aggregation, sub-query, and left outer join operations.
    The query statement for Q14 is as follows:
select
100.00 * sum(case
when p_type like 'PROMO%' //促销零件
then l_extendedprice*(1-l_discount) //某一特定时间的收入
else 0
end) / sum(l_extendedprice * (1 - l_discount)) as promo_revenue
from
lineitem, part
where
l_partkey = p_partkey
and l_shipdate >= date '[DATE]' // DATE是从1993年到1997年中任一年的任一月的一号
and l_shipdate < date '[DATE]' + interval '1' month;
  • 15.Q15: First-class supplier query The
    Q15 statement query obtains the information of the supplier (ranked first) that has contributed the most to the total revenue within a certain period of time. It can be used to decide which first-class suppliers to give rewards, give more orders, give special certifications, give encouragement and other incentives.
    The characteristics of the Q15 statement are: join operations of ordinary tables and views with coexistence of sorting, aggregation, and aggregation sub-query operations.
    The query statement for Q15 is as follows:
create view revenue[STREAM_ID](supplier_no, total_revenue) as //创建复杂视图(带有分组操作)
select
l_suppkey,
sum(l_extendedprice * (1 - l_discount)) //获取供货商为公司带来的总利润
from
lineitem
where
l_shipdate >= date '[DATE]' //DATE 是从1993年一月到1997年十月中任一月的一号
and l_shipdate < date '[DATE]' + interval '3' month //3个月内
group by //分组键与查询对象之一相同
l_suppkey;


//查询语句
select
s_suppkey,
s_name,
s_address,
s_phone,
total_revenue
from
supplier,revenue[STREAM_ID] //普通表与复杂视图进行连接操作
where
s_suppkey = supplier_no
and total_revenue = (//聚集子查询
select
max(total_revenue)
from
revenue[STREAM_ID] //聚集子查询从视图获得数据
)
order by
s_suppkey;
//删除视图
drop view revenue[STREAM_ID];
  • 16.Q16: Part/supplier relationship query
    Q16 statement query to obtain the number of suppliers who can supply parts with the specified contribution conditions. It can be used to determine whether there are sufficient suppliers when the order is large and the task is urgent.
    The characteristics of the Q16 statement are: two table join operations with coexistence of grouping, sorting, aggregation, deduplication, and NOT IN sub-query operations.
    The query statement for Q16 is as follows:
select
    p_brand,
    p_type,
    p_size,
    count(distinct ps_suppkey) as supplier_cnt //聚集、去重操作
from
    partsupp,
    part
where
    p_partkey = ps_partkey
    and p_brand <> '[BRAND]'
// BRAND=Brand#MN ,M和N是两个字母,代表两个数值,相互独立,取值在1到5之间
    and p_type not like '[TYPE]%' //消费者不感兴趣的类型和尺寸
and p_size in ([SIZE1], [SIZE2], [SIZE3], [SIZE4], [SIZE5], [SIZE6], [SIZE7], [SIZE8])
//TYPEX是在1到50之间任意选择的一组八个不同的值
    and ps_suppkey not in ( //NOT IN子查询,消费者排除某些供货商
        select
            s_suppkey
        from
            supplier
        where
            s_comment like '%Customer%Complaints%'
    )
group by //分组操作
    p_brand,
    p_type,
    p_size
order by //排序操作
    supplier_cnt desc, //按数量降序排列,按品牌、种类、尺寸升序排列
    p_brand,
    p_type,
    p_size;
  • 17.Q17: Small order revenue query
    Q17 statement query obtains small batch orders that are less than 20% of the average supply. For parts of specified brands and specified packaging types, determine the average number of items (past and pending) of these order parts in all orders in a seven-year database. If less than 20% of the orders for these parts are no longer accepted, how much will the average annual loss be? So this query can be used to calculate how much the average annual income will be lost if there is no small order (because the freight of a large number of goods will reduce management costs).
    The characteristic of Q17 statement is: join operation of two tables with coexistence of aggregation and aggregation subquery operation.
    The query statement for Q17 is as follows:
select
    sum(l_extendedprice) / 7.0 as avg_yearly //聚集操作
from
    lineitem, part
where
    p_partkey = l_partkey
    and p_brand = '[BRAND]' /*指定品牌。 BRAND=’Brand#MN’ ,M和N是两个字母,代表两个数值,相互独立,取值在1到5之间 */
    and p_container = '[CONTAINER]' //指定包装类型。在TPC-H标准指定的范围内随机选择
    and l_quantity < ( //聚集子查询
        select
            0.2 * avg(l_quantity)
        from
            lineitem
        where
            l_partkey = p_partkey
    );
  • 18.Q18: Large order customer query
    Q18 statement query to obtain supplier information that is larger than the specified supply. It can be used to determine whether there are sufficient suppliers when the order is large and the task is urgent.
    The characteristics of the Q18 statement are: three-table join operation with coexistence of grouping, sorting, aggregation, and IN subquery operations. The query statement does not grammatically limit the number of tuples returned, but the TPC-H standard stipulates that the query result only returns the first 100 rows (usually dependent on the application implementation).
    The query statement for Q18 is as follows:
select
    c_name, c_custkey, o_orderkey, o_orderdate, o_totalprice, //基本信息
    sum(l_quantity) //订货总数
from
    customer, orders, lineitem
where
    o_orderkey in ( //带有分组操作的IN子查询
        select
            l_orderkey
        from
            lineitem
        group by 
            l_orderkey having
            sum(l_quantity) > [QUANTITY] // QUANTITY是位于312到315之间的任意值
    )
    and c_custkey = o_custkey
    and o_orderkey = l_orderkey 
group by
    c_name,
    c_custkey,
    o_orderkey,
    o_orderdate,
    o_totalprice
order by
    o_totalprice desc,
    o_orderdate;
  • 19.Q19: Discount revenue query The
    Q19 statement query obtains the total discount revenue for all orders of three different types of air or manual transportation parts. The selection of parts takes into account the specific brand, packaging and size range. This query is an example of using data mining tools to generate formatted code.
    The characteristics of the Q19 statement are: three-table join operation with coexistence of grouping, sorting, aggregation, and IN subquery operations.
    The query statement for Q19 is as follows:
select
    sum(l_extendedprice * (1 - l_discount) ) as revenue
from
    lineitem, part
where
(
    p_partkey = l_partkey
    and p_brand = ‘[BRAND1]’ /*特定品牌。BRAND1、BRAND2、BRAND3=‘Brand#MN’,M和N是两个字母,代表两个数值,相互独立,取值在1到5之间 */
    and p_container in ( ‘SM CASE’, ‘SM BOX’, ‘SM PACK’, ‘SM PKG’) //包装范围
    and l_quantity >= [QUANTITY1] and l_quantity <= [QUANTITY1] + 10 /* QUANTITY1 是1到10之间的任意取值 */
    and p_size between 1 and 5 //尺寸范围
    and l_shipmode in (‘AIR’, ‘AIR REG’) //运输模式,如下带有阴影的粗体表示的条件是相同的,存在条件化简的可能
    and l_shipinstruct = ‘DELIVER IN PERSON’
)
or
(
    p_partkey = l_partkey
    and p_brand = ‘[BRAND2]’
    and p_container in (‘MED BAG’, ‘MED BOX’, ‘MED PKG’, ‘MED PACK’)
    and l_quantity >= [QUANTITY2] and l_quantity <= [QUANTITY2] + 10 /* QUANTITY2 是10到20之间的任意取值 */
    and p_size between 1 and 10
    and l_shipmode in (‘AIR’, ‘AIR REG’)
    and l_shipinstruct = ‘DELIVER IN PERSON’
)
or
(
    p_partkey = l_partkey
    and p_brand = ‘[BRAND3]’
    and p_container in ( ‘LG CASE’, ‘LG BOX’, ‘LG PACK’, ‘LG PKG’)
    and l_quantity >= [QUANTITY3] and l_quantity <= [QUANTITY3] + 10 /* QUANTITY3 是20到30之间的任意取值 */
    and p_size between 1 and 15
    and l_shipmode in (‘AIR’, ‘AIR REG’)
    and l_shipinstruct = ‘DELIVER IN PERSON’
);
  • 20.Q20: Supplier Competitiveness Query
    Q20 statement query determines that within a certain year, find out the supply of goods in the specified country that can provide a more competitive price for a certain part of the product. The so-called more competitive suppliers refer to those suppliers with surplus parts. If the supplier or supplier transports more than 50% of a certain part of a given country in a certain year, it is a surplus.
    The characteristics of the Q20 statement are: two table join operations with coexistence of sorting, aggregation, IN subquery, and ordinary subquery operations.
    The query statement for Q20 is as follows:
select
    s_name, s_address
from
    supplier, nation
where
    s_suppkey in ( //第一层的IN子查询
        select
            ps_suppkey
        from
            partsupp
        where
            ps_partkey in ( //第二层嵌套的IN子查询
                select
                    p_partkey
                from
                    part
                where
                    p_name like '[COLOR]%' //COLOR为产生P_NAME的值的列表中的任意值
            )
            and ps_availqty > (//第二层嵌套的子查询
                select
                    0.5 * sum(l_quantity) //聚集子查询
                from
                    lineitem
                where
                    l_partkey = ps_partkey
                    and l_suppkey = ps_suppkey
                    and l_shipdate >= date('[DATE]’) //DATE为在1993年至1997年的任一年的一月一号
                    and l_shipdate < date('[DATE]’) + interval ‘1’ year //1年内
            )
    )
    and s_nationkey = n_nationkey
    and n_name = '[NATION]' //TPC-H标准定义的任意值
order by
    s_name;
  • 21.Q21: Query for suppliers who cannot deliver on time.
    Q21 query queries to obtain suppliers who cannot deliver on time .
    The characteristics of the Q21 statement are: a four-table join operation with grouping, sorting, aggregation, EXISTS subquery, and NOT EXISTS subquery operations coexisting. The query statement does not grammatically limit the number of tuples returned, but the TPC-H standard stipulates that the query result only returns the first 100 rows (usually dependent on the application implementation).
    The query statement for Q21 is as follows:
select
    s_name, count(*) as numwait
from
    supplier, lineitem l1, orders, nation
where
    s_suppkey = l1.l_suppkey
    and o_orderkey = l1.l_orderkey
    and o_orderstatus = 'F'
    and l1.l_receiptdate > l1.l_commitdate
    and exists ( //EXISTS子查询
        select
            *
        from
            lineitem l2
        where
            l2.l_orderkey = l1.l_orderkey
            and l2.l_suppkey <> l1.l_suppkey
    )
    and not exists ( //NOT EXISTS子查询
        select
            *
        from
            lineitem l3
        where
            l3.l_orderkey = l1.l_orderkey
            and l3.l_suppkey <> l1.l_suppkey
            and l3.l_receiptdate > l3.l_commitdate
    )
    and s_nationkey = n_nationkey
    and n_name = '[NATION]' //TPC-H标准定义的任意值
group by
    s_name
order by
    numwait desc,
    s_name;
  • 22.Q22: Global sales opportunity query
    Q22 statement query to obtain the geographical distribution of consumers' possible purchases. This query calculates the number of consumers who are more positive than the average in the specified country but have not placed an order for seven years. Can reflect the attitude of ordinary consumers, that is, purchase intentions.
    The characteristics of the Q22 statement are: a four-table join operation with grouping, sorting, aggregation, EXISTS subquery, and NOT EXISTS subquery operations coexisting.
    The query statement for Q22 is as follows:
select
    cntrycode,
    count(*) as numcust,
    sum(c_acctbal) as totacctbal
from ( //第一层子查询
    select
        substring(c_phone from 1 for 2) as cntrycode,
        c_acctbal
    from
        customer
    where
        // I1…I7是在TPC-H中定义国家代码的可能值中不重复的任意值
        substring(c_phone from 1 for 2) in ('[I1]','[I2]’,'[I3]','[I4]','[I5]','[I6]','[I7]')
        and c_acctbal > (//第二层聚集子查询
            select
                avg(c_acctbal)
            from
                customer
            where
                c_acctbal > 0.00
                and substr (c_phone from 1 for 2) 
                                in ('[I1]','[I2]','[I3]','[I4]','[I5]','[I6]','[I7]')
        )
        and not exists (//第二层NOT EXISTS子查询
            select
                *
            from
                orders
            where
                o_custkey = c_custkey
        )
        ) as custsale
group by
    cntrycode
order by
    cntrycode;

5. Case study

Taking MySQL from major cloud vendors as an example, the test results are as follows. According to the analysis and comparison of this table, the pros and cons of various cloud vendors MySQL are obtained.

Insert picture description here