Kettle (PDI)-a tutorial for using ETL tools (1)

What is ETL

ETL, the abbreviation of Extract-Transform-Load in English, is used to describe the process of extracting, transforming, and loading data from the source to the destination. The term ETL is commonly used in data warehouses, but its object is not limited to data warehouses.

What is Kettle (PDI)

Kettle was originally an open source ETL tool, the full name is KDE Extraction, Transportation, Transformation and Loading Environment. In 2006, Pentaho acquired the Kettle project. Matt Casters, the original Kettle project initiator, joined the Pentaho team and became the data integration architect of the Pentaho suite. [1] Since then, Kettle has become the main component of the enterprise-level data integration and business intelligence suite Pentaho. Part, Kettle was also renamed Pentaho Data Integration [1-2]. Pentaho was acquired by Hitachi Data Systems in 2015. [3] (Hitachi Data Systems was renamed Hitachi Vantara in 2017 [4])
Pentaho Data Integration is developed in Java and supports cross-platform operation. Its features include: support for 100% code-free, drag-and-drop development of ETL data pipelines; dockable Including traditional databases, files, big data platforms, interfaces, streaming data and other data sources; support ETL data pipeline to join machine learning algorithms.
Pentaho Data Integration is divided into a commercial version and an open source version. As of January 2021, the cumulative download of the open source version reached 8.36 million, of which 19% came from China[5]. In China, most people are still used to refer to the open source version of Pentaho Data Integration as Kettle.

There are two kinds of script files in Kettle, transformation (.ktr) and job (.kjb)

The four core components of kettle

  • SPOON (Spoon): Allows you to design the ETL transformation process (Transformation) through a graphical interface. (GUI method)
  • PAN (Pan Pan): Allows you to run the ETL transformation (transform executor) designed by Spoon in batches. (Command line method)
  • CHEF (Chef): Allows you to create tasks (Job executor). Tasks are more conducive to automating the complex work of updating the data warehouse by allowing each conversion, task, script, etc. The task pass allows each conversion, task, script, etc. The task will be checked to see if it runs correctly. (GUI method)
  • KITCHEN: Allows you to use tasks designed by Chef in batches (for example, using a time scheduler). (Command line method)

installation

https://sourceforge.net/projects/pentaho/files/Pentaho%209.1/client-tools/

Start the graphical interface

Insert picture description here


mysql driver intopdi-ce-9.1.0.0-324\data-integration\lib

Basic demo

How to export data from multiple databases to a local database

1. Create a new local database

Create a new kettle_test database locally and add 2 tables.

-- ----------------------------
-- Table structure for eq_area_book
-- ----------------------------
DROP TABLE IF EXISTS `eq_area_book`;
CREATE TABLE `eq_area_book`  (
  `id` int NOT NULL COMMENT '主键',
  `area_id` varchar(64) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NOT NULL COMMENT '区域',
  `area_no` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL COMMENT '区域编号',
  `area_name` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL COMMENT '区域名称',
  `book_id` varchar(64) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL COMMENT '书',
  `book` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL COMMENT '书',
  `book_category` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL COMMENT '书分类',
  PRIMARY KEY (`id`) USING BTREE
) ENGINE = InnoDB CHARACTER SET = utf8mb4 COLLATE = utf8mb4_general_ci ROW_FORMAT = Dynamic;

New connection

Insert picture description here


Insert picture description here


if Chinese garbled characters appear characterEncoding:utf8

New table input and table output

Insert picture description here


insert and update

Insert picture description here

Create two new conversions respectively, complete database data synchronization, and create new jobs, as shown in the figure

Insert picture description here


Scheduling

Insert picture description here