Data Mining: Concepts and Technical Notes (1) Introduction

table of Contents

1.1 Why is data mining important?

1.2 What is data mining?

1.3 Data mining-on what kind of data

1.3.1 Relational Database

1.3.2 Data Warehouse

1.3.3 Transaction Database

1.3.4 Advanced database system and advanced database application

1.4 Data mining function-what types of patterns can be mined?

1.4.1 Concept/class description: characteristics and distinctions

1.4.2 Association analysis

1.4.3 Classification and prediction

1.4.4 Cluster analysis

1.4.5 Analysis of outsiders

1.4.6 Evolution analysis

1.5 Are all modes interesting?

1.6 Classification of data mining systems

1.7 The main problems of data mining

1.8 Summary


1.1 Why is data mining important?

Data warehouse (Section 1.3.2): is a database structure, a storage of multiple heterogeneous data sources organized in a unified mode at a single site to support management decision-making. Data warehouse technology includes data cleaning, data integration and online analytical processing (OLAP).

Online Analytical Processing (OLAP): It is an analysis technology with functions of summarization, merging and aggregation, and the ability to observe information from different angles. Although OLAP tools support multi-dimensional analysis and decision-making, other analysis tools are still needed for in-depth analysis, such as data classification, clustering, and the characteristics of data changes over time.

The evolution of database technology is as follows:

Data mining tools for data analysis can discover important data patterns, and have made great contributions to business decision-making, knowledge bases, scientific and medical research. The gap between data and information requires the systematic development of data mining tools to convert data graves into knowledge "gold nuggets".

1.2 What is data mining?

Data mining is the extraction or "mining" of knowledge from large amounts of data

Many people regard data mining as a synonym for another commonly used term "knowledge discovery in a database" or KDD. Others only regard data mining as a basic step in the knowledge discovery process in a database. The knowledge discovery process is shown in Figure 1.4

We adopt the broad view of data mining: Data mining is the process of mining interesting knowledge from large amounts of data stored in databases, data warehouses or other information repositories.

Data mining involves the integration of multidisciplinary technologies, including database technology, statistics, machine learning, high-performance computing, pattern recognition, neural networks, data visualization, information extraction, image and signal processing, and spatial data analysis.

Emphasizes the effective and scalable data mining technology in large databases. An algorithm is scalable. Given the available system resources such as memory and disk space, its running time should increase linearly with the size of the database.

1.3 Data mining-on what kind of data

1.3.1 Relational Database

A database system , also called a database management system (DBMS), consists of a set of internally related data, called a database, and a set of software programs for managing and accessing data.

A relational database is a collection of tables, and each table is assigned a unique name. Each table contains a set of attributes (columns or fields), and usually stores a large number of tuples (records or rows). Each tuple in the relationship represents an object identified by a unique key and is described by a set of attribute values. Semantic data models, such as entity-relation (ER) data models, model the database as a set of entities and the connections between them. The ER model is usually constructed for relational databases.

Example 1.1 AllElectronics company is described by the following relational tables: customer, item, employee and branch. A fragment of these tables is shown in Figure 1.6.

The relationship customer consists of a set of attributes, including the customer's unique identification number (cust_ID), the customer's name, address, age, occupation, annual income, reputation information, classification, etc.

Relational data can be accessed through database queries. Suppose your job is to analyze AllElectronics data. By using relational queries, you can ask questions like: "Explicit a list of products sold in the last quarter". The relational query language can also contain aggregate functions, such as sum, avg (average), count, max (maximum) and min (minimum). These allow you to ask "Show me the total sales of the last month, grouped by branch", or "How many sales transactions occurred in December?" or "Which salesperson had the highest sales?"

1.3.2 Data Warehouse

Assume that AllElectronics is a successful multinational company with branches all over the world. Each branch has its own set of databases. The president of AllElectronics wants you to provide the company's third quarter sales analysis for each product and each division. This is a difficult task, especially when the relevant data is spread across multiple databases and physically stored in many sites. If AllElectronics had a data warehouse, the task would be easy. A data warehouse is a storage of information collected from multiple data sources, stored in a consistent mode, and usually resides in a single site. The data warehouse is constructed through data cleaning, data transformation, data integration, data loading and regular data refresh. This process is studied in detail in Chapters 2 and 3. Figure 1.7 shows the basic structure of AllElectronics' data warehouse

To facilitate decision-making, the data in the data warehouse is organized around topics such as customers, commodities, suppliers, and events. Data storage provides information from a historical perspective (such as the past 5-10 years) and is aggregated. For example, the data warehouse does not store the details of each sales transaction, but stores each store, or (summarized to a higher level) the sales transaction summary of each type of commodity in each sales area. Generally, data warehouses are modeled with multi-dimensional database structures. Among them, each dimension corresponds to one or a group of attributes in the pattern, and each unit stores aggregate measures, such as count or sales_amount. The actual physical structure of the data warehouse can be relational data storage or multidimensional data. It provides a multi-dimensional view of the data and allows quick access to pre-calculated and summarized data.

Example 1.2 The summary sales data of AllElectronics is shown in Figure 1.8(a). The data cube has three dimensions: address (city value), time (quarter value Q1, Q2, Q3, Q4) and item (product type value: home entertainment, computer, phone, security). The aggregate value stored in each unit of the cube is sales_amount (unit: $1000). For example, the total sales of security systems in Vancouver in the first quarter were $400,000, which were stored in the unit. Other cubes can be used to store the aggregated sum on each dimension, corresponding to the aggregated value obtained by using different SQL groupings (for example, each city and quarter, or each quarter and product, or the total sales volume of each single dimension ).

By providing multi-dimensional data views and pre-calculation of summary data, the data warehouse is very suitable for online analytical processing (OLAP). OLAP operations use the domain background knowledge of the data, allowing data to be provided at different levels of abstraction. These operations are suitable for different users. Examples of OLAP operations include drill-down and scroll-up, which allow users to observe data at different summary levels, as shown in Figure 1.8(b). For example, you can drill down on the sales data summarized by quarter to observe the data summarized by month. Similarly, you can roll up the sales data aggregated by city and observe the data aggregated by country

1.3.3 Transaction Database

Generally, a transaction database consists of a file, where each record represents a transaction. Usually, a transaction contains a unique transaction identification number (trans_ID), and a list of items that make up the transaction (for example, items purchased in a store). The transaction database may have some additional tables associated with it, containing other information about the sale, such as the date of the transaction, the ID number of the customer, the ID number of the seller, the sales branch, and so on.

Example 1.3 Transactions can be stored in a table, with one record for each transaction. A fragment of AllElectronics' transaction database is shown in Figure 1.9. From a relational database point of view, the sales table in Figure 1.9 is a nested relationship, because the attribute "list of item_ID" contains a collection of items. Since most relational database systems do not support nested relationship structures, transactional databases are usually stored in a flat file similar to the table format in Figure 1.9, or expanded into a standard relationship similar to the items_sold table in Figure 1.6.

As an analyst of the AllElectronics database, you want to ask "show all products purchased by Sandy Smith" or "how many transactions contain the product number I3?". Answering such queries may require scanning the entire transaction database

Suppose you want to dig deeper into the data and ask "Which products are suitable for one piece of sale?" This "shopping basket analysis" enables you to bundle products into groups as a strategy to expand sales. For example, given the knowledge that printers and computers are often sold together, you can offer customers who buy selected computers a discount on a very expensive printer, hoping to sell more expensive printers. Conventional data extraction systems cannot answer such queries as above. However, by identifying frequently sold items, a data mining system for transaction data can do it.

1.3.4 Advanced database system and advanced database application

New database applications include processing spatial data (such as maps), engineering design data (such as architectural design, system components, integrated circuits), hypertext and multimedia data (including text, image and sound data), and time-related data (such as Historical data or stock exchange data) and the World Wide Web (the Internet makes a huge, widely distributed information store available). These applications require effective data structures and scalable methods to handle complex object structures, variable length records, semi-structured or unstructured data, text and multimedia data, and database models with complex structures and dynamic changes.

In response to these needs, advanced database systems and database systems for special applications have been developed. These include object-oriented and object-relational database systems, spatial database systems, time and time series database systems, heterogeneous and heritage database systems, and global information systems based on the World Wide Web.

Object -oriented database Object -oriented database is based on object-oriented programming paradigm. In general terms, each entity is treated as an object. For the AllElectronics example, the object can be every employee, customer, and product. Data and code related to an object are encapsulated in a unit. Per object association

The object-relational database is constructed based on the object-relational data model. This model expands the relational model by providing rich data types and object positioning for processing complex objects. In addition, it also contains a special structure of the relational query language to manage the added data types. By increasing the handling of complex data types, class hierarchies, and object inheritance as described above, the object-relational model extends the basic relational model. Object-relational databases are becoming increasingly popular in industry and applications.

The spatial database contains information related to space. Such databases include geographic (map) databases, VLSI chip design databases, medical and satellite image databases. Spatial data may be provided in raster format, consisting of n-dimensional bitmaps or pixel images. For example, a 2D satellite image can be represented by raster data, and each pixel stores the rainfall in a given area. Maps can also be provided in vector format, where roads, bridges, buildings and lakes can be represented by basic geographic structures such as points, lines, polygons, and differentiation and networks formed by these shapes

Both time database and time series database store time-related data. Time databases usually store data containing time-related attributes. These attributes may involve several time tags, each with different semantics. The time series database stores a series of values ​​that change over time, such as collected stock transaction data.

The text database is a database that contains text descriptions of objects. Usually, such word descriptions are not simple keywords, but long sentences or short texts, such as product introductions, error or malfunction reports, warning messages, summary reports, notes or other documents. Text databases may be highly denormalized (eg, web pages on the World Wide Web). Some text databases may be semi-structured (such as email messages and some HTML/XML web pages), while others may be well-structured (such as library databases). Generally, a well-structured text database can be implemented using a relational database system.

The multimedia database stores image, audio and video data. They are used for image-based content extraction, voice delivery, video-on-demand, the World Wide Web, and voice-based user interfaces that recognize spoken commands. Multimedia databases must support large objects, because data objects such as videos may require billions of bytes of storage. Special storage and retrieval techniques are also needed, because video and audio data need to be retrieved in real time at a stable, predetermined rate to prevent intermittent images or sounds and system buffer overflows. This kind of data is called continuous media data.

The heterogeneous database consists of a set of interconnected and autonomous member databases. These members communicate with each other in order to exchange information and answer inquiries. The objects in one member database may be very different from the objects in other member databases, making it difficult to absorb their semantics into a whole heterogeneous database. Many companies need legacy databases as a result of long-term development of information technology (including the use of different hardware and operating systems). The heritage database is a group of heterogeneous databases that combines different data systems. These data systems are such as relational or object-relational databases, hierarchical databases, mesh databases, spreadsheets, multimedia databases or file systems. The heterogeneous databases in the heritage database can be connected via the intranet or inter-network computer network.

The World Wide Web and its associated distributed information services (eg, AOL, Yahoo!, Alta Vista, Prodigy) provide a rich, worldwide online information service; here, data objects are linked together for easy interactive access. Users search for interesting information from one object to another through links. This system provides a lot of opportunities and challenges for data mining. For example, understanding user access patterns can not only help improve system design (by providing effective access between highly relevant objects), but it can also lead to better market decisions (for example, by placing advertisements on frequently accessed documents, or providing Better customer/user classification and behavior analysis). In this distributed information environment, capturing user access patterns is called mining path traversal patterns.

1.4 Data mining function-what types of patterns can be mined?

Since some patterns are not true for all data in the database, each pattern found usually carries a certainty or "credibility" measure.

1.4.1 Concept/class description: characteristics and distinctions

Data can be associated with classes or concepts. For example, in the AllElectronics store, the merchandise sold includes computers and printers, and the customer concept includes bigSpenders and budgetSpenders. It may be useful to describe each category and concept in a summarized, concise, and precise way. The description of such a class or concept is called a class/concept description . This description can be obtained by the following methods (1) data characterization, generally summarizing the data of the researched category (usually referred to as the target category), or (2) data differentiation, distinguishing the target category with one or more comparison categories ( Usually called comparison type) for comparison, or (3) data characterization and comparison.

Example 1.4 The data mining system should be able to generate a summary of the characteristics of customers who spend more than $1000 on AllElectronics within a year. The result may be the general profile of the customer, such as 40-50 age, job, and good reputation. The system will allow users to drill down in any dimension, such as drill down in occupation, in order to observe these customers according to their occupations.

Example 1.5 The data mining system should be able to compare two sets of AllElectronics customers, such as customers who buy computer products regularly (more than twice a month) and customers who buy this product occasionally (ie, less than 3 times a year). The result description may be a general outline. For example, 80% of the customers who often buy this product are between 20-40 years old and have a college education; 60% of the customers who do not buy this product often are either too old or too young. No college degree. Drilling down along the dimension, such as along the occupation dimension, or adding a new dimension, such as income_level, can help discover more distinguishing characteristics between the two categories

1.4.2 Association analysis

"What is association analysis?" Association analysis discovers association rules that show the conditions under which attribute-values ​​frequently appear together in a given data set. Association analysis is widely used for shopping basket or transaction data analysis.

More formally, the association rule is a rule of the form X ⇒ Y, that is, "A1 ∧...∧ Am ⇒ B1 ∧...∧ Bn"; where, Ai (i∈{1,...,m}) , Bj (j∈{1,...,n}) is an attribute-value pair. The association rule is interpreted as "most of the database tuples that satisfy the condition in X also satisfy the condition in Y". Example 1.6 Given the AllElectronics relational database, a data mining system may find rules of the following form

age(X ,"20 − 29") ∧ income(X ,"20 − 30K") ⇒ buys(X ,"CD _ player") [support = 2%,confidence = 60%]

Among them, X is a variable, representing customers. The rule is that 2% ( support ) of AllElectronics customers studied are 20-29 years old, have an annual income of 20-29K, and buy CD players from AllElectronics. Customers in this age and income group are 60% likely to buy a CD player ( confidence or credibility).

1.4.3 Classification and prediction

Classification is the process of finding models (or functions) that describe or identify data classes or concepts so that the models can be used to predict objects with unknown class labels. The derived model is based on the analysis of the training data set (ie, data objects whose class labels are known).

"How to provide an export model?" The export model can be expressed in various forms, such as classification (IF-THEN) rules, decision trees, mathematical formulas, or neural networks. The decision tree is a structure similar to a flowchart. Each node represents a test on an attribute value, each branch represents an output of the test, and the leaves represent a class or class distribution. The decision tree is easily converted into classification rules. When used for classification, a neural network is a set of processing units similar to neurons, with weighted connections between the units.

Classification can be used to predict the class label of a data object. However, in some applications, people may wish to predict some missing or unknown data values ​​instead of class labels. When the predicted value is numerical data, it is usually called prediction. Although prediction may involve data value prediction and class label prediction, usually prediction is limited to value prediction and is therefore different from classification. Forecasting also includes the identification of distribution trends based on available data. Correlation analysis may need to be performed before classification and prediction. It tries to identify attributes that are not useful for classification and prediction. These attributes should be excluded

Chapter 7 will discuss in detail classification and pre-

1.4.4 Cluster analysis

"What is cluster analysis?" Unlike classification and prediction, cluster analysis of data objects does not consider known class labels. Generally, the class label is not provided in the training data because it is not known where to start. Clustering can produce such labels.

Cluster analysis forms the theme of Chapter 8.

1.4.5 Analysis of outsiders

The database may contain some data objects that are inconsistent with the general behavior or model of the data. These data objects are outsiders. Most data mining methods treat outsiders as noise or exceptions and discard them. However, in some applications (eg, fraud detection), rare events may be more interesting than those that occur regularly. The analysis of outsider data is called outsider mining.

Example 1.9 An outsider analysis can discover credit card fraud. It detects fraudulent use of credit cards by detecting that a given account number is particularly large compared to normal payments. The outsider value can also be detected by shopping location and type, or shopping frequency. Outsider analysis is also discussed in Chapter 8.

1.4.6 Evolution analysis

Data evolution analysis describes the laws or trends of objects whose behavior changes over time and models them. Although this may include the characteristics, differentiation, association, classification, or clustering of time-related data, the different characteristics of this type of analysis include time series data analysis, sequence or periodic pattern matching, and similarity-based data analysis

Data evolution analysis will be discussed further in Chapter 9.

1.5 Are all modes interesting?

For a given user, only a small part of the possible patterns are of interest to him.

This raises a series of questions for the data mining system. You might be thinking: "What kind of pattern is interesting? Can a data mining system produce all interesting patterns? Can a data mining system produce only interesting patterns?" For the first question, a pattern is interesting, if (1) It is easy to understand, (2) to some extent, is valid for new or test data, (3) is potentially useful, and (4) is novel. If a model meets certain assumptions that users are sure of, it is also interesting. Interesting patterns represent knowledge .

There are some objective measures of model interest. These are based on the structure of the patterns found and statistics about them. For an association rule of the form X⇒Y, an objective measure is the support of the rule. The support of the rule represents the percentage of samples that meet the rule. Support is the probability P (X ∪ Y ), where X ∪ Y represents a transaction that includes both X and Y; that is, the union of itemsets X and Y. Another objective measure of association rules is confidence. Confidence is the conditional probability P (Y | X); that is, the probability that a transaction that contains X also contains Y. More formally, support and confidence are defined as

support (X ⇒ Y) = P (X ∪ Y)

confidence (X ⇒ Y) = P (Y | X)

Generally, each interest measure is associated with a threshold, which can be controlled by the user. For example, a rule that does not meet the 50% confidence threshold can be considered uninteresting. Rules below the threshold may reflect noise, exceptions, or a few cases, and may be less valuable.

1.6 Classification of data mining systems

Data mining is an interdisciplinary field that is influenced by multiple disciplines (see Figure 1.11), including database systems, statistics, machine learning, visualization, and information science. In addition, depending on the data mining method used, techniques from other disciplines can be used, such as neural networks, fuzzy/rough set theory, knowledge representation, inductive logic programming, or high-performance computing. Depending on the type of data being mined or a given data mining application, the data mining system may also integrate spatial data analysis, information extraction, pattern recognition, image analysis, signal processing, computer graphics, Web technology, economics, or psychology Field of technology.

Classification according to the type of database being mined

Classification according to the type of knowledge mined

According to the technology used

Classified by application

1.7 The main problems of data mining

This book emphasizes the main issues of data mining, considering mining technology, user interface, performance, and various data types.

Data mining technology and user interface issues: This reflects the type of knowledge being mined, the ability to mine knowledge at multiple granularities, the use of domain knowledge, specific mining and knowledge display.

Performance issues: This includes the effectiveness, scalability, and parallel processing of data mining algorithms.

and many more

1.8 Summary

Database technology has evolved from primitive data processing to the development of database management systems with query and transaction processing capabilities. Further development has led to an increasing need for effective data analysis and data understanding tools. This demand is an inevitable result of the explosive growth of data collected by various applications; these applications include business and management, administrative management, science and engineering, and environmental control.

Data mining is the discovery of interesting patterns from large amounts of data, which can be stored in databases, data warehouses or other information stores. This is a young interdisciplinary field that originates from such things as database systems, data warehouses, statistics, machine learning, data visualization, information extraction, and high-performance computing. Other contributing areas include neural networks, pattern recognition, spatial data analysis, image databases, signal processing, and some application areas, including business, economics, and bioinformatics.

The knowledge discovery process includes data cleaning, data integration, data transformation, data mining, pattern evaluation and knowledge representation.

Data patterns can be mined from different types of databases; such as relational databases, data warehouses, transactional, object-relational and object-oriented databases. Interesting data patterns can also be extracted from other types of information storage , including spatial, time-related, text, multimedia and heritage databases, and the World Wide Web.

A data warehouse is a long-term storage of data that comes from multiple data sources and is organized to support management decisions. This data is stored in a consistent mode and is usually aggregated. The data warehouse provides some data analysis capabilities, called OLAP (Online Analytical Processing).

Data mining functions include discovery of concept/class description, association, classification, prediction, clustering, trend analysis, deviation analysis and similarity analysis. Features and distinctions are the form of data aggregation. The pattern provides knowledge if it is easy to understand, is valid, potentially useful, and novel to the test data to some extent, or it verifies a certain premonition of the user's attention. Pattern interest measures, whether objective or subjective, can be used to guide the discovery process.

Data mining systems can be classified according to the type of database being mined, the type of knowledge being mined, or the technology used.

Effective data mining in large databases poses a lot of demands and challenges for researchers and developers. Questions involve data mining technology, user interaction, performance and scalability, and the processing of a large number of different data types. Other issues include the application development of data mining and their social impact.