# Preface

Before data analysis, the data must be preprocessed. This blog briefly introduces commonly used data preprocessing methods;

# 1. The main tasks of data preprocessing

The main tasks of data preprocessing:

① Data discretization: binning discretization, entropy-based discretization, ChiMerge discretization;

② Data Scaling: also known as data standardization, uniform sample data range , the data analysis process to avoid, because of the different properties in the range, resulting in an error occurs in the analysis result of the data analysis process; such as: time attribute value is useful If the second is the unit, and the useful hour is the unit, it must be unified into the same time unit;

③ Data Cleaning: identifying and handling  missing data , noise data , inconsistent data , etc.; such as: a lack of attribute data of a sample, the average value of the same property of the sample is assigned to a sample of the missing attribute;

④ feature extraction and feature selection: classification method for feature selection, feature selection effective, may be to reduce the amount of data , but also to improve the efficiency of constructing classification model , but also to improve the classification accuracy;

# 2. Data specification method

## 1. Standardization of z-score

z-score: also called standard score; z-score value z = x − μ σ z = \cfrac{x-\mu}{\sigma} z=σx−μ​ ;

among them  x x x 是本次要规范的属性值 ,  μ \mu μ 是均值 ,  σ \sigma σ 是标准差 , 该公式的含义是 计算当前属性值  x x x 偏离均值  μ \mu μ 的距离是多少个标准差  σ \sigma σ ;

z-score 规范化 又称为 零均值规范化 ( Zero-Mean Normalization ) , 给定属性  A A A , 均值为  μ \mu μ , 标准差为  σ \sigma σ , 属性  A A A 的取值  x x x 规范后的值  z = x − μ σ z = \cfrac{x - \mu}{\sigma} z=σx−μ​ ;

z = 60 − 82 39 = 0.564 z = \cfrac{60 - 82}{39} =0.564 z=3960−82​=0.564

## 2、最小-最大规范化

v = x − l r − l (R − L) + L v = \cfrac{x-l}{rl}(RL) + L v=r−lx−l​( R−L )+L

A sample attribute is annual income, value range  [10, 100] [10, 100] [ 1 0 ,1 0 0 ] , To map it to  [0, 1] [0, 1] [ 0 ,1 ] In the interval, then  20 20 2 0 The value after mapping to the new interval is:

v = 20 − 10 100 − 10 (1 − 0) + 0 = 0.1111 v = \cfrac{20-10}{100-10}(1-0) + 0 = 0.1111 v=1 0 0−1 02 0−1 0​( 1−0 )+0=0 . 1 1 1 1

# 3. Data Discrete Method

## 1. Discretization of bins

Discrete bin is divided into equidistant bins, and other frequency bins;

Equal distance binning: also known as equal width binning, a method of mapping each value of an attribute to an equal size interval;

Such as: student test scores,  0 0 0 ~  100 100 1 0 0 Points to  10 10 1 0 Divided into a file, divided into  10 10 1 0 Files,

15 15 1 5 Points in  11 11 1 1 ~  20 20 2 0 Files,
52 52 5 2 Points in  51 51 5 1 ~  60 60 6 0 Files;

Split bins at equal distances, which may cause some values ​​to be too large and some to be small, such as  71 71 7 1 ~  80 80 8 0 There are a lot of this file,  01 01 0 1 ~  10 10 1 0 There is almost no such file;

Equal frequency binning: also known as equal depth binning, each value is mapped to an interval, and each interval contains the same number of values;

## 2. Discretization based on entropy

Binning discretization is unsupervised discretization method based on discrete entropy is supervised discretization method;

Given data set  D D D And its classification attributes, the category set is  C = {c 1, c 2,…, c k} C = \{ c_1, c_2, \cdots, c_k \} C={ c1​,c2​,...,ck​} , data set  D D D Information entropy  e n t r o p y (D) \rm entropy(D) e n t r o p y ( D ) Calculated as follows :

e n t r o p y ( D ) = − ∑ i = 1 k p ( c i ) l o g 2 p ( c i ) \rm entropy(D) = - \sum_{i=1}^k p(c_i) log_2p(c_i) e n t r o p y ( D )=−i = 1∑k​p ( ci​) l o g2​p ( ci​)

p (c i) p(c_i) p ( ci​) The value is  c o u n t (c i) ∣ D ∣ \rm \cfrac{count(c_i)}{|D|} ∣ D ∣c o u n t ( ci​)​ ,  c o u n t (c i) \rm count(c_i) c o u n t ( ci​) Refers to  c i c_i ci​ In the data set  D D D Number of occurrences in,  ∣ D ∣ |D| ∣ D ∣ Indicates the number of data samples;

Information entropy  e n t r o p y (D) \rm entropy(D) e n t r o p y ( D ) The smaller the value, the purer the category step;

# to sum up

This blog mainly explains the operations required for data preprocessing, data standardization, data discretization, data cleaning, feature extraction and feature selection;

Data normalization involves a minimum - maximum normalized and z-score normalization;

Discrete data relates bin discretization and entropy-based discretization , discrete bin into equidistant bins and other frequency bins;