[Business Intelligence] Data Preprocessing

Business Intelligence Series Article Directory

[Business Intelligence] Data Preprocessing


Article Directory


Preface

Before data analysis, the data must be preprocessed. This blog briefly introduces commonly used data preprocessing methods;





1. The main tasks of data preprocessing


The main tasks of data preprocessing:

① Data discretization: binning discretization, entropy-based discretization, ChiMerge discretization;

② Data Scaling: also known as data standardization, uniform sample data range , the data analysis process to avoid, because of the different properties in the range, resulting in an error occurs in the analysis result of the data analysis process; such as: time attribute value is useful If the second is the unit, and the useful hour is the unit, it must be unified into the same time unit;

③ Data Cleaning: identifying and handling  missing data , noise data , inconsistent data , etc.; such as: a lack of attribute data of a sample, the average value of the same property of the sample is assigned to a sample of the missing attribute;

④ feature extraction and feature selection: classification method for feature selection, feature selection effective, may be to reduce the amount of data , but also to improve the efficiency of constructing classification model , but also to improve the classification accuracy;





2. Data specification method



1. Standardization of z-score


z-score: also called standard score; z-score value z = x − μ σ z = \cfrac{x-\mu}{\sigma} z=σx−μ​ ;

among them  x x x 是本次要规范的属性值 ,  μ \mu μ 是均值 ,  σ \sigma σ 是标准差 , 该公式的含义是 计算当前属性值  x x x 偏离均值  μ \mu μ 的距离是多少个标准差  σ \sigma σ ;


z-score 规范化 又称为 零均值规范化 ( Zero-Mean Normalization ) , 给定属性  A A A , 均值为  μ \mu μ , 标准差为  σ \sigma σ , 属性  A A A 的取值  x x x 规范后的值  z = x − μ σ z = \cfrac{x - \mu}{\sigma} z=σx−μ​ ;


年收入平均值  82 82 82 万 , 标准差  39 39 39 , 年收入  60 60 60 万使用 z-score 规范化后的值为 :

z = 60 − 82 39 = 0.564 z = \cfrac{60 - 82}{39} =0.564 z=3960−82​=0.564


2、最小-最大规范化


样本属性原来取值范围  [ l , r ] [l , r] [l,r] , 现在需要将样本属性映射到  [ L , R ] [L, R] [L,R] 区间内 , 根据等比例映射原理 , 属性值  x x x 映射到新区间后的值计算方法如下 :

v = x − l r − l (R − L) + L v = \cfrac{x-l}{rl}(RL) + L v=r−lx−l​( R−L )+L


A sample attribute is annual income, value range  [10, 100] [10, 100] [ 1 0 ,1 0 0 ] , To map it to  [0, 1] [0, 1] [ 0 ,1 ] In the interval, then  20 20 2 0 The value after mapping to the new interval is:

v = 20 − 10 100 − 10 (1 − 0) + 0 = 0.1111 v = \cfrac{20-10}{100-10}(1-0) + 0 = 0.1111 v=1 0 0−1 02 0−1 0​( 1−0 )+0=0 . 1 1 1 1





3. Data Discrete Method



1. Discretization of bins


Discrete bin is divided into equidistant bins, and other frequency bins;


Equal distance binning: also known as equal width binning, a method of mapping each value of an attribute to an equal size interval;

Such as: student test scores,  0 0 0 ~  100 100 1 0 0 Points to  10 10 1 0 Divided into a file, divided into  10 10 1 0 Files,

15 15 1 5 Points in  11 11 1 1 ~  20 20 2 0 Files,
 52 52 5 2 Points in  51 51 5 1 ~  60 60 6 0 Files;

Split bins at equal distances, which may cause some values ​​to be too large and some to be small, such as  71 71 7 1 ~  80 80 8 0 There are a lot of this file,  01 01 0 1 ~  10 10 1 0 There is almost no such file;


Equal frequency binning: also known as equal depth binning, each value is mapped to an interval, and each interval contains the same number of values;


2. Discretization based on entropy


Binning discretization is unsupervised discretization method based on discrete entropy is supervised discretization method;

Given data set  D D D And its classification attributes, the category set is  C = {c 1, c 2,…, c k} C = \{ c_1, c_2, \cdots, c_k \} C={ c1​,c2​,...,ck​} , data set  D D D Information entropy  e n t r o p y (D) \rm entropy(D) e n t r o p y ( D ) Calculated as follows :

e n t r o p y ( D ) = − ∑ i = 1 k p ( c i ) l o g 2 p ( c i ) \rm entropy(D) = - \sum_{i=1}^k p(c_i) log_2p(c_i) e n t r o p y ( D )=−i = 1∑k​p ( ci​) l o g2​p ( ci​)

p (c i) p(c_i) p ( ci​) The value is  c o u n t (c i) ∣ D ∣ \rm \cfrac{count(c_i)}{|D|} ∣ D ∣c o u n t ( ci​)​ ,  c o u n t (c i) \rm count(c_i) c o u n t ( ci​) Refers to  c i c_i ci​ In the data set  D D D Number of occurrences in,  ∣ D ∣ |D| ∣ D ∣ Indicates the number of data samples;

Information entropy  e n t r o p y (D) \rm entropy(D) e n t r o p y ( D ) The smaller the value, the purer the category step;


Attribute information entropy calculation refers to the [data mining] decision tree to determine the partition attributes according to the information gain (information and entropy | total entropy calculation formula | entropy calculation formula for each attribute | information gain calculation formula | partition attribute determination) blog;






to sum up

This blog mainly explains the operations required for data preprocessing, data standardization, data discretization, data cleaning, feature extraction and feature selection;

Data normalization involves a minimum - maximum normalized and z-score normalization;

Discrete data relates bin discretization and entropy-based discretization , discrete bin into equidistant bins and other frequency bins;