Machine Learning-Quick Start Course (Comprehensive Edition)

Preface

This article refers to the quick introductory course of machine learning on Google's official website . The overall course is relatively easy to understand for everyone to learn and refer to; the article will also be optimized based on your own understanding. After seeing the news on the official website, the Chinese version of the machine learning quick start course will not be provided after 2021/7, and the English version will be available.

table of Contents

Preface

1. Machine learning-main terms (integrated version)

2. Training and loss

2.1 Preface

2.2 Training model

2.3 Loss function

2.4 Keywords

3. Model iteration

3.1 Preface

3.2 Iterative method of training model

3.3 Keywords (training, convergence, loss)

3.4 Gradient descent method

3.5 Implementation process of gradient descent method

3.6 Keywords-gradient descent method

Four, stochastic gradient descent, batch gradient descent method

4.1 Preface

4.2 Stochastic Gradient Descent (SGD)

4.3 Batch Gradient Descent (BGD)

Five, learning rate

5.1 Preface

5.2 Learning rate

5.3 Choose the learning rate

5.4 Keywords

Six, generalization and overfitting

6.1 Preface

6.2 Overfitting

6.3 William of Occam

6.4 Data set split

6.5 Machine Learning-Generalization Rules

6.6 Summary

6.7 Keywords

Seven, data set division

7.1 Preface

7.2 Divided into training set and test set

7.3 Divided into training set, validation set, and test set

7.4 Keywords

8. Feature Engineering

8.1 Preface

8.2 Mapping raw data to features

8.3 Mapping values

8.4 Mapping classification values

8.5, sparse representation

8.6 The characteristics of good characteristics

8.7 Keywords

Nine, regularization​​L2 (simplicity)

9.1 Preface

9.2 Principle

9.3 Complexity

9.4 L2 regularization

9.5 Simplified regularization: Lambda

9.6 ​Regularization and learning rate

10. Logistic regression, linear regression

10.1 Preface

10.2 Ways to calculate and return probabilities

10.3 Sigmoid function

10.4 Logistic regression inference calculation

10.5 Loss function of logistic regression

10.6 Regularization in Logistic Regression

10.7 Logistic regression summary

10.8 Linear regression

11. Regularization​​L1 (sparseness)

11.1 Preface

11.2 Regularization of sparsity

11.3 ​and ​Regularization Comparison

11.4 Keywords

12. Neural network

12.1 Preface

12.2 Solutions to nonlinear problems

12.3 Single layer neural network

12.4 Two-layer neural network

12.5 Three-layer neural network

12.6 Activation function

12.7 Summary


1. Machine learning-main terms (integrated version)

Refer to Google's official website for the explanation of machine learning terms, summarize and describe.

https://guo-pu.blog.csdn.net/article/details/117442581

2. Training and loss

2.1 Preface

The training model means learning the optimal values ​​of all weights w and deviation b in the model through labeled samples. In supervised learning, the machine learning algorithm builds a model by checking multiple samples and trying to find the loss that can minimize the model; this process is called empirical risk minimization.

Loss is a penalty for bad predictions; Loss is the song value, which represents the accuracy of the model's prediction for a single sample. If the prediction of the model is completed accurately, the loss is zero, otherwise the loss will be larger.

2.2 Training model

The goal of the training model is to find a set of weights and deviations with "less" average loss from all samples.

The red arrow indicates the loss; the blue line indicates the forecast. The red arrow in the graph on the left is much longer than the corresponding red arrow in the graph on the right; that is, the actual point is farther away from the model prediction, and the difference is greater.

The model with the larger loss is shown on the left; the model with the smaller loss is shown on the right.

2.3 Loss function

Square loss is a common loss function. The linear regression model uses

L_{2}

a loss function called square loss (also known as loss). The squared loss for a single sample is as follows:

  = the square of the difference between the label and the prediction  = (observation - prediction(x))2  = (y - y')2

The mean square error (MSE) refers to the average squared loss of each sample. To calculate MSE, you need to find the sum of all square losses of each sample, and then divide by the number of samples:

MSE = \frac{1}{n}\sum_{(x,y)\epsilon D)}^{}(y-prediction(x))^{2}

among them:

  • (x,y) refers to the sample; x refers to the feature set used by the model to make predictions (such as temperature, age, etc.) and y refers to the label of the sample (such as the number of cricket tweets per minute)
  • Prediction (x) refers to the function of combining weights and biases with feature set x.
  • D refers to a data set containing multiple labeled samples.
  • n refers to the number of samples in D.

In the MSE common language regression task; the cross-entropy loss function is commonly used in the classification task.

Reference: https://developers.google.cn/machine-learning/crash-course/descending-into-ml/training-and-loss

2.4 Keywords

Empirical risk minimization (ERM, empirical risk minimization), for selecting the function, select a function to minimize loss based on the training set. Contrast with minimizing structural risk.

Mean square error (MSE, Mean Squared Error), the mean square of the loss of each sample. MSE is calculated by dividing the squared loss by the number of samples.

The squared loss function (squared loss) is the loss function used in linear regression (also known as the L2 loss function). Change the line to calculate the square of the difference between the value predicted by the labeled sample and the true value of the label. Due to the squared value, this loss function magnifies the impact of poor predictions. Compared with the L1 loss function, the square loss function reacts more strongly to outliers.

Training (training) the process of constructing the ideal parameters of the model.

Loss (Loss) is a measure used to measure the degree to which the model's prediction deviates from its label . To determine this value, the model needs to define a loss function. For example: the linear regression model participates in the mean square error MAS loss function, and the classification model uses the cross entropy loss function.

3. Model iteration

3.1 Preface

When training a machine learning model, first make initial guesses on the weights and biases, and then repeatedly adjust these guessing parameters (weights and biases) until the weights and biases at the lowest possible loss are obtained.

3.2 Iterative method of training model

The iterative trial and error process of machine learning algorithms used to train the model:

The "model" part takes one or more features as input and returns a prediction (

y^{'}

) as output.

To simplify, consider taking a feature and returning a predictive model:

y^{'} = b + w_{1}x_{1}

What initial values need to be considered for

b

and

w_{1}

set? For linear regression problems, it turns out that the initial value is not important. (Note: If it is other model initialization value may be very important, the specific model specific processing). We can initialize randomly, or use these insignificant values:

b

= 0

w_{1}

= 0

If the first eigenvalue is 10, substituting the eigenvalue into the prediction function will get the following results:

  y' = 0 + 0(10)  y' = 0

Then you need to calculate the loss. The "calculation loss" part in the above figure is the loss function used by the model, such as the square loss function.

The loss function will take two input values:

y^{'}

: The model's prediction of feature x

  • y: The correct label corresponding to feature x.

Finally, go to the "calculation parameter update" part of the figure . The machine learning system checks the value of the loss function in this part, and updates the

b

sum

w_{1}

, that is

b

,

w_{1}

generates a new value for the sum.

Suppose this mysterious green box generates a new value, and that value generates a new parameter. This learning process will continue to iterate until the algorithm finds that the loss has been minimized, at this time a better model is obtained, and the model parameters at this time are saved.

Generally, iterate continuously until the overall loss no longer changes or changes extremely slowly, at which point the model has converged.

3.3 Keywords (training, convergence, loss)

Training (training) the process of constructing the ideal parameters of the model.

Convergence Convergence usually refers to a state reached during training, that is, after a certain number of iterations, the training loss and verification loss change very little or no longer in each iteration. In deep learning, the loss value will sometimes remain unchanged or almost unchanged for multiple iterations before the final drop, temporarily forming an illusion of convergence. See the early stopping method . See Convex Optimization ("Convex Optimization") by Boyd and Vandenberghe

Loss (Loss) is a measure used to measure the degree to which the model's prediction deviates from its label . To determine this value, the model needs to define a loss function. For example: the linear regression model participates in the mean square error MAS loss function, and the classification model uses the cross entropy loss function.

3.4 Gradient descent method

In the iteration of the training model, the "calculation parameter update" part can be implemented using the gradient descent method.

Suppose we have time and computing resources to calculate

w_{1}

the loss of all possible values. For the regression problem we have been studying, the resulting loss and

w_{1}

graph are always convex, as shown in the following figure:

The loss and weights generated by the regression problem are convex.

Convex problems have only one lowest point; that is, there is only one position where the slope is exactly 0. This minimum is where the loss function converges.

w_{1}

Finding the convergence point by calculating the loss function of each possible value in the entire data set is too inefficient. Let's study a better mechanism, this mechanism is very popular in the field of machine learning, called the gradient descent method.

3.5 Implementation process of gradient descent method

The first stage of the gradient descent method is to

w_{1}

select an initial value (starting point). The following figure shows that we have selected a starting point slightly greater than 0:

Then, the gradient descent method calculates the gradient of the loss curve at the starting point. The gradient is a vector of partial derivatives; it allows the model to understand which direction is "closer" or "further" to the target.

Partial derivatives and gradients to learn more about

The knowledge of calculus is mainly used here. Usually open source machine learning frameworks have helped us calculate the gradient, such as TensorFlow.


Partial derivative

The value of a multivariate function is a function with multiple parameters, for example:

f(x,y) = e^{2y} sin(x)
f
x

The partial derivative  you want to be expressed as:,

\frac{\partial f}{\partial x}

is also

f(x)

the derivative.

To calculate

\frac{\partial f}{\partial x}

, you must keep y fixed (so

f

it is now a function with only one variable), and then take the normal derivative

f

with respect to it

x

. For example, when

y

fixed to 1, the previous function becomes:

f(x) = e^{2} sin(x)

This is just a

x

function of a variable , and its derivative is:

e^{2} cos(x)

Generally speaking, assuming that it

y

remains unchanged,

f

the

x

formula for calculating the partial derivative of the pair is as follows:

\frac{\partial f}{\partial x}(x,y) = e^{2y}cos(x)

Similarly, if we

x

keep it constant, the partial derivative of the

f

pair

y

is:

\frac{\partial f}{\partial x}(x,y) = 2e^{2y}sin(x)

Intuitively, the partial derivative allows us to understand how much change will the function send when a variable is slightly changed? In the previous example:

\frac{\partial f}{\partial x}(0,1) = e^{2} \approx 7.4

Therefore, if we set the starting point as (0,1),

y

keep it fixed and

x

move it a little,

f

the amount of

x

change is about 7.4 times the amount of change.

In machine learning, partial derivatives are mainly used together with the gradient of a function.

gradient

The gradient of the function is the partial derivative equivalent to the vector of all independent variables, expressed as: ∇

f

, eh, CSDN editor cannot hit ∇

have to be aware of is:

f

Point to the direction where the function grows fastest.

-∇f

Point to the direction where the function drops the fastest.

The number of dimensions in the vector is equal

f

to the number of variables in the formula; the vector is located in the domain space of the function.

For example, when viewing the following function in three-dimensional space

f(x,y)

:

z = f(x,y)

Just like a valley, the lowest point is (2,0,4):

f(x,y)

The gradient of is a two-dimensional vector, which allows us to understand that when moving in that direction, the height drops the fastest; that is, the gradient vector points to the bottom of the valley.

In machine learning, gradients are used in gradient descent methods. Our loss function usually has many variables, and we try to minimize the loss function by following the negative direction of the gradient of the function.


It should be noted that the gradient is a vector, so it has the following two characteristics:

  • direction
  • size

The gradient always points to the direction of the most rapid growth in the loss function. The gradient descent method will take a step in the direction of the negative gradient in order to reduce the loss as soon as possible. The gradient descent method relies on negative gradients:

In order to determine the next point on the loss function curve, the gradient descent method adds a part of the gradient size to the starting point, as shown in the following figure:

A gradient step moves us to the next point on the loss curve. Then, the gradient descent method will repeat this process, gradually approaching the lowest point.

3.6 Keywords-gradient descent method

Gradient descent is a technique that calculates the gradient and minimizes the loss. It uses the training data bit conditions to calculate the gradient of the loss relative to the model parameters. The gradient descent method adjusts the parameters in an iterative manner, and gradually finds the best combination of weights and deviations, thereby minimizing the loss.

Reference: https://developers.google.cn/machine-learning/crash-course/reducing-loss/an-iterative-approach

Reference: https://developers.google.cn/machine-learning/crash-course/reducing-loss/gradient-descent

Four, stochastic gradient descent, batch gradient descent method

4.1 Preface

In the gradient descent method, the batch refers to the total number of samples used to calculate the gradient in a single iteration; that is, the batch in the gradient descent method refers to the entire data set.

If you use a large data set, the data set contains millions, tens of millions, or hundreds of millions of samples; it contains a large number of features. Therefore, a batch may be quite large, and a single iteration may take a long time to perform calculations.

Generally, the larger the batch size, the higher the likelihood of redundancy. Some redundancy may help eliminate messy gradients, but the predictive value of very large batches is often not higher than that of large batches.

4.2 Stochastic Gradient Descent (SGD)

background

The gradient descent method is time-consuming and low-value in large data sets. If we can get the correct average gradient with less calculation, the effect will be better. By randomly selecting samples from the data set, the larger average is estimated.

principle

It uses only one sample per iteration (batch size is 1).

If enough iterations are made, SGD can also play a role, but the process will be very messy. The term "random" means that a sample that makes up each batch is randomly selected.

4.3 Batch Gradient Descent (BGD)

It is a compromise between full batch iteration and random selection of an iteration. Full batch iteration (gradient descent method); random selection of an iteration (random gradient descent).

It randomly selects a part of the samples from the data set, forms a small batch of samples, and iterates. Small batches usually contain 10-1000 randomly selected samples. BGD can reduce the number of messy samples in SGD, but it is still more efficient for full batches.

Among the three methods, gradient descent, stochastic gradient descent, and batch gradient descent methods, the iterative model of batch gradient descent method is usually used.

Five, learning rate

5.1 Preface

The gradient vector has a direction and size; the gradient descent algorithm multiplies the gradient by a scalar called the learning rate (sometimes called the step size) to determine the location of the next point.

For example, if the gradient size is 2.5 and the learning rate is 0.01, the gradient descent algorithm will choose a position 0.025 from the previous point as the next point.

5.2 Learning rate

Hyperparameters are knobs used by programmers to adjust in machine learning algorithms. Most machine learning programmers spend a considerable amount of time adjusting the learning rate.

If the selected learning rate is too small, it will take too long to learn:

If the selected learning rate is too large, the next point will always bounce at the bottom of the U-shaped curve, and the global lowest point cannot be found:

If the selected learning rate happens to be:

5.3 Choose the learning rate

The learning rate is related to the flatness of the loss function. If you know that the gradient of the loss function is small, you can try a larger learning rate to compensate for the smaller gradient and obtain a larger step size.

Ideal learning rate one-dimensional space is

\frac{1}{f(x)^{n}}

,

f(x)

Dui

x

reciprocal of the second derivative.

The ideal learning rate in a two-dimensional or multi-dimensional space is the reciprocal of the Hessian matrix (a matrix composed of second-order partial derivatives).

The situation of generalized convex functions is more complicated.

For detailed Hessian matrix refer to Wikipedia:  https://en.wikipedia.org/wiki/Hessian_matrix

5.4 Keywords

Parameters , model variables trained by the machine learning system. For example, weight. Their values ​​are gradually learned by the machine learning system through successive training iterations; as opposed to hyperparameters.

Hyperparameters (hyperparameter), model training in a continuous process, the need for manual adjustment and specified; for example learning rate; relative parameter.

Learning rate , a scalar used for gradient descent when training the model. During each iteration, the gradient descent method multiplies the learning rate by the gradient; the resulting product is called the gradient step size.

Reference: https://developers.google.cn/machine-learning/crash-course/reducing-loss/learning-rate

Six, generalization and overfitting

6.1 Preface

This article focuses on generalization and overfitting models.

In order to understand the concept of generalization, first look at 3 pictures. Assume that each point in these figures represents the position of a tree in the forest.

The two colors in the picture represent the following meanings:

  • The blue dot represents the sick tree
  • Orange dots represent healthy trees

In the picture above, there are sick and healthy trees, corresponding to blue dots and orange dots respectively.

A model needs to be designed to distinguish those trees that are diseased and those that are healthy. The effect of the model is as follows:

On the surface, I feel that the model can distinguish those trees that are diseased and those that are healthy.

In fact, the model is a bit overfitting! !

If you use this model to predict some new data, the effect is as follows:

The model is very poor in processing new data, and the classification of most of the new data is incorrect.

6.2 Overfitting

Introduction

The overfitting model has very low loss during the training process, but it performs poorly when predicting new data.

cause

Over-fitting is caused by too small training data, and the complexity of the model exceeds the required level. That is, the model structure is too complicated, but the rules or meanings that the task needs to express do not need to be so complicated.

The goal of machine learning

The goal of machine learning is to make good predictions for new data drawn from the true probability distribution; that is, to make good predictions for new data that has not been seen before.

6.3 William of Occam

William of OccamHe was a monk and philosopher who admired simplicity in the 14th century. He believes that scientists should give priority to simpler (rather than more complex) formulas or theories. The application of Occam's Razor Law in machine learning is as follows:

The simpler the machine learning model, the more likely it is that good empirical results are not based solely on the characteristics of the sample.

Today, we have officially applied Occam's Razor Law to the fields of statistical learning theory and computational learning theory . These fields have formed a generalization boundary , that is, the ability of the statistical description model to generalize to new data based on the following factors:

  • Complexity of the model
  • The model's performance in processing training data

Although theoretical analysis can provide formal guarantees under ideal assumptions, it is difficult to apply in practice.

For example, if the model above simply fits the data, use a line to simply distinguish those trees that are diseased and those that are healthy. The model is no longer over-fitting; although the distinction is not very accurate, most of the Can distinguish correctly.

6.4 Data set split

Machine learning models are designed to make good predictions based on new data that has not been seen before. But if you want to build a model based on a data set, how do you get data that you haven't seen before? One way is to divide the data set into two subsets:

  • Training set  -the subset used to train the model.
  • Test set  -used to test a subset of the model.

Generally speaking, good performance on the test set is a useful indicator of whether it can perform well on new data, provided that:

  • The test set is large enough.
  • The same test set will not be used repeatedly to falsify.

6.5 Machine Learning-Generalization Rules

The following three basic assumptions clarify generalization:

  • We randomly select independent and identically distributed  ( iid ) samples from the distribution . In other words, the samples will not affect each other. (Another explanation: iid is a way to express the randomness of variables).
  • The distribution is stable ; that is, the distribution does not change within the data set.
  • We draw samples from data partitions of the same distribution .

In practice, we sometimes violate these assumptions. E.g:

  • Imagine a model for selecting advertisements to be displayed. If the model selects ads based on the ads that users have seen before, it will violate the iid assumption.
  • Imagine a data set containing one year of retail information. There will be seasonal changes in users' buying behavior, which violates stability.

Reference: https://developers.google.cn/machine-learning/crash-course/generalization/peril-of-overfitting

6.6 Summary

  • If a model tries to closely fit the training data but cannot generalize to the new data well, overfitting will occur.
  • If the key assumptions of supervised machine learning are not met, then we will lose the important theoretical guarantee of the ability to predict new data.

6.7 Keywords

Generalization refers to the ability of a model to make predictions for new data that has not been seen before, based on the model used during training.

Overfitting (overfitting), the created model matches the training data too much, so that the model cannot make correct predictions based on the new data.

Prediction (perdition), the model's output after receiving data samples.

Stationarit , an attribute of data in a data set, indicates that the distribution of data in one or more dimensions remains unchanged. The most common dimension of this kind is time, that is, data that indicates stationarity does not change with time.

Training set (training set), a subset of the data set, used to train the model. Contrast with validation set and test set.

Validation set (validation set), a subset of the data set, separated from the training set, used to adjust the hyperparameters. Contrast with training set and test set.

Test set (test set), a subset of the data set, used to test the model after the model has been initially verified by the validation set. Contrast with training set and validation set.

Seven, data set division

7.1 Preface

In machine learning, the data set can be divided into two subsets, namely the training set and the test set. A better way is to divide the data set into three subsets, namely training set, validation set, and test set.

7.2 Divided into training set and test set

The concept of dividing the data set into two subsets:

Training set —used to train the model;

Test set —used to test the trained model

For example, divide the data set into a training set and a test set:

When using this scheme, you need to ensure that the test set meets the following two conditions:

  • The scale is large enough to produce statistically significant results.
  • Can represent the entire data set. That is, the characteristics of the selected test set should be the same as the characteristics of the training set.

When the test set meets the above two conditions, a model that can be better generalized to new data can usually be obtained.

The process of using the training set and the test set to train the model

"Adjusting the model" refers to adjusting the parameters, hyperparameters, and model structure related to the model, such as learning rate, adding or removing features, or designing a new model from an early age, and so on.

7.3 Divided into training set, validation set, and test set

Dividing the data set into three subsets, as shown in the figure below, can greatly reduce the transmission probability of overfitting:

The process of training the model using the training set, validation set, and test set

First select the model that obtains the best effect on the early validation set. Then use the test set to check the model again.

The model trained by this method will pass better because less information is exposed to the test set.

Divided into training set, validation set, test set method, through the test set to adjust the effect of the model, from which the rules of the test set are continuously learned; thus, the test set is different from the new data. The model has some understanding of the test set, but the new data is still Predicted without knowing it at all

note

Constantly adapting to the test set and verification set will make it gradually lose its effect. The more times that the same data is used to determine hyperparameter settings or other model improvements, the lower the effect of these results can be truly generalized to new data that has not been seen before.

Recommendation: Collect more data to "refresh" the test set and validation set. Starting over is a good way to reset.

7.4 Keywords

Training set (training set), a subset of the data set, used to train the model. Contrast with validation set and test set.

Validation set (validation set), a subset of the data set, separated from the training set, used to adjust the hyperparameters. Contrast with training set and test set.

Test set (test set), a subset of the data set, used to test the model after the model has been initially verified by the validation set. Contrast with training set and validation set.

Overfitting (overfitting), the created model matches the training data too much, so that the model cannot make correct predictions based on the new data.

Reference: https://developers.google.cn/machine-learning/crash-course/training-and-test-sets/splitting-data

Reference: https://developers.google.cn/machine-learning/crash-course/validation/another-partition

8. Feature Engineering

8.1 Preface

The focus of traditional programming is code. In machine learning projects, the focus becomes feature representation; that is, developers adjust the model by adding and improving features.

Feature engineering refers to the conversion of original data into feature vectors; it is estimated that a lot of time will be required for feature engineering.

8.2 Mapping raw data to features

In the figure below, the left side represents the original data from the input data source, and the right side represents the feature vector, which is the set of floating-point values ​​that make up the samples in the data set.

Feature engineering maps raw data to machine learning features.

8.3 Mapping values

Integer and floating point data do not need special coding, because they can be multiplied with digital weights.

In the figure below, it doesn't make much sense to convert the original integer value 6 to the eigenvalue 6.0.

8.4 Mapping classification values

The classification feature has a discrete set of possible values. For example, there may be a feature named street_name with options including:

{'Charleston Road', 'North Shoreline Boulevard', 'Shorebird Way', 'Rengstorff Avenue'}

Since the model cannot multiply the string with the learned weight, we use feature engineering to convert the string into a numeric value.

Realization ideas

You can define a mapping from feature values ​​(called a vocabulary of possible values) to integers.

Not every street in the world will appear in our data set, so we can group all other streets into an all-inclusive "other" category called OOV bucketing (out of vocabulary).

Implementation process

With the above method, we can map street names to numbers in the following way:

  • Map Charleston Road to 0
  • Map North Shoreline Boulevard to 1
  • Map Shorebird Way to 2
  • Map Rengstorff Avenue to 3
  • Map all other streets (OOV) to 4

However, if we incorporate these index numbers directly into the model, it will cause some limitations:

1) We will learn a single weight that applies to all streets.

For example, if we learn that the weight of street_name is 6, then for Charleston Road, multiply it by 0; for North Shoreline Boulevard, multiply it by 1; for Shorebird Way, multiply it by 2, and so on.

Take a model that uses street_name as a feature to predict housing prices as an example. It is unlikely that housing prices will be adjusted linearly based on street names. In addition, these assumptions have been based on the average housing prices.

Our model needs to flexibly learn different weights for each street, and these weights will be added to the estimated housing prices using other features.

2) We did not take into account that street_name may have multiple values. For example, many houses are located on the corners of two streets, so if the model includes a single index, this information cannot be encoded in the street_name value.

Remove the above two restrictions, we can create a binary vector for each classification feature in the model to represent these values:

  • For the value of the sample used, set the corresponding vector element to 1.
  • Set all other elements to 0;

The length of this vector is equal to the number of elements in the vocabulary. When there is only one value of 1, this notation is called one- hot encoding ; when there are multiple values ​​of 1, this notation is called multi-hot encoding .

The figure is mapped by a one-hot encoding street address, street Shorebird Way to the one-hot encoding .

In this binary vector, the value of the element representing Shorebird Way is 1, and the value of the element representing all other streets is 0.

summary

This method can effectively create a Boolean variable for each characteristic value. In this method, if the house is located on Shorebird Way, only the binary value of Shorebird Way is 1. Therefore, the model only uses the weight of Shorebird Way.

If the house is at the corner of two streets, set the two binary values ​​to 1, and the model will use their respective weights.

8.5, sparse representation

background

Suppose there are 1 million different street names in the data set, and you want to include them as the value of street_name.

If you directly create a binary vector containing 1 million elements, of which only 1 or 2 elements are true, it is a very inefficient notation. It will take up a lot of storage space and consume a lot of time when processing these vectors. Calculation time.

Introduction

In this case, a common method is to use sparse notation, in which only non-zero values ​​are stored. In sparse representation, independent model weights are still learned for each feature value.

8.6 The characteristics of good characteristics

We explored ways to map raw data to suitable feature vectors, but this is only part of the work. Then you need to explore what value is considered a good feature in these feature vectors.

  • Avoid discrete eigenvalues ​​that are rarely practical
  • It is best to have a clear and unambiguous meaning
  • Do not include special values ​​in actual data
  • Consider upstream instability

5.1) Avoid discrete eigenvalues ​​that are rarely used

Good eigenvalues ​​should appear more than 5 times in the data set. In this way, the model can learn how the feature value is related to the label. A large number of samples with the same discrete value can give the model the opportunity to learn about the characteristics in different settings, so as to determine when it can make a good prediction for the label. For example, the house_type feature contains a large number of samples, where its value is victorian:

house_type: victorian

If a certain feature value occurs only once and fire rarely occurs, the model cannot make predictions based on that feature. For example, unique_house_id is not suitable as a feature, because each value is only used once, and the model cannot learn any rules from it:

unique_house_id: 8SK982ZZ1242Z

5.2) It is best to have a clear meaning

Each feature should have a clear meaning to anyone in the project. For example, if the age of the house is suitable as a feature, the age of the house in years can be immediately recognized: house_age: 27

On the contrary, for the meaning of some eigenvalues, except the engineer who created it, others may not be able to recognize: house_age: 851472000

In some cases, confusing data can lead to unclear values. For example, if the source of user_age is not checked, the value is correct: user_age: 277

5.3) Do not include special values ​​in actual data

A good floating point feature does not contain abnormal breakpoints or feature values ​​that are out of range. For example, suppose a feature has a floating point value between 0 and 1. Then, the following values ​​are acceptable:

quality_rating: 0.82quality_rating: 0.37

However, if the user does not enter quality_rating, the data set may use the following special values ​​to indicate that the value does not exist:

quality_rating: -1

In order to solve the problem of special values, the feature needs to be converted into two features:

  • A feature only stores the quality score and does not contain special values.
  • A feature stores a boolean value indicating whether quality_rating is provided.

5.4) Consider upstream instability

The definition of characteristics should not change over time. For example, the following values ​​are useful because city names generally do not change.

city_id: "br/sao_paulo"

But collecting values ​​inferred by other models incurs additional costs. The possible value "219" currently represents Sao Paulo, but this representation may easily send changes when running other models in the future:

inferred_city_cluster: "219"

8.7 Keywords

Feature engineering (feature engineering) refers to determining which features may be useful in training a model, and then converting log files and raw data from other sources into the required features. Feature engineering is sometimes called feature extraction.

Discrete feature, a feature that contains a finite number of possible values. For example, a certain value can only be the characteristics of "animal" or "vegetable", which can all enumerate the category. Contrast with continuous features.

Hot encoded (one-hot-encoding), Sparse bivariate vector, wherein:

  • One element is set to 1.
  • All other elements are set to 0.

The one-hot encoding common term means a string or identifier with a limited number of possible values.

Representation , the process of mapping data to practical features.

Reference: https://developers.google.cn/machine-learning/crash-course/representation/feature-engineering

Reference: https://developers.google.cn/machine-learning/crash-course/representation/qualities-of-good-features

Nine, regularization​​L2 (simplicity)

9.1 Preface

By reducing the complexity of the model to prevent overfitting, this principle is called regularization .

9.2 Principle

When training the model, it is not just for the purpose of minimizing loss (minimizing empirical risk)

minimize(Loss(Data|Model))

Instead, the goal is to minimize loss and complexity, which is called structural risk minimization:

minimize( Loss (Data | Model) + complexity(Model))

Now, our training optimization algorithm is a function consisting of two things:

  • One is the loss term, which is used to measure the fit between the model and the data;
  • The other is a regularization term, which is used to measure the complexity of the model.

9.3 Complexity

There are two common ways to measure model complexity:

  • The model complexity is taken as a function of the weights of all features in the model.
  • The model complexity is taken as a function of the total number of features with non-zero weights.

If the model complexity is a function of the weight, the higher the absolute value of the feature weight, the greater the contribution to the model complexity.

9.4 L2 regularization

We can use

L_{2}

the regularization formula to quantify the complexity, which defines the regularization term as the sum of the squares of all feature weights :

L_{2} regularization term = \left \| w \right \|_{2}^{2} = w_{1}^{2} +w_{2}^{2}+.....+w_ {n}^{2}

In this formula, weights close to 0 have almost no effect on model complexity, while outlier weights will have a huge impact.

For example, a linear model has the following:

{w_{1} = 0.2, w_{2} = 0.5, w_{3} = 5, w_{4} = 1, w_{5} = 0.25, w_{6} = 0.75}

Combined with the formula,

L_{2}

the regularization term is calculated  as 26.915:

w_{1}^{2} + w_{2}^{2} + w_{3}^{2} + w_{4}^{2} + w_{5}^{2} +w_{6}^ {2} = 0.2^{2} + 0.5^{2} + 5^{2} + 1^{2} +0.25^{2} +0.75^{2}
=0.04 + 0.25 + 25 + 1 + 0.0625 + 0.5625 = 26.915
w_{3}

The square value in the above case is 25, which contributes almost all the complexity. The value of the sum of the squares of the other five weights is only 1.915,

L_{2}

which has a small contribution to the regularization term. Therefore, for

L_{2}

the regularization term, a weight close to 0 has almost no effect on the model complexity, while the outlier weight will have a huge impact .

9.5 Simplified regularization: Lambda

Model developers adjust the overall impact of the regularization term by multiplying the value of the normalization term by a scalar called lambda (also known as the regularization rate). The model developer will perform the following operations:

minimize(Loss(Data|Model)) + \lambda complexity(Model))

Which

\lambda

refers to lambda;

Performing

L_{2}

regularization has the following effects on the model

  • Make the weight value close to 0 (but not exactly 0)
  • Make the average of the weights close to 0 and present a normal distribution (Gaussian curve)

Adding the lambda value will enhance the regularization effect. For example, a weight histogram with a higher lambda value might look like the following figure:

If you reduce the value of lambda, you will often get a relatively flat histogram:

When choosing lambda values, the goal is to achieve an appropriate balance between simplification and training data fitting:

  • If the lambda value is too high, the model will be very simple and face the risk of underfitting the data; the model will not be able to obtain enough information from the training data to make useful predictions.
  • If the lambda value is too low, the model will be more complicated and the data will be over-fitted; the model cannot be generalized to new data due to too much information about the characteristics of the training data.
note:
Setting lambda to 0 can completely cancel regularization. In this case, the sole purpose of training is to minimize the loss, and doing so will maximize the risk of overfitting.

The model generated by the ideal lambda value can be well generalized to new data that has not been seen before; usually, the ideal lambda value depends on the data and requires manual or automatic adjustments.

9.6

L_{2}

Regularization and learning rate

There is a close relationship between learning rate and lambda.

A strong

L_{2}

regularization value tends to make the feature weight closer to 0. A lower learning rate (using the early stopping method) usually produces the same effect. Therefore, adjusting the learning rate and lambda at the same time may have confusing effects.

Early stopping means that the training ends before the model fully converges. In actual operation, we often train in a continuous manner and adopt some implicit early stopping methods.

As mentioned above, the effect of changing the regularization parameter may be confused with the effect of changing the learning rate or number of iterations. There is a way to perform enough iterations when training a fixed batch of data so that the early stopping method will not work.

Reference: https://developers.google.cn/machine-learning/crash-course/regularization-for-simplicity/l2-regularization

Reference: https://developers.google.cn/machine-learning/crash-course/regularization-for-simplicity/lambda

10. Logistic regression, linear regression

10.1 Preface

Many problems require probability estimates as output. Logistic regression is an extremely efficient probability computer system.

10.2 Ways to calculate and return probabilities

  • As is
  • Convert to binary category

Use the probability as it is; suppose we create a logistic regression model to predict the probability of a dog barking in the middle of the night. We call this probability:

p(bark | night)

如果逻辑回归模型预测

p(bark | night)

的值为0.05,那么一年内(365天),主人被狗惊醒约18次:

started = p (bark | night) * nights

即:18 = 0.05 * 365

转换为二元类别

在很多情况下,将逻辑回归输出映射到二元分类问题的解决方案,该二元分类问题的目标是正确预测两个可能的标签中的一个。

10.3 S形函数

逻辑回归模型如何确保输出值始终落在0和1之间呢?S型函数生成的输出值正好具有这些特征,其定义如下:

y = \frac{1}{1+e^{-z}}

S型函数会产生以下曲线:

如果z表示使用逻辑回归训练的模型的线性层的输出,则S型函数会生成一个介于0和1之间的值(概率)。用数学方法表示为:

y^{'}=\frac{1}{1+e^{-(z)}}

其中:

y^{'}

是逻辑回归模型针对特定样本的输出。

  • z是
b+w_{1}x_{1 }+ w_{2}x_{2} +......+ w_{N}x_{N}

;w是该模型学习的权重,b是偏差。x是指特征样本的特征值。

请注意,z也称为对数几率,因为S型函数的反函数表明:z可定义为标签“1”的概率除以标签“0”的概率,得出的值的对数:

z = log(\frac{y}{1-y})

以下是具有机器学习标签的S型函数:

10.4 逻辑回归推断计算

假设我们的逻辑回归模型具有学习了下列偏差和权重的三个特征:

  • b = 1
  • w1 = 2
  • w2 = -1
  • w3 = 5

进一步假设给定样本具有以下特征:

  • x1 = 0
  • x2 = 10
  • x3 =2

因此,对数几率:

b + w_{1}x_{1} + w_{2}x_{2} + w_{3}x_{3}

代入公式计算得:(1)+(2)(0)+(-1)(10)+(5)(2)=1

因此,此特定样本的逻辑回归预测值将是0.731:

y^{'}=\frac{1}{1+e^{-(1)}}=0.731

概率输出为0,731,即73.1%的概率:

10.5 逻辑回归的损失函数

线性回归的损失函数是平方损失。逻辑回归的损失函数是对数损失函数,定义如下:

LohLoss =\sum_{(x,y)\epsilon D}^{}-ylog(y^{'))-(1-y)log(1-y^{'})

其中:

(x,y)\epsilon D

是包含很多有标签样本(x,y)的数据集。

  • y是有标签样本的标签。由于这是逻辑回归,因此y的每个值都必须使0或1.
y^{'}

是对于特征集x的预测值(介于0和1之间)。

对数损失函数的方程式,是似然函数的负对数。

10.6 逻辑回归中的正则化

正则化在逻辑回归建模中极其重要。如果没有正则化,逻辑回归的逐渐性会不断促使损失在搞维度空间内达到0.因此,大多数逻辑回归模型会使用一下两个策略之一来降低模型复杂性:

L_{2}

正则化。

  • 早停法,即:限制训练步数或学习率。

假设向每个样本分配一个唯一ID,且将每个ID映射到其自己的特征。如果未指定正则化函数,模型会变得完成过拟合。

这是因为模型会尝试促使所有样本的始终损失达不到0,从而使每个特征的权重接近正无穷或负无穷。当大量罕见的特征组合的高纬度数据就会出现这种情况。

L_{2}

正则化或早停法可以防止此类问题。

10.7 逻辑回归小结

逻辑回归模型生成概率。

对数损失函数使逻辑回归的损失函数。

逻辑回归被很多从业者广发使用。

10.8 线性回归

本案例观察蟋蟀鸣叫的规律,训练一个模型,预测鸣叫与温度的关系。

蟋蟀在较为炎热的天气里鸣叫更为频繁,数十年来,专业和业余昆虫学者已将每分钟的喵叫和温度方面的数据编入目录。

温度与蟋蟀鸣叫的数据,先了解数据的分布情况:

上图表示每分钟的鸣叫与温度的关系。

此曲线图中能看到随着温度的升,蟋蟀鸣叫次数特增加。鸣叫声与温度之间的关系是线性关系。

可以绘制一条直线来近视地表示这种关系,如下图所示:

虽然图中的每个点不是完全分布在直线上,但基本都在直线附近;线性关系用公式表示如下:

y = mx + b

其中:

  • y是指温度,是预测的值;
  • m是指直线的斜率;
  • x是指每分钟的鸣叫声次数,即输入特征的值。
  • b是指y轴截距。

按照机器学习的方式,写一个模型方程式:

y^{'} = b + w_{1} x_{1}

其中:

y^{'}

是指预测的标签(输出值)

  • b是指偏差(对应y轴截距),一些机器学习文档中,称为
w_{0}
w1

是指特征1的权重。权重与上文中用m表示“斜率”的概念相同。

x_{1}

是指特征。(输出向)

要根据新的每分钟的鸣叫声值

x_{1}

推断(预测)温度

y^{'}

,只需将

x_{1}

值代入此模型即可。

下标(例如

w1

x_{1}

)预示着可以用多个特征来表示更复杂的模型。例如,具有三个特征的模型可以采用以下方程式:

y^{'} = b + w_{1}x_{1} + w_{2}x_{2} + w_{3}x_{3}

参考:https://developers.google.cn/machine-learning/crash-course/descending-into-ml/linear-regression

关键词

偏差(bias),距离原点的截距或偏移。偏差(也称为偏差项)在机器学习模型中用b或

w_{0}

表示。例如,在下面的公式中,偏差为b:

y^{'} = b + w_{1} x_{1} +w_{2} x_{2} +.......w_{n} x_{n}

推断(inference),在机器学习中,推断通常指以下过程:通过将训练过的模型应用于无标签样本来做出雨雪。在统计学中,推断是指在某些观察数据条件下拟合分布参数的过程。(请参阅维基百科中有关统计学推断的文章。)

线性回归(linear regression),一种回归模型,通过将输入特征进行线性组合输出连续值。

权重(weight),模型中特征的系数,或深度网络中的边。训练模型的目标是确定每个特征的理想权重。如果权重为0,则相应的特征对模型来说没有任何影响。

参考:https://developers.google.cn/machine-learning/crash-course/logistic-regression/calculating-a-probability

参考:https://developers.google.cn/machine-learning/crash-course/logistic-regression/model-training

十一、正则化​​L1(稀疏性)

11.1 前言

L_{1}

正则化,也称稀疏性正则化。

创建特征组合会导致包含更多维度;由于使用此类高纬度特征矢量,因此模型可能会非常庞大,并且需要大量的RAM。

11.2 稀疏性的正则化

在高纬度稀疏矢量中,最好尽可能使权重正好降至0。正好为0的权重基本会使相应特征从模型中移除。将特征设为0可节省RAM空间,且可以减少模型中的噪点。

L1正则化(L1 regularization),一种正则化,根据权重的绝对值的总和来惩罚权重。在以来稀疏特征的模型中,L1正则化有助于使不相关或几乎不相关的特征的权重正好为0,从而将这些特征从模型中移除。与L2正则化相对。

对比:
L_{2}

正则化可以使权重变小,但是并不能使它们正好为0.0。

11.3

L_{1}

L_{2}

正则化对比

L_{1}

L_{2}

采用不同的方式降低权重:

L_{1}

采用绝对值的方式降低权重,即

\left | w \right |
L_{2}

采用平方的方式降低权重的,即

w^{2}

因此,

L_{1}

L_{2}

具有不同的导数:

L_{1}

的导数为k(一个常数,其值与权重无关)。

L_{2}

的导数为2*权重。

L_{1}

导数的作用理解为:每次从权重中减去一个常数。不过,由于减去的是绝对值,

L_{1}

在0处具有不连续性,这会导致与0相交的减法结果变为0。例如,如果减法使权重从+0.1变为-0.2,

L_{1}

便会将权重设为0。就这样,

L_{1}

使权重变为0了。

L_{2}

的导数作用理解为:每次移除权重的x%。对于任意数字,即使按每次减去x%的幅度执行数十亿次减法计算,最后得出的值也绝不会正好为0。即,

L_{2}

通常不会使权重变为0。

小结

L_{1}

正则化,减少所有权重的绝对值,对宽度模型非常有效。

下面是比较 L1 和 L2 正则化对权重网络的影响:

能看到L1正则化会把很小的权重变为0;

11.4 关键词

L1正则化(L1 regularization),一种正则化,根据权重的绝对值的总和,来惩罚权重。在以来稀疏特征的模型中,L1正则化有助于使不相关或几乎不相关的特征的权重正好为0,从而将这些特征从模型中移除。与L2正则化相对。

L2正则化(L2 regularization),一种正则化,根据权重的平方和,来惩罚权重。L2正则化有助于使离群值(具有较大正值或较小负责)权重接近于0,但又不正好为0。在线性模型中,L2正则化始终可以进行泛化。

参考:https://developers.google.cn/machine-learning/crash-course/regularization-for-sparsity/l1-regularization

十二、神经网络

12.1 前言

本文主要介绍神经网络。有些分类问题是属于非线性问题:

“非线性”意味着无法使用这样的形式:

b +w_{1}x_{1} + w_{2}x_{2}

的模型精准预测标签。即:“决策面”不是直线。

如果数据集如下所示(更难的非线性分类问题):

上图的数据集问题,无法使用线性模型解决。

12.2 非线性问题解决方法

或者可以考虑一个可行方法,对非线性问题进行建模——特征组合。

或者使用神经网络解决非线性问题,通常效果较好;它不一定始终比特征组合好,但它确实可以提供适用于很多情形的灵活代替方案。

12.3 单层神经网络

单层神经网络,其实也是一个线性模型,模型结构如下:

每个蓝色圆圈均表示一个输入特征,绿色圆圈表示各个输入的加权和。

这里明明有输出层、输出层,为什么叫单层神经网络呢?

通常,统计神经网络层数时,对有权重参数的层,才进行统计。输入层只负责特征数据输入,没有参数,所以不纳入统计层数的。

12.4 两层神经网络

在输入层和输出层之间,添加多一层网络,其称为隐藏层。如下图所示:

隐藏层中的每个黄色节点均是,蓝色输入节点值的加权和。输出时黄色节点的加权和。

两层神经网络,其输出仍是其输入的线性组合。

12.5 三层神经网络

由输出层、隐藏层1、隐藏层2、输出层组成;在输入层和输出层之间的都可以称为隐藏层。:

此模型仍是线性的;当将输出表示为输入的函数并进行简化时,只是获得输入的另一个加权和而已。

该加权和无法对非线性问题,进行有效建模。

12.6 激活函数

要对非线性问题进行建模,我们可以直接引入非线性函数。我们可以使用非线性函数将每个隐藏节点享管道一样连接起来。

在下图所示的模型中,在隐藏层1中各个节点的只传递到一下层,进行加权求和之前;我们采用一个非线性函数对其进行了转换。这种非线性函数成为激活函数。

包含激活函数的三层模型:

现在,我们已经添加了激活函数,如果添加层,将会产生更多影响。通过在非线性上堆叠非线性,我们能够对输入和预测输出之间极其复杂的关系进行建模。

每一层均可通过原始输入有效学习更复杂、更高级别的函数。

常见激活函数

1、S型激活函数,将加权和转换为介于0和1之间的值。

F(x)=\frac{1}{1+e^{-x}}

曲线图如下:

2、修正线性单元激光函数,简称ReLU,相对于S型函数等平滑函数,它的效果通常要好一点,同时还非常易于计算。

F(x)=max(0,x)

ReLU的优势在于它基于实证发现,拥有更实用的响应范围。S型函数的响应性在两端相对较快地减少。ReLU激活函数如下所示:

实际上,所有数学函数均可作为激光函数。假设

\sigma

表示我们的激活函数(ReLuck、S型函数等等)。因此,网络中节点的值由以下公式指定:

\sigma (w\cdot x+b)

12.7 总结

现在,我们的模型拥有了人们通常所说的“神经网络”的所有标准组件:

  • 一组节点,类似于神经元,位于层中。
  • 一组权重,表示每个神经网络层与下方的层之间的关系。下方的层可能是另一个神经网络层,也可能是其他类型的层。
  • 一组偏差,每个节点一个偏差。
  • 一个激活函数,对层中每个节点的输出进行转换。不同的层可能拥有不同的激光函数。

Reference: https://developers.google.cn/machine-learning/crash-course/introduction-to-neural-networks/anatomy

This article is for your reference, welcome to communicate~

If there are any errors, please point them out, thank you.