# Preface

This article refers to the quick introductory course of machine learning on Google's official website . The overall course is relatively easy to understand for everyone to learn and refer to; the article will also be optimized based on your own understanding. After seeing the news on the official website, the Chinese version of the machine learning quick start course will not be provided after 2021/7, and the English version will be available.

Preface

1. Machine learning-main terms (integrated version)

2. Training and loss

2.1 Preface

2.2 Training model

2.3 Loss function

2.4 Keywords

3. Model iteration

3.1 Preface

3.2 Iterative method of training model

3.3 Keywords (training, convergence, loss)

3.5 Implementation process of gradient descent method

4.1 Preface

Five, learning rate

5.1 Preface

5.2 Learning rate

5.3 Choose the learning rate

5.4 Keywords

Six, generalization and overfitting

6.1 Preface

6.2 Overfitting

6.3 William of Occam

6.4 Data set split

6.5 Machine Learning-Generalization Rules

6.6 Summary

6.7 Keywords

Seven, data set division

7.1 Preface

7.2 Divided into training set and test set

7.3 Divided into training set, validation set, and test set

7.4 Keywords

8. Feature Engineering

8.1 Preface

8.2 Mapping raw data to features

8.3 Mapping values

8.4 Mapping classification values

8.5, sparse representation

8.6 The characteristics of good characteristics

8.7 Keywords

Nine, regularization​​L2 (simplicity)

9.1 Preface

9.2 Principle

9.3 Complexity

9.4 L2 regularization

9.5 Simplified regularization: Lambda

9.6 ​Regularization and learning rate

10. Logistic regression, linear regression

10.1 Preface

10.2 Ways to calculate and return probabilities

10.3 Sigmoid function

10.4 Logistic regression inference calculation

10.5 Loss function of logistic regression

10.6 Regularization in Logistic Regression

10.7 Logistic regression summary

10.8 Linear regression

11. Regularization​​L1 (sparseness)

11.1 Preface

11.2 Regularization of sparsity

11.3 ​and ​Regularization Comparison

11.4 Keywords

12. Neural network

12.1 Preface

12.2 Solutions to nonlinear problems

12.3 Single layer neural network

12.4 Two-layer neural network

12.5 Three-layer neural network

12.6 Activation function

12.7 Summary

# 1. Machine learning-main terms (integrated version)

Refer to Google's official website for the explanation of machine learning terms, summarize and describe.

https://guo-pu.blog.csdn.net/article/details/117442581

# 2. Training and loss

## 2.1 Preface

The training model means learning the optimal values ​​of all weights w and deviation b in the model through labeled samples. In supervised learning, the machine learning algorithm builds a model by checking multiple samples and trying to find the loss that can minimize the model; this process is called empirical risk minimization.

Loss is a penalty for bad predictions; Loss is the song value, which represents the accuracy of the model's prediction for a single sample. If the prediction of the model is completed accurately, the loss is zero, otherwise the loss will be larger.

## 2.2 Training model

The goal of the training model is to find a set of weights and deviations with "less" average loss from all samples.

The red arrow indicates the loss; the blue line indicates the forecast. The red arrow in the graph on the left is much longer than the corresponding red arrow in the graph on the right; that is, the actual point is farther away from the model prediction, and the difference is greater.

The model with the larger loss is shown on the left; the model with the smaller loss is shown on the right.

## 2.3 Loss function

Square loss is a common loss function. The linear regression model uses

a loss function called square loss (also known as loss). The squared loss for a single sample is as follows:

  = the square of the difference between the label and the prediction  = (observation - prediction(x))2  = (y - y')2

The mean square error (MSE) refers to the average squared loss of each sample. To calculate MSE, you need to find the sum of all square losses of each sample, and then divide by the number of samples:

among them:

• (x,y) refers to the sample; x refers to the feature set used by the model to make predictions (such as temperature, age, etc.) and y refers to the label of the sample (such as the number of cricket tweets per minute)
• Prediction (x) refers to the function of combining weights and biases with feature set x.
• D refers to a data set containing multiple labeled samples.
• n refers to the number of samples in D.

In the MSE common language regression task; the cross-entropy loss function is commonly used in the classification task.

## 2.4 Keywords

Empirical risk minimization (ERM, empirical risk minimization), for selecting the function, select a function to minimize loss based on the training set. Contrast with minimizing structural risk.

Mean square error (MSE, Mean Squared Error), the mean square of the loss of each sample. MSE is calculated by dividing the squared loss by the number of samples.

The squared loss function (squared loss) is the loss function used in linear regression (also known as the L2 loss function). Change the line to calculate the square of the difference between the value predicted by the labeled sample and the true value of the label. Due to the squared value, this loss function magnifies the impact of poor predictions. Compared with the L1 loss function, the square loss function reacts more strongly to outliers.

Training (training) the process of constructing the ideal parameters of the model.

Loss (Loss) is a measure used to measure the degree to which the model's prediction deviates from its label . To determine this value, the model needs to define a loss function. For example: the linear regression model participates in the mean square error MAS loss function, and the classification model uses the cross entropy loss function.

# 3. Model iteration

## 3.1 Preface

When training a machine learning model, first make initial guesses on the weights and biases, and then repeatedly adjust these guessing parameters (weights and biases) until the weights and biases at the lowest possible loss are obtained.

## 3.2 Iterative method of training model

The iterative trial and error process of machine learning algorithms used to train the model:

The "model" part takes one or more features as input and returns a prediction (

) as output.

To simplify, consider taking a feature and returning a predictive model:

What initial values need to be considered for

and

set? For linear regression problems, it turns out that the initial value is not important. (Note: If it is other model initialization value may be very important, the specific model specific processing). We can initialize randomly, or use these insignificant values:

= 0

= 0

If the first eigenvalue is 10, substituting the eigenvalue into the prediction function will get the following results:

  y' = 0 + 0(10)  y' = 0

Then you need to calculate the loss. The "calculation loss" part in the above figure is the loss function used by the model, such as the square loss function.

The loss function will take two input values:

: The model's prediction of feature x

• y: The correct label corresponding to feature x.

Finally, go to the "calculation parameter update" part of the figure . The machine learning system checks the value of the loss function in this part, and updates the

sum

, that is

,

generates a new value for the sum.

Suppose this mysterious green box generates a new value, and that value generates a new parameter. This learning process will continue to iterate until the algorithm finds that the loss has been minimized, at this time a better model is obtained, and the model parameters at this time are saved.

Generally, iterate continuously until the overall loss no longer changes or changes extremely slowly, at which point the model has converged.

## 3.3 Keywords (training, convergence, loss)

Training (training) the process of constructing the ideal parameters of the model.

Convergence Convergence usually refers to a state reached during training, that is, after a certain number of iterations, the training loss and verification loss change very little or no longer in each iteration. In deep learning, the loss value will sometimes remain unchanged or almost unchanged for multiple iterations before the final drop, temporarily forming an illusion of convergence. See the early stopping method . See Convex Optimization ("Convex Optimization") by Boyd and Vandenberghe

Loss (Loss) is a measure used to measure the degree to which the model's prediction deviates from its label . To determine this value, the model needs to define a loss function. For example: the linear regression model participates in the mean square error MAS loss function, and the classification model uses the cross entropy loss function.

In the iteration of the training model, the "calculation parameter update" part can be implemented using the gradient descent method.

Suppose we have time and computing resources to calculate

the loss of all possible values. For the regression problem we have been studying, the resulting loss and

graph are always convex, as shown in the following figure:

The loss and weights generated by the regression problem are convex.

Convex problems have only one lowest point; that is, there is only one position where the slope is exactly 0. This minimum is where the loss function converges.

Finding the convergence point by calculating the loss function of each possible value in the entire data set is too inefficient. Let's study a better mechanism, this mechanism is very popular in the field of machine learning, called the gradient descent method.

## 3.5 Implementation process of gradient descent method

The first stage of the gradient descent method is to

select an initial value (starting point). The following figure shows that we have selected a starting point slightly greater than 0:

Then, the gradient descent method calculates the gradient of the loss curve at the starting point. The gradient is a vector of partial derivatives; it allows the model to understand which direction is "closer" or "further" to the target.

The knowledge of calculus is mainly used here. Usually open source machine learning frameworks have helped us calculate the gradient, such as TensorFlow.

Partial derivative

The value of a multivariate function is a function with multiple parameters, for example:

The partial derivative  you want to be expressed as:,

is also

the derivative.

To calculate

, you must keep y fixed (so

it is now a function with only one variable), and then take the normal derivative

with respect to it

. For example, when

fixed to 1, the previous function becomes:

This is just a

function of a variable , and its derivative is:

Generally speaking, assuming that it

remains unchanged,

the

formula for calculating the partial derivative of the pair is as follows:

Similarly, if we

keep it constant, the partial derivative of the

pair

is:

Intuitively, the partial derivative allows us to understand how much change will the function send when a variable is slightly changed? In the previous example:

Therefore, if we set the starting point as (0,1),

keep it fixed and

move it a little,

the amount of

change is about 7.4 times the amount of change.

In machine learning, partial derivatives are mainly used together with the gradient of a function.

The gradient of the function is the partial derivative equivalent to the vector of all independent variables, expressed as: ∇

, eh, CSDN editor cannot hit ∇

have to be aware of is:

 ∇$f$ Point to the direction where the function grows fastest. -∇$f$ Point to the direction where the function drops the fastest.

The number of dimensions in the vector is equal

to the number of variables in the formula; the vector is located in the domain space of the function.

For example, when viewing the following function in three-dimensional space

:

Just like a valley, the lowest point is (2,0,4):

The gradient of is a two-dimensional vector, which allows us to understand that when moving in that direction, the height drops the fastest; that is, the gradient vector points to the bottom of the valley.

In machine learning, gradients are used in gradient descent methods. Our loss function usually has many variables, and we try to minimize the loss function by following the negative direction of the gradient of the function.

It should be noted that the gradient is a vector, so it has the following two characteristics:

• direction
• size

The gradient always points to the direction of the most rapid growth in the loss function. The gradient descent method will take a step in the direction of the negative gradient in order to reduce the loss as soon as possible. The gradient descent method relies on negative gradients:

In order to determine the next point on the loss function curve, the gradient descent method adds a part of the gradient size to the starting point, as shown in the following figure:

A gradient step moves us to the next point on the loss curve. Then, the gradient descent method will repeat this process, gradually approaching the lowest point.

Gradient descent is a technique that calculates the gradient and minimizes the loss. It uses the training data bit conditions to calculate the gradient of the loss relative to the model parameters. The gradient descent method adjusts the parameters in an iterative manner, and gradually finds the best combination of weights and deviations, thereby minimizing the loss.

## 4.1 Preface

In the gradient descent method, the batch refers to the total number of samples used to calculate the gradient in a single iteration; that is, the batch in the gradient descent method refers to the entire data set.

If you use a large data set, the data set contains millions, tens of millions, or hundreds of millions of samples; it contains a large number of features. Therefore, a batch may be quite large, and a single iteration may take a long time to perform calculations.

Generally, the larger the batch size, the higher the likelihood of redundancy. Some redundancy may help eliminate messy gradients, but the predictive value of very large batches is often not higher than that of large batches.

## 4.2 Stochastic Gradient Descent (SGD)

background

The gradient descent method is time-consuming and low-value in large data sets. If we can get the correct average gradient with less calculation, the effect will be better. By randomly selecting samples from the data set, the larger average is estimated.

principle

It uses only one sample per iteration (batch size is 1).

If enough iterations are made, SGD can also play a role, but the process will be very messy. The term "random" means that a sample that makes up each batch is randomly selected.

## 4.3 Batch Gradient Descent (BGD)

It is a compromise between full batch iteration and random selection of an iteration. Full batch iteration (gradient descent method); random selection of an iteration (random gradient descent).

It randomly selects a part of the samples from the data set, forms a small batch of samples, and iterates. Small batches usually contain 10-1000 randomly selected samples. BGD can reduce the number of messy samples in SGD, but it is still more efficient for full batches.

Among the three methods, gradient descent, stochastic gradient descent, and batch gradient descent methods, the iterative model of batch gradient descent method is usually used.

# Five, learning rate

## 5.1 Preface

The gradient vector has a direction and size; the gradient descent algorithm multiplies the gradient by a scalar called the learning rate (sometimes called the step size) to determine the location of the next point.

For example, if the gradient size is 2.5 and the learning rate is 0.01, the gradient descent algorithm will choose a position 0.025 from the previous point as the next point.

## 5.2 Learning rate

Hyperparameters are knobs used by programmers to adjust in machine learning algorithms. Most machine learning programmers spend a considerable amount of time adjusting the learning rate.

If the selected learning rate is too small, it will take too long to learn:

If the selected learning rate is too large, the next point will always bounce at the bottom of the U-shaped curve, and the global lowest point cannot be found:

If the selected learning rate happens to be:

## 5.3 Choose the learning rate

The learning rate is related to the flatness of the loss function. If you know that the gradient of the loss function is small, you can try a larger learning rate to compensate for the smaller gradient and obtain a larger step size.

Ideal learning rate one-dimensional space is

,

Dui

reciprocal of the second derivative.

The ideal learning rate in a two-dimensional or multi-dimensional space is the reciprocal of the Hessian matrix (a matrix composed of second-order partial derivatives).

The situation of generalized convex functions is more complicated.

For detailed Hessian matrix refer to Wikipedia:  https://en.wikipedia.org/wiki/Hessian_matrix

## 5.4 Keywords

Parameters , model variables trained by the machine learning system. For example, weight. Their values ​​are gradually learned by the machine learning system through successive training iterations; as opposed to hyperparameters.

Hyperparameters (hyperparameter), model training in a continuous process, the need for manual adjustment and specified; for example learning rate; relative parameter.

Learning rate , a scalar used for gradient descent when training the model. During each iteration, the gradient descent method multiplies the learning rate by the gradient; the resulting product is called the gradient step size.

# Six, generalization and overfitting

## 6.1 Preface

In order to understand the concept of generalization, first look at 3 pictures. Assume that each point in these figures represents the position of a tree in the forest.

The two colors in the picture represent the following meanings:

• The blue dot represents the sick tree
• Orange dots represent healthy trees

In the picture above, there are sick and healthy trees, corresponding to blue dots and orange dots respectively.

A model needs to be designed to distinguish those trees that are diseased and those that are healthy. The effect of the model is as follows:

On the surface, I feel that the model can distinguish those trees that are diseased and those that are healthy.

In fact, the model is a bit overfitting! !

If you use this model to predict some new data, the effect is as follows:

The model is very poor in processing new data, and the classification of most of the new data is incorrect.

## 6.2 Overfitting

Introduction

The overfitting model has very low loss during the training process, but it performs poorly when predicting new data.

cause

Over-fitting is caused by too small training data, and the complexity of the model exceeds the required level. That is, the model structure is too complicated, but the rules or meanings that the task needs to express do not need to be so complicated.

The goal of machine learning

The goal of machine learning is to make good predictions for new data drawn from the true probability distribution; that is, to make good predictions for new data that has not been seen before.

## 6.3 William of Occam

William of OccamHe was a monk and philosopher who admired simplicity in the 14th century. He believes that scientists should give priority to simpler (rather than more complex) formulas or theories. The application of Occam's Razor Law in machine learning is as follows:

The simpler the machine learning model, the more likely it is that good empirical results are not based solely on the characteristics of the sample.

Today, we have officially applied Occam's Razor Law to the fields of statistical learning theory and computational learning theory . These fields have formed a generalization boundary , that is, the ability of the statistical description model to generalize to new data based on the following factors:

• Complexity of the model
• The model's performance in processing training data

Although theoretical analysis can provide formal guarantees under ideal assumptions, it is difficult to apply in practice.

For example, if the model above simply fits the data, use a line to simply distinguish those trees that are diseased and those that are healthy. The model is no longer over-fitting; although the distinction is not very accurate, most of the Can distinguish correctly.

## 6.4 Data set split

Machine learning models are designed to make good predictions based on new data that has not been seen before. But if you want to build a model based on a data set, how do you get data that you haven't seen before? One way is to divide the data set into two subsets:

• Training set  -the subset used to train the model.
• Test set  -used to test a subset of the model.

Generally speaking, good performance on the test set is a useful indicator of whether it can perform well on new data, provided that:

• The test set is large enough.
• The same test set will not be used repeatedly to falsify.

## 6.5 Machine Learning-Generalization Rules

The following three basic assumptions clarify generalization:

• We randomly select independent and identically distributed  ( iid ) samples from the distribution . In other words, the samples will not affect each other. (Another explanation: iid is a way to express the randomness of variables).
• The distribution is stable ; that is, the distribution does not change within the data set.
• We draw samples from data partitions of the same distribution .

In practice, we sometimes violate these assumptions. E.g:

• Imagine a model for selecting advertisements to be displayed. If the model selects ads based on the ads that users have seen before, it will violate the iid assumption.
• Imagine a data set containing one year of retail information. There will be seasonal changes in users' buying behavior, which violates stability.

## 6.6 Summary

• If a model tries to closely fit the training data but cannot generalize to the new data well, overfitting will occur.
• If the key assumptions of supervised machine learning are not met, then we will lose the important theoretical guarantee of the ability to predict new data.

## 6.7 Keywords

Generalization refers to the ability of a model to make predictions for new data that has not been seen before, based on the model used during training.

Overfitting (overfitting), the created model matches the training data too much, so that the model cannot make correct predictions based on the new data.

Prediction (perdition), the model's output after receiving data samples.

Stationarit , an attribute of data in a data set, indicates that the distribution of data in one or more dimensions remains unchanged. The most common dimension of this kind is time, that is, data that indicates stationarity does not change with time.

Training set (training set), a subset of the data set, used to train the model. Contrast with validation set and test set.

Validation set (validation set), a subset of the data set, separated from the training set, used to adjust the hyperparameters. Contrast with training set and test set.

Test set (test set), a subset of the data set, used to test the model after the model has been initially verified by the validation set. Contrast with training set and validation set.

# Seven, data set division

## 7.1 Preface

In machine learning, the data set can be divided into two subsets, namely the training set and the test set. A better way is to divide the data set into three subsets, namely training set, validation set, and test set.

## 7.2 Divided into training set and test set

The concept of dividing the data set into two subsets:

Training set —used to train the model;

Test set —used to test the trained model

For example, divide the data set into a training set and a test set:

When using this scheme, you need to ensure that the test set meets the following two conditions:

• The scale is large enough to produce statistically significant results.
• Can represent the entire data set. That is, the characteristics of the selected test set should be the same as the characteristics of the training set.

When the test set meets the above two conditions, a model that can be better generalized to new data can usually be obtained.

The process of using the training set and the test set to train the model

"Adjusting the model" refers to adjusting the parameters, hyperparameters, and model structure related to the model, such as learning rate, adding or removing features, or designing a new model from an early age, and so on.

## 7.3 Divided into training set, validation set, and test set

Dividing the data set into three subsets, as shown in the figure below, can greatly reduce the transmission probability of overfitting:

The process of training the model using the training set, validation set, and test set

First select the model that obtains the best effect on the early validation set. Then use the test set to check the model again.

The model trained by this method will pass better because less information is exposed to the test set.

Divided into training set, validation set, test set method, through the test set to adjust the effect of the model, from which the rules of the test set are continuously learned; thus, the test set is different from the new data. The model has some understanding of the test set, but the new data is still Predicted without knowing it at all

note

Constantly adapting to the test set and verification set will make it gradually lose its effect. The more times that the same data is used to determine hyperparameter settings or other model improvements, the lower the effect of these results can be truly generalized to new data that has not been seen before.

Recommendation: Collect more data to "refresh" the test set and validation set. Starting over is a good way to reset.

## 7.4 Keywords

Training set (training set), a subset of the data set, used to train the model. Contrast with validation set and test set.

Validation set (validation set), a subset of the data set, separated from the training set, used to adjust the hyperparameters. Contrast with training set and test set.

Test set (test set), a subset of the data set, used to test the model after the model has been initially verified by the validation set. Contrast with training set and validation set.

Overfitting (overfitting), the created model matches the training data too much, so that the model cannot make correct predictions based on the new data.

# 8. Feature Engineering

## 8.1 Preface

The focus of traditional programming is code. In machine learning projects, the focus becomes feature representation; that is, developers adjust the model by adding and improving features.

Feature engineering refers to the conversion of original data into feature vectors; it is estimated that a lot of time will be required for feature engineering.

## 8.2 Mapping raw data to features

In the figure below, the left side represents the original data from the input data source, and the right side represents the feature vector, which is the set of floating-point values ​​that make up the samples in the data set.

Feature engineering maps raw data to machine learning features.

## 8.3 Mapping values

Integer and floating point data do not need special coding, because they can be multiplied with digital weights.

In the figure below, it doesn't make much sense to convert the original integer value 6 to the eigenvalue 6.0.

## 8.4 Mapping classification values

The classification feature has a discrete set of possible values. For example, there may be a feature named street_name with options including:

{'Charleston Road', 'North Shoreline Boulevard', 'Shorebird Way', 'Rengstorff Avenue'}

Since the model cannot multiply the string with the learned weight, we use feature engineering to convert the string into a numeric value.

Realization ideas

You can define a mapping from feature values ​​(called a vocabulary of possible values) to integers.

Not every street in the world will appear in our data set, so we can group all other streets into an all-inclusive "other" category called OOV bucketing (out of vocabulary).

Implementation process

With the above method, we can map street names to numbers in the following way:

• Map Charleston Road to 0
• Map North Shoreline Boulevard to 1
• Map Shorebird Way to 2
• Map Rengstorff Avenue to 3
• Map all other streets (OOV) to 4

However, if we incorporate these index numbers directly into the model, it will cause some limitations:

1) We will learn a single weight that applies to all streets.

For example, if we learn that the weight of street_name is 6, then for Charleston Road, multiply it by 0; for North Shoreline Boulevard, multiply it by 1; for Shorebird Way, multiply it by 2, and so on.

Take a model that uses street_name as a feature to predict housing prices as an example. It is unlikely that housing prices will be adjusted linearly based on street names. In addition, these assumptions have been based on the average housing prices.

Our model needs to flexibly learn different weights for each street, and these weights will be added to the estimated housing prices using other features.

2) We did not take into account that street_name may have multiple values. For example, many houses are located on the corners of two streets, so if the model includes a single index, this information cannot be encoded in the street_name value.

Remove the above two restrictions, we can create a binary vector for each classification feature in the model to represent these values:

• For the value of the sample used, set the corresponding vector element to 1.
• Set all other elements to 0;

The length of this vector is equal to the number of elements in the vocabulary. When there is only one value of 1, this notation is called one- hot encoding ; when there are multiple values ​​of 1, this notation is called multi-hot encoding .

The figure is mapped by a one-hot encoding street address, street Shorebird Way to the one-hot encoding .

In this binary vector, the value of the element representing Shorebird Way is 1, and the value of the element representing all other streets is 0.

summary

This method can effectively create a Boolean variable for each characteristic value. In this method, if the house is located on Shorebird Way, only the binary value of Shorebird Way is 1. Therefore, the model only uses the weight of Shorebird Way.

If the house is at the corner of two streets, set the two binary values ​​to 1, and the model will use their respective weights.

## 8.5, sparse representation

background

Suppose there are 1 million different street names in the data set, and you want to include them as the value of street_name.

If you directly create a binary vector containing 1 million elements, of which only 1 or 2 elements are true, it is a very inefficient notation. It will take up a lot of storage space and consume a lot of time when processing these vectors. Calculation time.

Introduction

In this case, a common method is to use sparse notation, in which only non-zero values ​​are stored. In sparse representation, independent model weights are still learned for each feature value.

## 8.6 The characteristics of good characteristics

We explored ways to map raw data to suitable feature vectors, but this is only part of the work. Then you need to explore what value is considered a good feature in these feature vectors.

• Avoid discrete eigenvalues ​​that are rarely practical
• It is best to have a clear and unambiguous meaning
• Do not include special values ​​in actual data
• Consider upstream instability

5.1) Avoid discrete eigenvalues ​​that are rarely used

Good eigenvalues ​​should appear more than 5 times in the data set. In this way, the model can learn how the feature value is related to the label. A large number of samples with the same discrete value can give the model the opportunity to learn about the characteristics in different settings, so as to determine when it can make a good prediction for the label. For example, the house_type feature contains a large number of samples, where its value is victorian:

house_type: victorian


If a certain feature value occurs only once and fire rarely occurs, the model cannot make predictions based on that feature. For example, unique_house_id is not suitable as a feature, because each value is only used once, and the model cannot learn any rules from it:

unique_house_id: 8SK982ZZ1242Z

5.2) It is best to have a clear meaning

Each feature should have a clear meaning to anyone in the project. For example, if the age of the house is suitable as a feature, the age of the house in years can be immediately recognized: house_age: 27

On the contrary, for the meaning of some eigenvalues, except the engineer who created it, others may not be able to recognize: house_age: 851472000

In some cases, confusing data can lead to unclear values. For example, if the source of user_age is not checked, the value is correct: user_age: 277

5.3) Do not include special values ​​in actual data

A good floating point feature does not contain abnormal breakpoints or feature values ​​that are out of range. For example, suppose a feature has a floating point value between 0 and 1. Then, the following values ​​are acceptable:

quality_rating: 0.82quality_rating: 0.37

However, if the user does not enter quality_rating, the data set may use the following special values ​​to indicate that the value does not exist:

quality_rating: -1

In order to solve the problem of special values, the feature needs to be converted into two features:

• A feature only stores the quality score and does not contain special values.
• A feature stores a boolean value indicating whether quality_rating is provided.

5.4) Consider upstream instability

The definition of characteristics should not change over time. For example, the following values ​​are useful because city names generally do not change.

city_id: "br/sao_paulo"

But collecting values ​​inferred by other models incurs additional costs. The possible value "219" currently represents Sao Paulo, but this representation may easily send changes when running other models in the future:

inferred_city_cluster: "219"

## 8.7 Keywords

Feature engineering (feature engineering) refers to determining which features may be useful in training a model, and then converting log files and raw data from other sources into the required features. Feature engineering is sometimes called feature extraction.

Discrete feature, a feature that contains a finite number of possible values. For example, a certain value can only be the characteristics of "animal" or "vegetable", which can all enumerate the category. Contrast with continuous features.

Hot encoded (one-hot-encoding), Sparse bivariate vector, wherein:

• One element is set to 1.
• All other elements are set to 0.

The one-hot encoding common term means a string or identifier with a limited number of possible values.

Representation , the process of mapping data to practical features.

# Nine, regularization​​L2 (simplicity)

## 9.1 Preface

By reducing the complexity of the model to prevent overfitting, this principle is called regularization .

## 9.2 Principle

When training the model, it is not just for the purpose of minimizing loss (minimizing empirical risk)

Instead, the goal is to minimize loss and complexity, which is called structural risk minimization:

Now, our training optimization algorithm is a function consisting of two things:

• One is the loss term, which is used to measure the fit between the model and the data;
• The other is a regularization term, which is used to measure the complexity of the model.

## 9.3 Complexity

There are two common ways to measure model complexity:

• The model complexity is taken as a function of the weights of all features in the model.
• The model complexity is taken as a function of the total number of features with non-zero weights.

If the model complexity is a function of the weight, the higher the absolute value of the feature weight, the greater the contribution to the model complexity.

## 9.4 L2 regularization

We can use

the regularization formula to quantify the complexity, which defines the regularization term as the sum of the squares of all feature weights :

In this formula, weights close to 0 have almost no effect on model complexity, while outlier weights will have a huge impact.

For example, a linear model has the following:

Combined with the formula,

the regularization term is calculated  as 26.915:

The square value in the above case is 25, which contributes almost all the complexity. The value of the sum of the squares of the other five weights is only 1.915,

which has a small contribution to the regularization term. Therefore, for

the regularization term, a weight close to 0 has almost no effect on the model complexity, while the outlier weight will have a huge impact .

## 9.5 Simplified regularization: Lambda

Model developers adjust the overall impact of the regularization term by multiplying the value of the normalization term by a scalar called lambda (also known as the regularization rate). The model developer will perform the following operations:

Which

refers to lambda;

Performing

regularization has the following effects on the model

• Make the weight value close to 0 (but not exactly 0)
• Make the average of the weights close to 0 and present a normal distribution (Gaussian curve)

Adding the lambda value will enhance the regularization effect. For example, a weight histogram with a higher lambda value might look like the following figure:

If you reduce the value of lambda, you will often get a relatively flat histogram:

When choosing lambda values, the goal is to achieve an appropriate balance between simplification and training data fitting:

• If the lambda value is too high, the model will be very simple and face the risk of underfitting the data; the model will not be able to obtain enough information from the training data to make useful predictions.
• If the lambda value is too low, the model will be more complicated and the data will be over-fitted; the model cannot be generalized to new data due to too much information about the characteristics of the training data.
note:
Setting lambda to 0 can completely cancel regularization. In this case, the sole purpose of training is to minimize the loss, and doing so will maximize the risk of overfitting.

The model generated by the ideal lambda value can be well generalized to new data that has not been seen before; usually, the ideal lambda value depends on the data and requires manual or automatic adjustments.

## 9.6

Regularization and learning rate

There is a close relationship between learning rate and lambda.

A strong

regularization value tends to make the feature weight closer to 0. A lower learning rate (using the early stopping method) usually produces the same effect. Therefore, adjusting the learning rate and lambda at the same time may have confusing effects.

Early stopping means that the training ends before the model fully converges. In actual operation, we often train in a continuous manner and adopt some implicit early stopping methods.

As mentioned above, the effect of changing the regularization parameter may be confused with the effect of changing the learning rate or number of iterations. There is a way to perform enough iterations when training a fixed batch of data so that the early stopping method will not work.

# 10. Logistic regression, linear regression

## 10.1 Preface

Many problems require probability estimates as output. Logistic regression is an extremely efficient probability computer system.

## 10.2 Ways to calculate and return probabilities

• As is
• Convert to binary category

Use the probability as it is; suppose we create a logistic regression model to predict the probability of a dog barking in the middle of the night. We call this probability:

## 10.3 S形函数

S型函数会产生以下曲线：

• z是

；w是该模型学习的权重，b是偏差。x是指特征样本的特征值。

• b = 1
• w1 = 2
• w2 = -1
• w3 = 5

• x1 = 0
• x2 = 10
• x3 =2

## 10.5 逻辑回归的损失函数

• y是有标签样本的标签。由于这是逻辑回归，因此y的每个值都必须使0或1.

## 10.6 逻辑回归中的正则化

• 早停法，即：限制训练步数或学习率。

## 10.7 逻辑回归小结

### 10.8 线性回归

• y是指温度，是预测的值；
• m是指直线的斜率；
• x是指每分钟的鸣叫声次数，即输入特征的值。
• b是指y轴截距。

• b是指偏差（对应y轴截距），一些机器学习文档中，称为

，只需将

）预示着可以用多个特征来表示更复杂的模型。例如，具有三个特征的模型可以采用以下方程式：

# 十一、正则化​​L1（稀疏性）

## 11.2 稀疏性的正则化

L1正则化（L1 regularization），一种正则化，根据权重的绝对值的总和来惩罚权重。在以来稀疏特征的模型中，L1正则化有助于使不相关或几乎不相关的特征的权重正好为0，从而将这些特征从模型中移除。与L2正则化相对。

## 11.4 关键词

L1正则化（L1 regularization），一种正则化，根据权重的绝对值的总和，来惩罚权重。在以来稀疏特征的模型中，L1正则化有助于使不相关或几乎不相关的特征的权重正好为0，从而将这些特征从模型中移除。与L2正则化相对。

L2正则化（L2 regularization），一种正则化，根据权重的平方和，来惩罚权重。L2正则化有助于使离群值（具有较大正值或较小负责）权重接近于0，但又不正好为0。在线性模型中，L2正则化始终可以进行泛化。

# 十二、神经网络

## 12.1 前言

“非线性”意味着无法使用这样的形式：

## 12.6 激活函数

1、S型激活函数，将加权和转换为介于0和1之间的值。

2、修正线性单元激光函数，简称ReLU，相对于S型函数等平滑函数，它的效果通常要好一点，同时还非常易于计算。

ReLU的优势在于它基于实证发现，拥有更实用的响应范围。S型函数的响应性在两端相对较快地减少。ReLU激活函数如下所示：

## 12.7 总结

• 一组节点，类似于神经元，位于层中。
• 一组权重，表示每个神经网络层与下方的层之间的关系。下方的层可能是另一个神经网络层，也可能是其他类型的层。
• 一组偏差，每个节点一个偏差。
• 一个激活函数，对层中每个节点的输出进行转换。不同的层可能拥有不同的激光函数。