Machine learning-main terms (integrated version)


This article summarizes and describes the machine learning terminology with reference to Google's official website .

What is machine learning? In simple terms, machine learning systems learn how to combine input information to make useful predictions on unseen data.

table of Contents


Main terms (basic)





Regression model

Classification model




Forecast (perdition)


Training set

Validation set

Test set

Main terms (advanced version 1)


Classification model

Regression model




Recall rate (recall)

Convex set

Convex function

Convex optimization

Activation function

Backpropagation algorithm (backpropagation)


Batch size

Main terms (completed)



Linear regression


Empirical risk minimization (ERM, empirical risk minimization)

Mean Squared Error (MSE, Mean Squared Error)

Squared loss function (squared loss)

Loss (Loss)

Gradient descent

Stochastic gradient descent (SGD)

Batch Gradient Descent (BGD)


Hyperparameter (hyperparameter)

Learning rate

Feature engineering

Discrete feature



Synthetic feature

Feature cross

L1 regularization

L2 regularization (L2 regularization)

Main terms (basic)

It mainly includes labels, features, samples, training, models, regression models, classification models, generalization, overfitting, prediction, stationarity, training set, validation set, and test set.


The label is the thing we want to predict, the category in the classification task, such as cat or dog; the y variable in simple linear regression;. The label can be the future price of wheat, the animal species shown in the picture, the meaning of an audio clip, or anything in kind.

In supervised learning, the label value is the "answer" or "result" part of the sample.


The input variables used when making predictions .

The feature is the input variable, that is, the x variable in simple linear regression; the input image feature in the classification task.

A simple machine learning project may use a single feature, while a more complex machine learning project may use millions of features, formulated as follows:


In the spam detector example, the characteristics might include:

  • Words in the email file
  • Sender's address
  • Time period for sending email
  • Email contains "some sensitive words"


A row of the data set. In supervised learning samples, a sample has both features and labels. In unsupervised learning samples, a sample has only features.

A sample refers to a specific example of data: x. (X represents a vector) The samples are divided into the following two categories:

  • Labeled sample
  • Unlabeled sample

The labeled sample also contains the characteristic label, namely:

labeled examples: {features, label}: (x, y)

We use labeled samples to train the model; in the spam detector example, labeled samples are individual emails that the user has clearly marked as "spam" or "not spam".

For example, the following table shows five labeled samples drawn from a data set containing information about California housing prices:


Unlabeled samples contain features but no labels, namely:

unlabeled examples: {features, ?}: (x, ?)

The following are 3 unlabeled samples taken from the same housing data set, which do not contain medianHoustonValue:


After training the model with labeled samples, we will use the model to predict the labels of unlabeled samples. In the spam detector example, the unlabeled sample is a new email that the user has not added a label to.


Model defines the characteristics of the tag between the relations . For example, the spam detection model may closely associate certain characteristics with "spam". The two phases of the model life cycle:

  • Training refers to creating or learning models. That is: show the label sample to the model, and let the model gradually learn the relationship between the feature and the label.
  • Inference refers to applying the trained model to unlabeled samples. That is: use the trained model to make useful predictions
\left (y^{'} \right)

. During inference, the medianHouseValue can be predicted for new unlabeled samples.

Regression model

A model capable of outputting continuous values (usually floating point values).

The regression model can predict continuous values. For example, the prediction made by the regression model can answer the following questions:

  • What is the value of a property in xxx?
  • What is the probability that a user clicks on this ad?

Classification model

Used to distinguish two or more discrete categories .

The classification model can predict discrete values. For example, the predictions made by the classification model can answer the following questions:

  • Is a specified email spam or not spam?
  • Is this an image of a dog or a cat?


The process of forming ideal parameters in a model; training a good model is mainly to obtain the parameters in the model, including weights


and biases



Generalization (generalization)

It refers to the model's ability to make predictions for new data that has not been seen based on the model used during training.

Over-fitting (overfitting)

The created model matches the training data so much that the model cannot make correct predictions based on the new data.

Forecast (perdition)

The output of the model after receiving the data sample.

Stationarity (stationarit)

An attribute of data in a data set that indicates that the distribution of data in one or more dimensions remains unchanged. The most common dimension of this kind is time, that is, the data that indicates stationarity does not change with time.

Training set (training set)

A subset of the data set used to train the model. Contrast with validation set and test set.

The validation set (validation set)

A subset of the data set, separated from the training set, used to adjust hyperparameters. Contrast with training set and test set.

Test set (test set)

A subset of the data set used to test the model after the model has been initially verified by the validation set. Contrast with training set and validation set.

Main terms (advanced version 1)

Mainly include category, classification model, regression model, convergence, accuracy, precision, recall, convex set, convex function, convex optimization, activation function, back propagation algorithm, batch, batch size.


The category is one of a set of target values ​​enumerated by the label. For example, in the second classification, there are two label groups, namely cats and dogs; among them, "cat" is a category; "dog" is also a category.

Classification model

Used to distinguish two or more discrete categories.

For example, in the recognition of cats and dogs, the model needs to distinguish whether the input image is a "cat" or a "dog". This is a typical two-category model.

In language classification, the model needs to distinguish whether the input is Chinese, English, French, Russian, or other languages; this is a multi-classification model.

Regression model

Used to predict the output of continuous values, such as floating point values.

For example: in the holiday forecast, enter some data related to the house price, sale date, sale price, number of bedrooms, number of bathrooms, house area, parking area, house score, building area, etc.; predict the price of the house through a model, such as Output 567,800 yuan.


It refers to a state reached during training. The model reaches a stable state, that is, after a certain number of iterations, the transformation of training loss and verification loss in each iteration is very small or unchanged.


It is usually used in classification models to indicate the proportion of correct predictions of the classification model. In multi-category, define:

\large acc = \frac{n}{sum}

acc refers to the accuracy rate; n refers to the number of correct classifications; sum refers to the total number of samples.

For example: there are a total of 100 data samples, the model predicts 98 correctly, and 2 predictions are wrong, then the accuracy of the model is: acc = 98/100 = 0.98, that is: 98%


The index of a classification model refers to the frequency with which the model correctly predicts the positive category, namely:

\large pre = \frac{TP}{TP + FP}

Pre refers to the accuracy rate; TP (positive case) refers to the fact that is positive and the prediction is positive; FP (false positive case) refers to the fact that is negative and the prediction is positive.

The accuracy rate is for the positive category. A total of several positive categories (positive cases + false positive cases) are predicted, and how many of them are correct.

The accuracy rate is for the overall data, including positive categories and negative categories (positive cases + negative classes + false positive cases + false negative cases), how much of the overall data is predicted to be correct.

Recall rate (recall)

A classification model index refers to all possible positive category labels,

Convex set

A subset of Euclidean space in which the line between any two points still falls within the subset.

For example, the following two images are both convex sets:

On the contrary, the following two graphs are not convex sets:

Convex function

The area above the function image is a convex set. The shape of a typical convex function is similar to the letter U. The following are several convex functions:

On the contrary, the following functions are not convex functions. Please note that the area above the image is not a convex set:

Strictly convex functions have only one local lowest point, and the changed point is also the global lowest point.

Common functions are convex functions:

  • L2 loss function
  • Log loss function
  • L1 regularization
  • L2 regularization

Many variants of the gradient descent method must be able to find a point close to the minimum of the strict graph function.

Many variants of the stochastic gradient descent method are highly likely (not necessarily able to find) points close to the minimum of the strictly convex function.

The sum of two convex functions is also convex, such as L2 loss function + L1 regularization.

The depth model will never be a convex function. But algorithms designed specifically for convex optimization can always find very good solutions on deep networks, although these solutions do not necessarily correspond to the global minimum.

Convex optimization

The process of using mathematical methods to find the minimum value of a convex function.

A lot of research in machine learning is focused on how to express various problems as convex optimization problems through formulas, and how to efficiently solve these problems.

Activation function

The essence is a function, usually the input value is mapped to another value, the mapping methods are: linear mapping, non-linear mapping;

For example: in linear mapping, suppose the activation function

\large f(x) = 2x

is y = 2x, the input value x, the output value y after the mapping; when the input value is 3, after the activation function is mapped, the output value is 6.

In nonlinear mapping, suppose the activation function is

\large f(x) = \frac{1}{1 + e^{-x}}

input value x and output value y after mapping; when the input value is 0, after activation function mapping, the output value is 0.5.

In fact, the activation function of this nonlinear mapping is the more common Sigmoid function. Take a look at its image:

Backpropagation algorithm (backpropagation)

The algorithm first calculates (and caches) the output value of each node according to the forward propagation method, and then calculates the partial derivative of the loss function value with respect to each parameter in the backward propagation traversal graph.


The sample set used in one iteration of model training (one gradient update).

Batch size

The number of samples in a batch. For example, in the stochastic gradient descent SGD algorithm, the batch size is 1; in the gradient descent algorithm, the batch size is the entire training set;

In batch gradient descent calculation, the batch size can be customized, and the value range is usually between 10 and 1000. For example: the training set is 40,000 samples, the batch size is set to 32, and the model is trained once, and 32 samples are used.

Main terms (completed)

Main terms, including bias, inference, linear regression, weight, empirical risk minimization, mean square error, square loss function, loss, gradient descent, stochastic gradient descent, batch gradient descent, parameters, hyperparameters, learning rate, features Engineering, discrete features, one-hot encoding, notation, feature combination, synthetic feature,

Deviation (bias)

The intercept or offset from the origin. The deviation (also called the deviation term) is


represented by b or in the machine learning model . For example, in the following formula, the deviation is b:

y^{'} = b + w_{1} x_{1} +w_{2} x_{2} +.......w_{n} x_{n}

Inference (inference)

In machine learning, inference usually refers to the following process: making rain and snow by applying a trained model to unlabeled samples. In statistics, inference refers to the process of fitting distribution parameters under certain observed data conditions. (See the article on statistical inference in Wikipedia .)

Linear regression (linear regression)

A regression model that outputs continuous values ​​by linearly combining input features.

Weight (weight)

The coefficients of the features in the model, or the edges in the deep network. The goal of training the model is to determine the ideal weight for each feature. If the weight is 0, the corresponding feature has no effect on the model.

Empirical risk minimization (ERM, empirical risk minimization)

Used to select the function, select the function that minimizes the loss based on the training set. Contrast with minimizing structural risk.

Mean square error (MSE, Mean Squared Error)

The average squared loss for each sample. MSE is calculated by dividing the squared loss by the number of samples.

Squared loss function (squared loss)

The loss function used in linear regression (also called the L2 loss function). Change the line to calculate the square of the difference between the value predicted by the labeled sample and the true value of the label. Due to the squared value, this loss function magnifies the impact of poor predictions. Compared with the L1 loss function, the square loss function reacts more strongly to outliers.

Loss (Loss)

A measure used to measure how far the model’s predictions deviate from its label . To determine this value, the model needs to define a loss function. For example: the linear regression model participates in the mean square error MAS loss function, and the classification model uses the cross entropy loss function.

Gradient descent

A technique that calculates the gradient and minimizes the loss. It uses the training data bit conditions to calculate the gradient of the loss relative to the model parameters. The gradient descent method adjusts the parameters in an iterative manner, and gradually finds the best combination of weights and deviations, thereby minimizing the loss.

Stochastic gradient descent (SGD)

The gradient descent method is time-consuming and low-value in large data sets. If we can get the correct average gradient with less calculation, the effect will be better. By randomly selecting samples from the data set, the larger average is estimated.

Principle   It uses only one sample per iteration (batch size is 1).

If enough iterations are made, SGD can also play a role, but the process will be very messy. The term "random" means that a sample that makes up each batch is randomly selected.

Batch Gradient Descent (BGD)

It is a compromise between full batch iteration and random selection of an iteration. Full batch iteration (gradient descent method); random selection of an iteration (random gradient descent).

Principle   It randomly selects a part of the sample from the data set, forms a small batch of samples, and performs iteration. Small batches usually contain 10-1000 randomly selected samples. BGD can reduce the number of messy samples in SGD, but it is still more efficient for full batches.

Parameters (parameter)

Model variables trained by the machine learning system. For example, weight. Their values ​​are gradually learned by the machine learning system through successive training iterations; as opposed to hyperparameters.

Super parameters (hyperparameter)

In the continuous process of model training, it needs to be manually specified and adjusted; for example, learning rate; as opposed to parameters.

Learning rate (learning rate)

A scalar used for gradient descent when training the model. During each iteration, the gradient descent method multiplies the learning rate by the gradient; the resulting product is called the gradient step size.

Features works (feature engineering)

Refers to determining which features may be useful in training the model, and then converting the original data from log files and other sources into the required features. Feature engineering is sometimes called feature extraction.

Discrete feature feature

A feature that contains a limited number of possible values. For example, a certain value can only be the characteristics of "animal" or "vegetable", which can all enumerate the category. Contrast with continuous features.

Hot encoded (one-hot-encoding)

A sparse binary vector in which:

  • One element is set to 1.
  • All other elements are set to 0.

The one-hot encoding common term means a string or identifier with a limited number of possible values.

Notation (representation)

The process of mapping data to useful features.

Characterized in Synthesis (synthetic feature)

A feature is not included in the input features, but derived from one or more input features. Synthetic features include the following types:

  • The continuous features are divided into buckets to divide into multiple intervals and bins.
  • Difference (or divide) an eigenvalue from other eigenvalues ​​or itself.
  • Create a feature combination.

Features created only through normalization or scaling are not considered composite features.

Feature cross

A composite feature is formed by combining individual features (calculating the Cartesian product). Feature combinations help express non-linear relationships.

L1 regularization

A type of regularization that penalizes weights based on the sum of their absolute values. In the model with sparse features, L1 regularization helps to make the weights of irrelevant or almost irrelevant features exactly 0, thereby removing these features from the model. Contrast with L2 regularization.

L2 regularization (L2 regularization)

A type of regularization that penalizes weights based on the sum of the squares of the weights. L2 regularization helps to make the weight of outliers (with larger positive values ​​or smaller ones) close to 0, but not exactly 0. In linear models, L2 regularization can always be generalized.

This article refers to the official Google:

This is the first section of the machine learning (quick start) column, and the following arrangement is:

  1. Machine Learning 1-Main Terms (Overview)
  2. Machine learning 2-linear regression
  3. Machine learning 3-training and loss
  4. Machine learning 4 model iteration
  5. Machine learning 5-learning rate
  6. Machine learning 6-generalization and overfitting
  7. Machine Learning 7-Data Set Division
  8. Machine Learning 8-Feature Engineering
  9. Machine Learning 9-Regularization​​L2 (Simplicity)
  10. Machine Learning 10-Logistic Regression
  11. Machine Learning 11-Regularization​​L1 (sparseness)
  12. Machine Learning 12-Neural Network
  13. Machine Learning 13-The Pit of Training Models
  14. .........
  15. ..........

It's basically finished, and it hasn't been published yet. I will publish one or two articles a week later.