[Image classification] Zero-based entry AlexNet

table of Contents

1. Model introduction

2. Model structure

3. Model characteristics

4. The official implementation of Pytorch

5. Keras implementation

1. Model introduction

​ AlexNet is the first deep convolutional neural network applied to image classification proposed by Alex Krizhevsky. The network won the top-5 test error rate of 15.3% in the 2012 ILSVRC (ImageNet Large Scale Visual Recognition Competition) image classification competition. First place. Also after that year, more and deeper neural networks were proposed, such as the excellent vgg, GoogLeNet . This is quite good for traditional machine learning classification algorithms.

AlexNet contains several relatively new technical points, and for the first time successfully applied Trick such as ReLU, Dropout and LRN in CNN. At the same time, AlexNet also uses GPU for computational acceleration.

AlexNet carried forward the ideas of LeNet and applied the basic principles of CNN to very deep and wide networks. The main new technologies used by AlexNet are as follows:

(1) Successfully use ReLU as the activation function of CNN, and verify that its effect surpasses Sigmoid in a deeper network, and successfully solves the gradient dispersion problem of Sigmoid when the network is deeper. Although the ReLU activation function was proposed a long time ago, it was not carried forward until the emergence of AlexNet.

(2) Use Dropout to randomly ignore some neurons during training to avoid model overfitting. Although Dropout has a separate paper on it, AlexNet puts it into practical use and proves its effect through practice. In AlexNet, Dropout is mainly used in the last few fully connected layers.

(3) Use overlapped maximum pooling in CNN. Previously, average pooling was commonly used in CNN, and AlexNet all used maximum pooling to avoid the blurring effect of average pooling. And AlexNet proposes to make the step size smaller than the size of the pooling core, so that there will be overlap and coverage between the outputs of the pooling layer, which improves the richness of features.

(4) The LRN layer is proposed to create a competition mechanism for the activity of local neurons, so that the value with a larger response becomes relatively larger, and other neurons with smaller feedback are inhibited, which enhances the generalization ability of the model.

(5) Use CUDA to accelerate the training of deep convolutional networks, and use the powerful parallel computing capabilities of GPU to process a large number of matrix operations during neural network training. AlexNet uses two GTX 580 GPUs for training, and a single GTX 580 has only 3GB of video memory, which limits the maximum size of the network that can be trained. Therefore, the author distributes AlexNet on two GPUs, and stores half of the neuron parameters in the video memory of each GPU. Because the communication between GPUs is convenient and the video memory can be accessed each other without going through the host memory, it is also very efficient to use multiple GPUs at the same time. At the same time, the design of AlexNet allows communication between GPUs to be carried out only in certain layers of the network, which controls the performance loss of communication.

(6) Data enhancement, randomly cut 224*224 size area (and horizontally flipped mirror image) from 256*256 original image, which is equivalent to an increase of 2*(256-224)^2=2048 times the amount of data . If there is no data enhancement, relying only on the original data volume, a CNN with many parameters will fall into over-fitting. The use of data enhancement can greatly reduce over-fitting and improve generalization ability. When making predictions, take the four corners of the picture plus 5 positions in the middle, and flip them left and right to obtain a total of 10 pictures, make predictions on them, and average the 10 results. At the same time, the AlexNet paper mentioned that the RGB data of the image will be processed by PCA, and the principal component will be subjected to a Gaussian disturbance with a standard deviation of 0.1, adding some noise. This trick can reduce the error rate by another 1%.

2. Model structure

First of all, this picture is divided into upper and lower parts of the network. The paper mentioned that these two parts of the network correspond to two GPUs respectively. Only after reaching a specific network layer do you need two GPUs to interact. This setting completely uses two Block GPU to improve the efficiency of computing, in fact, the difference in network structure is not very big. For a more convenient understanding, we assume that there is only one GPU or we use the CPU to perform operations. We analyze the network structure from this slightly simplified direction area. The total number of layers of the network is 8 layers, 5 layers of convolution, and 3 layers of fully connected layers.

The first layer: Convolutional layer 1, the input is an image of 224 × 224 × 3, the number of convolution kernels is 96, and the two GPUs in the paper calculate 48 kernels respectively; the size of the convolution kernel is 11 × 11 × 3, stride = 4, stride represents the step length, padding = 2.
What is the size of the image after convolution?
wide = (224 + 2 * padding-kernel_size) / stride + 1 = 55
height = (224 + 2 * padding-kernel_size) / stride + 1 = 55
dimention = 96
then proceed (Local Response Normalized), followed by pool_size = (3, 3), stride = 2, pad = 0 The feature map of the first layer of convolution is finally obtained, and the
final output of the first layer of convolution is 96×55×55

The second layer: Convolutional layer 2, the input is the feature map of the previous layer of convolution, the number of convolutions is 256, and the two GPUs in the paper each have 128 convolution kernels. The size of the convolution kernel is: 5 × 5 × 48, padding = 2, stride = 1; Then do LRN, and finally max_pooling, pool_size = (3, 3), stride = 2;

The third layer: Convolution 3, the input is the output of the second layer, the number of convolution kernels is 384, kernel_size = (3 × 3 × 256), padding = 1, the third layer does not do LRN and Pool

The fourth layer: Convolution 4, the input is the output of the third layer, the number of convolution kernels is 384, kernel_size = (3 × 3 ), padding = 1, the same as the third layer, without LRN and Pool

Fifth layer: Convolution 5, the input is the output of the fourth layer, the number of convolution kernels is 256, kernel_size = (3 × 3), padding = 1. Then go directly to max_pooling, pool_size = (3, 3), stride = 2;

The 6th, 7th, and 8th layers are fully connected layers. The number of neurons in each layer is 4096, and the final output softmax is 1000. As mentioned above, the number of classifications for the ImageNet competition is 1000. RELU and Dropout are used in the fully connected layer.

The figure below is a summary of the above parameters.

3. Model characteristics

  • All convolutional layers use ReLU as a non-linear mapping function to make the model converge faster
  • Training the model on multiple GPUs can not only increase the training speed of the model, but also increase the scale of data usage
  • Use LRN to normalize the local features, and the result as the input of the ReLU activation function can effectively reduce the error rate
  • Overlapping max pooling, that is, the pooling range z and the step size s have a relationship z>s (for example

, the mid-core scale is 3×3/2), avoiding the average effect of average pooling

  • Use random dropout technology to selectively ignore individual neurons in training to avoid overfitting of the model

4. The official implementation of Pytorch

The official pytorch does not strictly follow the implementation of the AlexNet paper. The first convolution is 64 convolution kernels, and the official is 96.

class AlexNet(nn.Module):     def __init__(self, num_classes=1000):        super(AlexNet, self).__init__()        self.features = nn.Sequential(            nn.Conv2d(3, 64, kernel_size=11, stride=4, padding=2),            nn.ReLU(inplace=True),            nn.MaxPool2d(kernel_size=3, stride=2),            nn.Conv2d(64, 192, kernel_size=5, padding=2),            nn.ReLU(inplace=True),            nn.MaxPool2d(kernel_size=3, stride=2),            nn.Conv2d(192, 384, kernel_size=3, padding=1),            nn.ReLU(inplace=True),            nn.Conv2d(384, 256, kernel_size=3, padding=1),            nn.ReLU(inplace=True),            nn.Conv2d(256, 256, kernel_size=3, padding=1),            nn.ReLU(inplace=True),            nn.MaxPool2d(kernel_size=3, stride=2),        )        self.avgpool = nn.AdaptiveAvgPool2d((6, 6))        self.classifier = nn.Sequential(            nn.Dropout(),            nn.Linear(256 * 6 * 6, 4096),            nn.ReLU(inplace=True),            nn.Dropout(),            nn.Linear(4096, 4096),            nn.ReLU(inplace=True),            nn.Linear(4096, num_classes),        )     def forward(self, x):        x = self.features(x)        x = self.avgpool(x)        x = torch.flatten(x, 1)        x = self.classifier(x)        return x

5. Keras implementation

 import os import pandas as pdimport numpy as npfrom keras.callbacks import EarlyStopping, ModelCheckpointfrom matplotlib import pyplot as pltfrom skimage.io import imread, imshowfrom skimage import transformimport warningsfrom tqdm import tqdmfrom keras.layers import Input, Lambda, Conv2D, MaxPool2D, BatchNormalization, Dense, Flatten, Dropoutfrom keras.models import Modelfrom keras.utils import to_categorical def AlexNet(input_shape, num_classes):    inputs = Input(input_shape, name="Input")    x = ZeroPadding2D(((3, 0), (3, 0)))(inputs)    x = Conv2D(96,               (11, 11),               4,               kernel_initializer=initializers.RandomNormal(stddev=0.01),               name="Conv_1")(x)    x = Lambda(tf.nn.local_response_normalization, name="Lrn_1")(x)    x = Activation(activation="relu")(x)    x = MaxPool2D(name="Maxpool_1")(x)     x = Conv2D(256,               (5, 5),               kernel_initializer=initializers.RandomNormal(stddev=0.01),               padding="SAME",               name="Conv_2")(x)    x = Lambda(tf.nn.local_response_normalization, name="Lrn_2")(x)    x = Activation(activation="relu")(x)    x = MaxPool2D(name="Maxpool_2")(x)     x = Conv2D(384,               (3, 3),               padding="Same",               kernel_initializer=initializers.RandomNormal(stddev=0.01),               name="Conv_3_1")(x)    x = Conv2D(384,               (3, 3),               padding="Same",               kernel_initializer=initializers.RandomNormal(stddev=0.01),               name="Conv_3_2")(x)    x = Conv2D(256,               (3, 3),               activation="relu",               padding="Same",               kernel_initializer=initializers.RandomNormal(stddev=0.01),               name="Conv_3_3")(x)    x = MaxPool2D(name="Maxpool_3")(x)     x = Flatten(name="Flt_1")(x)    x = Dense(4096,              activation="relu",              kernel_initializer=initializers.RandomNormal(stddev=0.01),              name="fc_1")(x)    x = Dropout(0.5, name="drop_1")(x)    x = Dense(4096,              activation="relu",              kernel_initializer=initializers.RandomNormal(stddev=0.01),              name="fc_2")(x)    x = Dropout(0.5, name="drop_2")(x)    output = keras.layers.Dense(num_classes, activation="softmax", name="Output")(x)    m = keras.Model(inputs, output, name="AlexNet")    m.summary()    return m