Construction of Image Classification Model for Small Data Sets

Article information

This article addresses: http://blog.keras.io/building-powerful-image-classification-models-using-very-little-data.html

Author: Francois Chollet

Summary

In this paper, we will provide some efficient and practical image classifiers for small data sets (hundreds to thousands of pictures).

This paper will discuss the following methods:

Direct training of a small network from pictures (as a benchmark method)
Bottleneck (bottleneck) characteristics using pre-training networks
High-level fine-tune pre-training network

The Keras modules used in this article are:

fit_generator: Used to train networks from Python generators
ImageData Generator: Used for real-time data enhancement
Layer parameter freezing and fine-tune model

Configuration situation

Our experiment is based on the following configuration

The data set consisting of 2000 training pictures consists of two categories, with 1000 pictures in each category.
With Keras, SciPy and PIL installed, it would be better if there were NVIDIA GPU, but because we are facing small data sets, no can be.
Data sets are stored as follows

data/
    train/
        dogs/
            dog001.jpg
            dog002.jpg
            ...
        cats/
            cat001/jpg
            cat002.jpg
            ...
    validation/
        dogs/
            dog001.jpg
            dog002.jpg
            ...
        cats/
            cat001/jpg
            cat002.jpg
            ...

This data set comes from Kaggle The original data set contains 12 500 cats and 12 500 dogs. We only took the first 1,000 pictures of each category. In addition, we also took 400 additional pictures from various classes for testing.

Here are some sample pictures of the data set. The number of pictures is very small, which is a big trouble for image classification. But the reality is that many real-world pictures are difficult to obtain, and the number of samples we can get is really limited (for example, medical images, each positive sample means a patient suffering from pain:(). For data scientists, we should have the ability to extract the full value of a small amount of data, rather than simply reaching for more data.

In Kaggle's Cat and Dog Competition, participants achieved 98% accuracy by using modern in-depth learning techniques, and we only used 8% of the total data, so this problem is even more difficult for us.

Deep Learning for Small Data Sets

I often hear the saying that in-depth learning is meaningful only when you have a large amount of data. Although this statement is not entirely wrong, it is highly misleading. Of course, in-depth learning emphasizes the ability to automatically learn features from data, and it is almost impossible that there are not enough training samples. Especially when the input data dimension is very high (as in the picture). However, as a pillar of deep learning, convolutional neural networks are designed to be one of the best models for "perception" problems (such as image classification). Even with very little data, networks can learn features well. Neural networks for small data sets can still get reasonable results without any manual feature engineering. In a word, convolutional neural network is a good method.

On the other hand, in-depth learning models are naturally reusable: for example, you can reuse a model of image classification or speech recognition trained on large-scale data to another very different problem with only a limited amount of modification. Especially in the field of computer vision, many pre-trained models are now publicly downloaded and reused on other issues to improve performance on small data sets.

Data Preprocessing and Data Upgrading

In order to make the best use of our limited training data, we will upgrade our model through a series of random transform reactor data, so that we can not see any two identical images, which is conducive to suppressing over-fitting and making the generalization ability of the model better.

In Keras, this step can be implemented by keras.preprocessing.image.ImageGenerator, which allows you to:

In the training process, set up the random transformation to be implemented.
An image batch generator is instantiated by. flow or. flow_from_directory(directory) methods, which can be used as input to keras model-related methods, such as fit_generator, evaluate_generator and predict_generator.

Now let's look at an example:

from keras.preprocessing.image import ImageDataGenerator

datagen = ImageDataGenerator(
        rotation_range=40,
        width_shift_range=0.2,
        height_shift_range=0.2,
        rescale=1./255,
        shear_range=0.2,
        zoom_range=0.2,
        horizontal_flip=True,
        fill_mode='nearest')

Only a few options are shown above. Please read the relevant parts of the document to see all available options. Let's take a quick look at what these options mean:

rotation_range is a 0-180 degree used to specify the angle of randomly selected pictures.
width_shift and height_shift are used to specify the degree of random movement in horizontal and vertical directions, which is the ratio between 0 and 1.
The rescale value will be multiplied to the whole image before performing other processing. Our image is an integer of 0-255 in the RGB channel. This operation may cause the value of the image to be too high or too low, so we set the value between 0 and 1.
shear_range is the extent to which shear transformation is performed. Shear transformation
zoom_range is used for random amplification
Horizontal flip randomly flips a picture horizontally. This parameter is applicable when horizontal flip does not affect the semantics of the picture.
fill_mode is used to specify how to fill new pixels when pixel filling is required, such as rotation, horizontal and vertical displacements.

Next, we use this tool to generate images and save them in a temporary folder, so that we can feel what data enhancement has done. In order to show the picture, rescaling is not used here.

from keras.preprocessing.image import ImageDataGenerator, array_to_img, img_to_array, load_img

datagen = ImageDataGenerator(
        rotation_range=40,
        width_shift_range=0.2,
        height_shift_range=0.2,
        shear_range=0.2,
        zoom_range=0.2,
        horizontal_flip=True,
        fill_mode='nearest')

img = load_img('data/train/cats/cat.0.jpg')  # this is a PIL image
x = img_to_array(img)  # this is a Numpy array with shape (3, 150, 150)
x = x.reshape((1,) + x.shape)  # this is a Numpy array with shape (1, 3, 150, 150)

# the .flow() command below generates batches of randomly transformed images
# and saves the results to the `preview/` directory
i = 0
for batch in datagen.flow(x, batch_size=1,
                          save_to_dir='preview', save_prefix='cat', save_format='jpeg'):
    i += 1
    if i > 20:
        break  # otherwise the generator would loop indefinitely

Following are the results of the promotion of a picture:

Training Neural Networks on Small Data Sets: 40 Lines of Code Accuracy of 80%

The correct tool for image classification is convolution network, so we try to build a primary model with convolution neural network. Because we have very few samples, we should pay more attention to the problem of over-fitting. When a model learns from a few samples to a model that cannot be extended to new data, we call it an over-fitting problem. When over-fitting occurs, the model attempts to use irrelevant features to predict. For example, you have three pictures of woodcutters and three pictures of sailors. Only one woodcutter wears a hat in the six photos. If you think wearing a hat is a distinguishing feature between woodcutter and sailor, then you are a poor classifier.

Data upgrade is a weapon against over-fitting, but it is not enough, because the upgraded data is still highly correlated. To counter over-fitting, you should focus on the "entropy capacity" of the model - the amount of information that the model allows to store. Models that can store more information can use more features to achieve better performance, but there is also a risk of storing irrelevant features. On the other hand, models that can only store a small amount of information will concentrate on truly relevant features and have better generalization performance.

There are many different ways to adjust the "entropy capacity" of the model. One common choice is to adjust the number of parameters of the model, that is, the number of layers and the size of each layer. Another method is to regularize the weights, such as L1 or L2. Such constraints will biase the weights of the model to smaller values.

In our model, we use a very small convolution network, with only a few layers and a small number of filters per layer. With data upgrades and Dropout, that's about it. Dropout prevents over-fitting by preventing one layer from seeing exactly the same pattern twice, which is equivalent to a method of data enhancement. (You can say that dropout and data upgrades are all randomly disrupting the correlation of data.)

The code shown below is our first model, a very simple three-level convolution plus the ReLU activation function, followed by the max-pooling layer. This structure is similar to the image classifier published by Yann LeCun in 1990 (except ReLU)

All the code for this experiment is in Here

from keras.models import Sequential
from keras.layers import Convolution2D, MaxPooling2D
from keras.layers import Activation, Dropout, Flatten, Dense

model = Sequential()
model.add(Convolution2D(32, 3, 3, input_shape=(3, 150, 150)))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))

model.add(Convolution2D(32, 3, 3))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))

model.add(Convolution2D(64, 3, 3))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))

# the model so far outputs 3D feature maps (height, width, features)

Then we connected two fully connected networks and activated the model with a single neuron and sigmoid. This choice produces the result of two classifications, which corresponds to this configuration. We use binary_crossentropy as the loss function.

model.add(Flatten())  # this converts our 3D feature maps to 1D feature vectors
model.add(Dense(64))
model.add(Activation('relu'))
model.add(Dropout(0.5))
model.add(Dense(1))
model.add(Activation('sigmoid'))

model.compile(loss='binary_crossentropy',
              optimizer='rmsprop',
              metrics=['accuracy'])

Then we started to prepare the data and use. flow_from_directory() to generate data and labels directly from our jpgs images.

# this is the augmentation configuration we will use for training
train_datagen = ImageDataGenerator(
        rescale=1./255,
        shear_range=0.2,
        zoom_range=0.2,
        horizontal_flip=True)

# this is the augmentation configuration we will use for testing:
# only rescaling
test_datagen = ImageDataGenerator(rescale=1./255)

# this is a generator that will read pictures found in
# subfolers of 'data/train', and indefinitely generate
# batches of augmented image data
train_generator = train_datagen.flow_from_directory(
        'data/train',  # this is the target directory
        target_size=(150, 150),  # all images will be resized to 150x150
        batch_size=32,
        class_mode='binary')  # since we use binary_crossentropy loss, we need binary labels

# this is a similar generator, for validation data
validation_generator = test_datagen.flow_from_directory(
        'data/validation',
        target_size=(150, 150),
        batch_size=32,
        class_mode='binary')

Then we can use this generator to train the network. It takes 20 to 30 seconds for each epoch on the GPU and 300 to 400 seconds on the CPU, so if you are not in a hurry, running this model on the CPU is perfectly possible.

model.fit_generator(
        train_generator,
        samples_per_epoch=2000,
        nb_epoch=50,
        validation_data=validation_generator,
        nb_val_samples=800)
model.save_weights('first_try.h5')  # always save your weights after training or during training

The accuracy of this model is 79%~81% after 50 epoch s. Don't forget that we only used 8% data and didn't spend time to optimize the model and super parameters. In Kaggle, the model has reached the top 100 (215 teams in total) and the remaining 115 teams are estimated to be useless for in-depth learning.

Note that this change in accuracy is likely to be large, because accuracy is a highly variable evaluation parameter, and we only have 800 samples to test. A better verification method is to use K-fold cross-validation, but we need to train a model in each round of validation.

bottleneck feature using pre-training network: 90% accuracy in one minute

A slightly more sophisticated approach is to use pre-trained networks on large data sets. Such a network can acquire good features in most computer vision problems, and we can get higher accuracy by using such features.

We will use the vgg-16 network, which is trained on ImageNet datasets, a model we mentioned earlier. Because the ImageNet data set contains a variety of "cat" and "dog" classes, this model has been able to learn the characteristics associated with our data set. In fact, simply recording the output of the original network without bottleneck features is enough to solve our problem well. But the method we're talking about here has a better generalization to other similar problems, including the classification of categories that don't appear in ImageNet.

The network structure of VGG-16 is as follows:

Our approach is that we will use the convolution layer of the network to throw out the fully connected parts. Then we run through our training set and test set and record the output (bottleneck feature, feature map activated at the last layer of the network before full connection) in two numpy array s. Then we train a fully connected network based on recorded features.

We keep these features offline instead of adding our full connection model directly to the network and freezing the layer parameters before training because of the consideration of computational efficiency. The cost of running a VGG network is very high, especially on the CPU, so we only want to run it once. That's why we don't upgrade our data.

We're not going to dwell on how to build a vgg-16 network. As we said before, you can find it in the example of keras. But let's see how to record bottleneck features.

generator = datagen.flow_from_directory(
        'data/train',
        target_size=(150, 150),
        batch_size=32,
        class_mode=None,  # this means our generator will only yield batches of data, no labels
        shuffle=False)  # our data will be in order, so all first 1000 images will be cats, then 1000 dogs
# the predict_generator method returns the output of a model, given
# a generator that yields batches of numpy data
bottleneck_features_train = model.predict_generator(generator, 2000)
# save the output as a Numpy array
np.save(open('bottleneck_features_train.npy', 'w'), bottleneck_features_train)

generator = datagen.flow_from_directory(
        'data/validation',
        target_size=(150, 150),
        batch_size=32,
        class_mode=None,
        shuffle=False)
bottleneck_features_validation = model.predict_generator(generator, 800)
np.save(open('bottleneck_features_validation.npy', 'w'), bottleneck_features_validation)

After recording, we can load the data to train our fully connected network:

train_data = np.load(open('bottleneck_features_train.npy'))
# the features were saved in order, so recreating the labels is easy
train_labels = np.array([0] * 1000 + [1] * 1000)

validation_data = np.load(open('bottleneck_features_validation.npy'))
validation_labels = np.array([0] * 400 + [1] * 400)

model = Sequential()
model.add(Flatten(input_shape=train_data.shape[1:]))
model.add(Dense(256, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(1, activation='sigmoid'))

model.compile(optimizer='rmsprop',
              loss='binary_crossentropy',
              metrics=['accuracy'])

model.fit(train_data, train_labels,
          nb_epoch=50, batch_size=32,
          validation_data=(validation_data, validation_labels))
model.save_weights('bottleneck_fc_model.h5')

Because the size of features is very small, the model runs very fast on the CPU, about 1 s an epoch, and finally our accuracy is 90%~91%. The good results are mostly attributed to the pre-trained vgg network to help us extract features.

fine-tune on pre-trained networks

To further improve the previous results, we can try the back layers of fine-tune network. Fine-tune is based on a pre-trained network and retrains a small number of weights on a new data set. In this experiment, fine-tune is divided into three steps

Build vgg-16 and load weight
Add the previously defined fully connected network to the top of the model and load the weight
Freezing some parameters of vgg16 network

Be careful:

In order to fine-tune, all layers should take the trained weight as the initial value. For example, you can't put the full connection of the random initial on the pre-trained convolution layer, because the large gradient generated by the random weight will destroy the weight of the pre-trained convolution layer. In our case, that's why we first train the top-level classifier and then fine-tune based on it.
We chose only fine-tune final convolution blocks, not the entire network, to prevent over-fitting. The whole network has huge entropy capacity, so it has a high tendency of over-fitting. The features learned by the underlying convolution module are more general and less abstract, so we need to keep the first two convolution blocks (learning general features) intact, only the convolution blocks after fine-tune (learning special features).
Fin-tune should be performed at a very low learning rate, usually using SGD optimization rather than other adaptive learning rate optimization algorithms, such as RMSProp. This is to ensure that the range of updates is kept at a low level, so as not to destroy the characteristics of pre-training.

The code is as follows. First, we add our pre-trained model to the initialized vgg network.

# build a classifier model to put on top of the convolutional model
top_model = Sequential()
top_model.add(Flatten(input_shape=model.output_shape[1:]))
top_model.add(Dense(256, activation='relu'))
top_model.add(Dropout(0.5))
top_model.add(Dense(1, activation='sigmoid'))

# note that it is necessary to start with a fully-trained
# classifier, including the top classifier,
# in order to successfully do fine-tuning
top_model.load_weights(top_model_weights_path)

# add the model on top of the convolutional base
model.add(top_model)

Then the parameters of the convolution layer in front of the last convolution block are frozen:

# set the first 25 layers (up to the last conv block)
# to non-trainable (weights will not be updated)
for layer in model.layers[:25]:
    layer.trainable = False

# compile the model with a SGD/momentum optimizer
# and a very slow learning rate.
model.compile(loss='binary_crossentropy',
              optimizer=optimizers.SGD(lr=1e-4, momentum=0.9),
              metrics=['accuracy'])

Then training with a very low learning rate:

# prepare data augmentation configuration
train_datagen = ImageDataGenerator(
        rescale=1./255,
        shear_range=0.2,
        zoom_range=0.2,
        horizontal_flip=True)

test_datagen = ImageDataGenerator(rescale=1./255)

train_generator = train_datagen.flow_from_directory(
        train_data_dir,
        target_size=(img_height, img_width),
        batch_size=32,
        class_mode='binary')

validation_generator = test_datagen.flow_from_directory(
        validation_data_dir,
        target_size=(img_height, img_width),
        batch_size=32,
        class_mode='binary')

# fine-tune the model
model.fit_generator(
        train_generator,
        samples_per_epoch=nb_train_samples,
        nb_epoch=nb_epoch,
        validation_data=validation_generator,
        nb_val_samples=nb_validation_samples)

After 50 epoch s, the accuracy of this method is 94%, which is very successful.

You can achieve more than 95% accuracy through the following methods:

Stronger data upgrade
More intense dropout
Use L1 and L2 regular terms (also known as weight decay)
fine-tune more convolutional blocks (with larger regularities)

Small sponsorship

If you think this document is helpful to your research and use, you are welcome to scan the following two-dimensional code to give a small sponsorship to the author, in order to encourage the author to further improve the content of the document and improve the quality of the document.

Posted by daedlus on Thu, 25 Apr 2019 12:30:34 -0700

Programmer Group