Classification and positioning
Locating the object in the picture can be represented as a regression task: predicting the boundary box around the object. A common method is to predict the horizontal and vertical coordinates of the center of the object, as well as its height and width. This means that there are four numbers to predict. It does not need to make too many modifications to the model. It only needs to add four second dense output layers with units (usually above the global average pooling layer), which can be trained with MSE loss:
import tensorflow as tf from tensorflow import keras import tensorflow_datasets as tfds (test_set, valid_set, train_set), info = tfds.load('tf_flowers', split=['train[:10%]', 'train[10%:25%]', 'train[25%:]'], as_supervised=True, with_info=True) dataset_size = info.splits['train'].num_examples class_names = info.features['label'].names n_classes = info.features['label'].num_classes def preprocess(image, label): resize_image = tf.image.resize(image, [224, 224]) final_image = keras.applications.xception.preprocess_input(resize_image) return final_image, label batch_size = 16 train_set = train_set.shuffle(1000) train_set = train_set.map(preprocess).batch(batch_size).prefetch(1) valid_set = valid_set.map(preprocess).batch(batch_size).prefetch(1) test_set = test_set.map(preprocess).batch(batch_size).prefetch(1)
base_model = keras.applications.xception.Xception(weights='imagenet', include_top=False) avg = keras.layers.GlobalAveragePooling2D()(base_model.output) class_output = keras.layers.Dense(n_classes, activation='softmax')(avg) loc_output = keras.layers.Dense(4)(avg) optimizer = keras.optimizers.SGD(lr=.2, momentum=.9, decay=.01) model = keras.Model(input=base_model.input, outputs=[class_output, loc_output]) model.compile(loss=['sparse_categorical_entropy', 'mse'], loss_weights=[0.8, 0.2], optimizer=optimizer, metrics=['accuracy'])
The flower dataset has no bounding box around the flower, so you need to add it yourself. This is often one of the most difficult and expensive parts of a machine learning project: getting tags. Taking the time to find the right tools is one way. To label images with bounding boxes, you may need to use open source image marking tools, such as VGG Image Annotator, LabelImg, OpenLabeler, or ImageLab, or commercial tools, such as LabelBox or supervise. If you need to label a large number of images, you may also need to consider crowdsourcing platforms, such as Amazon Mechanical Turk. However, to establish a crowdsourcing platform, you need to prepare the forms sent to workers, supervise them and ensure that the quality of the bounding boxes they produce is very good. Therefore, it is worth doing so. If there are only a few thousand images to mark, you'd better do it yourself.
If the bounding box of each image in the flower dataset has been obtained (assuming that each image has a bounding box). Then you need to create a dataset whose data items will be batch processing of preprocessed images, as well as their class labels and bounding boxes. Each dataset should be a tuple of the following form: (image,(class_labels,bounding_boxes)).
The bounding box should be normalized so that the horizontal and vertical coordinates, height and width are in the range of 0 to 1. Moreover, the square root of the height and width is usually predicted rather than the direct height and width values: in this way, the 10 pixel error for the large frame will not be punished as the 10 pixel error for the small frame
MSE is usually used as a cost function to train the model, but it is not a good index to evaluate the prediction ability of the model to the boundary box. The most commonly used metric is Intersection over Union IoU: the overlapping area between the predicted bounding box and the target bounding box divided by their joint area. In tf.keras, it is implemented by the tf.keras.metrics.MeanIoU class