Reuters data set
For this news data set, this is a multi classification problem
Dataset characteristics:
- Text classification dataset
- Contains 46 different topics
- There are at least 10 samples for each topic in the training set
- The dataset is in Keras and can be directly transferred in
Difference between multi classification problem and 0-1 problem (single classification) (for example):
-
Single category: is this object human? A: Yes, B: No
-
Multi classification: which of the following categories does this object belong to? A: Person, B: car, C: plane, D: 🐟 …
3-12 loading data sets
from keras.datasets import reuters (train_data, train_labels), (test_data, test_labels) = reuters.load_data(num_words = 10000) # Test whether the load is successful print(len(train_data)) print(len(test_data)) # print(train_data[10])
3-13 decoding index into news text
word_index = reuters.get_word_index() # Reverse data reverse_word_index = dict([(value, key) for (key, value) in word_index.items()]) # Decode text decoded_newswire = ' '.join(reverse_word_index.get(i-3,'?') for i in train_data[0]) print(decoded_newswire) print(train_labels[10])
? ? ? said as a result of its december acquisition of space co it expects earnings per share in 1987 of 1 15 to 1 30 dlrs per share up from 70 cts in 1986 the company said pretax net should rise to nine to 10 mln dlrs from six mln dlrs in 1986 and rental operation revenues to 19 to 22 mln dlrs from 12 5 mln dlrs it said cash flow per share this year should be 2 50 to three dlrs reuter 3 3
- Translation (there is some grammatical confusion behind, but it can still be seen that this is a small piece of news)
? ? ? Said that due to its acquisition of aerospace company in December, it expected earnings per share of 1.15 to 30 dlr per share in 1987. CTS said that the net pre tax should rise to 9 to 10 mln dlr from six mln dlr and rental operating income from 19 to 22 mln dlr from 12.5 mln dlr in 1986. It is said that the cash flow per share this year should be 2.50 to 3 DLRS
3-14 coded data
Data Vectorization
import numpy as np # Define vectorization function def vectorize_sequence(sequences, dimension = 10000): results = np.zeros((len(sequences), dimension)) for i ,sequence in enumerate(sequences): results[i, sequence] = 1. return results # Vectorized sample x_train = vectorize_sequence(train_data) # Training data Vectorization x_test = vectorize_sequence(test_data) # Test data Vectorization
Label Vectorization
After vectorizing the data, we use one - hot coding to vectorize the tags
def to_one_hot(labels, dimension = 46): results = np.zeros((len(labels), dimension)) for i, label in enumerate(labels): results[i,label] = 1. return results # Vectorization label one_hot_train_labels = to_one_hot(train_labels) # Training set one_hot_test_labels = to_one_hot(test_labels) # Test set
# Vectorization using Keras's built-in method from keras.utils.np_utils import to_categorical one_hot_train_labels = to_categorical(train_labels) one_hot_test_labels = to_categorical(test_labels)
3-15 multi classification model definition
In this problem, there are 46 possible output categories for a news data in the sample, and the middle layer using 16 dimensions in the previous data set may not be able to distinguish, making its smaller dimensions become the intermediate bottleneck of information transmission. For this reason, we use a larger dimension neural network in this example
# Model definition from keras import models from keras import layers model = models.Sequential() model.add(layers.Dense(64, activation = 'relu', input_shape = (10000, ))) model.add(layers.Dense(64, activation = 'relu')) model.add(layers.Dense(46, activation = 'softmax')) # The final output has 46 possibilities
3-16 compilation model
model.compile(optimizer = 'rmsprop', loss = 'categorical_crossentropy', metrics = ['accuracy'])
3-17 verification method
# 3-17-1 we set aside 1000 samples in the training data as the validation set # Here we simply use the data slicing knowledge of python x_val = x_train[:1000] # 0 to 1000 are validation sets partial_x_train = x_train[1000:] # We use the data after 1000 as our training data y_val = one_hot_train_labels[:1000] partial_y_train = one_hot_train_labels[1000:]
# 3-17-2 training model history = model.fit(partial_x_train, partial_y_train, epochs = 20, batch_size = 512, validation_data = (x_val, y_val))
Train on 7982 samples, validate on 1000 samples Epoch 1/20 7982/7982 [==============================] - 1s 182us/step - loss: 2.5544 - accuracy: 0.5014 - val_loss: 1.7032 - val_accuracy: 0.6170 Epoch 2/20 7982/7982 [==============================] - 1s 100us/step - loss: 1.4136 - accuracy: 0.6994 - val_loss: 1.2892 - val_accuracy: 0.7080 Epoch 3/20 7982/7982 [==============================] - 1s 107us/step - loss: 1.0453 - accuracy: 0.7741 - val_loss: 1.1258 - val_accuracy: 0.7470 Epoch 4/20 7982/7982 [==============================] - 1s 95us/step - loss: 0.8166 - accuracy: 0.8254 - val_loss: 1.0081 - val_accuracy: 0.7840 Epoch 5/20 7982/7982 [==============================] - 1s 94us/step - loss: 0.6438 - accuracy: 0.8631 - val_loss: 0.9416 - val_accuracy: 0.8130 Epoch 6/20 7982/7982 [==============================] - 1s 91us/step - loss: 0.5099 - accuracy: 0.8968 - val_loss: 0.8941 - val_accuracy: 0.8200 Epoch 7/20 7982/7982 [==============================] - 1s 104us/step - loss: 0.4163 - accuracy: 0.9136 - val_loss: 0.8934 - val_accuracy: 0.8080 Epoch 8/20 7982/7982 [==============================] - 1s 117us/step - loss: 0.3332 - accuracy: 0.9290 - val_loss: 0.8881 - val_accuracy: 0.8150 Epoch 9/20 7982/7982 [==============================] - 1s 93us/step - loss: 0.2763 - accuracy: 0.9392 - val_loss: 0.8839 - val_accuracy: 0.8200 Epoch 10/20 7982/7982 [==============================] - 1s 85us/step - loss: 0.2371 - accuracy: 0.9464 - val_loss: 0.8970 - val_accuracy: 0.8140 Epoch 11/20 7982/7982 [==============================] - 1s 96us/step - loss: 0.2001 - accuracy: 0.9505 - val_loss: 0.9158 - val_accuracy: 0.8110 Epoch 12/20 7982/7982 [==============================] - 1s 88us/step - loss: 0.1777 - accuracy: 0.9518 - val_loss: 0.9198 - val_accuracy: 0.8110 Epoch 13/20 7982/7982 [==============================] - 1s 81us/step - loss: 0.1604 - accuracy: 0.9543 - val_loss: 0.9159 - val_accuracy: 0.8190 Epoch 14/20 7982/7982 [==============================] - 1s 92us/step - loss: 0.1455 - accuracy: 0.9569 - val_loss: 0.9516 - val_accuracy: 0.8130 Epoch 15/20 7982/7982 [==============================] - 1s 86us/step - loss: 0.1388 - accuracy: 0.9559 - val_loss: 0.9443 - val_accuracy: 0.8190 Epoch 16/20 7982/7982 [==============================] - 1s 80us/step - loss: 0.1306 - accuracy: 0.9546 - val_loss: 1.0283 - val_accuracy: 0.7990 Epoch 17/20 7982/7982 [==============================] - 1s 96us/step - loss: 0.1217 - accuracy: 0.9587 - val_loss: 1.0271 - val_accuracy: 0.8100 Epoch 18/20 7982/7982 [==============================] - 1s 88us/step - loss: 0.1173 - accuracy: 0.9578 - val_loss: 1.0426 - val_accuracy: 0.8070 Epoch 19/20 7982/7982 [==============================] - 1s 86us/step - loss: 0.1151 - accuracy: 0.9563 - val_loss: 1.0390 - val_accuracy: 0.8090 Epoch 20/20 7982/7982 [==============================] - 1s 142us/step - loss: 0.1093 - accuracy: 0.9583 - val_loss: 1.0477 - val_accuracy: 0.8090
# Drawing import matplotlib.pyplot as plt loss = history.history['loss'] val_loss =history.history['val_loss'] epochs = range(1, len(loss) + 1) plt.plot(epochs, loss, 'bo', label = 'Training loss') # 'bo'l represents the blue origin plt.plot(epochs, val_loss, 'b', label = 'Validation loss') # b represents a solid blue line plt.title('Training and validation loss') plt.xlabel('Epochs') plt.ylabel('Loss') plt.legend() plt.show()
[the external chain picture transfer fails. The source station may have an anti-theft chain mechanism. It is recommended to save the picture and upload it directly (img-o4fdpieh-163347174466) (output_17_0. PNG)]
plt.clf() # Empty image # Note that the following versions are in the book # acc = history.history['acc'] # val_acc = history.history['val_acc'] # In my version of Keras, acc is replaced by accuracy for reference only acc = history.history['accuracy'] val_acc = history.history['val_accuracy'] plt.plot(epochs, acc, 'bo', label = 'Training acc') plt.plot(epochs, val_acc, 'b', label = 'Validation acc') plt.title('Training and validation accuracy') plt.xlabel('Epochs') plt.ylabel('Accuracy') plt.legend() plt.show()
[the external chain picture transfer fails. The source station may have an anti-theft chain mechanism. It is recommended to save the picture and upload it directly (img-hj9huu6e-163347174469) (output_18_0. PNG)]
3-21 use fewer training steps to train the neural network
model_2 = models.Sequential() model_2.add(layers.Dense(64, activation = 'relu', input_shape = (10000, ))) model_2.add(layers.Dense(64, activation = 'relu')) model_2.add(layers.Dense(46, activation = 'softmax')) model_2.compile(optimizer = 'rmsprop', loss = 'categorical_crossentropy', metrics = ['accuracy']) model_2.fit(partial_x_train, partial_y_train, epochs = 9, batch_size = 512, validation_data = (x_val, y_val)) results = model.evaluate(x_test, one_hot_test_labels) print(results)
Train on 7982 samples, validate on 1000 samples Epoch 1/9 7982/7982 [==============================] - 1s 97us/step - loss: 2.4856 - accuracy: 0.5234 - val_loss: 1.6474 - val_accuracy: 0.6420 Epoch 2/9 7982/7982 [==============================] - 1s 95us/step - loss: 1.3877 - accuracy: 0.6994 - val_loss: 1.2912 - val_accuracy: 0.7030 Epoch 3/9 7982/7982 [==============================] - 1s 92us/step - loss: 1.0489 - accuracy: 0.7699 - val_loss: 1.1425 - val_accuracy: 0.7600 Epoch 4/9 7982/7982 [==============================] - 1s 85us/step - loss: 0.8293 - accuracy: 0.8182 - val_loss: 1.0284 - val_accuracy: 0.7830 Epoch 5/9 7982/7982 [==============================] - 1s 99us/step - loss: 0.6605 - accuracy: 0.8543 - val_loss: 0.9781 - val_accuracy: 0.7820 Epoch 6/9 7982/7982 [==============================] - 1s 83us/step - loss: 0.5260 - accuracy: 0.8887 - val_loss: 0.9659 - val_accuracy: 0.7810 Epoch 7/9 7982/7982 [==============================] - 1s 84us/step - loss: 0.4247 - accuracy: 0.9138 - val_loss: 0.9053 - val_accuracy: 0.8040 Epoch 8/9 7982/7982 [==============================] - 1s 102us/step - loss: 0.3496 - accuracy: 0.9253 - val_loss: 0.8785 - val_accuracy: 0.8150 Epoch 9/9 7982/7982 [==============================] - 1s 100us/step - loss: 0.2851 - accuracy: 0.9394 - val_loss: 0.8921 - val_accuracy: 0.8210 2246/2246 [==============================] - 0s 157us/step [1.251534348179587, 0.7853962779045105]
The results are compared with the completely random classifier results
import copy # Randomly generated test_label test_label_copy = copy.copy(test_labels) np.random.shuffle(test_label_copy) hits_array = np.array(test_labels) == np.array(test_label_copy) float(np.sum(hits_array)) / len(test_labels)
0.18655387355298308
3-22 generate new prediction results on the dataset
prediction = model.predict(x_test) # Each element in the prediction is a vector with a length of 46 print(prediction[0].shape) # And its sum is 1 (indicating the probability that the data is divided into 46 different classifications) print(np.sum(prediction[0])) # Print its largest forecast category print(np.argmax(prediction[0]))
(46,) 1.0000001 3
3-23 model with information bottleneck
Here, in order to verify that the dimension output of the intermediate full link layer is too small, which will make the full link layer an obstacle to information flow, we specially reduce the dimension of the full link layer here, and then make further comparative analysis by comparing the accuracy of prediction.
# For the establishment of the model, note that we change the middle full connection layer to 4 model_3 = models.Sequential() model_3.add(layers.Dense(64, activation = 'relu', input_shape = (10000,))) # You can change the numbers in this part, 8, 32 and 64 to see the impact # -------------------------------------------------- model_3.add(layers.Dense(128, activation = 'relu')) # -------------------------------------------------- model_3.add(layers.Dense(46, activation = 'softmax'))
# Model training model.compile(optimizer = 'rmsprop', loss = 'categorical_crossentropy', metrics = ['accuracy']) model.fit(partial_x_train, partial_y_train, epochs = 20, batch_size = 128, validation_data = (x_val, y_val))
Train on 7982 samples, validate on 1000 samples Epoch 1/20 7982/7982 [==============================] - 1s 176us/step - loss: 0.0738 - accuracy: 0.9588 - val_loss: 2.5090 - val_accuracy: 0.7760 Epoch 2/20 7982/7982 [==============================] - 1s 133us/step - loss: 0.0698 - accuracy: 0.9597 - val_loss: 2.6685 - val_accuracy: 0.7700 Epoch 3/20 7982/7982 [==============================] - 1s 130us/step - loss: 0.0676 - accuracy: 0.9607 - val_loss: 2.7785 - val_accuracy: 0.7700 Epoch 4/20 7982/7982 [==============================] - 1s 129us/step - loss: 0.0675 - accuracy: 0.9584 - val_loss: 2.9254 - val_accuracy: 0.7700 Epoch 5/20 7982/7982 [==============================] - 1s 133us/step - loss: 0.0670 - accuracy: 0.9588 - val_loss: 3.0094 - val_accuracy: 0.7710 Epoch 6/20 7982/7982 [==============================] - 1s 129us/step - loss: 0.0666 - accuracy: 0.9578 - val_loss: 3.0232 - val_accuracy: 0.7680 Epoch 7/20 7982/7982 [==============================] - 1s 168us/step - loss: 0.0665 - accuracy: 0.9590 - val_loss: 3.0974 - val_accuracy: 0.7730 Epoch 8/20 7982/7982 [==============================] - 1s 122us/step - loss: 0.0659 - accuracy: 0.9587 - val_loss: 3.2057 - val_accuracy: 0.7670 Epoch 9/20 7982/7982 [==============================] - 1s 133us/step - loss: 0.0648 - accuracy: 0.9599 - val_loss: 3.2828 - val_accuracy: 0.7650 Epoch 10/20 7982/7982 [==============================] - 1s 124us/step - loss: 0.0644 - accuracy: 0.9602 - val_loss: 3.1684 - val_accuracy: 0.7700 Epoch 11/20 7982/7982 [==============================] - 1s 131us/step - loss: 0.0647 - accuracy: 0.9577 - val_loss: 3.2552 - val_accuracy: 0.7650 Epoch 12/20 7982/7982 [==============================] - 1s 138us/step - loss: 0.0627 - accuracy: 0.9578 - val_loss: 3.4422 - val_accuracy: 0.7710 Epoch 13/20 7982/7982 [==============================] - 1s 135us/step - loss: 0.0631 - accuracy: 0.9594 - val_loss: 3.3429 - val_accuracy: 0.7610 Epoch 14/20 7982/7982 [==============================] - 1s 148us/step - loss: 0.0636 - accuracy: 0.9585 - val_loss: 3.6921 - val_accuracy: 0.7660 Epoch 15/20 7982/7982 [==============================] - 1s 160us/step - loss: 0.0632 - accuracy: 0.9597 - val_loss: 3.4518 - val_accuracy: 0.7640 Epoch 16/20 7982/7982 [==============================] - 1s 186us/step - loss: 0.0616 - accuracy: 0.9590 - val_loss: 3.7733 - val_accuracy: 0.7620 Epoch 17/20 7982/7982 [==============================] - 1s 139us/step - loss: 0.0621 - accuracy: 0.9590 - val_loss: 3.7500 - val_accuracy: 0.7610 Epoch 18/20 7982/7982 [==============================] - 1s 132us/step - loss: 0.0610 - accuracy: 0.9589 - val_loss: 3.9891 - val_accuracy: 0.7540 Epoch 19/20 7982/7982 [==============================] - 1s 150us/step - loss: 0.0622 - accuracy: 0.9588 - val_loss: 3.8385 - val_accuracy: 0.7500 Epoch 20/20 7982/7982 [==============================] - 1s 134us/step - loss: 0.0602 - accuracy: 0.9599 - val_loss: 4.1126 - val_accuracy: 0.7570 <keras.callbacks.callbacks.History at 0x1b52af48be0>
Comparison conclusion
# Print results results = model.evaluate(x_test, one_hot_test_labels) print(results)
2246/2246 [==============================] - 0s 89us/step [4.692387377058727, 0.75200355052948]
- If the data samples are divided into N categories, the last layer must be the full connection layer with size N
- For single label and multi category problems, the last layer should use softmax as activation
- The tags and are encoded by using one-hot, and finally category is used_ Crossentropy as a loss function
Write at the end
Note: the code of this article comes from Python deep learning and is uploaded in the form of electronic notes. It is only for learning reference. The authors have run successfully. If there is any omission, please practice the author of this article
Ladies and gentlemen, I've seen it here. Please use your fingers to praise the blogger 8. Your support is the author's greatest creative power (^-^)>
Lack of talent and learning. If there is any mistake, please correct it
This article is only for the purpose of learning and communication, not for any commercial purpose. If copyright issues are involved, please contact the author as soon as possible