[NLP] Text Classification and Generalization (rnn, cnn, word2vec, TfidfVectorizer)

Keywords: encoding Session network github

1. Emotional Analysis of Chinese Comments (keras+rnn)

1.1 Required Libraries
1.2 Pre-training Word Vector
1.3 Word Vector Model
1.4 Training corpus (dataset)
1.5 participle and tokenize
1.6 Index Length Standardization
1.7 Reverse tokenize
1.8 Building an embedding matrix
1.9 padding and truncating
1.10 Building LSTM Model with Keas
1.11 Conclusion
1.12 Misclassification

2. Sina News Category (tensorflow+cnn)
3. Sohu News Text Category (word2vec)

3.1 Data preparation
3.2 word 2vec model
3.3 Feature Engineering:
3.4 Model training, model evaluation

3.4.1 Label Coding:
3.4.2 Logistic Regression Model
3.4.3 Save Model
3.4.4 Cross-validation
3.4.5 Model Test

3.5 Summary

4. Sohu News Text Category (TfidfVectorizer)

1. Emotional Analysis of Chinese Comments (keras+rnn)

1.1 Required Libraries

# First load the required libraries, jieba and gensim are in Chinese only 
# The%matplotlib inline function allows inline drawing and omits the plt.show() step
%matplotlib inline 
import numpy as np
import matplotlib.pyplot as plt
import re #Regularization
import jieba # Chinese must be used as computer does not break sentences
# gensim is used to load pre-training word vector
from gensim.models import KeyedVectors
#KeyedVectors map entities (words, documents, pictures can all) to vectors, all represented by string IDS
#Sometimes there are many warning outputs when running code, such as reminding a new version, if you don't want to mess up the output
import warnings
warnings.filterwarnings("ignore")

1.2 Pre-training Word Vector

Word bag cn_model: The open source link of "chinese-word-vectors" github by Institute of Chinese Information Processing of Beijing Normal University and researcher of DBIIR Laboratory of People's University of China is https://github.com/Embedding/Chinese-Word-Vectors.Here we use the word vector of "chinese-word-vectors" known as Word + Ngram, which can be downloaded from the GitHub link above. The author himself downloads it to the networking disk and other corpus and needs to establish his own reading path. Link: https://pan.baidu.com/s/1RCrNNAagOjLq0BP8cA9EQ Extraction Code: fgux

# Loading pre-training Chinese participle embedding using gensim
cn_model = KeyedVectors.load_word2vec_format('chinese_word_vectors/sgns.zhihu.bigram', 
                                          binary=False)

1.3 Word Vector Model

In this word vector model, each word is an index, corresponding to a vector with a length of 300. The LSTM neural network model that we need to build today does not directly process the Chinese character text. It needs to be graded and converted into word vector first. Steps refer to: 0. Original text: I like literature 1. Word breaking: I, like, literature2.Tokenize (indexed): [2,345,4564] 3.Embedding (word vectorization): with a 300-dimensional word vector, the above tokens become a [3,300] matrix 4.RNN:1 DCONV, GRU, LSTM, etc. 5.Classified by activation function output: e.g., sigmoid output between 0 and 1

# So each word corresponds to a vector of 300 lengths
embedding_dim = cn_model['Shandong University'].shape[0]  #shape[0] returns rows
print('The length of the word vector is{}'.format(embedding_dim))
cn_model['Shandong University']

The output is as follows

The length of the word vector is 300
Out[3]:
array([-2.603470e-01, 3.677500e-01, -2.379650e-01, 5.301700e-02,
-3.628220e-01, -3.212010e-01, -1.903330e-01, 1.587220e-01,
.
dtype=float32)

# Compute Similarity
cn_model.similarity('A mandarin orange', 'orange')

output

0.66128117

# dot ('orange'/|'orange'|,'orange'/|'orange'|), cosine similarity
np.dot(cn_model['A mandarin orange']/np.linalg.norm(cn_model['A mandarin orange']), 
cn_model['orange']/np.linalg.norm(cn_model['orange']))

output

0.66128117

# Find the closest word, cosine similarity
cn_model.most_similar(positive=['University?'], topn=10)

output

[(High School, 0.7247823476791382),
(Undergraduate, 0.6768535375595093),
(Graduate, 0.6244412660598755),
(High School, 0.6088204979896545),
(Undergraduate, 0.595908522605896),
(Junior High School, 0.5883588790893555),
(Read & Research, 0.5778335332870483),
(Senior, 0.5767995119094849),
(University graduate, 0.5767451524734497),
(Normal University, 0.5708829760551453)]

# Find out different words
test_words = 'Teacher Accountant Programmer Lawyer Doctor Elderly'
test_words_result = cn_model.doesnt_match(test_words.split())
print('stay '+test_words+' in:\n Words that are not of the same category are: %s' %test_words_result)

output

Among teachers, accountants, programmers, lawyers, doctors, and elderly people:
Not the same category of words are: old people

cn_model.most_similar(positive=['woman','Derail'], negative=['Man'], topn=1)

output

[('Cutting Legs', 0.5849199295043945)]

1.4 Training corpus (dataset)

This tutorial uses Hotel reviews, where training samples are placed in two folders: pos and neg, with 2000 txt files in each folder, a comment in each file, and 4000 training samples. Sample data of this size is very minimal in the NLP

# Get an index of the samples, which are stored in two folders.
# Positive evaluation'pos'folder and negative evaluation'neg' folder, respectively
# There are 2,000 txt files in each folder, one for each evaluation
import os #Read-in Read-out Channel
pos_txts = os.listdir('pos')
neg_txts = os.listdir('neg')

print( 'Total Samples: '+ str(len(pos_txts) + len(neg_txts)) )

Total samples: 4000

# Now let's put all the reviews in a list
train_texts_orig = [] # Store all reviews, one string per case, original review
# After adding all the samples, train_texts_orig is a list of 4000 texts
# Among them, the first 2000 texts are positive and the last 2000 are negative.
#Following is the process of reading in the.txt file
for i in range(len(pos_txts)):
    with open('pos/'+pos_txts[i], 'r', errors='ignore') as f:
        text = f.read().strip()
        train_texts_orig.append(text)
        f.close()
for i in range(len(neg_txts)):
    with open('neg/'+neg_txts[i], 'r', errors='ignore') as f:
        text = f.read().strip()
        train_texts_orig.append(text)
        f.close()

len(train_texts_orig)

4000

# We use tensorflow's keras interface to model
from keras.models import Sequential
from keras.layers import Dense, GRU, Embedding, LSTM, Bidirectional#Dense Full Connection
#Bidirectional bidirectional LSTM callbacks are used to tune parameters
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.optimizers import RMSprop
from keras.optimizers import Adam
from keras.callbacks import EarlyStopping, ModelCheckpoint, TensorBoard, ReduceLROnPlateau

1.5 participle and tokenize

First we remove the punctuation from each sample, and then we use the jieba participle, which returns a generator. It is not possible to directly tokenize the result, so we convert the participle result into a list and index it so that each evaluated text becomes an index number corresponding to the words in the pre-training words Vector model.

# Perform word segmentation and tokenize
# train_tokens is a long list with 4,000 small lists for each evaluation
train_tokens = []
for text in train_texts_orig:
    # Remove punctuation
    text = re.sub("[\s+\.\!\/_,$%^*(+\"\']+|[+-!，. ？,~@#￥%......&*()]+", "",text)
    # Stubborn participle
    cut = jieba.cut(text)
    # The output of a stuttering participle is a generator
    # Convert generator to list
    cut_list = [ i for i in cut ]
    for i, word in enumerate(cut_list):
        try:
            # Convert words to index
            cut_list[i] = cn_model.vocab[word].index
        except KeyError:
            # If the word is not in the dictionary, output 0
            cut_list[i] = 0
    train_tokens.append(cut_list)

1.6 Index Length Standardization

Because the length of each comment is not the same, if we simply take the longest comment and fill the other with the same length, it would be a waste of computing resources, so we take a compromise length.

# Get the length of all tokens
num_tokens = [ len(tokens) for tokens in train_tokens ]
num_tokens = np.array(num_tokens)

# Average tokens length
np.mean(num_tokens)

71.4495

# Length of longest evaluation tokens
np.max(num_tokens)

1540

plt.hist(np.log(num_tokens), bins = 100)#Has logarithm of size
plt.xlim((0,20))
plt.ylabel('number of tokens')
plt.xlabel('length of tokens')
plt.title('Distribution of tokens length')
plt.show()

# Take the average tokens and add the standard deviation of two tokens.
# Assuming that the distribution of tokens length is normal, the value max_tokens can cover about 95% of the samples
max_tokens = np.mean(num_tokens) + 2 * np.std(num_tokens)
max_tokens = int(max_tokens)
max_tokens

236

# When the length of tokens is 236, about 95% of the samples are covered
# We padding the lengths that are insufficient and trimming the lengths that are too long
np.sum( num_tokens < max_tokens ) / len(num_tokens)

0.9565

1.7 Reverse tokenize

To verify later, we define a function to convert the index into readable text, which is important for debug.

# Used to convert tokens to text
def reverse_tokens(tokens):
    text = ''
    for i in tokens:
        if i != 0:
            text = text + cn_model.index2word[i]
        else:
            text = text + ' '
    return text

reverse = reverse_tokens(train_tokens[0])

# Restore to text after tokenize
# No more punctuation visible
reverse

'Breakfast is too bad No matter how many people you go to and restaurants that don't add food should take this seriously. The room itself is good'.

# Original Text
train_texts_orig[0]

'Breakfast is too bad. No matter how many people you go there, you won't add food there.Hotels should take this seriously.\nn The room itself is good."

1.8 Building an embedding matrix

Now let's prepare the embedding matrix for the model. According to keras, we need to prepare a dimension of (numwords,Matrix of embeddingdim [num words stands for the number of words we use, EmdeddingDimension is 300 in the pre-training word vector model we are using now, and each word is represented by a vector with a length of 300.) Note that we only select the first 50 K most frequently used words. In this pre-training word vector model, there are 2.6 million vocabularies, which would be a waste of computational resources if used entirely on classification issues, because our training samples are very smallSmall, there is only 4k in total. If we have 100k, 200k or more training samples, we can consider reducing the vocabulary used in classification.

embedding_dim

300

# Use only the first 50,000 words in the library
num_words = 50000
# Initialize embedding_matrix and apply on keras
embedding_matrix = np.zeros((num_words, embedding_dim))
# embedding_matrix is a [num_words, embedding_dim] matrix
# Dimension 50000 * 300
for i in range(num_words):
    embedding_matrix[i,:] = cn_model[cn_model.index2word[i]]
embedding_matrix = embedding_matrix.astype('float32')

# Check if the index corresponds,
# Output 300 means that embedding vectors of length 300 correspond one-to-one
np.sum( cn_model[cn_model.index2word[333]] == embedding_matrix[333] )

300

# The dimension of embedding_matrix,
# This dimension is required by keras and will be used later in the model
embedding_matrix.shape

(50000, 300)

1.9 padding and truncating

After converting the text to tokens, the length of each index string is not equal, so we need to standardize the length of the index to facilitate the training of the model. We chose 236 above, which can cover 95% of the training samples. Next we padding and truncating. We usually use the'pre'method, which is in front of the text index.Fill in 0 because, according to some research practice, filling in 0 after the text index can have some adverse effects on the model.

# For padding and truncating, the input train_tokens is a list
# The returned train_pad is a numpy array
train_pad = pad_sequences(train_tokens, maxlen=max_tokens,
                            padding='pre', truncating='pre')

# Replace words exceeding 50,000 word vectors with 0
train_pad[ train_pad>=num_words ] = 0

# You can see that after padding the tokens before become all 0, and the text is at the end
train_pad[33]

array([ 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0,
290, 3053, 57, 169, 73, 1, 25, 11216, 49,
163, 15985, 0, 0, 30, 8, 0, 1, 228,
223, 40, 35, 653, 0, 5, 1642, 29, 11216,
2751, 500, 98, 30, 3159, 2225, 2146, 371, 6285,
169, 27396, 1, 1191, 5432, 1080, 20055, 57, 562,
1, 22671, 40, 35, 169, 2567, 0, 42665, 7761,
110, 0, 0, 41281, 0, 110, 0, 35891, 110,
0, 28781, 57, 169, 1419, 1, 11670, 0, 19470,
1, 0, 0, 169, 35071, 40, 562, 35, 12398,
657, 4857])

# Prepare the target vector, the first 2000 samples are 1, and the last 2000 are 0
train_target = np.concatenate( (np.ones(2000),np.zeros(2000)) )

# Segmentation of training and test samples
from sklearn.model_selection import train_test_split

train_target.shape
train_pad.shape

(4000, 236)

# 90% of the samples were used for training and the remaining 10% for testing
#Since the first 2,000 folders are in the neg category, scramble the order to train random_state
X_train, X_test, y_train, y_test = train_test_split(train_pad,
                                                    train_target,
                                                    test_size=0.1,
                                                    random_state=12
                                                    )

# Check training samples to make sure they are correct
print(reverse_tokens(X_train[35]))
print('class: ',y_train[35])

The room is very large and the sea view balcony is the beach. The only pity is that it is not easy to brush.
class: 1.0

1.10 Building LSTM Model with Keas

The first layer of the model is the Embedding layer. Text can only be processed with a neural network when we convert the tokens index to a word vector matrix.Keas provides an Embedding interface to avoid tedious sparse matrix operations.At the Embedding layer we enter the following matrix: (batchsize, maxtokens) and the output matrix is: (batchsize, maxtokens, embeddingdim)

# Classifying samples using LSTM
model = Sequential()

# The first layer of the model is embedding, trainable=False because embedding_matrix has been trained since it was downloaded
model.add(Embedding(num_words,
                    embedding_dim,
                    weights=[embedding_matrix],
                    input_length=max_tokens,
                    trainable=False))

model.add(Bidirectional(LSTM(units=32, return_sequences=True)))#Bidirectional LSTM considers both before and after words
model.add(LSTM(units=16, return_sequences=False))#units=16 neurons

GRU: If you use GRU, the test sample can achieve 87% accuracy, but when I test my own text content, I find that the output of the last activation function of GRU is around 0.5, which indicates that the judgment of the model is not clear, confidence is low, and it is found that the judgment of the model on negative sentences can sometimes be wrong after testing. We expect the output of negative samples to be close to 0, positive sampleBen is close to 1 instead of hovering between 0.5.
BILSTM: Tested LSTM and BiLSTM, found that BiLSTM performed best and LSTM performed slightly better than GRU because BiLSTM has better memory for longer sentence structures.After Embedding, the first layer uses BiLSTM to return sequences, then the second layer and the 16-unit LSTM do not return sequences, only the final result, and finally a full-link layer that uses the sigmoid activation function to output the results

# GRU Code
# model.add(GRU(units=32, return_sequences=True))
# model.add(GRU(units=16, return_sequences=True))
# model.add(GRU(units=4, return_sequences=False))

#Add Full Connection Layer
model.add(Dense(1, activation='sigmoid'))
# We used adam to optimize with 0.001 learning rate
optimizer = Adam(lr=1e-3)

model.compile(loss='binary_crossentropy',
              optimizer=optimizer,  
              metrics=['accuracy'])

# Let's look at the structure of the model. There are about 90k trainable variables. None means batchsize. A batch has 236 words in it
#15000000 is 50000*300, because train=false, these parameters are not trained
#17=16*1+1(bias is a parameter)
model.summary()

# Create a storage point for weights, verbose=1 can print more detailed information, look for problems
path_checkpoint = 'sentiment_checkpoint.keras'
checkpoint = ModelCheckpoint(filepath=path_checkpoint, monitor='val_loss',
                                      verbose=1, save_weights_only=True,
                                      save_best_only=True)

# Attempting to load a trained model
try:
    model.load_weights(path_checkpoint)
except Exception as e:
    print(e)

# Define early stoping to stop training if validation loss does not improve within 3 epoch s
earlystopping = EarlyStopping(monitor='val_loss', patience=3, verbose=1)

# Automatically reduce learning rate
lr_reduction = ReduceLROnPlateau(monitor='val_loss',
                                       factor=0.1, min_lr=1e-5, patience=0,
                                       verbose=1)

# Define callback function
callbacks = [
    earlystopping, 
    checkpoint,
    lr_reduction
]

# Start training, 4000*0.1=400 as test, validation_split=0.1 as 3600*0.1
model.fit(X_train, y_train,
          validation_split=0.1, 
          epochs=4,
          batch_size=128,
          callbacks=callbacks)

1.11 Conclusion

We define a prediction function (processing and re-entering the input text as required by the model) to predict the polarity of the input text, so that the model can accurately judge negative sentences and some simple logical structures.

result = model.evaluate(X_test, y_test)
print('Accuracy:{0:.2%}'.format(result[1]))

def predict_sentiment(text):
    print(text)
    # Depunctuation
    text = re.sub("[\s+\.\!\/_,$%^*(+\"\']+|[+-!，. ？,~@#￥%......&*()]+", "",text)
    # participle
    cut = jieba.cut(text)
    cut_list = [ i for i in cut ]
    # tokenize
    for i, word in enumerate(cut_list):
        try:
            cut_list[i] = cn_model.vocab[word].index
        except KeyError:
            cut_list[i] = 0
    # padding
    tokens_pad = pad_sequences([cut_list], maxlen=max_tokens,
                           padding='pre', truncating='pre')
    # Forecast
    result = model.predict(x=tokens_pad)
    coef = result[0][0]
    if coef >= 0.5:
        print('Is a positive evaluation','output=%.2f'%coef)
    else:
        print('Is a negative evaluation','output=%.2f'%coef)

test_list = [
    'Hotel facilities are not new and service attitudes are poor',
    'The room is cool with plenty of air conditioning',
    'The hotel environment is not good and the accommodation experience is not good',
    'Insufficient room insulation' ,
    'Come back in the evening and find no cleaning'   
]
for text in test_list:
    predict_sentiment(text)

Hotel facilities are not new and service attitudes are poor
Is a negative evaluation output=0.01
The room is cool with plenty of air conditioning
Is a positive evaluation output=0.94
The hotel environment is not good and the accommodation experience is not good
Is a negative evaluation output=0.01
Insufficient room insulation
Is a negative evaluation output=0.02
Come back in the evening and find no cleaning
Is a negative evaluation output=0.40

1.12 Misclassification

y_pred = model.predict(X_test)
y_pred = y_pred.T[0]
y_pred = [1 if p>= 0.5 else 0 for p in y_pred]
y_pred = np.array(y_pred)

y_actual = np.array(y_test)

# Find index for misclassification
misclassified = np.where( y_pred != y_actual )[0]

# Output all misclassified indexes, 48 out of 400 tests
len(misclassified)
print(len(misclassified))

# Let's find a sample of misclassified errors. misclassified[1] made the second mistake.
idx=misclassified[1]
print(reverse_tokens(X_test[idx]))
print('Classification of predictions', y_pred[idx])
print('Actual Classification', y_actual[idx])

Geographically, it's convenient to go anywhere, but the service isn't like a poor group management. I slept in the afternoon and took a bath. I wanted the hotel to clean again. So I turned on the service light and returned to the hotel at night to find that the service light was turned off and the room was not cleaned.
Forecast Classification 1
Actual Classification 0.0

2. Sina News Category (tensorflow+cnn)

The dataset uses the author of the Tsinghua dataset to intercept part of the dataset, Baidu cloud download link:
Link: https://pan.baidu.com/s/1-Rm9PU7ekU3zx_7gTB9ibw Extraction Code: 3d3o

#Here's how to get the file
import os
def getFilePathList(rootDir):
    filePath_list = []
    for walk in os.walk(rootDir):
        part_filePath_list = [os.path.join(walk[0], file) for file in walk[2]]
        filePath_list.extend(part_filePath_list)
    return filePath_list
filePath_list = getFilePathList('part_cnews')
len(filePath_list)#A total of 1400.txt files make up a list

Run result:

filePath_list[5]

Run result:

#All sample label values are summarized into a list and assigned to label_list
#windows differs from Linux system path strings by the interval character
label_list = []
for filePath in filePath_list:
    label = filePath.split('\\')[1]
    label_list.append(label)
len(label_list)

Run result:

#Label Statistics Count
import pandas as pd 
pd.value_counts(label_list)

Run result:

#Save label_list with the pickle Library
import pickle
with open('label_list.pickle','wb') as file:
    pickle.dump(label_list,file)

#Get all sample content, save content_list
#Pile.dump persists objects in python to binary files, which load very quickly.Avoid memory overflow by using the dump method of the pickle library to save every number of files read
import time
import pickle
import re

def getFile(filePath):
   with open(filePath, encoding='utf8') as file:
       fileStr = ''.join(file.readlines(1000))
   return fileStr

interval = 100
n_samples = len(label_list)
startTime = time.time()
directory_name = 'content_list'
if not os.path.isdir(directory_name):
   os.mkdir(directory_name)
for i in range(0, n_samples, interval):
   startIndex = i
   endIndex = i + interval
   content_list = []
   print('%06d-%06d start' %(startIndex, endIndex))
   for filePath in filePath_list[startIndex:endIndex]:
       fileStr = getFile(filePath)
       content = re.sub('\s+', ' ', fileStr)
       content_list.append(content)
   save_fileName = directory_name + '/%06d-%06d.pickle' %(startIndex, endIndex)
   with open(save_fileName, 'wb') as file:
       pickle.dump(content_list, file)
   used_time = time.time() - startTime
   print('%06d-%06d used time: %.2f seconds' %(startIndex, endIndex, used_time))

Run result:

#Previously, in order to get the data and get the original data for processing, [content_list folder, label_list file, code file] are all in the same path
#Loading data
import time
import pickle
import os

def getFilePathList(rootDir):
    filePath_list = []
    for walk in os.walk(rootDir):
        part_filePath_list = [os.path.join(walk[0], file) for file in walk[2]]
        filePath_list.extend(part_filePath_list)
    return filePath_list

startTime = time.time()
contentListPath_list = getFilePathList('content_list')
content_list = []
for filePath in contentListPath_list:
    with open(filePath, 'rb') as file:
        part_content_list = pickle.load(file)
    content_list.extend(part_content_list)
with open('label_list.pickle', 'rb') as file:
    label_list = pickle.load(file)
used_time = time.time() - startTime
print('used time: %.2f seconds' %used_time)
sample_size = len(content_list)
print('length of content_list，mean sample size: %d' %sample_size)

Run result:

len(content_list)

Run result:

#Make a vocabulary: The elements in the content_list of the content list are the content of each article, and the data type is a string.
#Words in all articles are counted and the top 5000 occurrences are assigned to the variable vocabulary_list.
from collections import Counter 
def getVocabularyList(content_list, vocabulary_size):
    allContent_str = ''.join(content_list)
    counter = Counter(allContent_str)
    vocabulary_list = [k[0] for k in counter.most_common(vocabulary_size)]
    return ['PAD'] + vocabulary_list
startTime = time.time()
vocabulary_list = getVocabularyList(content_list, 5000)
used_time = time.time() - startTime
print('used time: %.2f seconds' %used_time)

Run result:

#Save vocabulary
import pickle 

with open('vocabulary_list.pickle', 'wb') as file:
    pickle.dump(vocabulary_list, file)

#Loading vocabularies: When you have finished making vocabularies, save them.Running the code afterwards directly loads the saved vocabulary, saving you time copying it as a vocabulary.
import pickle

with open('vocabulary_list.pickle', 'rb') as file:
    vocabulary_list = pickle.load(file)

#Data preparation
#Press Esc in the code block to enter command mode, and the vertical line to the left of the code block appears blue.In command mode, clicking the L key displays the number of lines of code
import time
startTime = time.time()#Record the start time of this code run and assign it to the variable startTime
from sklearn.model_selection import train_test_split
train_X, test_X, train_y, test_y = train_test_split(content_list, label_list)
train_content_list = train_X
train_label_list = train_y
test_content_list = test_X
test_label_list = test_y
used_time = time.time() - startTime #Print a prompt indicating how long it takes the program to run to this point
print('train_test_split used time : %.2f seconds' %used_time)

vocabulary_size = 10000  # Vocabulary expression is small
sequence_length = 600  # Sequence Length
embedding_size = 64  # Word vector dimension
num_filters = 256  # Number of convolution cores
filter_size = 5  # Convolution Kernel Size
num_fc_units = 128  # Full Junction Layer Neurons
dropout_keep_probability = 0.5  # dropout retention ratio
learning_rate = 1e-3  # learning rate
batch_size = 64  # Training size per batch

word2id_dict = dict([(b, a) for a, b in enumerate(vocabulary_list)])#Use list derivation to get a list of words and their id s, and call dict to cast the list into a dictionary
content2idList = lambda content : [word2id_dict[word] for word in content if word in word2id_dict]#Use list derivation and anonymous functions to define the function content2idlist, which converts every word in the article to an id
train_idlist_list = [content2idList(content) for content in train_content_list]#The result of the list derivation is a list of lists, with the elements in the total list train_id list_list being a list of IDS corresponding to the words in each article
used_time = time.time() - startTime#Code print prompt indicating how long it takes the program to run to this point
print('content2idList used time : %.2f seconds' %used_time)

import numpy as np
num_classes = np.unique(label_list).shape[0]#Gets the number of categories for a label, such as 14 for this article, where the value of the variable num_classes is 14
#The following six behaviors obtain a feature matrix and predictive target values that can be used for model training
import tensorflow.contrib.keras as kr
train_X = kr.preprocessing.sequence.pad_sequences(train_idlist_list, sequence_length)#The uniform length of each sample is seq_length, i.e. the superparameter above 600 has been set to 600
from sklearn.preprocessing import LabelEncoder#LaelEncoder method for importing sklearn.preprocessing Library
labelEncoder = LabelEncoder()#Instantiate LabelEncoder object
train_y = labelEncoder.fit_transform(train_label_list)#Call fit_transform method of LabelEncoder object for label encoding
train_Y = kr.utils.to_categorical(train_y, num_classes)#Call the to_categorical method of the keras.untils library to encode the result of label encoding before Ont-Hot encoding

import tensorflow as tf
tf.reset_default_graph()#Reset tensorflow diagram to enhance code robustness
X_holder = tf.placeholder(tf.int32, [None, sequence_length])
Y_holder = tf.placeholder(tf.float32, [None, num_classes])
used_time = time.time() - startTime
print('data preparation used time : %.2f seconds' %used_time)

Run result:

list(word2id_dict.items())[:20]

Run result:

#Build Neural Network
embedding = tf.get_variable('embedding', 
                            [vocabulary_size, embedding_size])#Calling the get_variable method of the tf library to instantiate the updatable model parameter embedding, the matrix shape is vocabulary_size*embedding_size, or 10000*64
embedding_inputs = tf.nn.embedding_lookup(embedding,#By embedding the input data into words, the new variable embedding_inputs has the shape batch_size*sequence_length*embedding_size, or 64*600*64
                                          X_holder)
conv = tf.layers.conv1d(embedding_inputs,  #Calling the tf.layers.conv1d method requires three parameters, the first is the input data, the second is the number of convolution cores num_filters, and the third is the convolution core size filter_size
                        num_filters,       #The result of the method is assigned to the variable conv in the shape of batch_size*596*num_filters, 596 being the result of 600-5+1
                        filter_size)
max_pooling = tf.reduce_max(conv,  #Do a maximum operation on the first dimension of the variable conv.The result of the method is assigned to the variable max_pooling in the shape of batch_size*num_filters, or 64*256
                            [1])
full_connect = tf.layers.dense(max_pooling, #Add a full connection layer, and the result of the tf.layers.dense method is assigned to the variable full_connect in the shape of batch_size*num_fc_units, or 64*128
                               num_fc_units)
full_connect_dropout = tf.contrib.layers.dropout(full_connect, #The code calls the tf.contrib.layers.dropout method, which requires two parameters, the first is the input data and the second is the preservation ratio
                                                 keep_prob=dropout_keep_probability)
full_connect_activate = tf.nn.relu(full_connect_dropout) #Activation function
softmax_before = tf.layers.dense(full_connect_activate, #Add a full connection layer, and the result of the tf.layers.dense method is assigned to the variable softmax_before in the shape of batch_size*num_classes, or 64*14
                                 num_classes)
predict_Y = tf.nn.softmax(softmax_before)  #tf.nn.softmax method, the result of which is the prediction probability value
cross_entropy = tf.nn.softmax_cross_entropy_with_logits_v2(labels=Y_holder,
                                                           logits=softmax_before)
loss = tf.reduce_mean(cross_entropy)
optimizer = tf.train.AdamOptimizer(learning_rate)
train = optimizer.minimize(loss)
isCorrect = tf.equal(tf.argmax(Y_holder, 1), tf.argmax(predict_Y, 1))
accuracy = tf.reduce_mean(tf.cast(isCorrect, tf.float32))

#Parameter initialization is one of the most important parameters for neural network models.Initialization of parameters is required before training the neural network model
init = tf.global_variables_initializer()
session = tf.Session()
session.run(init)

#model training
test_idlist_list = [content2idList(content) for content in test_content_list] #Get data from test set
test_X = kr.preprocessing.sequence.pad_sequences(test_idlist_list, sequence_length)
test_y = labelEncoder.transform(test_label_list)
test_Y = kr.utils.to_categorical(test_y, num_classes)
import random
for i in range(100): #Model iteration training 100 times
    selected_index = random.sample(list(range(len(train_y))), k=batch_size)#Select batch_size from training set, that is, 64 samples for batch gradient reduction
    batch_X = train_X[selected_index]
    batch_Y = train_Y[selected_index]
    session.run(train, {X_holder:batch_X, Y_holder:batch_Y}) #Each run represents a model training session
    step = i + 1 
    if step % 10 == 0:#Print every 10 steps
        selected_index = random.sample(list(range(len(test_y))), k=100)#Randomly select 100 samples from a test set
        batch_X = test_X[selected_index]
        batch_Y = test_Y[selected_index]
        loss_value, accuracy_value = session.run([loss, accuracy], {X_holder:batch_X, Y_holder:batch_Y})
        print('step:%d loss:%.4f accuracy:%.4f' %(step, loss_value, accuracy_value))

Run result:

import warnings
warnings.filterwarnings("ignore")
def predict(input_content):
   #The data type of the idList must be a list list.
   #Otherwise, calling the kr.preprocessing.sequence.pad_sequences method will cause an error
   idList = [content2idList(input_content)]
   X = kr.preprocessing.sequence.pad_sequences(idList, sequence_length)
   Y = session.run(predict_Y, {X_holder:X})
   y = np.argmax(Y, axis=1)
   label = labelEncoder.inverse_transform(y)[0]
   return label

selected_index = random.sample(range(len(test_content_list)), k=1)[0]
selected_sample = test_content_list[selected_index]
true_label = test_label_list[selected_index]
predict_label = predict(selected_sample)

print('selected_sample :', selected_sample)
print('true_label :', true_label)
print('predict_label :', predict_label, '\n')
print('predict whatever you want, for example:')
input_content = "Football referee hits"
print('predict("%s") :' %input_content, predict(input_content))

Run result:

#Confusion Matrix
import numpy as np
import pandas as pd
from sklearn.metrics import confusion_matrix

def predictAll(test_X, batch_size=100):
    predict_value_list = []
    for i in range(0, len(test_X), batch_size):
        selected_X = test_X[i: i + batch_size]
        predict_value = session.run(predict_Y, {X_holder:selected_X})
        predict_value_list.extend(predict_value)
    return np.array(predict_value_list)

Y = predictAll(test_X)
y = np.argmax(Y, axis=1)
predict_label_list = labelEncoder.inverse_transform(y)
pd.DataFrame(confusion_matrix(test_label_list, predict_label_list), 
             columns=labelEncoder.classes_,
             index=labelEncoder.classes_ )

Run result:

#Report Form
import numpy as np
from sklearn.metrics import precision_recall_fscore_support

def eval_model(y_true, y_pred, labels):
   # Calculate recision, Recall, f1, support for each classification
   p, r, f1, s = precision_recall_fscore_support(y_true, y_pred)
   # Calculate the average Precision, Recall, f1, support for the population
   tot_p = np.average(p, weights=s)
   tot_r = np.average(r, weights=s)
   tot_f1 = np.average(f1, weights=s)
   tot_s = np.sum(s)
   res1 = pd.DataFrame({
       u'Label': labels,
       u'Precision': p,
       u'Recall': r,
       u'F1': f1,
       u'Support': s
   })
   res2 = pd.DataFrame({
       u'Label': ['population'],
       u'Precision': [tot_p],
       u'Recall': [tot_r],
       u'F1': [tot_f1],
       u'Support': [tot_s]
   })
   res2.index = [999]
   res = pd.concat([res1, res2])
   return res[['Label', 'Precision', 'Recall', 'F1', 'Support']]

eval_model(test_label_list, predict_label_list, labelEncoder.classes_)

Run result:

3. Sohu News Text Category (word2vec)

Create a new word2vec-based text categorization folder on your desktop. Enter cmd in the folder:

Create a new word2vec_test.ipynb:

rename is: word2vec_test

There are two more files in the folder at this time:

3.1 Data preparation

There are 24,000 samples in the training set, 12 classifications, and 2000 samples in each classification.
There are 12,000 samples in the test set, 12 classifications, and 1,000 samples in each classification.
Link: https://pan.baidu.com/s/1UbPjMpcp3kqvdd0HMAgMfQ Extraction Code: b53e
The file decompressed for the compressed package is shown in the red box below:

Load the training set into the variable train_df and print the first five lines of the training set, coded as follows:

#There are 24,000 training set data and 12,000 test set data.
import pandas as pd
#Load the training set into the variable train_df and print the first five lines of the training set, coded as follows.
#There are three parameters in the read_csv method. The first parameter is the path to load the text file, the second key parameter sep is the separator, and the third key parameter header is whether the first line of the text file is the field name or not.
train_df = pd.read_csv('sohu_train.txt', sep='\t', header=None)
train_df.head()

#View the name of each category in the training set and the number of samples
for name, group in train_df.groupby(0):
    print(name,len(group))

#Load the test set and see the name of each category and the number of samples
test_df = pd.read_csv('sohu_test.txt', sep='\t', header=None)
for name, group in test_df.groupby(0):
    print(name, len(group))

#Walk through 24,000 samples of the training set and use the cut method of the jieba library to get the word breaker list and assign it to the variable cutWords
#Determines whether a participle is a pause, or if it is not, adds it to the variable cutWords
import jieba
import time

train_df.columns = ['classification', 'Article']
stopword_list = [k.strip() for k in open('stopwords.txt', encoding='utf8').readlines() if k.strip() != '']
cutWords_list = []
i = 0
startTime = time.time()
for article in train_df['Article']:
 cutWords = [k for k in jieba.cut(article) if k not in stopword_list]
 i += 1
 if i % 1000 == 0:
     print('Front%d Total cost of article word breaking%.2f second' %(i, time.time()-startTime))
 cutWords_list.append(cutWords)

Run result:

#Save word breaking results as local file cutWords_list.txt
with open('cutWords_list.txt', 'w') as file: 
    for cutWords in cutWords_list:
        file.write(' '.join(cutWords) + '\n')

The author provides links to text files that have been word-split: https://pan.baidu.com/s/1oKjLZjSkqE0LfLEvLxkBNw Extraction Code: oh3u

#Load word breaker file
with open('cutWords_list.txt') as file:
  cutWords_list = [k.split() for k in file.readlines()]

3.2 word 2vec model

To complete this step, you need to first install the gensim library, install the command: pip install gensim

#Invoke the LineSentence method in the gensim.models.word2vec library to instantiate the model object
from gensim.models import Word2Vec

word2vec_model = Word2Vec(cutWords_list, size=100, iter=10, min_count=20)

#Always prompt for warnings when invoking methods for model objects to avoid them
import warnings

warnings.filterwarnings('ignore')

#Call the wv.most_similar method of the Word2Vec model object to see words that have the closest meaning to photography.
#The wv.most_similar method has two parameters. The first parameter is the word to search for, and the second keyword parameter, topn, has a positive integer data type. It refers to how many most relevant words need to be listed. By default, it lists 10 most relevant words.
#The data type of the return value from the wv.most_similar method is a list, with each element in the list having a data type of tuple and two elements in the tuple, the first element having a related vocabulary, the second element having a degree of correlation and the data type floating point.
word2vec_model.wv.most_similar('Photography')

Run result:

A simple example of using positive and negative as keyword parameters for the wv.most_similar method is to look at the results for women+men:

word2vec_model.most_similar(positive=['woman', 'Sir'], negative=['Man'], topn=1)

Run result:

Look at the correlation between the two words as shown in the following figure:

Save the Word2Vec model as a word2vec_model.w2v file with the following code:

word2vec_model.save('word2vec_model.w2v')

3.3 Feature Engineering:

#For each article, get the correlation vector of each participle in the word2vec model
#Then sum the correlation vectors of all the participles of an article in the word2vec model to get the average, that is, the correlation vectors of this article in the word2vec model
#When instantiating a Word2Vec object, the keyword parameter size is defined as 100, and the correlation matrix is 100 dimensions
#Defines the getVector function to get the word vector for each article, passing in two parameters, the first parameter is the result of article partitioning and the second parameter is the word2vec model object
#The variable vector_list derives the word vectors of all the participles in a single article through list derivation, converts them into ndarray objects by np.array method, and averages each column

#Method 1, general calculation with for loop
''''import numpy as np 
import time 

def getVector_v1(cutWords, word2vec_model):
    count = 0
    article_vector = np.zeros(word2vec_model.layer1_size)
    for cutWord in cutWords:
        if cutWord in word2vec_model:
            article_vector += word2vec_model[cutWord]
            count += 1
    return article_vector / count

startTime = time.time()
vector_list = []
i = 0
for cutWords in cutWords_list[:5000]:
    i += 1
    if i % 1000 ==0:
        print('Front%d Article Formation Word Vector Spending%.2f second' %(i, time.time()-startTime))
    vector_list.append(getVector_v1(cutWords, word2vec_model))
X = np.array(vector_list)

#The second method, calculated by mean s of pandas
''''import time 
import pandas as pd 

def getVector_v2(cutWords, word2vec_model):
    vector_list = [word2vec_model[k] for k in cutWords if k in word2vec_model]
    vector_df = pd.DataFrame(vector_list)
    cutWord_vector = vector_df.mean(axis=0).values
    return cutWord_vector

startTime = time.time()
vector_list = []
i = 0
for cutWords in cutWords_list[:5000]:
    i += 1
    if i % 1000 ==0:
        print('Front%d Article Formation Word Vector Spending%.2f second' %(i, time.time()-startTime))
    vector_list.append(getVector_v2(cutWords, word2vec_model))
X = np.array(vector_list)

#The third method, calculated by mean s of numpy
import time
import numpy as np

def getVector_v3(cutWords, word2vec_model):
    vector_list = [word2vec_model[k] for k in cutWords if k in word2vec_model]
    cutWord_vector = np.array(vector_list).mean(axis=0)
    return cutWord_vector

startTime = time.time()
vector_list = []
i = 0
for cutWords in cutWords_list:
    i += 1
    if i % 1000 ==0:
        print('Front%d Article Formation Word Vector Spending%.2f second' %(i, time.time()-startTime))
    vector_list.append(getVector_v3(cutWords, word2vec_model))
X = np.array(vector_list)

Run result:

#Method 4, calculated by numpy add, divide
''''import time
import numpy as np

def getVector_v4(cutWords, word2vec_model):
    i = 0
    index2word_set = set(word2vec_model.wv.index2word)
    article_vector = np.zeros((word2vec_model.layer1_size))
    for cutWord in cutWords:
        if cutWord in index2word_set:
            article_vector = np.add(article_vector, word2vec_model.wv[cutWord])
            i += 1
    cutWord_vector = np.divide(article_vector, i)
    return cutWord_vector

startTime = time.time()
vector_list = []
i = 0
for cutWords in cutWords_list[:5000]:
    i += 1
    if i % 1000 ==0:
        print('Front%d Article Formation Word Vector Spending%.2f second' %(i, time.time()-startTime))
    vector_list.append(getVector_v4(cutWords, word2vec_model))
X = np.array(vector_list)

#Because it takes a long time to form the feature matrix, the feature matrix is saved as a file in order to avoid repeating the time in the future.
#Using the dump method of the ndarray object requires a parameter with a data type of string and a file name to save the file
X.dump('articles_vector.txt')

#Load the contents of this file to assign to variable X
X = np.load('articles_vector.txt')

3.4 Model training, model evaluation

3.4.1 Label Coding:

#Label Encoding
#Call the LabelEncoder method of sklearn.preprocessing library to label article categories
from sklearn.preprocessing import LabelEncoder
import pandas as pd

train_df = pd.read_csv('sohu_train.txt', sep='\t', header=None)
train_df.columns = ['classification', 'Article']
labelEncoder = LabelEncoder()
y = labelEncoder.fit_transform(train_df['classification'])

3.4.2 Logistic Regression Model

#Call the LogisticRegression method of the sklearn.linear_model library to instantiate the model object.
#Calling the train_test_split method of sklearn.model_selection library to partition training and test sets
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

train_X, test_X, train_y, test_y = train_test_split(X, y, test_size=0.2)
logistic_model = LogisticRegression()
logistic_model.fit(train_X, train_y)
logistic_model.score(test_X, test_y)

Run result:

0.789375

3.4.3 Save Model

#Call the joblib method in the sklearn.externals library to save the model as a logistic.model file
from sklearn.externals import joblib

joblib.dump(logistic_model, 'logistic.model')

#Load Model
from sklearn.externals import joblib

logistic_model = joblib.load('logistic.model')

3.4.4 Cross-validation

#The results of cross-validation are more convincing.
#Call the ShuffleSplit method of the sklearn.model_selection library to instantiate the cross-validation object.
#Call the cross_val_score method of the sklearn.model_selection library to get a score for each cross-validation
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import ShuffleSplit
from sklearn.model_selection import cross_val_score

cv_split = ShuffleSplit(n_splits=5, train_size=0.7, test_size=0.2)
logistic_model = LogisticRegression()
score_ndarray = cross_val_score(logistic_model, X, y, cv=cv_split)
print(score_ndarray)
print(score_ndarray.mean())

Run result:

[0.79104167 0.77375 0.78875 0.77979167 0.78958333]
0.7845833333333333

3.4.5 Model Test

#Model Test
#Call the load method of the joblib object in the sklearn.externals library to load the model assignment to the variable logistic_model.
#Call the pandas library read_csv method to read the test set data.
#Call the groupby method of the DataFrame object to group each category so that each article category is categorized accurately.
#Call the custom getVector method to convert the article to a correlation vector.
#Customize the getVectorMatrix method to get the characteristic matrix of the test set.
#Calling the transform method of the labelEncoder object labels the prediction label to get the prediction target value

import pandas as pd
import numpy as np
from sklearn.externals import joblib
import jieba 

def getVectorMatrix(article_series):
    return np.array([getVector_v3(jieba.cut(k), word2vec_model) for k in article_series])
    
logistic_model = joblib.load('logistic.model')
test_df = pd.read_csv('sohu_test.txt', sep='\t', header=None)
test_df.columns = ['classification', 'Article']
for name, group in test_df.groupby('classification'):
    featureMatrix = getVectorMatrix(group['Article'])
    target = labelEncoder.transform(group['classification'])
    print(name, logistic_model.score(featureMatrix, target))

Result of last code run:

Sports 0.968
Health 0.814
Women 0.772
Entertainment 0.765
Real Estate 0.879
Education 0.886
Culture 0.563
News 0.575
Travel 0.82
Car 0.934
Technology 0.817
Finance 0.7

3.5 Summary

word2vec model is applied. There are 24,000 training set data and 12,000 test set data.After cross-validation, the average model score is about 0.78.(In the validation effect of the test set, the seven categories of sports, education, health, culture, tourism, automobile and entertainment have higher scores, that is, they are easy to be classified correctly.The five categories of women, entertainment, news, science and technology, finance have low scores, which means they are difficult to classify correctly.)

4. Sohu News Text Category (TfidfVectorizer)

import pandas as pd

train_df = pd.read_csv('sohu_train.txt', sep='\t', header=None)
train_df.head()

for name, group in train_df.groupby(0):
    print(name,len(group))

Run result:

test_df = pd.read_csv('sohu_test.txt', sep='\t', header=None)
for name, group in test_df.groupby(0):
    print(name, len(group))

Run result:

with open('stopwords.txt', encoding='utf8') as file:
  stopWord_list = [k.strip() for k in file.readlines()]

Run result:

with open('stopwords.txt', encoding='utf8') as file:
  stopWord_list = [k.strip() for k in file.readlines()]

import jieba
import time

train_df.columns = ['classification', 'Article']
stopword_list = [k.strip() for k in open('stopwords.txt', encoding='utf8').readlines() if k.strip() != '']
cutWords_list = []
i = 0
startTime = time.time()
for article in train_df['Article']:
   cutWords = [k for k in jieba.cut(article) if k not in stopword_list]
   i += 1
   if i % 1000 == 0:
       print('Front%d Total cost of article word breaking%.2f second' %(i, time.time()-startTime))
   cutWords_list.append(cutWords)

with open('cutWords_list.txt', 'w') as file: 
    for cutWords in cutWords_list:
        file.write(' '.join(cutWords) + '\n')

with open('cutWords_list.txt') as file:
    cutWords_list = [k.split() for k in file.readlines()]

#Feature Engineering
X = tfidf.fit_transform(train_df[1])
print('Glossary size:', len(tfidf.vocabulary_))
print(X.shape)

Run result:

#Call the LabelEncoder method of the sklearn.preprocessing library to label the article categories.
#The last line of code looks at the shape of the prediction target.

from sklearn.preprocessing import LabelEncoder
import pandas as pd

train_df = pd.read_csv('sohu_train.txt', sep='\t', header=None)
labelEncoder = LabelEncoder()
y = labelEncoder.fit_transform(train_df[0])
y.shape

Run result:

#Call the LogisticRegression method of the sklearn.linear_model library to instantiate the model object.
#Call the train_test_split method of the sklearn.model_selection library to divide the training and test sets.

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

train_X, test_X, train_y, test_y = train_test_split(X, y, test_size=0.2)
logistic_model = LogisticRegression(multi_class='multinomial', solver='lbfgs')
logistic_model.fit(train_X, train_y)
logistic_model.score(test_X, test_y)

Run result:

#To save the model, you need to install the pickle library first. Install the command: pip install pickle
#Calling the dump method of the pickle library to save the model requires two parameters.
#The first parameter is a saved object and can be of any data type, since there are three models to save, the first parameter in the code below is a dictionary.
#The second parameter is the saved file object, with a data type of _io.BufferedWriter

import pickle

with open('tfidf.model', 'wb') as file:
    save = {
        'labelEncoder' : labelEncoder,
        'tfidfVectorizer' : tfidf,
        'logistic_model' : logistic_model
    }
    pickle.dump(save, file)

#Call the load method of the pickle library to load the saved model object
import pickle

with open('tfidf.model', 'rb') as file:
    tfidf_model = pickle.load(file)
    tfidfVectorizer = tfidf_model['tfidfVectorizer']
    labelEncoder = tfidf_model['labelEncoder']
    logistic_model = tfidf_model['logistic_model']

#Call the pandas read_csv method to load the training set data.
#The transform ation method of the TfidfVectorizer object is called to obtain the feature matrix.
#Call the transform method of the LabelEncoder object to get the predicted target value.

import pandas as pd

train_df = pd.read_csv('sohu_train.txt', sep='\t', header=None)
X = tfidfVectorizer.transform(train_df[1])
y = labelEncoder.transform(train_df[0])
```python
#The Logistic Regression model object is instantiated by calling the LogisticRegression method of the sklearn.linear_model library.
#Call the ShuffleSplit method of the sklearn.model_selection library to instantiate the cross-validation object.
#Call the cross_val_score method of the sklearn.model_selection library to get the score for each cross-validation.
#Final print each score and average score

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import ShuffleSplit
from sklearn.model_selection import cross_val_score

logistic_model = LogisticRegression(multi_class='multinomial', solver='lbfgs')
cv_split = ShuffleSplit(n_splits=5, test_size=0.3)
score_ndarray = cross_val_score(logistic_model, X, y, cv=cv_split)
print(score_ndarray)
print(score_ndarray.mean())

Run result:

#Draw confusion matrix
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegressionCV
from sklearn.metrics import confusion_matrix
import pandas as pd

train_X, test_X, train_y, test_y = train_test_split(X, y, test_size=0.2)
logistic_model = LogisticRegressionCV(multi_class='multinomial', solver='lbfgs')
logistic_model.fit(train_X, train_y)
predict_y = logistic_model.predict(test_X)
pd.DataFrame(confusion_matrix(test_y, predict_y), 
             columns=labelEncoder.classes_, 
             index=labelEncoder.classes_)

Run result:

#Draw precision, recall, f1-score, support report tables
import numpy as np
from sklearn.metrics import precision_recall_fscore_support

def eval_model(y_true, y_pred, labels):
    # Calculate recision, Recall, f1, support for each classification
    p, r, f1, s = precision_recall_fscore_support(y_true, y_pred)
    # Calculate the average Precision, Recall, f1, support for the population
    tot_p = np.average(p, weights=s)
    tot_r = np.average(r, weights=s)
    tot_f1 = np.average(f1, weights=s)
    tot_s = np.sum(s)
    res1 = pd.DataFrame({
        u'Label': labels,
        u'Precision': p,
        u'Recall': r,
        u'F1': f1,
        u'Support': s
    })
    res2 = pd.DataFrame({
        u'Label': ['population'],
        u'Precision': [tot_p],
        u'Recall': [tot_r],
        u'F1': [tot_f1],
        u'Support': [tot_s]
    })
    res2.index = [999]
    res = pd.concat([res1, res2])
    return res[['Label', 'Precision', 'Recall', 'F1', 'Support']]

predict_y = logistic_model.predict(test_X)
eval_model(test_y, predict_y, labelEncoder.classes_)

Run result:

#Model testing is the prediction of a completely new set of tests.
#Call the read_csv method of the pandas library to read the test set file.
#The transform ation method of the TfidfVectorizer object is called to obtain the feature matrix.
#Call the transform ation method of the LabelEncoder object to get the predicted target value

import pandas as pd

test_df = pd.read_csv('sohu_test.txt', sep='\t', header=None)
test_X = tfidfVectorizer.transform(test_df[1])
test_y = labelEncoder.transform(test_df[0])
predict_y = logistic_model.predict(test_X)
eval_model(test_y, predict_y, labelEncoder.classes_)

Run result:

C/C++ Language Technology

18 original articles were published, 113 were praised, 20,000 visits+

Private letter follow

Posted by d3vilr3d on Sun, 22 Mar 2020 19:07:16 -0700

Programmer Group

[NLP] Text Classification and Generalization (rnn, cnn, word2vec, TfidfVectorizer)

1. Emotional Analysis of Chinese Comments (keras+rnn)

1.1 Required Libraries

1.2 Pre-training Word Vector

1.3 Word Vector Model

1.4 Training corpus (dataset)

1.5 participle and tokenize

1.6 Index Length Standardization

1.7 Reverse tokenize

1.8 Building an embedding matrix

1.9 padding and truncating

1.10 Building LSTM Model with Keas

1.11 Conclusion

1.12 Misclassification

2. Sina News Category (tensorflow+cnn)

3. Sohu News Text Category (word2vec)

3.1 Data preparation

3.2 word 2vec model

3.3 Feature Engineering:

3.4 Model training, model evaluation

3.4.1 Label Coding:

3.4.2 Logistic Regression Model

3.4.3 Save Model

3.4.4 Cross-validation

3.4.5 Model Test

3.5 Summary

4. Sohu News Text Category (TfidfVectorizer)

Hot Keywords