Spam detection: Part 2

Pan Chuang AI sharing

Author | Md Sohel Mahmood Compile | VK Source: Towards Data Science

introduce

Classifying natural language is one of the great challenges facing natural language processing.

It involves techniques that can effectively distinguish between target text and normal text. Other services, such as chat robots, also rely heavily on text entered by users. They need to process a lot of data to determine user needs and guide the right path.

Use of Tensorflow

In the first part of this spam classifier, I showed how to use nltk package to analyze and classify text, and then input it into the classifier model to train and finally evaluate the performance of the model. I have demonstrated the performance of naive Bayes, SVC and random forest as e-mail classifiers.

https://towardsdatascience.com/ml-classifier-performance-comparison-for-spam-emails-detection-77749926d508

In this article, I will demonstrate how to token ize and effectively classify e-mail using Tensorflow. Let's start.

import numpy as np
import pandas as pd
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

I've included pad_sequence, which will be used to make all text arrays equal in size. padding can be done according to the maximum size and can be post filled or pre filled. The superparameters are defined below

vocab_size = 1000
embedding_dim = 64
max_length = 120
trunc_type='post'
padding_type='post'
oov_tok = "<OOV>"

Users can use these parameters to prevent training data from over fitting. For example, you can reduce your vocabulary to minimize over fitting of low-frequency words. Similarly, the lower the embedding dimension, the faster the model can be trained. I also included a tag for words outside the vocabulary.

I will use the same dataset used in part 1.

spam = pd.read_csv('spam.csv',  encoding='latin-1')
spam = spam.filter(['v1','v2'], axis=1)
spam.columns = ['label', 'text']

sentences = []
labels = []
for i in range(0,spam.shape[0],1):
    sentences.append(spam['text'][i])
    labels.append(spam['label'][i])

I have allocated 80% of the entire data set for training and the rest for testing.

training_size = int(spam.shape[0]*0.8)
training_sentences = sentences[0:training_size]
testing_sentences = sentences[training_size:]
training_labels_str = labels[0:training_size]
testing_labels_str = labels[training_size:]

Encode labels

Since the label of the dataset is a string, it will be converted to an integer by encoding it into 0 and 1 (0 for spam and 1 for real text).

training_labels = [0] * len(training_labels_str)
for ind,item in enumerate(training_labels_str):
    if item == 'ham':
        training_labels[ind] = 1
    else:
        training_labels[ind] = 0

testing_labels = [0] * len(testing_labels_str)
for ind,item in enumerate(testing_labels_str):
    if item == 'ham':
        testing_labels[ind] = 1
    else:
        testing_labels[ind] = 0

Next, you'll finish converting text and labels to numpy arrays.

training_padded = np.array(training_padded)
testing_padded = np.array(testing_padded)

training_labels = np.array(training_labels)
testing_labels = np.array(testing_labels)

tokenizer = Tokenizer(num_words=vocab_size, oov_token=oov_tok)
tokenizer.fit_on_texts(training_sentences)

word_index = tokenizer.word_index

As mentioned earlier, padding is required to make the array equal in length.

training_sequences = tokenizer.texts_to_sequences(training_sentences)
training_padded = pad_sequences(training_sequences, maxlen=max_length, padding=padding_type, truncating=trunc_type)

testing_sequences = tokenizer.texts_to_sequences(testing_sentences)
testing_padded = pad_sequences(testing_sequences, maxlen=max_length, padding=padding_type, truncating=trunc_type)

Model definition

Then RNN and bidirectional LSTM algorithms are used to define the model. Here, bidirectional LSTM is used to obtain the best performance on RNN.

model = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, embedding_dim, input_length=max_length),
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(32)),
    tf.keras.layers.Dense(24, activation='relu'),
    tf.keras.layers.Dropout(0.25),
    tf.keras.layers.Dense(1, activation='sigmoid')
])

adam = tf.keras.optimizers.Adam(learning_rate=0.0001, beta_1=0.9, beta_2=0.999, amsgrad=False)

model.compile(loss='binary_crossentropy',optimizer=adam,metrics=['accuracy'])
model.summary()
num_epochs = 20
history = model.fit(training_padded, training_labels, epochs=num_epochs, \
                    validation_data=(testing_padded, testing_labels), verbose=1)

Training model

After 20 epoch s, the model has been well trained, and the accuracy of the validation data (here is the test data) is about 98%

Then predict.

X_train = training_padded
X_test = testing_padded

y_train = training_labels
y_test = testing_labels

from sklearn.metrics import confusion_matrix, classification_report, accuracy_score
y_pred = model.predict(X_test)

# Convert forecast value to 0 or 1
y_prediction = [0] * y_pred.shape[0]
for ind,item in enumerate(y_pred):
    if item > 0.5:
        y_prediction[ind] = 1
    else:
        y_prediction[ind] = 0

rep = classification_report(y_test, y_prediction)

The report shows that the model has good accuracy, recall rate and F1 score for both text categories (0 for spam and 1 for normal mail). For 0, the recall rate is slightly lower than the accuracy, indicating that there are some false negatives. The model mistakenly recognizes some normal e-mails as spam.

We can identify any sample text to check whether it is spam or normal text. Since tokenizer is already defined, we no longer need to define it again. All we need is tokenize the sample text, fill it with 0, and then pass it to the model for prediction.

Choose some catchy words, such as "WINNER", "free" and "prize", which will eventually make this text detected as spam.

sample_text = ["Winner!!! Darling please click the link to claim your free prize"]

sample_text_tokenized  = tokenizer.texts_to_sequences(sample_text)
sample_text_tokenized_padded = pad_sequences(sample_text_tokenized, maxlen=max_length, padding=padding_type, truncating=trunc_type)

# 0 is spam and 1 is normal
pred = float(model.predict(sample_text_tokenized_padded))
if (pred>0.5):
    print ("This is a real email")
else:
    print("This is a spam")

conclusion

This paper demonstrates how to use Tensorflow to effectively train NLP models with high accuracy, and then evaluate model performance parameters such as accuracy, recall and F1 score. With this small data set, it seems that 20 epoch s can generate an excellent model, and the verification accuracy is about 98%.

Github page: https://mdsohelmahmood.github.io/2021/06/23/Spam-email-classification-Part2-Tensorflow.html

Posted by pesoto74 on Wed, 10 Nov 2021 16:17:43 -0800