Spam filtering based on Naive Bayes in machine learning

Keywords: C++ Machine Learning Decision Tree

1, Naive Bayes overview

Naive Bayesian method is a classification method based on Bayesian theorem and the assumption of feature conditional independence. For a given training set, firstly, the joint probability distribution of input and output is learned based on the independent assumption of characteristic conditions (the naive Bayes method, which obtains the model through learning, obviously belongs to the generative model); Then, based on this model, for a given input x, the output y with the maximum a posteriori probability is obtained by using Bayesian theorem

2, Basic formula of naive Bayes

1. Joint distribution rate

The joint probability is expressed as the probability that multiple conditions are included and all conditions are true at the same time, which is recorded as P (x = a, y = b) P (x = a, y = b) P (x = a, y = b) or P (a, b) P (a, b) or P (a, b) P (AB) P (AB)

2. Conditional probability

There is a jar containing 7 stones, of which 3 are white and 4 are black. If a stone is taken randomly from the jar, what is the probability that it is a white stone

Obviously, the probability of taking out white stone is 3 / 7 and the probability of taking out black stone is 4 / 7. We use   P(white)   The probability value can be obtained by dividing the number of white stones by the total number of stones.  

Seven stones are placed in two barrels as shown in the figure. How should the conditional probability be calculated  ?

To calculate P(white) or P(black), obviously, the information of the bucket where the stone is located will change the result, which is the conditional probability. Assuming that the calculation is the probability of taking the white stone from bucket B, this probability can be recorded as P (white|bucket b), which we call "the probability of taking the white stone when the stone is known to come from bucket B".

It is easy to get that the value of P (white|bucket a) is 2 / 4 and the value of P (white|bucket b) is 1 / 3.

The calculation formula of conditional probability is as follows:


Put it in our example: P (white|bucket b) = P (white and bucket B) / P (bucket B)

Formula interpretation:

P (white|bucket b): the probability of taking out white stone under the condition that the stone is known to come from bucket B
P (white and bucket B): probability of taking out the white stone in bucket B = 1 / 7
P (bucket B): the probability of taking out the stone in bucket B is 3 / 7

3. Bayesian theorem

Another effective method to calculate conditional probability is called Bayesian theorem. Bayesian theorem tells us how to exchange the conditions and results in conditional probability, that is, if known   P(X|Y), requirements   P(Y|X):


P(Y): a priori probability. A priori probability means that something has not happened yet. It is a priori probability to find the possibility of this thing happening. It often appears as "cause" in the problem of "seeking result from cause".

P (Y ∣ X) P (Y | X) P (Y ∣ X): a posteriori probability. A posteriori probability refers to the probability that something has happened and the reason for it is caused by a certain factor. The calculation of a posteriori probability should be based on a priori probability

P (x ∣ Y) P (x | Y) P (x ∣ Y): conditional probability, also known as likelihood probability, is generally obtained through historical data statistics. Generally, it is not called a priori probability, but it also conforms to the a priori definition.

4. Naive Bayesian classifier

The naive Bayesian method learns the joint probability distribution P (x, y) P (x, y) P (x, y) through the training data set, which is actually learning the prior probability distribution and conditional probability distribution:

A priori probability distribution:

Conditional probability distribution:

  Therefore, the joint probability distribution can be obtained from the conditional probability formula  

In naive Bayesian classification, for a given input x, the posterior probability distribution P(Y = c_k|X = x) is calculated through the above learned model, and the class with the largest posterior probability is taken as the output of the class of X. The posterior probability is calculated according to Bayesian theorem:

Apply the full probability formula to the denominator:


Bring conditional independence hypothesis 4.3 into the above formula to obtain the basic formula of naive Bayesian classification:


The denominator in the above formula is the same for all categories and will not affect the calculation results. Therefore, the naive Bayesian classifier can be simplified to


3, Spam filtering

1. Steps of realizing spam classification by naive Bayes

(1) Collect data: provide text files.

(2) Prepare data: parse the text file into an entry vector.

(3) Analyze data: check entries to ensure the correctness of parsing.

(4) Training algorithm: calculate the conditional probability of different independent features.

(5) Test algorithm: calculate the error rate.

(6) Using algorithms: build a complete program to classify a group of documents

2. Data set download

There are two folders ham and spam under the email folder: Click to download the dataset

txt file in ham folder is normal mail;

txt file under spam file is spam;

3. Code implementation

# -*- coding: UTF-8 -*-
import numpy as np
import re
import random
Function description:The segmented experimental sample entries are sorted into a list of non repeated entries, that is, a vocabulary
    dataSet - Collated sample data set
    vocabSet - Returns a list of non repeating entries, that is, a glossary
def createVocabList(dataSet):
    vocabSet = set([])  # Create an empty non repeating list
    for document in dataSet:
        vocabSet = vocabSet | set(document)  # Union set
    return list(vocabSet)
Function description:according to vocabList Glossary, adding inputSet Vectorization, each element of the vector is 1 or 0
    vocabList - createVocabList List of returned
    inputSet - Segmented entry list
    returnVec - Document vector,Word set model
def setOfWords2Vec(vocabList, inputSet):
    returnVec = [0] * len(vocabList)               #Create a vector with 0 elements
    for word in inputSet:                          #Traverse each entry
        if word in vocabList:                      #If the entry exists in the vocabulary, set 1
            returnVec[vocabList.index(word)] = 1
            print("the word: %s is not in my Vocabulary!" % word)
    return returnVec        #Return document vector
Function description:according to vocabList Vocabulary, constructing word bag model
    vocabList - createVocabList List of returned
    inputSet - Segmented entry list
    returnVec - Document vector,Word bag model
def bagOfWords2VecMN(vocabList, inputSet):
    returnVec = [0] * len(vocabList)  # Create a vector with 0 elements
    for word in inputSet:             # Traverse each entry
        if word in vocabList:         # If the entry exists in the vocabulary, the count is incremented by one
            returnVec[vocabList.index(word)] += 1
    return returnVec  # Return word bag model
Function description:Naive Bayesian classifier training function
    trainMatrix - Training document matrix, i.e setOfWords2Vec Returned returnVec Composition matrix
    trainCategory - Training category label vector, i.e loadDataSet Returned classVec
    p0Vect - Conditional probability array of normal mail class
    p1Vect - Conditional probability array of spam class
    pAbusive - Probability that the document belongs to spam
def trainNB0(trainMatrix, trainCategory):
    numTrainDocs = len(trainMatrix)  # Calculate the number of training documents
    numWords = len(trainMatrix[0])  # Calculate the number of entries per document
    pAbusive = sum(trainCategory) / float(numTrainDocs)  # Probability that the document belongs to spam
    p0Num = np.ones(numWords)
    p1Num = np.ones(numWords)  # Create numpy.ones array, initialize the number of entries to 1, and Laplace smoothing
    p0Denom = 2.0
    p1Denom = 2.0  # Denominator initialized to 2, Laplace smoothing
    for i in range(numTrainDocs):
        if trainCategory[i] == 1:  # The data required for statistics of conditional probability belonging to insult class, i.e. P(w0|1),P(w1|1),P(w2|1)···
            p1Num += trainMatrix[i]
            p1Denom += sum(trainMatrix[i])
        else:  # The data required to count the conditional probability belonging to the non insult class, i.e. P(w0|0),P(w1|0),P(w2|0)···
            p0Num += trainMatrix[i]
            p0Denom += sum(trainMatrix[i])
    p1Vect = np.log(p1Num / p1Denom)
    p0Vect = np.log(p0Num / p0Denom)   #Take logarithm to prevent lower overflow
    return p0Vect, p1Vect, pAbusive  # Return the conditional probability array belonging to normal mail class, the conditional probability array belonging to insulting spam class, and the probability that the document belongs to spam class
Function description:Naive Bayesian classifier classification function
	vec2Classify - Array of entries to be classified
	p0Vec - Conditional probability array of normal mail class
	p1Vec - Conditional probability array of spam class
	pClass1 - Probability that a document is spam
	0 - It belongs to normal mail class
	1 - Belong to spam
def classifyNB(vec2Classify, p0Vec, p1Vec, pClass1):
    #p1 = reduce(lambda x, y: x * y, vec2Classify * p1Vec) * pClass1  # Multiply corresponding elements
    #p0 = reduce(lambda x, y: x * y, vec2Classify * p0Vec) * (1.0 - pClass1)
    if p1 > p0:
        return 1
        return 0
Function description:Receives a large string and parses it into a list of strings
def textParse(bigString):  # Convert string to character list
    listOfTokens = re.split(r'\W*', bigString)  # Use special symbols as segmentation marks to segment strings, that is, non alphabetic and non numeric
    return [tok.lower() for tok in listOfTokens if len(tok) > 2]  # Except for a single letter, such as a capital I, other words become lowercase
Function description:Test naive Bayes classifier and use naive Bayes for cross validation
def spamTest():
    docList = []
    classList = []
    fullText = []
    for i in range(1, 26):  # Traverse 25 txt files
        wordList = textParse(open('email/spam/%d.txt' % i, 'r').read())  # Read each spam and convert the string into a string list
        classList.append(1)  # Mark spam, 1 indicates junk file
        wordList = textParse(open('email/ham/%d.txt' % i, 'r').read())  # Read each non spam message and convert the string into a string list
        classList.append(0)  # Mark normal mail, 0 indicates normal file
    vocabList = createVocabList(docList)  # Create vocabulary without repetition
    trainingSet = list(range(50))
    testSet = []  # Create a list that stores the index values of the training set and the test set
    for i in range(10):  # From the 50 emails, 40 were randomly selected as the training set and 10 as the test set
        randIndex = int(random.uniform(0, len(trainingSet)))  # Random selection of index value
        testSet.append(trainingSet[randIndex])  # Add index value of test set
        del (trainingSet[randIndex])  # Delete the index value added to the test set in the training set list
    trainMat = []
    trainClasses = []  # Create training set matrix and training set category label coefficient vector
    for docIndex in trainingSet:  # Ergodic training set
        trainMat.append(setOfWords2Vec(vocabList, docList[docIndex]))  # Add the generated word set model to the training matrix
        trainClasses.append(classList[docIndex])  # Add a category to the training set category label coefficient vector
    p0V, p1V, pSpam = trainNB0(np.array(trainMat), np.array(trainClasses))  # Training naive Bayesian model
    errorCount = 0  # Error classification count
    for docIndex in testSet:  # Traversal test set
        wordVector = setOfWords2Vec(vocabList, docList[docIndex])  # Word set model of test set
        if classifyNB(np.array(wordVector), p0V, p1V, pSpam) != classList[docIndex]:  # If the classification is wrong
            errorCount += 1  # Error count plus 1
            print("Misclassified test sets:", docList[docIndex])
    print('Error rate:%.2f%%' % (float(errorCount) / len(testSet) * 100))
if __name__ == '__main__':

  4. Operation results


5. Naive Bayesian evaluation

Advantages: it is still effective in the case of less data, and can deal with multi category problems

Disadvantages: it is sensitive to the preparation method of input data; Because of simplicity Bayes Because of its "simplicity" characteristics, it will bring some loss of accuracy


Posted by yaatra on Sat, 27 Nov 2021 22:37:58 -0800