Freehand Reality: A Simple Bayesian Classifier

Keywords: Python encoding

Implementing a Bayesian classifier by hand

Introduction

Naive Bayesian classifier, as its name implies, is a classifier based on Bayesian formula, which converts a posterior probability into a product of a priori probability and different conditional probabilities, and then classifies by comparing the size of the product under different categories.Unlike other classifiers, the Naive Bayesian classifier has no training process in its strict sense and only needs to calculate the correlation probability.The Bayesian classifier is more suitable for the model of natural language classification. The following is a detailed description of how to implement a Bayesian classifier, taking the English critical text class as an example.

Code

First, introduce a corpus, which contains comments from English websites. If you need to process Chinese, you also need to partition Chinese to get a list of words.

postingList = [
    ['my','dog','has','flea','problems','help','please'],    
    ['maybe','not','take','him','to','dog','park','stupid'],    
    ['my','dalmation','is','so','cute','I','love','him'],
    ['stop','posting','stupid','worthless','garbage'],
    ['mr','licks','ate','my','steak','how','to','stop','him'],
    ['quit','buying','worthless','dog','food','stupid']
]
classVec = [0,1,0,1,0,1]

Among them, "0" means positive evaluation and "1" means negative evaluation.In practice, the abundance of the asleep corpus can further refine the classification criteria, such as positive rating being classified into 5 levels and negative rating being classified into 5 levels.
Since computers cannot process text directly, the text needs to be encoded first. Here we use a simple word-bag model, which is to build a word list to include all the words that appear in the corpus.The specific implementation method can adopt the collection data type in python, the collection can not contain duplicate elements, and has the function of automatic de-duplication.

dataSet = postingList
vocabSet = set([])   #Initialize an empty collection
for document in dataSet:   
    # | Represents a concurrent operation of a collection that cannot contain duplicate elements, and automatically removes the same elements from both collections when the union operation occurs
    vocabSet = vocabSet | set(document)
vocabSet = list(vocabSet)    #list is easier to operate with than a collection set 

To facilitate computer processing, training samples need to be coded.On the basis of the word list obtained, each training sample is transformed into a fixed-length feature vector whose length is equal to the length of the word list.

def bagOfWords2VecMN(vocabList, inputSet):    
    #The length of the eigenvector equals the length of the vocabulary    
    returnVec = [0]*len(vocabList)    
    for word in inputSet:        
        if word in vocabList:            
            #Count of corresponding locations appearing once plus one            
            returnVec[vocabList.index(word)] += 1
    return returnVec
    
#Converts the dataset into a feature vector under the bag model, encoding
trainMat = []
for postinDoc in postingList:   
    res = bagOfWords2VecMN(vocabSet,postinDoc)    
    trainMat.append(res)
#Convert list to array for easy training
trainMatrix = np.array(trainMat)
trainCategory = np.array(classVec)
print(trainMatrix)

By doing the above, the training sample is transformed into the following form, where the length of the eigenvector is the length of the vocabulary, and each number corresponds to the number of occurrences of each word in the vocabulary.

[[0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 1 0 0 0 1 0 0 1]
 [1 0 0 1 1 0 0 0 1 0 0 0 0 0 1 0 1 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0]
 [0 1 0 0 0 1 1 1 0 0 0 0 1 0 1 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 1 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0]
 [1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 1 1 0 0 0 0 1 0 1 1 0 0 1 0 0]
 [0 0 0 0 1 0 0 0 0 0 1 1 0 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]]

Once these steps are completed, each probability value can be calculated.The prior probability is solved first.For discrete variables, the solution of a priori probability can be simplified to the statistics of frequency values.

numTrainDocs = len(trainMatrix)     #Number of samples
numWords = len(trainMatrix[0])     #Number of sample characteristics
pC1 = sum(trainCategory) / float(numTrainDocs)    #The ratio of P(1), a priori probability

The next step is to solve the conditional probability for each class.For C1, p(F1|C1)p(F2|C1)...P(F32|C1) for these 32 values, the feature corresponding conditional probability is equal to the number of occurrences of the word/the total number of occurrences of words under the category.If the number of occurrences of some words in some categories is 0, the result of the multiplication is directly 0.In addition, because p(F1|C1)p(F2|C1)...Each probability value in p(F32|C1) may be small and may cause a computer underflow after multiplying.The solution is to initialize the list of words and the total number of words in each category to 1 and 2 respectively, and to take ln for the probability continuous multiplication, to avoid the occurrence of 0 values, to control the probability values within a certain range, and to convert the continuous multiplication operation to the continuous addition so as to avoid too small values.

#Get the total number of words in different categories
numWordC0 = 2    #Initialize the total number of words to 2 to avoid minimums
numWordC1 = 2
for i in range(numTrainDocs):
    if trainCategory[i] == 0:
        for num in trainMatrix[i]:            
            numWordC0 += num
    else:
        for num in trainMatrix[i]:
            numWordC1 += num

#Get the number of occurrences of each word under different categories
numWordFsC0 = np.ones(numWords)    #Initialize word occurrences to 1 to avoid zero
numWordFsC1 = np.ones(numWords)
for j in range(numTrainDocs):    
    for k in range(numWords):        
        if trainCategory[j] == 0:
            numWordFsC0[k] += trainMatrix[j][k]
        else:            
            numWordFsC1[k] += trainMatrix[j][k]
                  
#Calculate conditional probability values for different classes of each word
psOfFsC0 = numWordFsC0/numWordC0
psOfFsC1 = numWordFsC1/numWordC1

After obtaining the prior probability and conditional probability, we can classify the test target. The classification idea is simple. We add up the natural logarithm values of conditional probability of different classes of words contained in the test target, and then add up the logarithm values of different prior probabilities. The larger one is the classification result.

testWords = ['stupid','garbage']
testValueC0 = 0
testValueC1 = 0
# Find the conditional probability values for each word in the test target, take the logarithms and add them up
for  word in testWords:
    testValueC0 += math.log(psOfFsC0[vocabSet.index(word)])
    testValueC1 += math.log(psOfFsC1[vocabSet.index(word)])
# Accumulation of prior probability values corresponding to classes
testValueC0 += math.log(1-pC1)
testValueC1 += math.log(pC1)

print('C0: ',testValueC0)
print('C1: ',testValueC1)
# Compare values, the larger being the class to which they belong
if(testValueC0 > testValueC1):
    print("belone to C0")
else:
    print("belone to C1")

The printout is as follows:

C0:  -7.20934025660291
C1:  -4.702750514326955
belone to C1

The C1 value is significantly greater than the C0 value, so it is judged that the test target belongs to the C1 class.At the same time, we can use common sense to judge that'intelligent','garbage'is obviously a negative evaluation, which is the same as the test results.
If you change the test objectives to obvious positive reviews: ['love','my','dalmation'], the results are as follows.

C0:  -7.694848072384611
C1:  -9.826714493730215
belone to C0

The printed results show that the C0 value is significantly greater than the C1 value, belonging to the C0 category, which is the same as our common sense judgment.

Summary

Naive Bayesian classifier has clear principle, simple result and obvious effect, and can effectively realize natural language processing classification.Subsequently, we will explore the validity and accuracy of Naive Bayesian classification with different bag models on the basis of a larger corpus.

Four original articles were published, 0 were praised, 41 were visited
Private letter follow

Posted by keldorn on Tue, 21 Jan 2020 17:18:38 -0800