Implementing a Bayesian classifier by hand
Introduction
Naive Bayesian classifier, as its name implies, is a classifier based on Bayesian formula, which converts a posterior probability into a product of a priori probability and different conditional probabilities, and then classifies by comparing the size of the product under different categories.Unlike other classifiers, the Naive Bayesian classifier has no training process in its strict sense and only needs to calculate the correlation probability.The Bayesian classifier is more suitable for the model of natural language classification. The following is a detailed description of how to implement a Bayesian classifier, taking the English critical text class as an example.
Code
First, introduce a corpus, which contains comments from English websites. If you need to process Chinese, you also need to partition Chinese to get a list of words.
postingList = [ ['my','dog','has','flea','problems','help','please'], ['maybe','not','take','him','to','dog','park','stupid'], ['my','dalmation','is','so','cute','I','love','him'], ['stop','posting','stupid','worthless','garbage'], ['mr','licks','ate','my','steak','how','to','stop','him'], ['quit','buying','worthless','dog','food','stupid'] ] classVec = [0,1,0,1,0,1]
Among them, "0" means positive evaluation and "1" means negative evaluation.In practice, the abundance of the asleep corpus can further refine the classification criteria, such as positive rating being classified into 5 levels and negative rating being classified into 5 levels.
Since computers cannot process text directly, the text needs to be encoded first. Here we use a simple word-bag model, which is to build a word list to include all the words that appear in the corpus.The specific implementation method can adopt the collection data type in python, the collection can not contain duplicate elements, and has the function of automatic de-duplication.
dataSet = postingList vocabSet = set([]) #Initialize an empty collection for document in dataSet: # | Represents a concurrent operation of a collection that cannot contain duplicate elements, and automatically removes the same elements from both collections when the union operation occurs vocabSet = vocabSet | set(document) vocabSet = list(vocabSet) #list is easier to operate with than a collection set
To facilitate computer processing, training samples need to be coded.On the basis of the word list obtained, each training sample is transformed into a fixed-length feature vector whose length is equal to the length of the word list.
def bagOfWords2VecMN(vocabList, inputSet): #The length of the eigenvector equals the length of the vocabulary returnVec = [0]*len(vocabList) for word in inputSet: if word in vocabList: #Count of corresponding locations appearing once plus one returnVec[vocabList.index(word)] += 1 return returnVec #Converts the dataset into a feature vector under the bag model, encoding trainMat = [] for postinDoc in postingList: res = bagOfWords2VecMN(vocabSet,postinDoc) trainMat.append(res) #Convert list to array for easy training trainMatrix = np.array(trainMat) trainCategory = np.array(classVec) print(trainMatrix)
By doing the above, the training sample is transformed into the following form, where the length of the eigenvector is the length of the vocabulary, and each number corresponds to the number of occurrences of each word in the vocabulary.
[[0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 1 0 0 0 1 0 0 1] [1 0 0 1 1 0 0 0 1 0 0 0 0 0 1 0 1 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0] [0 1 0 0 0 1 1 1 0 0 0 0 1 0 1 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0] [0 0 0 0 1 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0] [1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 1 1 0 0 0 0 1 0 1 1 0 0 1 0 0] [0 0 0 0 1 0 0 0 0 0 1 1 0 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]]
Once these steps are completed, each probability value can be calculated.The prior probability is solved first.For discrete variables, the solution of a priori probability can be simplified to the statistics of frequency values.
numTrainDocs = len(trainMatrix) #Number of samples numWords = len(trainMatrix[0]) #Number of sample characteristics pC1 = sum(trainCategory) / float(numTrainDocs) #The ratio of P(1), a priori probability
The next step is to solve the conditional probability for each class.For C1, p(F1|C1)p(F2|C1)...P(F32|C1) for these 32 values, the feature corresponding conditional probability is equal to the number of occurrences of the word/the total number of occurrences of words under the category.If the number of occurrences of some words in some categories is 0, the result of the multiplication is directly 0.In addition, because p(F1|C1)p(F2|C1)...Each probability value in p(F32|C1) may be small and may cause a computer underflow after multiplying.The solution is to initialize the list of words and the total number of words in each category to 1 and 2 respectively, and to take ln for the probability continuous multiplication, to avoid the occurrence of 0 values, to control the probability values within a certain range, and to convert the continuous multiplication operation to the continuous addition so as to avoid too small values.
#Get the total number of words in different categories numWordC0 = 2 #Initialize the total number of words to 2 to avoid minimums numWordC1 = 2 for i in range(numTrainDocs): if trainCategory[i] == 0: for num in trainMatrix[i]: numWordC0 += num else: for num in trainMatrix[i]: numWordC1 += num #Get the number of occurrences of each word under different categories numWordFsC0 = np.ones(numWords) #Initialize word occurrences to 1 to avoid zero numWordFsC1 = np.ones(numWords) for j in range(numTrainDocs): for k in range(numWords): if trainCategory[j] == 0: numWordFsC0[k] += trainMatrix[j][k] else: numWordFsC1[k] += trainMatrix[j][k] #Calculate conditional probability values for different classes of each word psOfFsC0 = numWordFsC0/numWordC0 psOfFsC1 = numWordFsC1/numWordC1
After obtaining the prior probability and conditional probability, we can classify the test target. The classification idea is simple. We add up the natural logarithm values of conditional probability of different classes of words contained in the test target, and then add up the logarithm values of different prior probabilities. The larger one is the classification result.
testWords = ['stupid','garbage'] testValueC0 = 0 testValueC1 = 0 # Find the conditional probability values for each word in the test target, take the logarithms and add them up for word in testWords: testValueC0 += math.log(psOfFsC0[vocabSet.index(word)]) testValueC1 += math.log(psOfFsC1[vocabSet.index(word)]) # Accumulation of prior probability values corresponding to classes testValueC0 += math.log(1-pC1) testValueC1 += math.log(pC1) print('C0: ',testValueC0) print('C1: ',testValueC1) # Compare values, the larger being the class to which they belong if(testValueC0 > testValueC1): print("belone to C0") else: print("belone to C1")
The printout is as follows:
C0: -7.20934025660291 C1: -4.702750514326955 belone to C1
The C1 value is significantly greater than the C0 value, so it is judged that the test target belongs to the C1 class.At the same time, we can use common sense to judge that'intelligent','garbage'is obviously a negative evaluation, which is the same as the test results.
If you change the test objectives to obvious positive reviews: ['love','my','dalmation'], the results are as follows.
C0: -7.694848072384611 C1: -9.826714493730215 belone to C0
The printed results show that the C0 value is significantly greater than the C1 value, belonging to the C0 category, which is the same as our common sense judgment.
Summary
Naive Bayesian classifier has clear principle, simple result and obvious effect, and can effectively realize natural language processing classification.Subsequently, we will explore the validity and accuracy of Naive Bayesian classification with different bag models on the basis of a larger corpus.