Q1 : Compute the prior for the two classes + and -, and the likelihoods for each word given the class (leave in the form of fractions).
import numpy as np from fractions import Fraction import nltk sentences = ["just plain boring", "entirely predictable and lacks energy", "no surprises and very few laughs", "very powerful", "the most fun film of the summer"] labels = ["-", "-", "-", "+","+"] test = "predictable with no originality" words = [] for sentence in sentences: words = words+nltk.word_tokenize(sentence) V = list(set(words)) # Calculate the total number of positive and negative words num_pos = sum([len(nltk.word_tokenize(x))*(l=='+') for (x, l) in zip(sentences, labels)]) num_neg = sum([len(nltk.word_tokenize(x))*(l=='-') for (x, l) in zip(sentences, labels)]) sentences_neg= sentences[0]+" "+sentences[1]+" "+sentences[2] sentences_pos = sentences[3]+" "+sentences[4] # Calculate the prior probability of + class, i.e. the proportion of total Tags print(Fraction(labels.count("+"), len(labels)))
2/5
# Calculation - the prior probability of a class, i.e. the proportion of the total Tags print(Fraction(labels.count("-"), len(labels)))
3/5
For the function p(x| θ), X represents a specific data, and θ represents the parameters of the model. If θ is known and determined, X is a variable, this function is called probability function, which describes the probability of occurrence for different sample points X. If x is known and determined, and θ is a variable, this function is called the likelihood function, which describes the probability of X appearing as a sample point for different model parameters.
# Calculate Laplacian smoothness or maximum likelihood estimation for each word for word in V: pn = sentences_pos.count(word) nn = sentences_neg.count(word) print ('P('+word+'|+) = ', Fraction(pn+1, len(V)+num_pos)) print ('P('+word+'|-) = ', Fraction(nn+1, len(V)+num_neg))
P(fun|+) = 2/29 P(fun|-) = 1/34 P(surprises|+) = 1/29 P(surprises|-) = 1/17 P(of|+) = 2/29 P(of|-) = 1/34 P(very|+) = 2/29 P(very|-) = 1/17 P(most|+) = 2/29 P(most|-) = 1/34 P(summer|+) = 2/29 P(summer|-) = 1/34 P(few|+) = 1/29 P(few|-) = 1/17 P(energy|+) = 1/29 P(energy|-) = 1/17 P(lacks|+) = 1/29 P(lacks|-) = 1/17 P(and|+) = 1/29 P(and|-) = 3/34 P(film|+) = 2/29 P(film|-) = 1/34 P(entirely|+) = 1/29 P(entirely|-) = 1/17 P(plain|+) = 1/29 P(plain|-) = 1/17 P(no|+) = 1/29 P(no|-) = 1/17 P(boring|+) = 1/29 P(boring|-) = 1/17 P(the|+) = 3/29 P(the|-) = 1/34 P(powerful|+) = 2/29 P(powerful|-) = 1/34 P(predictable|+) = 1/29 P(predictable|-) = 1/17 P(just|+) = 1/29 P(just|-) = 1/17 P(laughs|+) = 1/29 P(laughs|-) = 1/17
Q2 : Then compute whether the sentence in the test set is of class positive or negative (you may need a computer for this final computation).
twords = nltk.word_tokenize(test) result = Fraction(labels.count("+"), len(labels)) print('P(+|'+test+') = ', result, end='') for word in twords: if word in words: print(" *",Fraction(sentences_pos.count(word)+1, len(V)+num_pos),end='') result *= Fraction(sentences_pos.count(word)+1, len(V)+num_pos) print(" = ", float(result)) result = Fraction(labels.count("-"), len(labels)) print('P(-|'+test+') = ', result, end='') for word in twords: if word in words: print(" *",Fraction(sentences_neg.count(word)+1, len(V)+num_neg),end='') result *= Fraction(sentences_neg.count(word)+1, len(V)+num_neg) print(" = ", float(result))
P(+|predictable with no originality) = 2/5 * 1/29 * 1/29 = 0.0004756242568370987 P(-|predictable with no originality) = 3/5 * 1/17 * 1/17 = 0.0020761245674740486
Because P (- | predictable with no origin) is greater than P (+ | predictable with no origin), it is more likely to be a negative class
Q3. Would using binary multinomial Naïve Bayes change anything?
words = [] for sentence in sentences: words = words+list(set(nltk.word_tokenize(sentence))) V = list(set(words)) # Calculate the total number of positive and negative words num_pos = sum([len(set(nltk.word_tokenize(x)))*(l=='+') for (x, l) in zip(sentences, labels)]) num_neg = sum([len(set(nltk.word_tokenize(x)))*(l=='-') for (x, l) in zip(sentences, labels)]) sentences_neg= list(set(nltk.word_tokenize(sentences[0])))+list(set(nltk.word_tokenize(sentences[1])))+list(set(nltk.word_tokenize(sentences[2]))) sentences_pos = list(set(nltk.word_tokenize(sentences[3])))+list(set(nltk.word_tokenize(sentences[4])))
twords = nltk.word_tokenize(test) result = Fraction(labels.count("+"), len(labels)) print('P(+|'+test+') = ', result, end='') for word in twords: if word in words: print(" *",Fraction(sentences_pos.count(word)+1, len(V)+num_pos),end='') result *= Fraction(sentences_pos.count(word)+1, len(V)+num_pos) print(" = ", float(result)) result = Fraction(labels.count("-"), len(labels)) print('P(-|'+test+') = ', result, end='') for word in twords: if word in words: print(" *",Fraction(sentences_neg.count(word)+1, len(V)+num_neg),end='') result *= Fraction(sentences_neg.count(word)+1, len(V)+num_neg) print(" = ", float(result))
P(+|predictable with no originality) = 2/5 * 1/28 * 1/28 = 0.0005102040816326531 P(-|predictable with no originality) = 3/5 * 1/17 * 1/17 = 0.0020761245674740486
Since P (- | predictable with no origin) is still greater than P (+ | predictable with no origin), there is no effect.
Q4:Why do you add 𝑽 to the denominator of add-1 smoothing, instead of just counting the words in one class?
This is because if you add one to the number of times a word doesn't appear, the number of other words that have already appeared will be relatively reduced.
Q5: What would the answer to question 2 be without add-1 smoothing?
twords = nltk.word_tokenize(test) result = Fraction(labels.count("+"), len(labels)) print('P(+|'+test+') = ', result, end='') for word in twords: if word in words: print(" *",Fraction(sentences_pos.count(word), num_pos),end='') result *= Fraction(sentences_pos.count(word), num_pos) print(" = ", float(result)) result = Fraction(labels.count("-"), len(labels)) print('P(-|'+test+') = ', result, end='') for word in twords: if word in words: print(" *",Fraction(sentences_neg.count(word), num_neg),end='') result *= Fraction(sentences_neg.count(word), num_neg) print(" = ", float(result))
P(+|predictable with no originality) = 2/5 * 0 * 0 = 0.0 P(-|predictable with no originality) = 3/5 * 1/14 * 1/14 = 0.003061224489795918