Probability calculation of nlp operation

Q1 : Compute the prior for the two classes + and -, and the likelihoods for each word given the class (leave in the form of fractions).

import numpy as np
from fractions import Fraction
import nltk
sentences = ["just plain boring", "entirely predictable and lacks energy", "no surprises and very few laughs", "very powerful", "the most fun film of the summer"]
labels = ["-", "-", "-", "+","+"]
test = "predictable with no originality"
words = []
for sentence in sentences:
    words = words+nltk.word_tokenize(sentence)
V = list(set(words))
# Calculate the total number of positive and negative words
num_pos = sum([len(nltk.word_tokenize(x))*(l=='+') for (x, l) in zip(sentences, labels)])
num_neg = sum([len(nltk.word_tokenize(x))*(l=='-') for (x, l) in zip(sentences, labels)])
sentences_neg= sentences[0]+" "+sentences[1]+" "+sentences[2]
sentences_pos = sentences[3]+" "+sentences[4]
# Calculate the prior probability of + class, i.e. the proportion of total Tags
print(Fraction(labels.count("+"), len(labels)))
2/5
# Calculation - the prior probability of a class, i.e. the proportion of the total Tags
print(Fraction(labels.count("-"), len(labels)))
3/5

For the function p(x| θ), X represents a specific data, and θ represents the parameters of the model. If θ is known and determined, X is a variable, this function is called probability function, which describes the probability of occurrence for different sample points X. If x is known and determined, and θ is a variable, this function is called the likelihood function, which describes the probability of X appearing as a sample point for different model parameters.

# Calculate Laplacian smoothness or maximum likelihood estimation for each word
for word in V:
    pn = sentences_pos.count(word)
    nn = sentences_neg.count(word)
    print ('P('+word+'|+) = ', Fraction(pn+1, len(V)+num_pos))
    print ('P('+word+'|-) = ', Fraction(nn+1, len(V)+num_neg))
P(fun|+) =  2/29
P(fun|-) =  1/34
P(surprises|+) =  1/29
P(surprises|-) =  1/17
P(of|+) =  2/29
P(of|-) =  1/34
P(very|+) =  2/29
P(very|-) =  1/17
P(most|+) =  2/29
P(most|-) =  1/34
P(summer|+) =  2/29
P(summer|-) =  1/34
P(few|+) =  1/29
P(few|-) =  1/17
P(energy|+) =  1/29
P(energy|-) =  1/17
P(lacks|+) =  1/29
P(lacks|-) =  1/17
P(and|+) =  1/29
P(and|-) =  3/34
P(film|+) =  2/29
P(film|-) =  1/34
P(entirely|+) =  1/29
P(entirely|-) =  1/17
P(plain|+) =  1/29
P(plain|-) =  1/17
P(no|+) =  1/29
P(no|-) =  1/17
P(boring|+) =  1/29
P(boring|-) =  1/17
P(the|+) =  3/29
P(the|-) =  1/34
P(powerful|+) =  2/29
P(powerful|-) =  1/34
P(predictable|+) =  1/29
P(predictable|-) =  1/17
P(just|+) =  1/29
P(just|-) =  1/17
P(laughs|+) =  1/29
P(laughs|-) =  1/17

Q2 : Then compute whether the sentence in the test set is of class positive or negative (you may need a computer for this final computation).

twords = nltk.word_tokenize(test)
result = Fraction(labels.count("+"), len(labels))
print('P(+|'+test+') = ', result, end='')
for word in twords:
    if word in words:
        print(" *",Fraction(sentences_pos.count(word)+1, len(V)+num_pos),end='')
        result *= Fraction(sentences_pos.count(word)+1, len(V)+num_pos)
print(" = ", float(result))
result = Fraction(labels.count("-"), len(labels))
print('P(-|'+test+') = ', result, end='')
for word in twords:
    if word in words:
        print(" *",Fraction(sentences_neg.count(word)+1, len(V)+num_neg),end='')
        result *= Fraction(sentences_neg.count(word)+1, len(V)+num_neg)
print(" = ", float(result))
P(+|predictable with no originality) =  2/5 * 1/29 * 1/29 =  0.0004756242568370987
P(-|predictable with no originality) =  3/5 * 1/17 * 1/17 =  0.0020761245674740486

Because P (- | predictable with no origin) is greater than P (+ | predictable with no origin), it is more likely to be a negative class

Q3. Would using binary multinomial Naïve Bayes change anything?

words = []
for sentence in sentences:
    words = words+list(set(nltk.word_tokenize(sentence)))
V = list(set(words))
# Calculate the total number of positive and negative words
num_pos = sum([len(set(nltk.word_tokenize(x)))*(l=='+') for (x, l) in zip(sentences, labels)])
num_neg = sum([len(set(nltk.word_tokenize(x)))*(l=='-') for (x, l) in zip(sentences, labels)])
sentences_neg= list(set(nltk.word_tokenize(sentences[0])))+list(set(nltk.word_tokenize(sentences[1])))+list(set(nltk.word_tokenize(sentences[2])))
sentences_pos = list(set(nltk.word_tokenize(sentences[3])))+list(set(nltk.word_tokenize(sentences[4])))
twords = nltk.word_tokenize(test)
result = Fraction(labels.count("+"), len(labels))
print('P(+|'+test+') = ', result, end='')
for word in twords:
    if word in words:
        print(" *",Fraction(sentences_pos.count(word)+1, len(V)+num_pos),end='')
        result *= Fraction(sentences_pos.count(word)+1, len(V)+num_pos)
print(" = ", float(result))
result = Fraction(labels.count("-"), len(labels))
print('P(-|'+test+') = ', result, end='')
for word in twords:
    if word in words:
        print(" *",Fraction(sentences_neg.count(word)+1, len(V)+num_neg),end='')
        result *= Fraction(sentences_neg.count(word)+1, len(V)+num_neg)
print(" = ", float(result))
P(+|predictable with no originality) =  2/5 * 1/28 * 1/28 =  0.0005102040816326531
P(-|predictable with no originality) =  3/5 * 1/17 * 1/17 =  0.0020761245674740486

Since P (- | predictable with no origin) is still greater than P (+ | predictable with no origin), there is no effect.

Q4:Why do you add 𝑽 to the denominator of add-1 smoothing, instead of just counting the words in one class?

This is because if you add one to the number of times a word doesn't appear, the number of other words that have already appeared will be relatively reduced.

Q5: What would the answer to question 2 be without add-1 smoothing?

twords = nltk.word_tokenize(test)
result = Fraction(labels.count("+"), len(labels))
print('P(+|'+test+') = ', result, end='')
for word in twords:
    if word in words:
        print(" *",Fraction(sentences_pos.count(word), num_pos),end='')
        result *= Fraction(sentences_pos.count(word), num_pos)
print(" = ", float(result))
result = Fraction(labels.count("-"), len(labels))
print('P(-|'+test+') = ', result, end='')
for word in twords:
    if word in words:
        print(" *",Fraction(sentences_neg.count(word), num_neg),end='')
        result *= Fraction(sentences_neg.count(word), num_neg)
print(" = ", float(result))
P(+|predictable with no originality) =  2/5 * 0 * 0 =  0.0
P(-|predictable with no originality) =  3/5 * 1/14 * 1/14 =  0.003061224489795918

Posted by c_coder on Sun, 24 Nov 2019 07:44:35 -0800