Machine Learning (Zhou Zhihua) Watermelon Book Chapter 7 Exercise 7.3—— Python Implementation

Keywords: curl Python Programming

Machine Learning (Zhou Zhihua) Watermelon Book Chapter 7 Exercise 7.3-Python Implementation

  • Experimental topics

Laplacian modified naive Bayesian classifier is implemented by trial programming. The watermelon data set 3.0A is used as training set to discriminate P.151 "test 1" samples.

  • Experimental principle

Laplacian correction avoids the problem that the probability estimate is zero due to insufficient training set samples.

Naive Bayesian classifier:

Naive Bayesian Classifier Expressions

Laplace's revised prior-like probability formula

Laplacian modified conditional probability formula for discrete attributes

Conditional probability formula for continuous attributes

  • Experimental process

Data Set Acquisition

Get the watermelon data set 3.0A in the book and coexist as data_3 a.txt

Density, sugar content, melon
 0.6970000000000001, 0.46, yes
 0.774, 0.376, yes
 0.634, 0.264, yes
 0.608, 0.318, yes
 0.556, 0.215, yes
 0.4029999999999999997, 0.237, yes
 0.48100000000004,0.149,
0.43700000000006,0.21100000000000002
 0.665999999999999999, 0.091, no
 0.243, 0.267, no
 0.245, 0.057, no
 0.3429999999999999997, 0.099, no
 0.639, 0.161, no
 0.657, 0.198, no
 0.36, 0.37, no
 0.593, 0.042, no
 0.7190000000000001, 0.10300000000000001, no

Algorithm implementation

Data Definition, Defining Attributes and Their Value Categories, Class Label Categories

Read data function

Generating prediction data

Generate a set of samples of different label classes by label class

Computing Class Prior Probability

Computing Conditional Probability of Discrete Attributes

Computing Conditional Probability of Continuous Attributes

Calculate the probabilistic reliability of determining the test sample as label

Predictive test sample labeling

main function, calling the above function

  • experimental result

  • List of procedures:

import math
import numpy as np
import pandas as pd

D_keys = {
	'Color and lustre': ['Dark green', 'Jet black', 'plain'], 
	'Pedicle': ['Curl up', 'Stiff', 'Curl up'], 
	'stroke': ['Crisp and crisp', 'Dreary', 'Voiced sound'], 
	'texture': ['Slightly paste', 'vague', 'clear'], 
	'Umbilicus': ['Sunken', 'Slightly concave', 'flat'], 
	'Touch': ['Soft sticky', 'Hard slip'], 
}
Class, labels = 'Good melon', ['yes', 'no']

# Read data
def loadData(filename):
	dataSet = pd.read_csv(filename)
	dataSet.drop(columns=['number'], inplace=True)
	return dataSet

# Configuration Test 1 Data
def load_data_test():
	array = ['Dark green', 'Curl up', 'Voiced sound', 'clear', 'Sunken', 'Hard slip', 0.697, 0.460, '']
	dic = {a: b for a, b in zip(dataSet.columns, array)}
	return dic

def calculate_D(dataSet):
	D = []
	for label in labels:
		temp = dataSet.loc[dataSet[Class]==label]
		D.append(temp)

	return D

def calculate_Pc(Dc, D):
	D_size = D.shape[0]
	Dc_size = Dc.shape[0]
	N = len(labels)
	return (Dc_size+1) / (D_size+N)

def calculate_Pcx_D(key, value, Dc):
	Dc_size = Dc.shape[0]
	Dcx_size = Dc[key].value_counts()[value]
	Ni = len(D_keys[key])
	return (Dcx_size+1) / (Dc_size+Ni)

def calculate_Pcx_C(key, value, Dc):
	mean, var = Dc[key].mean(), Dc[key].var()
	exponent = math.exp(-(math.pow(value-mean, 2) / (2*var)))
	return (1 / (math.sqrt(2*math.pi*var)) * exponent)

def calculate_probability(label, Dc, dataSet, data_test):
	prob = calculate_Pc(Dc, dataSet)
	for key in Dc.columns[:-1]:
		value = data_test[key]
		if key in D_keys:
			prob *= calculate_Pcx_D(key, value, Dc)
		else:
			prob *= calculate_Pcx_C(key, value, Dc)

	return prob

def predict(dataSet, data_test):
	# mu, sigma = dataSet.mean(), dataSet.var()
	Dcs = calculate_D(dataSet)
	max_prob = -1
	for label, Dc in zip(labels, Dcs):
		prob = calculate_probability(label, Dc, dataSet, data_test)

		if prob > max_prob:
			best_label = label
			max_prob = prob

		print(label, prob)

	return best_label


if __name__ == '__main__':
	# Read data
	filename = 'data_3.txt'
	dataSet = loadData(filename)
	data_test = load_data_test()
	label = predict(dataSet, data_test)
	print('Forecast results:', label)

 

 

 

Posted by dkphp2 on Mon, 22 Apr 2019 18:12:34 -0700