[watermelon book reading notes] 04 decision tree part I

Keywords: Python Machine Learning Decision Tree

Chapter IV decision tree

1 basic process

2 division selection

With the continuous division process, we hope that the samples contained in the branch nodes of the decision tree belong to the same category as much as possible, that is, the "purity" of the nodes is getting higher and higher.

2.1 information gain

2.1.1 what is information entropy

https://www.zhihu.com/question/22178202

What is entropy: the uncertainty of a thing.

Information: something that eliminates uncertainty.

Function of information: adjust probability; Eliminate interference; Determine the situation (for example, the melon seller said that it is cooked and sweet).

Noise: something that cannot eliminate someone's uncertainty about something.

Data = information + noise

2.1.2 how to quantify entropy - equal probability

Refer to an uncertain event as a unit. For example, a coin toss is recorded as 1bit.

For example, eight uncertain cases with equal probability are equivalent to tossing coins three times, that is, 2 ^ 3 possible cases, and the entropy is 3bit.

For example, 10 uncertain cases with equal probability are equivalent to tossing log10 coins, that is, 2^log10 possible cases. The entropy is log10bit, and the logarithm is based on 2.

2.1.3 how to quantify entropy - unequal probability

Calculation formula:

2.1.4 how to quantify information

The difference in entropy before and after knowing the information.

Pre information entropy: log4 = 2

Post information entropy: 31/6log6 + 1/2*log2

2.1.5 summary

2.1.6 examples

#Calculate the information gain of color attribute
import numpy as np

#The root node information entropy is calculated according to the proportion of good and bad melons
Good = 8
Bad = 9
Total = 17
EntD = - 8/17 * np.log2(8/17) - 9/17 * np.log2(9/17)

#Calculate color information
#Calculate the number of good and bad melons in all colors
Green_good = 3
Green_bad = 3
Black_good = 4
Black_bad = 2
White_good = 1 
White_bad = 4

#Calculate the occurrence probability of good and bad melons
P_Green_good = 3/6
P_Green_bad = 3/6
P_Green = 6/17

P_Black_good = 4/6
P_Black_bad = 2/6
P_Black = 6/17

P_White_good = 1 /5
P_White_bad = 4/5
P_White = 5/17

#Calculate the information entropy of each branch node
EntD_Green = -P_Green_good*np.log2(P_Green_good) -P_Green_bad*np.log2(P_Green_bad)
EntD_White = -P_White_good*np.log2(P_White_good) -P_White_bad*np.log2(P_White_bad)
EntD_Black = -P_Black_good*np.log2(P_Black_good) -P_Black_bad*np.log2(P_Black_bad)

#The information entropy of color attribute is obtained by weighting
EntD_color = P_Green*EntD_Green + P_White*EntD_White + P_Black*EntD_Black

#The information gain of color attribute is obtained by subtraction
Gain = EntD - EntD_color
Gain

0.10812516526536531

According to the above similar method, the information gain of different attributes is calculated.

The root node is divided according to the attribute with the largest information gain - texture

Based on the known texture category, the information gain of each attribute is calculated to obtain the decision tree.

2.2 gain rate

The information gain criterion has a preference for attributes with a large number of values.

To solve this problem, Quinlan(1993) used the gain ratio to select the optimal partition attribute, which is defined as follows:

Firstly, the attributes with higher information gain than the average level are found from the candidate partition attributes, and then the number of attribute partition decisions with the highest gain rate is selected.

2.3 Gini index

2.3.1 classification tree: Gini index minimum criterion.

Data set Gini index:

Attribute Gini index:

example

#Gini index of data set: two bifurcation as an example

def gini_index_single(a,b):
    single_gini = 1 - ((a/(a+b))**2) - ((b/(a+b))**2)
    return single_gini

print(gini_index_single(105,39),gini_index_single(130,14))

#It can be seen that the higher the purity, the smaller the Gini index

0.3949652777777779 0.17554012345679013

#Attribute Gini index: weighting of data sets

def gini_index(a, b, c, d):
    zuo = gini_index_single(a,b)
    you = gini_index_single(c,d)
    gini_index = zuo*((a+b)/(a+b+c+d)) + you*((c+d)/(a+b+c+d))
    return gini_index

gini_index(105, 39, 34, 125)

0.36413900824044665

gini_index(37, 127, 100, 33)

0.3600300949466547

gini_index(92, 31, 45, 129)

0.38080175646758213

Select the attribute with the smallest Gini index for the first bifurcation, that is, good blood circulation

Continue with the second bifurcation

gini_index_single(37,127)

0.3494199881023201

gini_index(13, 98, 24, 29)

0.30011649938019014

gini_index(24, 25, 13, 102)

0.2899430822169802

According to the above figure, Gini impurity of Blocked arteries is 0.29, which is better than chestpain. (the smaller the Gini index, the better)

And the Gini index after bifurcation is lower than 0.34, so the second bifurcation should be carried out

Therefore, take Blocked arteries as the second bifurcation attribute. Get:

gini_index_single(24, 25)

0.4997917534360683

gini_index(17,3,7,22)

0.3208304011259677

Since 0.32 < 0.49, the third bifurcation should be carried out.

The final bifurcation results are as follows:

Posted by thunder708 on Sat, 02 Oct 2021 19:02:22 -0700

Programmer Group