Machine learning practice - decision tree

Keywords: Algorithm Machine Learning Decision Tree

1, Construction of decision tree

decision tree is a kind of common machine learning algorithm. It makes decisions based on tree structure. Step by step from the root node to the leaf node (decision). All data will eventually fall to leaf nodes, which can be classified or regressed.

Decision tree
Advantages: the calculation complexity is not high, the output result is easy to understand, is not sensitive to the loss of intermediate value, and can process irrelevant feature data.
Disadvantages: over matching may occur.
Applicable data types: numerical type and nominal type.

When constructing a decision tree, the first problem we need to solve is which feature on the current data set plays a decisive role in dividing data classification. In order to find the decisive features and divide the best results, we must evaluate each feature. After the test, the original data is divided into several data subsets. These data subsets are distributed on all branches of the first decision point. If the data under a branch belongs to the same type, the data set will be further divided here and correctly classified. If the data in the data subset does not belong to the same type, you need to repeat the process of dividing the data subset. The algorithm of how to divide the data subset is the same as that of the original data set, until all data with the same type are in one data subset.

General process of decision tree
(1) Collect data: any method can be used.
(2) Prepare data: the tree construction algorithm is only applicable to nominal data, so numerical data must be discretized.
(3) Analyze data: any method can be used. After constructing the tree, we should check whether the graph meets the expectations.
(4) Training algorithm: construct the data structure of the tree.
(5) Test algorithm: use the experience tree to calculate the error rate.
(6) Using algorithm: this step can be applied to any supervised learning algorithm, and using decision tree can better understand the internal meaning of data.

1. Information gain

The general principle of dividing data sets is to make disordered data more orderly

The change of information before and after dividing the data set is called information gain. Knowing how to calculate the information gain, we can calculate the information gain obtained by dividing the data set by each eigenvalue. The feature with the highest information gain is the best choice.

Before we can evaluate which data division method is the best data division, we must learn how to calculate the information gain. The measurement of set information is called Shannon entropy or entropy for short.

Entropy is defined as the expected value of information. Before clarifying this concept, we must know the definition of information. If the thing to be classified may be divided into multiple classifications, the symbol x i x_i xi's information is defined as
l ( x i ) = − log ⁡ 2 p ( x i ) l(x_i) = -\log_2p(x_i) l(xi​)=−log2​p(xi​)
among p ( x i ) p(x_i) p(xi) is the probability of selecting reclassification

In order to calculate entropy, we need to calculate the expected value of information contained in all possible values of all categories, which is obtained by the following formula:
H = − ∑ i = 1 n p ( x i ) log ⁡ 2 p ( x i ) H = - \sum_{i = 1}^n p(x_i)\log_2p(x_i) H=−i=1∑n​p(xi​)log2​p(xi​)
Where n is the number of classifications.

When p ( x i ) = 0 p(x_i) = 0 p(xi) = 0 or p ( x i ) = 1 p(x_i) = 1 When p(xi) = 1, H = 0 H = 0 H=0, random variables have no uncertainty at all.
When p ( x i ) = 0.5 p(x_i) = 0.5 When p(xi) = 0.5, H = 1 H = 1 H=1, the uncertainty of random variables is the largest

Calculate Shannon entropy for a given dataset:

    dataSet - data set
    shannonEnt - Returns the Shannon entropy calculated for a given dataset
def calcShannonEnt(dataSet):
    # Returns the number of rows in the dataset
    numEntries = len(dataSet)
    # Save a dictionary of the number of occurrences of each label
    labelCounts = {}
    # Each group of eigenvectors is counted
    for featVec in dataSet:
        # Extract label information
        currentLabel = featVec[-1]
        # If the label is not put into the dictionary of statistical times, add it
        if currentLabel not in labelCounts.keys():
            labelCounts[currentLabel] = 0
        # Label count
        labelCounts[currentLabel] += 1
    # Shannon entropy
    shannonEnt = 0.0
    # Calculate Shannon entropy
    for key in labelCounts:
        # The probability of selecting the label
        prob = float(labelCounts[key])/numEntries
        # Calculation of Shannon entropy by formula
        shannonEnt -= prob * log(prob, 2)
    # Return Shannon entropy
    return shannonEnt

First, calculate the total number of instances in the dataset. Then, create a data dictionary whose key value is the value of the last column. If the current key value does not exist, expand the dictionary and add the current key value to the dictionary. Each key value records the number of occurrences of the current category. Finally, the occurrence frequency of all class labels is used to calculate the probability of class occurrence. We will use this probability to calculate Shannon entropy and count the times of all class labels.

Marine biological data:

Can we survive without coming to the surfaceAre there finsBelonging to fish

Create a simple fish identification dataset from the table above

    dataSet - data set
    labels - Classification properties
def createDataSet():
    # data set
    dataSet = [[1, 1, 'yes'],
               [1, 1, 'yes'],
               [1, 0, 'no'],
               [0, 1, 'no'],
               [0, 1, 'no']]
    # Classification properties
    labels = ['no surfacing', 'flippers']
    # Return dataset and classification properties
    return dataSet, labels
if __name__ == '__main__':
    dataSet, labels = createDataSet()
[[1, 1, 'yes'], [1, 1, 'yes'], [1, 0, 'no'], [0, 1, 'no'], [0, 1, 'no']]

The higher the entropy, the more mixed data. We can add more classifications to the data set and observe how the entropy changes. Add a third category named may to test the change of entropy:

if __name__ == '__main__':
    dataSet, labels = createDataSet()
    dataSet[0][-1] = 'maybe'
[[1, 1, 'maybe'], [1, 1, 'yes'], [1, 0, 'no'], [0, 1, 'no'], [0, 1, 'no']]

The calculated entropy increases with the increase of classification. After obtaining the entropy, we can divide the data set according to the method of obtaining the maximum information gain.

2. Divide data sets

In addition to measuring the information entropy, the classification algorithm also needs to divide the data set and measure the entropy of the divided data set, so as to judge whether the data set is divided correctly. The information entropy is calculated once for the result of dividing the data set according to each feature, and then it is judged which feature is the best way to divide the data set.

Divide the data set according to the given characteristics:

    dataSet - Data set to be divided
    axis - Characteristics of partitioned data sets
    value - The value of the feature to be returned
    retDataSet - Partitioned dataset
def splitDataSet(dataSet, axis, value):
    # Create a list of returned datasets
    retDataSet = []
    # Traversal dataset
    for featVec in dataSet:
        if featVec[axis] == value:
            # Remove axis feature
            reducedFeatVec = featVec[:axis]
            # Add eligible to the returned dataset
            reducedFeatVec.extend(featVec[axis + 1:])
    # Returns the partitioned dataset
    return retDataSet

Test the function splitDataSet() on the previous simple sample data:

if __name__ == '__main__':
    dataSet, labels = createDataSet()
    print(splitDataSet(dataSet, 0 ,1))
    print(splitDataSet(dataSet, 0 ,0))
[[1, 1, 'yes'], [1, 1, 'yes'], [1, 0, 'no'], [0, 1, 'no'], [0, 1, 'no']]
[[1, 'yes'], [1, 'yes'], [0, 'no']]
[[1, 'no'], [1, 'no']]

Next, traverse the entire data set, and cycle through the entropy and splitDataSet() function to find the best feature division method. Entropy calculation will tell us how to divide data sets is the best way to organize data.

Select the best data set division method:

    dataSet - data set
    bestFeature - Maximum information gain(optimal)Index value of the feature
def chooseBestFeatureToSplit(dataSet):
    # Number of features
    numFeatures = len(dataSet[0]) - 1
    # Calculate Shannon entropy of data set
    baseEntropy = calcShannonEnt(dataSet)
    # information gain 
    bestInfoGain = 0.0
    # Index value of optimal feature
    bestFeature = -1
    # Traverse all features
    for i in range(numFeatures):
        #Get the i th all features of dataSet
        featList = [example[i] for example in dataSet]
        # Create set set {}, elements cannot be repeated
        uniqueVals = set(featList)
        # Empirical conditional entropy
        newEntropy = 0.0
        # Calculate information gain
        for value in uniqueVals:
            # The subset of the subDataSet after partition
            subDataSet = splitDataSet(dataSet, i, value)
            # Calculate the probability of subsets
            prob = len(subDataSet) / float(len(dataSet))
            # The empirical conditional entropy is calculated according to the formula
            newEntropy += prob * calcShannonEnt(subDataSet)
        # information gain 
        infoGain = baseEntropy - newEntropy
        # Print information gain for each feature
        print("The first%d The gain of each feature is%.3f" % (i, infoGain))
        # Calculate information gain
        if (infoGain > bestInfoGain):
            # Update the information gain to find the maximum information gain
            bestInfoGain = infoGain
            # Record the index value of the feature with the largest information gain
            bestFeature = i
    # Returns the index value of the feature with the largest information gain
    return bestFeature
if __name__ == '__main__':
    dataSet, labels = createDataSet()
[[1, 1, 'yes'], [1, 1, 'yes'], [1, 0, 'no'], [0, 1, 'no'], [0, 1, 'no']]
The gain of the 0th feature is 0.420
 The gain of the first feature is 0.171

The best index is 0, that is, the first feature is the best feature used to divide the dataset.

3. Recursive construction of decision tree

Working principle: get the original data set, and then divide the data set based on the best attribute value. Since there may be more than two eigenvalues, there may be data set division greater than two branches. After the first partition, the data will be passed down to the next node of the tree branch. On this node, we can partition the data again. Therefore, we can use the principle of recursion to deal with data sets.

The condition for the end of recursion: the program traverses all the attributes that divide the data set, or all instances under each branch have the same classification. If all instances have the same classification, a leaf node or termination block is obtained. Any data that reaches the leaf node must belong to the classification of the leaf node. If the dataset has processed all attributes, but the class label is still not unique, we need to decide how to define the leaf node. In this case, we usually use the majority voting method to determine the classification of the leaf node.

Majority vote:

    classList - Class label list
    sortedClassCount[0][0] - Elements that appear most here(Class label)
def majorityCnt(classList):
    classCount = {}
    # Count the number of occurrences of each element in the classList
    for vote in classList:
        if vote not in classCount.keys():classCount[vote] = 0
        classCount[vote] += 1
    # Sort by dictionary values in descending order
    sortedClassCount = sorted(classCount.items(), key = operator.itemgetter(1), reverse = True)
    # Returns the most frequent element in the classList
    return sortedClassCount[0][0]

To create a decision tree:

    dataSet - Training data set
    labels - Classification attribute label
    myTree - Decision tree
def createTree(dataSet, labels):
    # Take the classification label
    classList = [example[-1] for example in dataSet]
    # If all class labels are identical, the class label is returned directly
    if classList.count(classList[0]) == len(classList):
        return classList[0]
    # When all features are traversed, the class label with the most occurrences is returned
    if len(dataSet[0]) == 1:
        return majorityCnt(classList)
    # Select the optimal feature
    bestFeat = chooseBestFeatureToSplit(dataSet)
    # Optimal feature label
    bestFeatLabel = labels[bestFeat]
    # Label spanning tree based on optimal features
    myTree = {bestFeatLabel:{}}
    # Delete used feature labels
    # The attribute values of all optimal features in the training set are obtained
    featValues = [example[bestFeat] for example in dataSet]
    # Remove duplicate attribute values
    uniqueVals = set(featValues)
    # Traverse the features and create a decision tree.
    for value in uniqueVals:
        # The class label is copied and stored in the new list variable subLabels
        subLabels = labels[:]
        myTree[bestFeatLabel][value] = createTree(splitDataSet(dataSet, bestFeat, value), subLabels)
    return myTree

Simple data set before testing

if __name__ == '__main__':
    dataSet, labels = createDataSet()
    myTree = createTree(dataSet, labels)
[[1, 1, 'yes'], [1, 1, 'yes'], [1, 0, 'no'], [0, 1, 'no'], [0, 1, 'no']]
{'no surfacing': {0: 'no', 1: {'flippers': {0: 'no', 1: 'yes'}}}}

The variable myTree contains many nested dictionaries representing tree structure information. Starting from the left, the first keyword no surfacing is the feature name of the first partitioned dataset, and the value of this keyword is also another data dictionary. The second keyword is the data set divided by the no surfacing feature. The values of these keywords are the child nodes of the no surfacing node. These values may be class labels (such as' flippers') or another data dictionary. If the value is a class label, the child node is a leaf node; If the value is another data dictionary, the child node is a judgment node. This format structure repeats continuously to form the whole tree. This tree contains three leaf nodes and two judgment nodes.

2, Drawing a tree in Python using the Matplotlib annotation

1. Matplotlib

Draw tree nodes with text annotations:

# Define text box and arrow formats
decisionNode = dict(boxstyle="sawtooth", fc="0.8")
leafNode = dict(boxstyle="round4", fc="0.8")
arrow_args = dict(arrowstyle="<-")
    nodeTxt - Node name
    centerPt - Text position
    parentPt - Arrow position of dimension
    nodeType - Node format
def plotNode(nodeTxt, centerPt, parentPt, nodeType):
    createPlot.ax1.annotate(nodeTxt, xy=parentPt, xycoords='axes fraction',
                            xytext=centerPt, textcoords='axes fraction',
                            va="center", ha="center", bbox=nodeType, arrowprops=arrow_args)

def createPlot():
    fig = plt.figure(1, facecolor='white')
    createPlot.ax1 = plt.subplot(111, frameon=False)
    plotNode('a decision node', (0.5, 0.1), (0.1, 0.5), decisionNode)
    plotNode('a leaf node', (0.8, 0.1), (0.3, 0.8), leafNode)
if __name__ == '__main__':

Test effect:

2. Construct annotation tree

Although there are x and Y coordinates, how to place all tree nodes is a problem. You must know how many leaf nodes there are so that you can correctly determine the length of the x-axis; You also need to know how many layers the tree has so that you can correctly determine the height of the y-axis.

Get the number of leaf nodes and tree layers:

    myTree - Decision tree
    numLeafs - Number of leaf nodes of decision tree
def getNumLeafs(myTree):
    # Initialize leaf
    numLeafs = 0
    #myTree.keys() in Python 3 returns dict_keys, not a list,
    #Therefore, you cannot use the method of myTree.keys()[0] to obtain node properties. You can use list(myTree.keys())[0]
    firstStr = next(iter(myTree))
    # Get the next set of dictionaries
    secondDict = myTree[firstStr]
    for key in secondDict.keys():
        # Test whether the node is a dictionary. If it is not a dictionary, it means that the node is a leaf node
        if type(secondDict[key]).__name__=='dict':
            numLeafs += getNumLeafs(secondDict[key])
        else:   numLeafs +=1
    return numLeafs

    myTree - Decision tree
    maxDepth - Layers of decision tree
def getTreeDepth(myTree):
    # Initialize decision tree depth
    maxDepth = 0
    #myTree.keys() in Python 3 returns dict_keys, not a list,
    #Therefore, you cannot use the method of myTree.keys()[0] to obtain node properties. You can use list(myTree.keys())[0]
    firstStr = next(iter(myTree))
    # Get next dictionary
    secondDict = myTree[firstStr]
    for key in secondDict.keys():
        #Test whether the node is a dictionary. If it is not a dictionary, it means that the node is a leaf node
        if type(secondDict[key]).__name__=='dict':
            thisDepth = 1 + getTreeDepth(secondDict[key])
        else:   thisDepth = 1
        # Update layers
        if thisDepth > maxDepth: maxDepth = thisDepth
    return maxDepth

The function retrieveTree outputs the pre stored tree information, avoiding the trouble of creating a tree from the data every time you test the code.

def retrieveTree(i):
    Predefined tree structure for testing 
    listOfTrees = [{'no surfacing': {0: 'no', 1: {'flippers': {0: 'no', 1: 'yes'}}}},
                   {'no surfacing': {0: 'no', 1: {'flippers': {0: {'head': {0: 'no', 1: 'yes'}}, 1: 'no'}}}}
    return listOfTrees[i]
if __name__ == '__main__':
    myTree = retrieveTree(0)
{'no surfacing': {0: 'no', 1: {'flippers': {0: {'head': {0: 'no', 1: 'yes'}}, 1: 'no'}}}}

Both the getNumLeafs() function and the gettreedepts() function returned the correct results. Next, draw a complete tree.

Dimension directed edge attribute values:

    cntrPt,parentPt - Used to calculate dimension locations
    txtString - Marked content
def plotMidText(cntrPt, parentPt, txtString):
    xMid = (parentPt[0]-cntrPt[0])/2.0 + cntrPt[0]#Calculate dimension location
    yMid = (parentPt[1]-cntrPt[1])/2.0 + cntrPt[1]
    createPlot.ax1.text(xMid, yMid, txtString, va="center", ha="center", rotation=30)

Draw decision tree:

    myTree - Decision tree(Dictionaries)
    parentPt - Marked content
    nodeTxt - Node name
def plotTree(myTree, parentPt, nodeTxt):
    # Set node format
    decisionNode = dict(boxstyle="sawtooth", fc="0.8")
    # Format leaf nodes
    leafNode = dict(boxstyle="round4", fc="0.8")
    # Get the number of decision leaf nodes, which determines the width of the tree
    numLeafs = getNumLeafs(myTree)
    # Get the number of decision tree layers
    depth = getTreeDepth(myTree)
    # Next dictionary
    firstStr = next(iter(myTree))
    # Center position
    cntrPt = (plotTree.xOff + (1.0 + float(numLeafs))/2.0/plotTree.totalW, plotTree.yOff)
    # Dimension directed edge attribute values
    plotMidText(cntrPt, parentPt, nodeTxt)
    # Draw node
    plotNode(firstStr, cntrPt, parentPt, decisionNode)
    # The next dictionary is to continue drawing child nodes
    secondDict = myTree[firstStr]
    # y offset
    plotTree.yOff = plotTree.yOff - 1.0/plotTree.totalD
    for key in secondDict.keys():
        # Test whether the node is a dictionary. If it is not a dictionary, it means that the node is a leaf node, not a leaf node. Call recursively to continue drawing
        if type(secondDict[key]).__name__=='dict':
        # If it is a leaf node, draw the leaf node and mark the directed edge attribute value
            plotTree.xOff = plotTree.xOff + 1.0/plotTree.totalW
            plotNode(secondDict[key], (plotTree.xOff, plotTree.yOff), cntrPt, leafNode)
            plotMidText((plotTree.xOff, plotTree.yOff), cntrPt, str(key))
    plotTree.yOff = plotTree.yOff + 1.0/plotTree.totalD

To create a paint panel:

    inTree - Decision tree(Dictionaries)
def createPlot(inTree):
    # Create fig
    fig = plt.figure(1, facecolor='white')
    # Empty fig
    axprops = dict(xticks=[], yticks=[])
    # Remove the x and y axes
    createPlot.ax1 = plt.subplot(111, frameon=False, **axprops)
    # Get the number of decision nodes
    plotTree.totalW = float(getNumLeafs(inTree))
    # Get the number of decision tree layers
    plotTree.totalD = float(getTreeDepth(inTree))
    # x offset
    plotTree.xOff = -0.5/plotTree.totalW; plotTree.yOff = 1.0;
    # Draw decision tree
    plotTree(inTree, (0.5,1.0), '')
    # Display drawing results
if __name__ == '__main__':
    myTree = retrieveTree(0)

3, Test and store classifiers

1. Test algorithm: use decision tree to perform classification

After constructing the decision tree based on the training data, we can use it to classify the actual data. When performing data classification, the decision tree and the label vector used to construct the tree are needed. Then, the program compares the test data with the values on the decision tree, and recursively executes the process until it enters the leaf node; Finally, the test data is defined as the type of leaf node.

Use the classification function of the decision tree:

    inputTree - Generated decision tree
    featLabels - Store the selected optimal feature label
    testVec - Test data list, the order corresponds to the optimal feature label
    classLabel - Classification results
def classify(inputTree, featLabels, testVec):
    # Get decision tree node
    firstStr = next(iter(inputTree))
    # Next dictionary
    secondDict = inputTree[firstStr]
    featIndex = featLabels.index(firstStr)
    for key in secondDict.keys():
        if testVec[featIndex] == key:
            if type(secondDict[key]).__name__ == 'dict':
                classLabel = classify(secondDict[key], featLabels, testVec)
            else: classLabel = secondDict[key]
    return classLabel
if __name__ == '__main__':
    dataSet, labels = createDataSet()
    myTree = retrieveTree(0)
    print(classify(myTree, labels, [1, 0]))
    print(classify(myTree, labels, [1, 1]))
['no surfacing', 'flippers']
{'no surfacing': {0: 'no', 1: {'flippers': {0: 'no', 1: 'yes'}}}}

2. Use algorithm: storage of decision tree

Constructing a decision tree is a time-consuming task. It takes several seconds to process a small data set, such as the previous sample data. If the data set is large, it will consume a lot of computing time. However, using the created decision tree to solve the classification problem can be completed quickly. Therefore, in order to save computing time, it is best to call the constructed decision tree every time the classification is executed.

Storage decision tree:

    inputTree - Generated decision tree
    filename - Storage file name of the decision tree
def storeTree(inputTree, filename):
    with open(filename, 'wb') as fw:
        pickle.dump(inputTree, fw)

Read decision tree:

    filename - Storage file name of the decision tree
    pickle.load(fr) - Decision tree dictionary
def grabTree(filename):
    fr = open(filename, 'rb')
    return pickle.load(fr)
if __name__ == '__main__':
    myTree = retrieveTree(0)
    storeTree(myTree, 'classifierStorage.txt')

Through the above code, we can store the classifier on the hard disk instead of learning it again every time we classify the data, which is also one of the advantages of the decision tree.

4, Using decision tree to predict vehicle condition

The condition of the vehicle is divided into four categories:

  • unacc (Unacceptable condition is very poor)
  • acc (Acceptable condition is normal)
  • good (Good)
  • vgood (Very good)

Judge the condition of the vehicle through the following classification attributes:

  • buying (purchase price: vhigh, high, med, low)
  • maint (maintenance price: vhigh, high, med, low)
  • Doors (several doors: 2, 3, 4, 5more)
  • persons (capacity: 2, 4, more)
  • lug_boot (storage space: small, med, big)
  • safety: low, Med, high

Data sample:

Test code:

if __name__ == '__main__':
    fr = open('carTrain.txt')
    cars = [inst.strip().split('\t') for inst in fr.readlines()]
    carLabels = ['buying', 'maint', 'doors', 'persons', 'lub_boot', 'safety']
    carsTree = createTree(cars, carLabels)

Test results:

{'safety': {'low': 'unacc', 'med': {'persons': {'2': 'unacc', '4': {'buying': {'low': {'maint': {'low': 'acc', 'med': 'good', 'high': 'acc', 'vhigh': 'acc'}}, 'med': {'maint': {'low': 'acc', 'med': 'acc', 'high': 'unacc', 'vhigh': 'acc'}}, 'high': {'maint': {'low': 'unacc', 'med': 'acc', 'high': 'unacc', 'vhigh': 'unacc'}}, 'vhigh': {'maint': {'low': 'unacc', 'med': 'acc', 'high': 'unacc', 'vhigh': 'unacc'}}}}, 'more': {'buying': {'low': {'maint': {'med': {'doors': {'2': 'good', '3': 'good', '4': 'acc'}}, 'vhigh': {'doors': {'2': 'acc', '3': 'acc', '4': 'unacc'}}}}, 'med': {'doors': {'2': 'acc', '3': 'acc', '4': {'maint': {'med': 'acc', 'vhigh': 'unacc'}}}}, 'high': {'maint': {'med': {'doors': {'2': 'acc', '3': 'acc', '4': 'unacc'}}, 'vhigh': 'unacc'}}, 'vhigh': {'maint': {'med': {'doors': {'2': 'acc', '3': 'acc', '4': 'unacc'}}, 'vhigh': 'unacc'}}}}}}, 'high': {'persons': {'4': {'buying': {'low': {'maint': {'med': {'doors': {'2': 'vgood', '3': 'good', '4': 'good'}}, 'vhigh': 'acc'}}, 'med': {'doors': {'2': {'maint': {'med': 'vgood', 'vhigh': 'acc'}}, '3': 'acc', '4': 'acc'}}, 'high': {'maint': {'med': 'acc', 'vhigh': 'unacc'}}, 'vhigh': {'maint': {'med': 'acc', 'vhigh': 'unacc'}}}}, '2': 'unacc', 'more': {'buying': {'low': {'doors': {'4': 'vgood', '3': 'vgood', '5more': {'maint': {'low': 'good', 'high': 'acc'}}}}, 'med': {'maint': {'low': {'doors': {'4': 'vgood', '3': 'vgood', '5more': 'good'}}, 'high': 'acc'}}, 'high': 'acc', 'vhigh': {'maint': {'low': 'acc', 'high': 'unacc'}}}}}}}}

Along different branches of the decision tree, the condition of the vehicle can be judged by some attributes of the vehicle itself.

The decision tree matches the experimental data very well, but there may be too many matching options. This problem is called over matching. In order to reduce the over matching problem, we can cut the decision tree and remove some unnecessary leaf nodes. If a leaf node can only add a little information, it can be deleted and incorporated into other leaf nodes. ID3 algorithm cannot directly process numerical data. Although we can convert numerical data into nominal data by quantitative method, if there are too many feature divisions, ID3 algorithm will still face other problems.

5, Improved algorithm

1. C4.5 algorithm

C4.5 algorithm is a classical algorithm used to generate decision tree. It is an extension and optimization of ID3 algorithm. C4.5 algorithm improves ID3 algorithm. The main improvements are:

  • Using information gain rate to select partition features overcomes the deficiency of using information gain, but the information gain rate has a preference for attributes with a small number of values;
  • It can handle discrete and continuous attribute types, that is, discrete continuous attributes;
  • Capable of processing training data with missing attribute values;
  • Pruning during tree construction;

Information gain rate
The information gain criterion has a preference for attributes with a large number of values. In order to reduce the possible adverse effects of this preference, C4.5 algorithm uses the information gain rate to select the optimal partition attributes. Gain rate formula:
G a i n _ r a t i o ( D , a ) = G a i n ( D , a ) I V ( a ) Gain\_ratio(D,a) = \frac{ Gain(D,a)}{IV(a)} \quad Gain_ratio(D,a)=IV(a)Gain(D,a)​
I V ( a ) = − ∑ v = 1 V ∣ D v ∣ ∣ D ∣ log ⁡ 2 ∣ D v ∣ ∣ D ∣ IV(a) = -\sum_{v=1}^V \frac{|D^v|}{|D|}\log_2 \frac{|D^v|}{|D|} \quad \quad IV(a)=−v=1∑V​∣D∣∣Dv∣​log2​∣D∣∣Dv∣​
It is called the "intrinsic value" of attribute A. The more possible values of attribute a (i.e. the greater V), the I V ( a ) IV(a) The larger the value of IV(a) is usually

The information gain rate criterion has a preference for attributes with a small number of values. Therefore, C4.5 algorithm does not directly select the candidate partition attribute with the largest information gain rate, but first find the attribute with higher information gain than the average level from the candidate partition attributes, and then select the attribute with the highest information gain rate.

Processing of continuous features
When the attribute type is discrete, there is no need to discretize the data;

When the attribute type is continuous, the data needs to be discretized. The specific ideas are as follows:

Specific ideas:
The continuous characteristic a of M samples has m values, arranged from small to large a 1 , a 2 , ... , a n a_1,a_2,{\ldots} ,a_n a1, a2,..., an, take the average of two adjacent sample values as the division point, there are m-1 in total, and the ith division point Ti is expressed as: T i = ( a i + a i + 1 ) / 2 T_i = (a_i + a_{i+1})/2 Ti​=(ai​+ai+1​)/2 .
The information gain rate when these m-1 points are used as binary segmentation points is calculated respectively. The point with the largest information gain rate is selected as the best segmentation point of the continuous feature. For example, the point with the largest information gain rate is a t a_t at, then less than a t a_t The value of at is category 1, greater than a t a_t The value of at is class 2, so the discretization of continuous features is achieved.


CART algorithm constructs a decision tree for classification: the size of Gini index is used to measure the advantages and disadvantages of each partition point.
The definition of Gini index is: in the classification problem, assuming that D has k classes, the probability that the sample point belongs to class k is p k p_k pk, then the probability
The Gini value of the distribution is defined as:
G i n i ( D ) = ∑ k = 1 K p k ( 1 − p k ) = 1 − ∑ k = 1 K p k 2 Gini(D) = \sum_{k=1}^K p_k(1 - p_k) = 1 - \sum_{k=1}^K p_k^2 Gini(D)=k=1∑K​pk​(1−pk​)=1−k=1∑K​pk2​
G i n i ( D ) Gini(D) The smaller Gini(D), the higher the purity of data set D;

Given dataset D, the Gini index of attribute a is defined as:
G i n i i n d e x ( D , a ) = ∑ v = 1 V ∣ D v ∣ ∣ D ∣ G i n i ( D v ) Gini_index(D,a) = \sum_{v=1}^V \frac{|D^v|}{|D|}Gini(D^v) \quad Ginii​ndex(D,a)=v=1∑V​∣D∣∣Dv∣​Gini(Dv)
In the candidate attribute set A, select the attribute that minimizes the Gini index after partition as the optimal partition attribute.

Posted by barrowvian on Tue, 26 Oct 2021 03:51:20 -0700