k-nearest neighbor algorithm for machine learning in real-world combat

Keywords: Python Algorithm Machine Learning

1. KNN's Movie Classification

For a new movie, how do you tell if it's a love movie or an action movie? You can tell by the number of kisses or fights. Compare this movie with many already tagged movies, calculate the Euclidean distance between them, sort by distance, and find the shortest k (three below)And then see how many of these k are love movies and how many are action movies. Finally, based on the maximum number of movies, you can tell what type this new movie is.

1.1 Code implementation

First, we'll create a dataset with two arrays, one for kiss/fight counts and one for tags

def createDataSet():
    # Four sets of two-dimensional features
    group = np.array([[1, 101], [5, 89], [108, 5], [115, 8]])
    # Labels for four sets of features
    labels = ['Affectional film', 'Affectional film', 'Action movie', 'Action movie']
    return group, labels

Next comes the core of the KNN algorithm, which determines which test set belongs to the training set. There are four parameters, representing the input test set, the dataset, the labels in the dataset, and selecting the k points with the smallest distance to vote.
This code is more concise than a book: first calculate the distance, then select the k closest tags, and finally select the maximum number of tags among them.

def classify0(inx, dataset, labels, k):
    # Calculate Distance
    dist = np.sum((inx - dataset) ** 2, axis=1) ** 0.5
    # k nearest Tags
    k_labels = [labels[index] for index in dist.argsort()[0: k]]
    # The label that appears most often is the final category
    label = collections.Counter(k_labels).most_common(1)[0][0]
    return label

Finally, test.

if __name__ == '__main__':
    # Create Dataset
    group, labels = createDataSet()
    # Test Set
    test = [60, 100]
    # kNN Classification
    test_class = classify0(test, group, labels, 3)
    # Print classification results
    print(test_class)

The film is labeled as a love movie.

2. Helen's dating website

There are three types of people Helen associates with: dislike, average charm, and glamorous. These people have three kinds of data: annual flight mileage, percentage of time spent playing video games, and number of liters of ice cream consumed each week. Now Helen classifies these people to judge how much new people like Helen.

2.1 Preparations

First, we'll sort the data in the text into a matrix with n rows and 3 columns, and create an array with corresponding labels

def file2matrix(filename):
    # Open the file and specify the encoding this time.

    fr = open(filename, 'r', encoding='utf-8')
    # Read all the contents of the file
    arrayOLines = fr.readlines()
    # For UTF-8 text with a BOM, the BOM should be removed, otherwise an error will be raised later.
    arrayOLines[0] = arrayOLines[0].lstrip('\ufeff')
    # Get Lines of File
    numberOfLines = len(arrayOLines)
    # NumPy matrix returned, parsing completed data: numberOfLines row, 3 columns
    returnMat = np.zeros((numberOfLines, 3))
    # Return Classification Label Vector
    classLabelVector = []
    # Index value of row
    index = 0

    for line in arrayOLines:
        # s.strip(rm), when RM is empty, blank characters are deleted by default (including'\n','\r','t', '')
        line = line.strip()
        # Strings are sliced according to the'\t'delimiter using s.split (str=', num=string, cout(str)).
        listFromLine = line.split('\t')
        # The first three columns of the data are extracted and stored in the NumPy matrix of the returnMat, which is the eigenvalue matrix
        returnMat[index, :] = listFromLine[0:3]
        # Classify according to how much you like the tags in your text, 1 for dislike, 2 for charm, and 3 for glamour
        # The last tag for datingTestSet2.txt is that the processed tags have been changed to 1, 2, 3
        if listFromLine[-1] == 'didntLike':
            classLabelVector.append(1)
        elif listFromLine[-1] == 'smallDoses':
            classLabelVector.append(2)
        elif listFromLine[-1] == 'largeDoses':
            classLabelVector.append(3)
        index += 1
    return returnMat, classLabelVector

2.2 Data visualization

Showing these data in pictures helps us to see the relationship more intuitively.

def showdatas(datingDataMat, datingLabels):
    # Format Chinese Characters
    font = FontProperties(fname=r"c:\windows\fonts\simhei.ttf", size=14)  ##You need to see if your computer contains this font

    # Separate fig canvas into 1 row and 1 column, do not share x and y axes, fig canvas size is (13,8)
    # When nrow=2 and nclos=2, the fig canvas is divided into four zones, and axs[0][0] is the first zone in the first row
    fig, axs = plt.subplots(nrows=2, ncols=2, sharex=False, sharey=False, figsize=(13, 8))

    numberOfLabels = len(datingLabels)
    LabelsColors = []
    for i in datingLabels:
        if i == 1:
            LabelsColors.append('black')
        if i == 2:
            LabelsColors.append('orange')
        if i == 3:
            LabelsColors.append('red')
    # Draw a scatter plot using data from the first (Flight Frequent Routine) and second (Play Game) columns of the datingDataMat matrix, with a scatter size of 15 and a transparency of 0.5
    axs[0][0].scatter(x=datingDataMat[:, 0], y=datingDataMat[:, 1], color=LabelsColors, s=15, alpha=.5)
    # Set Title, x-axis label,y-axis label
    axs0_title_text = axs[0][0].set_title(u'Percentage of frequent flyer miles earned per year versus time spent playing video games', FontProperties=font)
    axs0_xlabel_text = axs[0][0].set_xlabel(u'Number of frequent flyer miles per year', FontProperties=font)
    axs0_ylabel_text = axs[0][0].set_ylabel(u'The percentage of time spent playing video games', FontProperties=font)
    plt.setp(axs0_title_text, size=9, weight='bold', color='red')
    plt.setp(axs0_xlabel_text, size=7, weight='bold', color='black')
    plt.setp(axs0_ylabel_text, size=7, weight='bold', color='black')

    # Draw a scatter plot using the first (flight routine) and third (ice cream) data of the datingDataMat matrix, with a scatter size of 15 and a transparency of 0.5
    axs[0][1].scatter(x=datingDataMat[:, 0], y=datingDataMat[:, 2], color=LabelsColors, s=15, alpha=.5)
    # Set Title, x-axis label,y-axis label
    axs1_title_text = axs[0][1].set_title(u'Frequent flyer miles per year and ice cream liters consumed per week', FontProperties=font)
    axs1_xlabel_text = axs[0][1].set_xlabel(u'Number of frequent flyer miles per year', FontProperties=font)
    axs1_ylabel_text = axs[0][1].set_ylabel(u'Ice cream liters consumed per week', FontProperties=font)
    plt.setp(axs1_title_text, size=9, weight='bold', color='red')
    plt.setp(axs1_xlabel_text, size=7, weight='bold', color='black')
    plt.setp(axs1_ylabel_text, size=7, weight='bold', color='black')

    # Draw a scatter plot using the data from the second (play games) and third (ice cream) column of the datingDataMat matrix, with a scatter size of 15 and a transparency of 0.5
    axs[1][0].scatter(x=datingDataMat[:, 1], y=datingDataMat[:, 2], color=LabelsColors, s=15, alpha=.5)
    # Set Title, x-axis label,y-axis label
    axs2_title_text = axs[1][0].set_title(u'Percentage of time spent playing video games versus liters of ice cream consumed per week', FontProperties=font)
    axs2_xlabel_text = axs[1][0].set_xlabel(u'The percentage of time spent playing video games', FontProperties=font)
    axs2_ylabel_text = axs[1][0].set_ylabel(u'Ice cream liters consumed per week', FontProperties=font)
    plt.setp(axs2_title_text, size=9, weight='bold', color='red')
    plt.setp(axs2_xlabel_text, size=7, weight='bold', color='black')
    plt.setp(axs2_ylabel_text, size=7, weight='bold', color='black')
    # Set Legend
    didntLike = mlines.Line2D([], [], color='black', marker='.',
                              markersize=6, label='didntLike')
    smallDoses = mlines.Line2D([], [], color='orange', marker='.',
                               markersize=6, label='smallDoses')
    largeDoses = mlines.Line2D([], [], color='red', marker='.',
                               markersize=6, label='largeDoses')
    # Add Legend
    axs[0][0].legend(handles=[didntLike, smallDoses, largeDoses])
    axs[0][1].legend(handles=[didntLike, smallDoses, largeDoses])
    axs[1][0].legend(handles=[didntLike, smallDoses, largeDoses])
    # display picture
    plt.show()

2.3 Data normalization

If the data is not normalized, some groups of data with smaller values will not work and the larger impact will be large.

def autoNorm(dataSet):
    # Get the minimum value of the data
    minVals = dataSet.min(0)
    maxVals = dataSet.max(0)
    # Range of maximum and minimum values
    ranges = maxVals - minVals
    # shape(dataSet) returns the number of matrix rows and columns of a dataSet
    normDataSet = np.zeros(np.shape(dataSet))
    # Returns the number of rows in the dataSet
    m = dataSet.shape[0]
    # Original value minus minimum value
    normDataSet = dataSet - np.tile(minVals, (m, 1))
    # Divide the difference between the maximum and minimum values to get normalized data
    normDataSet = normDataSet / np.tile(ranges, (m, 1))
    # Returns normalized data results, data range, minimum
    return normDataSet, ranges, minVals

2.4 KNN Classifier

The same is true for the last example: calculating Euclidean distances, taking k minimum values and choosing the most out of k to label this new data.

def classify0(inX, dataSet, labels, k):
    # The numpy function shape[0] returns the number of rows in the dataSet
    dataSetSize = dataSet.shape[0]
    # Repeat inX once in column vector direction (transverse) and dataSetSize times in row vector direction (longitudinal)
    diffMat = np.tile(inX, (dataSetSize, 1)) - dataSet
    # Square after subtracting two-dimensional features
    sqDiffMat = diffMat ** 2
    # sum() all elements add, sum(0) column add, sum(1) row add
    sqDistances = sqDiffMat.sum(axis=1)
    # Square, calculate distance
    distances = sqDistances ** 0.5
    # Returns the index value of the distances sorted from smallest to largest elements
    sortedDistIndices = distances.argsort()
    # A dictionary that records the number of categories
    classCount = {}
    for i in range(k):
        # Remove the category of the first k elements
        voteIlabel = labels[sortedDistIndices[i]]
        # dict.get(key,default=None), the dictionary get() method, returns the value of the specified key if the value is not in the dictionary.
        # Calculate number of categories
        classCount[voteIlabel] = classCount.get(voteIlabel, 0) + 1
    # Replacing iteritems() in python2 with items() in python3
    # key=operator.itemgetter(1) Sorts by dictionary value
    # key=operator.itemgetter(0) Sorts by dictionary key
    # reverse descending sort dictionary
    sortedClassCount = sorted(classCount.items(), key=operator.itemgetter(1), reverse=True)
    print(sortedClassCount)
    # The category that has been returned the most often, that is, the category to be classified
    return sortedClassCount[0][0]

2.5 Test Data

10% of the dataset is taken for testing

def datingClassTest():
    # Open File Name
    filename = "datingTestSet.txt"
    # Store the returned feature matrix and classification vector in datingDataMat and datingLabels, respectively
    datingDataMat, datingLabels = file2matrix(filename)
    # Take 10% of all data
    hoRatio = 0.10
    # Data normalization, return normalized matrix, data range, data minimum
    normMat, ranges, minVals = autoNorm(datingDataMat)
    # Get the number of rows of normMat
    m = normMat.shape[0]
    # Number of 10% test data
    numTestVecs = int(m * hoRatio)
    # Classification Error Count
    errorCount = 0.0

    for i in range(numTestVecs):
        # Pre-numTestVecs data as test set and post-numTestVecs data as training set
        classifierResult = classify0(normMat[i, :], normMat[numTestVecs:m, :],
                                     datingLabels[numTestVecs:m], 4)
        print("Classification results:%s\t Real Category:%d" % (classifierResult, datingLabels[i]))
        if classifierResult != datingLabels[i]:
            errorCount += 1.0
    print("error rate:%f%%" % (errorCount / float(numTestVecs) * 100))

Enter data not in the dataset to determine how much Karen likes these people

def classifyPerson():
    # Output Results
    resultList = ['Hate', 'Some like it', 'Like it very much']
    # 3-D Feature User Input
    precentTats = float(input("Percentage of time spent playing video games:"))
    ffMiles = float(input("Number of frequent flyer miles per year:"))
    iceCream = float(input("Ice cream liters consumed per week:"))
    # Open File Name
    filename = "datingTestSet.txt"
    # Open and process data
    datingDataMat, datingLabels = file2matrix(filename)
    # Training Set Normalization
    normMat, ranges, minVals = autoNorm(datingDataMat)
    # Generate NumPy array, test set
    inArr = np.array([ffMiles, precentTats, iceCream])
    # Test Set Normalization
    norminArr = (inArr - minVals) / ranges
    # Return classification results
    classifierResult = classify0(norminArr, normMat, datingLabels, 3)
    # Print results
    print("You may%s This man" % (resultList[classifierResult - 1]))

3. Handwritten Number Recognition

I've already covered this code in my last blog, so let me go into more detail in this post.
First prepare the training set and the test set, which consist of 0-9 numbers corresponding to the data in 0,1 format and are stored in txt text.
As shown, the number 0

3.1 Conversion Vector

Converts a 32 x 32 image to a 1 x 1024 vector.

def img2vector(filename):
    # Create 1x1024 zero vector
    returnVect = np.zeros((1, 1024))
    # Open File
    fr = open(filename)
    # Read by line
    for i in range(32):
        # Read a row of data
        lineStr = fr.readline()
        # The first 32 elements of each row are added to the returnVect in turn
        for j in range(32):
            returnVect[0, 32 * i + j] = int(lineStr[j])
    # Returns the converted 1x1024 vector
    return returnVect

3.2 Realizing Handwritten Number Recognition

def handwritingClassTest():
    # Labels of Test Set
    hwLabels = []
    # Returns the file name in the trainingDigits directory
    trainingFileList = listdir('trainingDigits')
    # Returns the number of files in a folder
    m = len(trainingFileList)
    # Initialize Mat matrix for training, test set
    trainingMat = np.zeros((m, 1024))
    # Resolve the category of the training set from the file name
    for i in range(m):
        # Get the name of the file
        fileNameStr = trainingFileList[i]
        # Get classified numbers
        classNumber = int(fileNameStr.split('_')[0])
        # Add the obtained category to hwLabels
        hwLabels.append(classNumber)
        # Store 1x1024 data for each file in the trainingMat matrix
        trainingMat[i, :] = img2vector('trainingDigits/%s' % (fileNameStr))
    # Building kNN classifier
    neigh = kNN(n_neighbors=3, algorithm='auto')
    # Fit the model, trainingMat is the training matrix, hwLabels is the corresponding label
    neigh.fit(trainingMat, hwLabels)
    # Returns a list of files in the testDigits directory
    testFileList = listdir('testDigits')
    # Error Detection Count
    errorCount = 0.0
    # Number of test data
    mTest = len(testFileList)
    # Resolve the categories of the test set from the file and test for classification
    for i in range(mTest):
        # Get the name of the file
        fileNameStr = testFileList[i]
        # Get classified numbers
        classNumber = int(fileNameStr.split('_')[0])
        # Get the 1x1024 vector of the test set for training
        vectorUnderTest = img2vector('testDigits/%s' % (fileNameStr))
        # Obtain predictions
        # classifierResult = classify0(vectorUnderTest, trainingMat, hwLabels, 3)
        classifierResult = neigh.predict(vectorUnderTest)
        print("Classification returns a result of%d\t The true result is%d" % (classifierResult, classNumber))
        if (classifierResult != classNumber):
            errorCount += 1.0
    print("Total error%d Data\n Error Rate%f%%" % (errorCount, errorCount / mTest * 100))

Posted by TurtleDove on Sat, 09 Oct 2021 11:20:47 -0700

Programmer Group