Machine learning_ 1:K-nearest neighbor algorithm

Keywords: Python Algorithm Machine Learning

Machine learning_ 1:K-nearest neighbor algorithm

Experimental background

This experiment is based on the classical k-nearest neighbor algorithm of machine learning. I will first introduce the principle of k-nearest neighbor algorithm and basic classification experiments, and then introduce how to use k-nearest neighbor algorithm for handwriting recognition.

1.k-nearest neighbor algorithm

1.1 algorithm principle

What is k-nearest neighbor algorithm is to find k nearest points according to a given point to be judged. Among the k points, which label appears the most times, the point to be judged belongs to that label. As shown in the figure, the value of K here is 5, and among the five points, there are four w1, so this judgment point belongs to w1, which is the k-nearest neighbor algorithm.
Pseudo code:
Perform the following operations for each point in the test dataset in turn:
(1) Calculate the distance between the point in the training data set and the current point;
(2) Sort by increasing distance from small to large;
(3) Select the k points with the minimum distance from the current point;
(4) Determine the occurrence frequency of the category where the k points are located (i.e. count the number of each category in the k points)
(5) The category with the highest frequency among the k points is returned as the prediction classification of the current point

1.2 algorithm analysis

The following code is k-The principle code of nearest neighbor algorithm, I will try to analyze most of the code, so that everyone reading this article can understand it
#Scientific computing package numpy
from numpy import *            
#Operator module 
import operator                 
def createDataSet():
    return group,labels

#Four parameters: input vector inx (input test coordinate point), training sample set dataSet (the array group above), label vector labels (the labels above), and k is the number of nearest neighbors (k in kNN algorithm)
def classify0(inx,dataSet,labels,k):
    #Read the length of the first dimension of the matrix
    #diffMat stores the difference between the input vector inx and the sample set dataSet calculated through tile
    #Square each number in the diffMat array
    #Sum these numbers
    #Root the summation result
    #Sort data from small to large
    #Build dictionary
    #Cycle k times
    for i in range(k):
        #Match labels to sort order
        #Statistics label frequency
    #Sort again and sort in descending order according to the order of the second element
    #Returns the most frequent tag
    return sortedClassCount[0][0]

The distance calculation formula adopts the European distance formula, which is as follows:

It can be simply understood as the square sum of the difference of the corresponding coordinates and then the root sign

1.3 algorithm experiment

As shown in the figure, click the path above, enter cmd to open the command prompt, and enter it in sequence
import KNN (the file you named to save the k-nearest neighbor algorithm)
group,labels=KNN.createDataSet() (create group and labels)
Errors may appear here

The reason is that the new version of python no longer supports iteritems. Just change it to items
The test result is shown in the figure, and the result is B (if it is not B, please pay attention to whether the code is different)

2. Handwriting recognition based on k-nearest neighbor algorithm

2.1 code analysis

#Convert the 32 * 32 image into a 1 * 1024 vector
def img2vector(filename):
    #Open file
    #Cyclic assignment
    for i in range(32):
        for j in range(32):
    #Return array
    return returnVect

#Handwriting recognition test set code
def handwritingClassTest():
    #Create an empty label array
    #Get directory content
    #Get the number of training sets
    #Cycle m times
    for i in range(m):
        #Get file name
        #Remove file suffix
        #The file naming format is k_x. Like 9_ 45 (indicating that the picture is the number 9, ranking 45th in this kind of image), and obtain the category
        #Add category to label array
    #Import test set
    #Error statistics
    #Number of tests
    #Cycle training
    for i in range(mTest):
        #Get file name
        #Remove file suffix
        #Get test category
        #Convert to 1 * 1024 vector
        #Judge with classify0
        #Display judgment results and real results
        print ("the classifier came back with:%d,the real answer is :%d"%(classifierResult,classNumStr))
        #Judge whether it is consistent, and the number of inconsistent errors + 1
    #Number of output errors
    print("\nthe total number of errors is:%d"%errorCount)
    #Output error rate
    print("\nthe total error rate is :%f"%(errorCount/float(mTest)))

2.2 code implementation

As in the previous experiment, enter cmd in the code directory to open the command prompt
Enter in sequence:
import KNN

As shown in the figure, the test results generally fluctuate around 1%.

3. Experimental summary

This experiment can clearly see the advantages and disadvantages of k-nearest neighbor algorithm
1. Simple and effective
2. Easy to understand
1. All data sets must be saved. If the training data set is large, it will consume a lot of storage space
2. Compared with other algorithms, it is very time-consuming to calculate the distance of each data
3. The basic structure information of any data cannot be given, so it is impossible to know what characteristics the average instance sample and typical sample have
Through the advantages and disadvantages, we can find that k-nearest neighbor algorithm is obviously a basic algorithm suitable for beginners' machine learning. It may be further optimized, but its design idea determines its upper limit, so it is difficult to make a huge breakthrough.

Posted by glitch003 on Sun, 03 Oct 2021 18:32:21 -0700