# k-nearest neighbor algorithm for machine learning in real-world combat

Keywords: Python Algorithm Machine Learning

## 1. KNN's Movie Classification

For a new movie, how do you tell if it's a love movie or an action movie? You can tell by the number of kisses or fights. Compare this movie with many already tagged movies, calculate the Euclidean distance between them, sort by distance, and find the shortest k (three below)And then see how many of these k are love movies and how many are action movies. Finally, based on the maximum number of movies, you can tell what type this new movie is.

## 1.1 Code implementation

First, we'll create a dataset with two arrays, one for kiss/fight counts and one for tags

```def createDataSet():
# Four sets of two-dimensional features
group = np.array([[1, 101], [5, 89], [108, 5], [115, 8]])
# Labels for four sets of features
labels = ['Affectional film', 'Affectional film', 'Action movie', 'Action movie']
return group, labels
```

Next comes the core of the KNN algorithm, which determines which test set belongs to the training set. There are four parameters, representing the input test set, the dataset, the labels in the dataset, and selecting the k points with the smallest distance to vote.
This code is more concise than a book: first calculate the distance, then select the k closest tags, and finally select the maximum number of tags among them.

```def classify0(inx, dataset, labels, k):
# Calculate Distance
dist = np.sum((inx - dataset) ** 2, axis=1) ** 0.5
# k nearest Tags
k_labels = [labels[index] for index in dist.argsort()[0: k]]
# The label that appears most often is the final category
label = collections.Counter(k_labels).most_common(1)
return label
```

Finally, test.

```if __name__ == '__main__':
# Create Dataset
group, labels = createDataSet()
# Test Set
test = [60, 100]
# kNN Classification
test_class = classify0(test, group, labels, 3)
# Print classification results
print(test_class)
```

The film is labeled as a love movie. ## 2. Helen's dating website

There are three types of people Helen associates with: dislike, average charm, and glamorous. These people have three kinds of data: annual flight mileage, percentage of time spent playing video games, and number of liters of ice cream consumed each week. Now Helen classifies these people to judge how much new people like Helen.

## 2.1 Preparations

First, we'll sort the data in the text into a matrix with n rows and 3 columns, and create an array with corresponding labels

```def file2matrix(filename):
# Open the file and specify the encoding this time.

fr = open(filename, 'r', encoding='utf-8')
# Read all the contents of the file
# For UTF-8 text with a BOM, the BOM should be removed, otherwise an error will be raised later.
arrayOLines = arrayOLines.lstrip('\ufeff')
# Get Lines of File
numberOfLines = len(arrayOLines)
# NumPy matrix returned, parsing completed data: numberOfLines row, 3 columns
returnMat = np.zeros((numberOfLines, 3))
# Return Classification Label Vector
classLabelVector = []
# Index value of row
index = 0

for line in arrayOLines:
# s.strip(rm), when RM is empty, blank characters are deleted by default (including'\n','\r','t', '')
line = line.strip()
# Strings are sliced according to the'\t'delimiter using s.split (str=', num=string, cout(str)).
listFromLine = line.split('\t')
# The first three columns of the data are extracted and stored in the NumPy matrix of the returnMat, which is the eigenvalue matrix
returnMat[index, :] = listFromLine[0:3]
# Classify according to how much you like the tags in your text, 1 for dislike, 2 for charm, and 3 for glamour
# The last tag for datingTestSet2.txt is that the processed tags have been changed to 1, 2, 3
if listFromLine[-1] == 'didntLike':
classLabelVector.append(1)
elif listFromLine[-1] == 'smallDoses':
classLabelVector.append(2)
elif listFromLine[-1] == 'largeDoses':
classLabelVector.append(3)
index += 1
return returnMat, classLabelVector
```

## 2.2 Data visualization

Showing these data in pictures helps us to see the relationship more intuitively.

```def showdatas(datingDataMat, datingLabels):
# Format Chinese Characters
font = FontProperties(fname=r"c:\windows\fonts\simhei.ttf", size=14)  ##You need to see if your computer contains this font

# Separate fig canvas into 1 row and 1 column, do not share x and y axes, fig canvas size is (13,8)
# When nrow=2 and nclos=2, the fig canvas is divided into four zones, and axs is the first zone in the first row
fig, axs = plt.subplots(nrows=2, ncols=2, sharex=False, sharey=False, figsize=(13, 8))

numberOfLabels = len(datingLabels)
LabelsColors = []
for i in datingLabels:
if i == 1:
LabelsColors.append('black')
if i == 2:
LabelsColors.append('orange')
if i == 3:
LabelsColors.append('red')
# Draw a scatter plot using data from the first (Flight Frequent Routine) and second (Play Game) columns of the datingDataMat matrix, with a scatter size of 15 and a transparency of 0.5
axs.scatter(x=datingDataMat[:, 0], y=datingDataMat[:, 1], color=LabelsColors, s=15, alpha=.5)
# Set Title, x-axis label,y-axis label
axs0_title_text = axs.set_title(u'Percentage of frequent flyer miles earned per year versus time spent playing video games', FontProperties=font)
axs0_xlabel_text = axs.set_xlabel(u'Number of frequent flyer miles per year', FontProperties=font)
axs0_ylabel_text = axs.set_ylabel(u'The percentage of time spent playing video games', FontProperties=font)
plt.setp(axs0_title_text, size=9, weight='bold', color='red')
plt.setp(axs0_xlabel_text, size=7, weight='bold', color='black')
plt.setp(axs0_ylabel_text, size=7, weight='bold', color='black')

# Draw a scatter plot using the first (flight routine) and third (ice cream) data of the datingDataMat matrix, with a scatter size of 15 and a transparency of 0.5
axs.scatter(x=datingDataMat[:, 0], y=datingDataMat[:, 2], color=LabelsColors, s=15, alpha=.5)
# Set Title, x-axis label,y-axis label
axs1_title_text = axs.set_title(u'Frequent flyer miles per year and ice cream liters consumed per week', FontProperties=font)
axs1_xlabel_text = axs.set_xlabel(u'Number of frequent flyer miles per year', FontProperties=font)
axs1_ylabel_text = axs.set_ylabel(u'Ice cream liters consumed per week', FontProperties=font)
plt.setp(axs1_title_text, size=9, weight='bold', color='red')
plt.setp(axs1_xlabel_text, size=7, weight='bold', color='black')
plt.setp(axs1_ylabel_text, size=7, weight='bold', color='black')

# Draw a scatter plot using the data from the second (play games) and third (ice cream) column of the datingDataMat matrix, with a scatter size of 15 and a transparency of 0.5
axs.scatter(x=datingDataMat[:, 1], y=datingDataMat[:, 2], color=LabelsColors, s=15, alpha=.5)
# Set Title, x-axis label,y-axis label
axs2_title_text = axs.set_title(u'Percentage of time spent playing video games versus liters of ice cream consumed per week', FontProperties=font)
axs2_xlabel_text = axs.set_xlabel(u'The percentage of time spent playing video games', FontProperties=font)
axs2_ylabel_text = axs.set_ylabel(u'Ice cream liters consumed per week', FontProperties=font)
plt.setp(axs2_title_text, size=9, weight='bold', color='red')
plt.setp(axs2_xlabel_text, size=7, weight='bold', color='black')
plt.setp(axs2_ylabel_text, size=7, weight='bold', color='black')
# Set Legend
didntLike = mlines.Line2D([], [], color='black', marker='.',
markersize=6, label='didntLike')
smallDoses = mlines.Line2D([], [], color='orange', marker='.',
markersize=6, label='smallDoses')
largeDoses = mlines.Line2D([], [], color='red', marker='.',
markersize=6, label='largeDoses')
axs.legend(handles=[didntLike, smallDoses, largeDoses])
axs.legend(handles=[didntLike, smallDoses, largeDoses])
axs.legend(handles=[didntLike, smallDoses, largeDoses])
# display picture
plt.show()
``` ## 2.3 Data normalization

If the data is not normalized, some groups of data with smaller values will not work and the larger impact will be large.

```def autoNorm(dataSet):
# Get the minimum value of the data
minVals = dataSet.min(0)
maxVals = dataSet.max(0)
# Range of maximum and minimum values
ranges = maxVals - minVals
# shape(dataSet) returns the number of matrix rows and columns of a dataSet
normDataSet = np.zeros(np.shape(dataSet))
# Returns the number of rows in the dataSet
m = dataSet.shape
# Original value minus minimum value
normDataSet = dataSet - np.tile(minVals, (m, 1))
# Divide the difference between the maximum and minimum values to get normalized data
normDataSet = normDataSet / np.tile(ranges, (m, 1))
# Returns normalized data results, data range, minimum
return normDataSet, ranges, minVals
```

## 2.4 KNN Classifier

The same is true for the last example: calculating Euclidean distances, taking k minimum values and choosing the most out of k to label this new data.

```def classify0(inX, dataSet, labels, k):
# The numpy function shape returns the number of rows in the dataSet
dataSetSize = dataSet.shape
# Repeat inX once in column vector direction (transverse) and dataSetSize times in row vector direction (longitudinal)
diffMat = np.tile(inX, (dataSetSize, 1)) - dataSet
# Square after subtracting two-dimensional features
sqDiffMat = diffMat ** 2
sqDistances = sqDiffMat.sum(axis=1)
# Square, calculate distance
distances = sqDistances ** 0.5
# Returns the index value of the distances sorted from smallest to largest elements
sortedDistIndices = distances.argsort()
# A dictionary that records the number of categories
classCount = {}
for i in range(k):
# Remove the category of the first k elements
voteIlabel = labels[sortedDistIndices[i]]
# dict.get(key,default=None), the dictionary get() method, returns the value of the specified key if the value is not in the dictionary.
# Calculate number of categories
classCount[voteIlabel] = classCount.get(voteIlabel, 0) + 1
# Replacing iteritems() in python2 with items() in python3
# key=operator.itemgetter(1) Sorts by dictionary value
# key=operator.itemgetter(0) Sorts by dictionary key
# reverse descending sort dictionary
sortedClassCount = sorted(classCount.items(), key=operator.itemgetter(1), reverse=True)
print(sortedClassCount)
# The category that has been returned the most often, that is, the category to be classified
return sortedClassCount
```

## 2.5 Test Data

10% of the dataset is taken for testing

```def datingClassTest():
# Open File Name
filename = "datingTestSet.txt"
# Store the returned feature matrix and classification vector in datingDataMat and datingLabels, respectively
datingDataMat, datingLabels = file2matrix(filename)
# Take 10% of all data
hoRatio = 0.10
# Data normalization, return normalized matrix, data range, data minimum
normMat, ranges, minVals = autoNorm(datingDataMat)
# Get the number of rows of normMat
m = normMat.shape
# Number of 10% test data
numTestVecs = int(m * hoRatio)
# Classification Error Count
errorCount = 0.0

for i in range(numTestVecs):
# Pre-numTestVecs data as test set and post-numTestVecs data as training set
classifierResult = classify0(normMat[i, :], normMat[numTestVecs:m, :],
datingLabels[numTestVecs:m], 4)
print("Classification results:%s\t Real Category:%d" % (classifierResult, datingLabels[i]))
if classifierResult != datingLabels[i]:
errorCount += 1.0
print("error rate:%f%%" % (errorCount / float(numTestVecs) * 100))
``` Enter data not in the dataset to determine how much Karen likes these people

```def classifyPerson():
# Output Results
resultList = ['Hate', 'Some like it', 'Like it very much']
# 3-D Feature User Input
precentTats = float(input("Percentage of time spent playing video games:"))
ffMiles = float(input("Number of frequent flyer miles per year:"))
iceCream = float(input("Ice cream liters consumed per week:"))
# Open File Name
filename = "datingTestSet.txt"
# Open and process data
datingDataMat, datingLabels = file2matrix(filename)
# Training Set Normalization
normMat, ranges, minVals = autoNorm(datingDataMat)
# Generate NumPy array, test set
inArr = np.array([ffMiles, precentTats, iceCream])
# Test Set Normalization
norminArr = (inArr - minVals) / ranges
# Return classification results
classifierResult = classify0(norminArr, normMat, datingLabels, 3)
# Print results
print("You may%s This man" % (resultList[classifierResult - 1]))
```   ## 3. Handwritten Number Recognition

I've already covered this code in my last blog, so let me go into more detail in this post.
First prepare the training set and the test set, which consist of 0-9 numbers corresponding to the data in 0,1 format and are stored in txt text.
As shown, the number 0 ## 3.1 Conversion Vector

Converts a 32 x 32 image to a 1 x 1024 vector.

```def img2vector(filename):
# Create 1x1024 zero vector
returnVect = np.zeros((1, 1024))
# Open File
fr = open(filename)
for i in range(32):
# Read a row of data
# The first 32 elements of each row are added to the returnVect in turn
for j in range(32):
returnVect[0, 32 * i + j] = int(lineStr[j])
# Returns the converted 1x1024 vector
return returnVect
```

## 3.2 Realizing Handwritten Number Recognition

```def handwritingClassTest():
# Labels of Test Set
hwLabels = []
# Returns the file name in the trainingDigits directory
trainingFileList = listdir('trainingDigits')
# Returns the number of files in a folder
m = len(trainingFileList)
# Initialize Mat matrix for training, test set
trainingMat = np.zeros((m, 1024))
# Resolve the category of the training set from the file name
for i in range(m):
# Get the name of the file
fileNameStr = trainingFileList[i]
# Get classified numbers
classNumber = int(fileNameStr.split('_'))
# Add the obtained category to hwLabels
hwLabels.append(classNumber)
# Store 1x1024 data for each file in the trainingMat matrix
trainingMat[i, :] = img2vector('trainingDigits/%s' % (fileNameStr))
# Building kNN classifier
neigh = kNN(n_neighbors=3, algorithm='auto')
# Fit the model, trainingMat is the training matrix, hwLabels is the corresponding label
neigh.fit(trainingMat, hwLabels)
# Returns a list of files in the testDigits directory
testFileList = listdir('testDigits')
# Error Detection Count
errorCount = 0.0
# Number of test data
mTest = len(testFileList)
# Resolve the categories of the test set from the file and test for classification
for i in range(mTest):
# Get the name of the file
fileNameStr = testFileList[i]
# Get classified numbers
classNumber = int(fileNameStr.split('_'))
# Get the 1x1024 vector of the test set for training
vectorUnderTest = img2vector('testDigits/%s' % (fileNameStr))
# Obtain predictions
# classifierResult = classify0(vectorUnderTest, trainingMat, hwLabels, 3)
classifierResult = neigh.predict(vectorUnderTest)
print("Classification returns a result of%d\t The true result is%d" % (classifierResult, classNumber))
if (classifierResult != classNumber):
errorCount += 1.0
print("Total error%d Data\n Error Rate%f%%" % (errorCount, errorCount / mTest * 100))
``` Posted by TurtleDove on Sat, 09 Oct 2021 11:20:47 -0700