Machine Learning 1-k Nearest Neighbor Algorithm

Keywords: Python Attribute

The k-Nearest Neighbor (kNN) algorithm is a classic supervised classification algorithm. The main idea is that if most of the k nearest neighbors of a sample in the feature space belong to a certain category, the result of classification for that sample also belongs to this category.

1. Algorithmic steps

Prepare training and test data;
Determine the parameter k;
Calculate the distance between the test data and each training data, and sort the distance incrementally.
The k points with the smallest distance are selected.
Determine the frequency of the first k points in the category;
Returns the category with the highest frequency among the first k points as the prediction classification for test data.

2. Python code implements kNN

2.1 algorithm implementation

# python 3.7.2

from numpy import *
import operator

def kNNClassify(testData, trainData, labels, k):
    dataSize = trainData.shape[0]  # Number of rows in the test data matrix, 4
    diffMat = tile(testData, (dataSize, 1)) - trainData  # The tile s in numpy are used to repeat elements in a matrix, constructed like the dataSize specification
    sqDiffMat = diffMat ** 2
    sqDistances = sqDiffMat.sum(axis=1)  # Compute the row sum of a matrix
    distances = sqDistances ** 0.5  # Using Euclidean distance calculation
    sortedDisindexes = distances.argsort()  # Returns the sorted index data
    classCount = {}
    for i in range(k):
        voteLabel = labels[sortedDisindexes[i]]
        classCount[voteLabel] = classCount.get(voteLabel, 0) + 1
    sortedClassCount = sorted(classCount.items(), key=operator.itemgetter(1), reverse=True)  # Sort by 2nd dimension
    return sortedClassCount[0][0]

Assume the training data is:

trainData= [[1, 1.1], [1, 1], [0, 0], [0, 0.1]]
labels = ['A', 'A', 'B', 'B']

The test data is:

testData = [[1.1 , 1], [0.1, 0]]

2.2 Actual Warfare: Dating Site Matching

Xiao Ming browses her sisters on a dating website and evaluates each one she sees: largeDoses,smallDoses,didntLike, based on three criteria:

Travel mileage per year
Play games as a percentage of the day
Number of desserts eaten per week

1000 pieces of related data were collected and stored in the datingTestSet.txt file

40920    8.326976    0.953952    largeDoses
14488    7.153469    1.673904    smallDoses
26052    1.441871    0.805124    didntLike
75136    13.147394    0.428964    didntLike
38344    1.669788    0.134296    didntLike
72993    10.141740    1.032955    didntLike
35948    6.830792    1.213192    largeDoses
42666    13.276369    0.543880    largeDoses
67497    8.631577    0.749278    didntLike
35483    12.273169    1.508053    largeDoses
50242    3.723498    0.831917    didntLike
63275    8.385879    1.669485    didntLike
5569    4.875435    0.728658    smallDoses
51052    4.680098    0.625224    didntLike
...

2.2.1 Read text file data and construct a matrix

def file2Matrix(filename):
    love_dictionary = {'largeDoses': 1, 'smallDoses': 0, 'didntLike': -1}
    fr = open('datingTestSet.txt')
    arrayOfLines = fr.readlines()
    numOfLines = len(arrayOfLines)
    dataMatrix = zeros((numOfLines, 3))  # Data Matrix
    classLabels = []  # Tag Array
    index = 0
    for line in arrayOfLines:
        line = line.strip()
        listFromLine = line.split('\t')
        dataMatrix [index, :] = listFromLine[0:3]
        classLabels.append(love_dictionary.get(listFromLine[-1]))
        index += 1
    return returnMat, classLabels

2.2.2 Data normalization

Dimensions have large differences in values, and direct use can seriously affect the classification results, so normalization is required:
newValue = (oldVlue -min) / (max - min)

def autoNorm(dataSet):
    minVals = dataSet.min(0)  # min(0) returns the minimum value of a column and min(1) returns the minimum value of a row
    maxVals = dataSet.max(0)  # max(0) returns the maximum value of a column and max(1) returns the maximum value of a row
    ranges = maxVals - minVals
    normDataSet = zeros(shape(dataSet))
    m = normDataSet.shape[0]
    normDataSet = dataSet - tile(minVals, (m, 1))
    normDataSet = normDataSet / tile(ranges, (m, 1))
    return normDataSet

Finally, the kNNClassify function is called to test.Omit here

3. Advantages and disadvantages of the algorithm

3.1 Advantages

Simple, easy to understand, easy to implement;
Suitable for numerical attribute classification;
For multi-modal problems (objects with multiple class labels), kNN performs better than SVM.

3.2 Disadvantages

When samples are unbalanced, such as a class with a large sample size and a small sample size of other classes, it is possible that when a new sample is input, the large class samples account for the majority of the k neighbors of the sample and the classification is biased.
It is computationally intensive, and each text to be classified calculates its distance from all known samples.

4. Improvement strategy

The improvement strategies are divided into two main directions: classification efficiency and classification effect:

Classification efficiency: Sample attributes are reduced beforehand, and attributes that have little influence on classification results are deleted.This algorithm is more suitable for automatic classification of class domains with larger sample sizes, while it is easier to misclassify those with smaller sample sizes.
Classification effect: 1) use the method of weight (with neighbor weight small from the sample large) to improve, such as the k-nearest neighbor method WAkNN (weighted adjusted k nearest neighbor); 2) select different number of nearest neighbors according to the number of files in the training set to participate in the classification; 3) class center algorithm to find out the class center of each sample to testThe distance of the trial data, grouped into the nearest classes.

Reference material

Machine Learning Actual Warfare

Posted by PHPilliterate on Sat, 27 Apr 2019 11:12:36 -0700

Programmer Group