KNN for machine learning

Keywords: Machine Learning AI

KNN, namely k-nearest neighbor method, is a basic classification and regression method proposed by Cover T and Hart P in 1967. It is also one of the basic algorithms of machine learning.

This article's reference tutorial: Machine learning practice

Principle of KNN algorithm

In a sample data set, also known as the training sample set, and each data in the sample set has a label, that is, we know the corresponding relationship between each data in the sample set and its classification. After entering the new data without labels, each feature of the new data is compared with the corresponding feature of the data in the sample set, and then the algorithm extracts the classification label of the most similar data (nearest neighbor) of the sample. Generally speaking, we only select the first k most similar data in the sample data set, which is the source of K in the k-nearest neighbor algorithm. Generally, K is an integer not greater than 20. Finally, the most frequent classification among the k most similar data is selected as the classification of new data.

Therefore, it can also be said that KNN algorithm is not actually trained, that is, its training complexity is 0; KNN nearest neighbor algorithm uses similarity to judge the category. Who you are more like is who you are.

KNN is suitable for numerical and nominal data. Its advantages are high precision, insensitive to outliers and no data input assumption, but its disadvantages are also obvious. We need to traverse the whole training set every time, and the computational complexity and spatial complexity are very high.

KNN code template

1. Import dependency

import numpy as np
from collections import Counter

2. Generate samples

def createDateSet():
    group = np.array([[1.0,1.1],[1.0,1.0],[0,0],[0,0.1]])
    labels = ['A','A','B','B']
    return group,labels

3.KNN algorithm

def classify0(inX, dataSet, labels, k):
    dataSetSize = dataSet.shape[0]
    diffMat = np.tile(inX, (dataSetSize,1)) - dataSet
    sqDiffMat = diffMat**2
    sqDistances = sqDiffMat.sum(axis = 1)
    distances = sqDistances**0.5
    sortedDistIndicies = np.argsort(distances)

    voteIlabel = np.array(labels)[sortedDistIndicies[:k]]
    return Counter(voteIlabel).most_common(1)[0][0]

4. Main function

def main():
    k = 3
    test=[0,0]
    group,labels = createDateSet()
    result = classify0(test, group ,labels,k)
    print(result)

if __name__ == '__main__':
    main()

KNN step by step

  • Imitate jupyter, the first code box is In [], and the second code box is Out []

  • The implementation of KNN is very simple, but it will be amazing to use numpy

1. Number of samples obtained

group = np.array([[1.0,1.1],[1.0,1.0],[0,0],[0,0.1]])
labels = ['A','A','B','B']
dataSetSize = group.shape[0]
dataSetSize

4

2. Obtain the coordinate difference between inX and the sample

test=[0,0]
diffMat = np.tile(test, (dataSetSize,1)) - group
diffMat
array([[-1. , -1.1],
    [-1. , -1. ],
    [ 0. ,  0. ],
    [ 0. , -0.1]])

It can be found that np.tile() is used to repeat test and calculate the coordinate difference between test and each sample at one time

3. Obtain distance

sqDiffMat = diffMat**2
sqDistances = sqDiffMat.sum(axis = 1)
sqDistances
array([2.21, 2.  , 0.  , 0.01])

Note the direction of the axis when sum

4. Sort index

sortedDistIndicies = np.argsort(distances)
sortedDistIndicies
array([2, 3, 1, 0], dtype=int64)

5. Obtain the labels of the first K adjacent points

k=3
voteIlabel = np.array(labels)[sortedDistIndicies[:k]]
voteIlabel
array(['B', 'B', 'A'], dtype='<U1')

label was originally a list type, which was first converted to ndarray

6. Make statistics and return the largest number of tags

from collections import Counter
cnt=Counter(voteIlabel)
cnt.most_common(1)
[('B', 2)]

KNN practical explanation

Dating battle:

Miss helen provided a total of 1000 cases of her previous blind date experience. Now she hopes to analyze these data through machine learning, so that she can roughly know what kind of person the other party belongs to in her mind without meeting in the future.

Miss helen examined three indicators in total, namely:

  • Frequent flyer miles per year

  • Percentage of time spent playing video games

  • Litres of ice cream consumed per week

    So how to classify by KNN?

  1. get data

    def getDateSet():
        df = pd.read_excel('date.xlsx')
        labels = np.array(df['label'])
        df.drop('label', axis=1, inplace=True)
        data = np.array(df)
        return data, labels
    
  2. Standardization

    Because the absolute value of some characteristic values is very large, the data should be normalized

    def normalization(data):
        min_val = data.min(0)
        max_val = data.max(0)
        ranges = max_val - min_val  # range
        norm_data = (data - min_val) / ranges
        return norm_data
    

    sklearn can also be used for normalization:

    data = MinMaxScaler().fit_transform(data)
    
  3. KNN algorithm

    Same as above.

    def classify0(inX, dataSet, labels, k):
        dataSetSize = dataSet.shape[0]
        diffMat = np.tile(inX, (dataSetSize,1)) - dataSet
        sqDiffMat = diffMat**2
        sqDistances = sqDiffMat.sum(axis = 1)
        distances = sqDistances**0.5
        sortedDistIndicies = np.argsort(distances)
    
        voteIlabel = np.array(labels)[sortedDistIndicies[:k]]
        return Counter(voteIlabel).most_common(1)[0][0]
    
  4. Main function

    def main():
        data,labels = getDateSet()
        norm_data = normalization(data)
        test=np.array([26052, 1.441871, 0.805124])
        norm_test=normalization(test)
    
        result = classify0(norm_test, norm_data ,labels,k=5)
        print(result)
    

    The result is 1, indicating that he is a very attractive man.

KNN outsourcing expert

After the hand coding of KNN, in fact, we have a full understanding of its principle, so we can switch packages later [dog head]

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier

def getDateSet():
    df = pd.read_excel('date.xlsx')
    labels = np.array(df['label'])
    df.drop('label', axis=1, inplace=True)
    data = np.array(df)
    return data, labels

data,labels = getDateSet()

# Divide the training set and test set by 7:3
x_train, x_test , y_train, y_test = train_test_split(data, labels, test_size = 0.3)

# k=3
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(x_train, y_train)
error_index = np.nonzero(knn.predict(x_test) - y_test)[0]
print(f'The prediction accuracy is: {100*(1 - len(error_index) / len(data))}%')


After simple training, 94% accuracy can be achieved.

Posted by compbry15 on Sun, 05 Dec 2021 20:11:28 -0800