KNN, namely knearest neighbor method, is a basic classification and regression method proposed by Cover T and Hart P in 1967. It is also one of the basic algorithms of machine learning.
This article's reference tutorial: Machine learning practice
Principle of KNN algorithm
In a sample data set, also known as the training sample set, and each data in the sample set has a label, that is, we know the corresponding relationship between each data in the sample set and its classification. After entering the new data without labels, each feature of the new data is compared with the corresponding feature of the data in the sample set, and then the algorithm extracts the classification label of the most similar data (nearest neighbor) of the sample. Generally speaking, we only select the first k most similar data in the sample data set, which is the source of K in the knearest neighbor algorithm. Generally, K is an integer not greater than 20. Finally, the most frequent classification among the k most similar data is selected as the classification of new data.
Therefore, it can also be said that KNN algorithm is not actually trained, that is, its training complexity is 0; KNN nearest neighbor algorithm uses similarity to judge the category. Who you are more like is who you are.
KNN is suitable for numerical and nominal data. Its advantages are high precision, insensitive to outliers and no data input assumption, but its disadvantages are also obvious. We need to traverse the whole training set every time, and the computational complexity and spatial complexity are very high.
KNN code template
1. Import dependency
import numpy as np from collections import Counter
2. Generate samples
def createDateSet(): group = np.array([[1.0,1.1],[1.0,1.0],[0,0],[0,0.1]]) labels = ['A','A','B','B'] return group,labels
3.KNN algorithm
def classify0(inX, dataSet, labels, k): dataSetSize = dataSet.shape[0] diffMat = np.tile(inX, (dataSetSize,1))  dataSet sqDiffMat = diffMat**2 sqDistances = sqDiffMat.sum(axis = 1) distances = sqDistances**0.5 sortedDistIndicies = np.argsort(distances) voteIlabel = np.array(labels)[sortedDistIndicies[:k]] return Counter(voteIlabel).most_common(1)[0][0]
4. Main function
def main(): k = 3 test=[0,0] group,labels = createDateSet() result = classify0(test, group ,labels,k) print(result) if __name__ == '__main__': main()
KNN step by step

Imitate jupyter, the first code box is In [], and the second code box is Out []

The implementation of KNN is very simple, but it will be amazing to use numpy
1. Number of samples obtained
group = np.array([[1.0,1.1],[1.0,1.0],[0,0],[0,0.1]]) labels = ['A','A','B','B'] dataSetSize = group.shape[0] dataSetSize
4
2. Obtain the coordinate difference between inX and the sample
test=[0,0] diffMat = np.tile(test, (dataSetSize,1))  group diffMat
array([[1. , 1.1], [1. , 1. ], [ 0. , 0. ], [ 0. , 0.1]])
It can be found that np.tile() is used to repeat test and calculate the coordinate difference between test and each sample at one time
3. Obtain distance
sqDiffMat = diffMat**2 sqDistances = sqDiffMat.sum(axis = 1) sqDistances
array([2.21, 2. , 0. , 0.01])
Note the direction of the axis when sum
4. Sort index
sortedDistIndicies = np.argsort(distances) sortedDistIndicies
array([2, 3, 1, 0], dtype=int64)
5. Obtain the labels of the first K adjacent points
k=3 voteIlabel = np.array(labels)[sortedDistIndicies[:k]] voteIlabel
array(['B', 'B', 'A'], dtype='<U1')
label was originally a list type, which was first converted to ndarray
6. Make statistics and return the largest number of tags
from collections import Counter cnt=Counter(voteIlabel) cnt.most_common(1)
[('B', 2)]
KNN practical explanation
Dating battle:
Miss helen provided a total of 1000 cases of her previous blind date experience. Now she hopes to analyze these data through machine learning, so that she can roughly know what kind of person the other party belongs to in her mind without meeting in the future.
Miss helen examined three indicators in total, namely:

Frequent flyer miles per year

Percentage of time spent playing video games

Litres of ice cream consumed per week
So how to classify by KNN?

get data
def getDateSet(): df = pd.read_excel('date.xlsx') labels = np.array(df['label']) df.drop('label', axis=1, inplace=True) data = np.array(df) return data, labels

Standardization
Because the absolute value of some characteristic values is very large, the data should be normalized
def normalization(data): min_val = data.min(0) max_val = data.max(0) ranges = max_val  min_val # range norm_data = (data  min_val) / ranges return norm_data
sklearn can also be used for normalization:
data = MinMaxScaler().fit_transform(data)

KNN algorithm
Same as above.
def classify0(inX, dataSet, labels, k): dataSetSize = dataSet.shape[0] diffMat = np.tile(inX, (dataSetSize,1))  dataSet sqDiffMat = diffMat**2 sqDistances = sqDiffMat.sum(axis = 1) distances = sqDistances**0.5 sortedDistIndicies = np.argsort(distances) voteIlabel = np.array(labels)[sortedDistIndicies[:k]] return Counter(voteIlabel).most_common(1)[0][0]

Main function
def main(): data,labels = getDateSet() norm_data = normalization(data) test=np.array([26052, 1.441871, 0.805124]) norm_test=normalization(test) result = classify0(norm_test, norm_data ,labels,k=5) print(result)
The result is 1, indicating that he is a very attractive man.
KNN outsourcing expert
After the hand coding of KNN, in fact, we have a full understanding of its principle, so we can switch packages later [dog head]
import numpy as np import pandas as pd from sklearn.model_selection import train_test_split from sklearn.neighbors import KNeighborsClassifier def getDateSet(): df = pd.read_excel('date.xlsx') labels = np.array(df['label']) df.drop('label', axis=1, inplace=True) data = np.array(df) return data, labels data,labels = getDateSet() # Divide the training set and test set by 7:3 x_train, x_test , y_train, y_test = train_test_split(data, labels, test_size = 0.3) # k=3 knn = KNeighborsClassifier(n_neighbors=3) knn.fit(x_train, y_train) error_index = np.nonzero(knn.predict(x_test)  y_test)[0] print(f'The prediction accuracy is: {100*(1  len(error_index) / len(data))}%')
After simple training, 94% accuracy can be achieved.