The k-Nearest Neighbor (kNN) algorithm is a classic supervised classification algorithm. The main idea is that if most of the k nearest neighbors of a sample in the feature space belong to a certain category, the result of classification for that sample also belongs to this category.
1. Algorithmic steps
- Prepare training and test data;
- Determine the parameter k;
- Calculate the distance between the test data and each training data, and sort the distance incrementally.
- The k points with the smallest distance are selected.
- Determine the frequency of the first k points in the category;
- Returns the category with the highest frequency among the first k points as the prediction classification for test data.
2. Python code implements kNN
2.1 algorithm implementation
# python 3.7.2 from numpy import * import operator def kNNClassify(testData, trainData, labels, k): dataSize = trainData.shape[0] # Number of rows in the test data matrix, 4 diffMat = tile(testData, (dataSize, 1)) - trainData # The tile s in numpy are used to repeat elements in a matrix, constructed like the dataSize specification sqDiffMat = diffMat ** 2 sqDistances = sqDiffMat.sum(axis=1) # Compute the row sum of a matrix distances = sqDistances ** 0.5 # Using Euclidean distance calculation sortedDisindexes = distances.argsort() # Returns the sorted index data classCount = {} for i in range(k): voteLabel = labels[sortedDisindexes[i]] classCount[voteLabel] = classCount.get(voteLabel, 0) + 1 sortedClassCount = sorted(classCount.items(), key=operator.itemgetter(1), reverse=True) # Sort by 2nd dimension return sortedClassCount[0][0]
Assume the training data is:
trainData= [[1, 1.1], [1, 1], [0, 0], [0, 0.1]] labels = ['A', 'A', 'B', 'B']
The test data is:
testData = [[1.1 , 1], [0.1, 0]]
2.2 Actual Warfare: Dating Site Matching
Xiao Ming browses her sisters on a dating website and evaluates each one she sees: largeDoses,smallDoses,didntLike, based on three criteria:
- Travel mileage per year
- Play games as a percentage of the day
- Number of desserts eaten per week
1000 pieces of related data were collected and stored in the datingTestSet.txt file
40920 8.326976 0.953952 largeDoses 14488 7.153469 1.673904 smallDoses 26052 1.441871 0.805124 didntLike 75136 13.147394 0.428964 didntLike 38344 1.669788 0.134296 didntLike 72993 10.141740 1.032955 didntLike 35948 6.830792 1.213192 largeDoses 42666 13.276369 0.543880 largeDoses 67497 8.631577 0.749278 didntLike 35483 12.273169 1.508053 largeDoses 50242 3.723498 0.831917 didntLike 63275 8.385879 1.669485 didntLike 5569 4.875435 0.728658 smallDoses 51052 4.680098 0.625224 didntLike ...
2.2.1 Read text file data and construct a matrix
def file2Matrix(filename): love_dictionary = {'largeDoses': 1, 'smallDoses': 0, 'didntLike': -1} fr = open('datingTestSet.txt') arrayOfLines = fr.readlines() numOfLines = len(arrayOfLines) dataMatrix = zeros((numOfLines, 3)) # Data Matrix classLabels = [] # Tag Array index = 0 for line in arrayOfLines: line = line.strip() listFromLine = line.split('\t') dataMatrix [index, :] = listFromLine[0:3] classLabels.append(love_dictionary.get(listFromLine[-1])) index += 1 return returnMat, classLabels
2.2.2 Data normalization
Dimensions have large differences in values, and direct use can seriously affect the classification results, so normalization is required:
newValue = (oldVlue -min) / (max - min)
def autoNorm(dataSet): minVals = dataSet.min(0) # min(0) returns the minimum value of a column and min(1) returns the minimum value of a row maxVals = dataSet.max(0) # max(0) returns the maximum value of a column and max(1) returns the maximum value of a row ranges = maxVals - minVals normDataSet = zeros(shape(dataSet)) m = normDataSet.shape[0] normDataSet = dataSet - tile(minVals, (m, 1)) normDataSet = normDataSet / tile(ranges, (m, 1)) return normDataSet
Finally, the kNNClassify function is called to test.Omit here
3. Advantages and disadvantages of the algorithm
3.1 Advantages
- Simple, easy to understand, easy to implement;
- Suitable for numerical attribute classification;
- For multi-modal problems (objects with multiple class labels), kNN performs better than SVM.
3.2 Disadvantages
- When samples are unbalanced, such as a class with a large sample size and a small sample size of other classes, it is possible that when a new sample is input, the large class samples account for the majority of the k neighbors of the sample and the classification is biased.
- It is computationally intensive, and each text to be classified calculates its distance from all known samples.
4. Improvement strategy
The improvement strategies are divided into two main directions: classification efficiency and classification effect:
- Classification efficiency: Sample attributes are reduced beforehand, and attributes that have little influence on classification results are deleted.This algorithm is more suitable for automatic classification of class domains with larger sample sizes, while it is easier to misclassify those with smaller sample sizes.
- Classification effect: 1) use the method of weight (with neighbor weight small from the sample large) to improve, such as the k-nearest neighbor method WAkNN (weighted adjusted k nearest neighbor); 2) select different number of nearest neighbors according to the number of files in the training set to participate in the classification; 3) class center algorithm to find out the class center of each sample to testThe distance of the trial data, grouped into the nearest classes.
Reference material
- Machine Learning Actual Warfare