1 Project Introduction
An APP user has been using online dating software to find a suitable date. Although dating websites recommend different people, she doesn't like everyone. After summing up, she found that she had met three types of people:
- People who don't like it (3)
- Charming People (2)
- Charming People (1)
An APP user hopes that the classification software can better help her classify matched objects into exact categories. In addition, data information that dating software has not recorded can also be collected, which she believes is more conducive to the categorization of matched objects. Some of the information collected is shown in the following figure:
Data Set Download
The sample mainly contains the following three characteristics:
- Frequent Flight Mileage Obtained Annually
- Percentage of time spent playing video games
- Ice-cream litres consumed per week
2 Preparing data: parsing data from text files
Before input the above feature data into the classifier, the format of the data to be processed must be changed to a format acceptable to the classifier.
import numpy as np def file2matrix(filename): """ :param filename: APP File names of appointment data collected by users :return: returnMat: Each row of data information provided by the user has three columns. //They are the number of frequent flyer miles each year. //Percentage of time spent playing video games, //Ice-cream litres consumed per week classLabelVetor: //User evaluation information is generally divided into three categories (1,2,3) """ fr = open(filename) arrayOfLines = fr.readlines() # print(arrayOfLines) # Get the number of file rows; numerOfLines = len(arrayOfLines) # Create the Numpy matrix to be returned; returnMat = np.zeros((numerOfLines, 3)) # Parse file data into matrix; classLabelVetor = [] index = 0 for line in arrayOfLines: line = line.strip() listFromLine = line.split('\t') returnMat[index, :] = listFromLine[0:3] classLabelVetor.append(listFromLine[-1]) index += 1 return returnMat, classLabelVetor print(file2matrix('data/datingTestSet2'))
The returned value shows:
3 Analytical data: Create scatter plots using Matplotlib
Using Matplotlib library graphically and clearly identifies three different sample classification regions, and the category regions of people with different hobbies are also different.
def draw_pic(datingDataMat, datingLabels): """ //A scatter plot of the number of frequent flyer miles per year and the number of ice-cream liters consumed per week. :param datingDataMat: :param datingLabels: :return: """ # Chinese display scrambling problem; myfont = font_manager.FontProperties(fname="/usr/share/fonts/cjkuni-uming/uming.ttc", size=12) # Create canvas fig = plt.figure() ax = fig.add_subplot(111) ax.scatter(datingDataMat[:, 0], datingDataMat[:, 2], 15 * datingLabels, datingLabels) plt.xlabel("Annual mileage", fontproperties=myfont) plt.ylabel("Ice-cream litres consumed per week", fontproperties=myfont) plt.grid(alpha=0.5) plt.show()
- Effect display
4 Prepare data: normalized values
- Calculate the distance between sample 3 and sample 4:
- Question:
The influence of frequent passenger mileage on the calculation results will be much greater than that of the other two characteristics. - Solutions:
When dealing with eigenvalues with different ranges of values, the usual method is to normalize the values, such as processing the ranges of values from 0 to 1 or from - 1 to 1.
- Normalization formula: newValue = oldValue / max
def autoNorm(dataSet): """ Normalized values, param dataSet: Each row of data information provided by the user, three columns; :return: Nor DataSet: Normalized feature information; maxVals: the maximum value of each feature data; """ # Get the maximum value of each feature data; maxVals = dataSet.max(0) # Get the number of samples; m = dataSet.shape[0] # The normalized feature information is generated according to the formula. normDataSet = dataSet / np.tile(maxVals, (m, 1)) return normDataSet, maxVals
4. Implementing kNN algorithm
For each point in the data set of unknown class attributes, the following operations are performed in turn, the same as the code of the previous case:
(1) Calculate the distance between the point in a given class data set and the current point;
(2) Sort them in order of increasing distance;
(3) Select k points with the smallest distance from the current point;
(4) Determine the occurrence frequency of the category of the first k points;
(5) Return the category with the highest frequency of the first k points as the predictive classification of the current point.
def classify(inX, dataSet, labels, k): """ :param inX: Data to be predicted :param dataSet: The known data set we want to pass in :param labels: The label we want to pass in :param k: KNN Li k, That is to say, we have to choose several neighbors. :return: Sorting results """ dataSetSize = dataSet.shape[0] # (6,2) 6 # tile repeats inX and repeats it as a matrix of type (data set size, 1). # print(inX) # (x1 - y1), (x2- y2) diffMat = np.tile(inX, (dataSetSize, 1)) - dataSet # square sqDiffMat = diffMat ** 2 # Addition, axis=1 row addition sqDistance = sqDiffMat.sum(axis=1) # Root number distances = sqDistance ** 0.5 # print(distances) # Sorting output is a sequence number index, not a value sortedDistIndicies = distances.argsort() # print(sortedDistIndicies) classCount = {} for i in range(k): voteLabel = labels[sortedDistIndicies[i]] classCount[voteLabel] = classCount.get(voteLabel, 0) + 1 # print(classCount) sortedClassCount = sorted(classCount.items(), key=lambda d: float(d[1]), reverse=True) return sortedClassCount[0]
5 Test Algorithms: As a Classifier for Complete Program Verification
If the correct rate of the classifier meets the requirements, you can use this software to process the appointment list provided by the dating website. A very important task of machine learning algorithm is to evaluate the accuracy of the algorithm. Usually we only provide 90% of the existing data as training samples to train the classifier, and use the remaining 10% data to test the classifier and detect the accuracy of the classifier.
def datingClassTest(): """ //The classifier obtains the error rate for the test code of the dating website. :return: """ hoRatio = 0.10 datingDataMat, datingLabels = file2matrix('data/datingTestSet2') normDataSet, maxVals = autoNorm(datingDataMat) # Sample size m = normDataSet.shape[0] # Number of test sets; numTestVecs = int(m*hoRatio) errorCount = 0.0 for i in range(numTestVecs): classiferResult = classify(normDataSet[i, :], normDataSet[numTestVecs:m, :], datingLabels[numTestVecs:m], 3) # print(classiferResult) if classiferResult != datingLabels[i]: errorCount += 1 print("Correct result:", datingLabels[i]) print("Prediction results:", classiferResult) # print("Number of Errors:", errorCount) return errorCount
- Execution effect demonstration:
6. Using Algorithms: Building a Complete and Available Prediction System
def classifyPerson(Person): """ //Use this classifier to classify people for an APP user. :param Person: :return: """ datingDataMat, datingLabels = file2matrix('data/datingTestSet2') normDataSet, maxVals = autoNorm(datingDataMat) classiferResult = classify(Person / maxVals, normDataSet, datingLabels, 3) if classiferResult == '1': print("Dislike") elif classiferResult == '2': print("A little like it") else: print("Like it very much")
Complete code
# encoding:utf-8 """ KNN Implementation, based on KNN An improved matching algorithm for categorized dating websites """ import numpy as np import matplotlib.pyplot as plt from matplotlib import font_manager def file2matrix(filename): """ :param filename: APP File names of appointment data collected by users :return: returnMat: Each row of data information provided by the user has three columns. //They are the number of frequent flyer miles each year. //Percentage of time spent playing video games, //Ice-cream litres consumed per week classLabelVetor: //User evaluation information is generally divided into three categories (1,2,3) """ fr = open(filename) arrayOfLines = fr.readlines() # print(arrayOfLines) # Get the number of file rows; numerOfLines = len(arrayOfLines) # Create the Numpy matrix to be returned; returnMat = np.zeros((numerOfLines, 3)) # Parse file data into matrix; classLabelVetor = [] index = 0 for line in arrayOfLines: line = line.strip() listFromLine = line.split('\t') returnMat[index, :] = listFromLine[0:3] classLabelVetor.append(listFromLine[-1]) index += 1 return returnMat, classLabelVetor def autoNorm(dataSet): """ //Normalized values, //Calculate the distance between sample 3 and sample 4: [(0-67)** 2 + (20000-32 000)** 2 + (1.1-0.1)** 2]** 0.5 //Question: //The influence of frequent passenger mileage on the calculation results will be much greater than that of the other two characteristics. //Solutions: //When dealing with eigenvalues with different ranges of values, //The usual method is to normalize the values, such as processing the range of values from 0 to 1 or from - 1 to 1. //Normalization formula: newValue = oldValue / max :param dataSet:User-provided data information per row, three columns; :return: normDataSet: Normalized feature information; maxVals: Maximum of each feature data; """ # Get the maximum value of each feature data; maxVals = dataSet.max(0) # Get the number of samples; m = dataSet.shape[0] # The normalized feature information is generated according to the formula. normDataSet = dataSet / np.tile(maxVals, (m, 1)) return normDataSet, maxVals def draw_pic(datingDataMat, datingLabels): """ //A scatter plot of frequent flyer mileage obtained each year and ice-cream liters consumed per week. :param datingDataMat: :param datingLabels: :return: """ # Chinese display scrambling problem; myfont = font_manager.FontProperties(fname="/usr/share/fonts/cjkuni-uming/uming.ttc", size=12) # Create canvas fig = plt.figure() ax = fig.add_subplot(111) ax.scatter(datingDataMat[:, 0], datingDataMat[:, 2], 15 * datingLabels, datingLabels) plt.xlabel("Annual mileage", fontproperties=myfont) plt.ylabel("Ice-cream litres consumed per week", fontproperties=myfont) plt.grid(alpha=0.5) plt.show() def classify(inX, dataSet, labels, k): """ :param inX: Data to be predicted :param dataSet: The known data set we want to pass in :param labels: The label we want to pass in :param k: KNN Li k, That is to say, we have to choose several neighbors. :return: Sorting results """ dataSetSize = dataSet.shape[0] # (6,2) 6 # tile repeats inX and repeats it as a matrix of type (data set size, 1). # print(inX) # (x1 - y1), (x2- y2) diffMat = np.tile(inX, (dataSetSize, 1)) - dataSet # square sqDiffMat = diffMat ** 2 # Addition, axis=1 row addition sqDistance = sqDiffMat.sum(axis=1) # Root number distances = sqDistance ** 0.5 # print(distances) # Sorting output is a sequence number index, not a value sortedDistIndicies = distances.argsort() # print(sortedDistIndicies) classCount = {} for i in range(k): voteLabel = labels[sortedDistIndicies[i]] classCount[voteLabel] = classCount.get(voteLabel, 0) + 1 # print(classCount) sortedClassCount = sorted(classCount.items(), key=lambda d: float(d[1]), reverse=True) return sortedClassCount[0][0] def datingClassTest(): """ //The classifier obtains the error rate for the test code of the dating website. :return: """ hoRatio = 0.10 datingDataMat, datingLabels = file2matrix('data/datingTestSet2') normDataSet, maxVals = autoNorm(datingDataMat) # Sample size m = normDataSet.shape[0] # Number of test sets; numTestVecs = int(m*hoRatio) errorCount = 0.0 for i in range(numTestVecs): classiferResult = classify(normDataSet[i, :], normDataSet[numTestVecs:m, :], datingLabels[numTestVecs:m], 3) # print(classiferResult) if classiferResult != datingLabels[i]: errorCount += 1 print("Correct result:", datingLabels[i]) print("Prediction results:", classiferResult) # print("Number of Errors:", errorCount) return errorCount def classifyPerson(Person): """ //Use this classifier to classify people for an APP user. :param Person: :return: """ datingDataMat, datingLabels = file2matrix('data/datingTestSet2') normDataSet, maxVals = autoNorm(datingDataMat) classiferResult = classify(Person / maxVals, normDataSet, datingLabels, 3) if classiferResult == '1': print("Dislike") elif classiferResult == '2': print("A little like it") else: print("Like it very much") if __name__ == '__main__': # personData = [30000, 10, 1.3] personData = [40920, 8.326976, 0.953952] classifyPerson(personData)
- results of enforcement