pca is a kind of dimension reduction method of black box. Through mapping, we want the data after projection to be scattered as much as possible, so we need to ensure that the variance after mapping is as large as possible, and the direction of the next mapping is orthogonal to the current mapping direction
Steps of pca:
Step 1: first, find the covariance matrix for the current data (de mean). The covariance matrix = the number of columns represented by data * data transpose / (m-1) m, the diagonal represents the variance, and the other positions represent the covariance
Step 2: we need to diagonalize the matrix so that the covariance is 0, only the data in the diagonal direction exists, and then we can get our eigenvalues and eigenvectors
Step 3: reduce the dimension by the current data * eigenvector. The sum of eigenvalues / eigenvalues can represent the expression importance of the corresponding eigenvector
Here is a description of the program
Step 1: data import, mean removal and covariance calculation
import pandas as pd import numpy as np import matplotlib.pyplot as plt df = pd.read_csv('iris.data') print(df.head()) df.columns=['sepal_len', 'sepal_wid', 'petal_len', 'petal_wid', 'class'] print(df.head()) # Used to store variables X = df.ix[:, 0:4].values #Used to store labels y = df.ix[:, 4].values msg ={'Iris-setosa':0, 'Iris-versicolor':1, 'Iris-virginica':2} df['class'] = df['class'].map(msg) #Change letters to numbers #Standardize from sklearn.preprocessing import StandardScaler Scaler = StandardScaler() X_Scaler = Scaler.fit_transform(X) # Find the average value of each line mean_vec = np.mean(X_Scaler, axis=0) #Covariance matrix after mean removal cov_mat = (X_Scaler-mean_vec).T.dot(X_Scaler-mean_vec)/(X_Scaler.shape[0]-1) print(cov_mat) #Use np Find covariance matrix,The result is the same cov_mat = np.cov(X_Scaler.T) print(cov_mat)
Step 2: the process of matrix diagonalization is a process of finding eigenvalues and eigenvectors
# Finding eigenvalues and eigenvectors eig_vals, eig_vecs = np.linalg.eig(cov_mat) print(eig_vals, eig_vecs) #Combine eigenvalues with eigenvectors eig_pairs = [(np.abs(eig_vals[i]), eig_vecs[:, i]) for i in range(len(eig_vals))] #Combination correspondence eig_pairs.sort(key=lambda x:x[0], reverse=True) tot = sum(eig_vals) var_exp = [(i/tot)*100 for i in sorted(eig_vals, reverse=True)] #cumsum Represents the sum of the first two numbers cum_var_exp = np.cumsum(var_exp) #Drawing plt.figure(figsize=(6, 4)) #Draw a histogram plt.bar(range(4), var_exp, alpha=0.5, align='center', label='individual explained variance') #Draw step chart plt.step(range(4), cum_var_exp, where='mid', label='cumulative explained variance') plt.ylabel('Explained variance ratio') plt.xlabel('Principal components') plt.legend(loc='best') plt.tight_layout()
Step 3: data (de average)
#Reduce the 4-D matrix to 2-D, and multiply the first two eigenvectors by their combined transposition points #np.hstack Merge two vectors,reshape Make a row a column,Equivalent to transposition matrix_w = np.hstack((eig_pairs[0][1].reshape(4,1), eig_pairs[1][1].reshape(4,1))) #Transformed matrix 149*4 .dot 4*2 = 149*2 become_X_Scaler = X_Scaler.dot(matrix_w) print(become_X_Scaler) plt.figure(figsize=(6, 4)) color = np.array(['red', 'green', 'blue']) #Constituent determinant plt.scatter(become_X_Scaler[:,0], become_X_Scaler[:,1], c=color[df['class']]) #Draw a kind of color scatter plt.show()