Learn algorithm PCA (dimension reduction)

Keywords: Lambda

pca is a kind of dimension reduction method of black box. Through mapping, we want the data after projection to be scattered as much as possible, so we need to ensure that the variance after mapping is as large as possible, and the direction of the next mapping is orthogonal to the current mapping direction

Steps of pca:

Step 1: first, find the covariance matrix for the current data (de mean). The covariance matrix = the number of columns represented by data * data transpose / (m-1) m, the diagonal represents the variance, and the other positions represent the covariance

Step 2: we need to diagonalize the matrix so that the covariance is 0, only the data in the diagonal direction exists, and then we can get our eigenvalues and eigenvectors

Step 3: reduce the dimension by the current data * eigenvector. The sum of eigenvalues / eigenvalues can represent the expression importance of the corresponding eigenvector

Here is a description of the program

Step 1: data import, mean removal and covariance calculation

import pandas as pd
import  numpy as np
import matplotlib.pyplot as plt

df = pd.read_csv('iris.data')
print(df.head())

df.columns=['sepal_len', 'sepal_wid', 'petal_len', 'petal_wid', 'class']
print(df.head())

# Used to store variables
X = df.ix[:, 0:4].values
#Used to store labels
y = df.ix[:, 4].values   

msg ={'Iris-setosa':0, 'Iris-versicolor':1, 'Iris-virginica':2}
df['class'] = df['class'].map(msg)  #Change letters to numbers

#Standardize
from sklearn.preprocessing import StandardScaler
Scaler = StandardScaler()
X_Scaler = Scaler.fit_transform(X)

# Find the average value of each line
mean_vec = np.mean(X_Scaler, axis=0)
#Covariance matrix after mean removal
cov_mat = (X_Scaler-mean_vec).T.dot(X_Scaler-mean_vec)/(X_Scaler.shape[0]-1)
print(cov_mat)
#Use np Find covariance matrix,The result is the same
cov_mat = np.cov(X_Scaler.T)
print(cov_mat)

Step 2: the process of matrix diagonalization is a process of finding eigenvalues and eigenvectors

# Finding eigenvalues and eigenvectors
eig_vals, eig_vecs = np.linalg.eig(cov_mat)
print(eig_vals, eig_vecs)

#Combine eigenvalues with eigenvectors
eig_pairs = [(np.abs(eig_vals[i]), eig_vecs[:, i]) for i in range(len(eig_vals))] #Combination correspondence
eig_pairs.sort(key=lambda x:x[0], reverse=True)


tot = sum(eig_vals)

var_exp = [(i/tot)*100 for i in sorted(eig_vals, reverse=True)]
#cumsum Represents the sum of the first two numbers
cum_var_exp = np.cumsum(var_exp)
#Drawing
plt.figure(figsize=(6, 4))
#Draw a histogram
plt.bar(range(4), var_exp, alpha=0.5, align='center',
            label='individual explained variance')
#Draw step chart
plt.step(range(4), cum_var_exp, where='mid',
             label='cumulative explained variance')
plt.ylabel('Explained variance ratio')
plt.xlabel('Principal components')
plt.legend(loc='best')
plt.tight_layout()

Step 3: data (de average)

#Reduce the 4-D matrix to 2-D, and multiply the first two eigenvectors by their combined transposition points

#np.hstack Merge two vectors,reshape Make a row a column,Equivalent to transposition
matrix_w = np.hstack((eig_pairs[0][1].reshape(4,1),
                      eig_pairs[1][1].reshape(4,1)))


#Transformed matrix 149*4 .dot 4*2 = 149*2
become_X_Scaler = X_Scaler.dot(matrix_w)
print(become_X_Scaler)
plt.figure(figsize=(6, 4))

color = np.array(['red', 'green', 'blue']) #Constituent determinant
plt.scatter(become_X_Scaler[:,0], become_X_Scaler[:,1], c=color[df['class']])  #Draw a kind of color scatter
plt.show()

Posted by thedualmind on Fri, 24 Apr 2020 07:11:16 -0700