Data Preprocessing in Machine Learning--Data Normalization

Links to the original text: https://www.lizenghai.com/archives/20262.html https://blog.csdn.net/csmqq/article/details/51461696

data normalization

Data normalization is the mapping of all data to the same scale
  • Maximum Normalization
  • Mean variance normalization

Maximum Normalization

Normalization maps all data between 0 and 1.

The reason for using this standardization method is that sometimes the standard deviation of data sets is very small, and sometimes there are many zero (sparse) elements in data that need to be preserved.

xscale=x−xminxmax−xminx_{scale}=\frac{x-x_{min}}{x_{max}-x_{min}}xscale​=xmax​−xmin​x−xmin​​

import numpy as np

import matplotlib.pyplot as plt

x = np.random.randint(0,100,size=100)

(x-np.min(x)) / (np.max(x)-np.min(x))

Maximum normalization scales data between a given minimum and maximum, usually between 0 and 1, and can be implemented with MinMax Scaler. Or you can scale the maximum absolute value to a unit size, using MaxAbsScaler.

MinMax Scaler

Formula:

X_std = (X - X.min(axis=0)) / (X.max(axis=0) - X.min(axis=0)) ;

X_scaler = X_std/ (max - min) + min
#Example: Scale data to [0, 1]. Training process: fit_transform()

X_train = np.array([[1., -1., 2.], [2., 0., 0.], [0., 1., -1.]])

min_max_scaler = preprocessing.MinMaxScaler() 

X_train_minmax = min_max_scaler.fit_transform(X_train)  

#out: array([[ 0.5       ,  0.        ,  1.        ], 
[ 1.        ,  0.5       ,  0.33333333],        
[ 0.        ,  1.        ,  0.        ]])

#Apply the scale parameters above to the test data
X_test = np.array([[ -3., -1., 4.]])  

X_test_minmax = min_max_scaler.transform(X_test) #out: array([[-1.5 ,  0. , 1.66666667]])

#You can view scaler properties in the following ways
min_max_scaler.scale_        #out: array([ 0.5 ,  0.5,  0.33...])
min_max_scaler.min_         #out: array([ 0.,  0.5,  0.33...])

MaxAbsScaler

Similar to the above standardization method, it scales the training set to [-1,1] by dividing it by the maximum value. This means that the data is zero-centric or sparse with very, very many zeros.

X_train = np.array([[ 1., -1.,  2.],
                     [ 2.,  0.,  0.],
                    [ 0.,  1., -1.]])
                    
max_abs_scaler = preprocessing.MaxAbsScaler()

X_train_maxabs = max_abs_scaler.fit_transform(X_train)

# doctest +NORMALIZE_WHITESPACE^, out: array([[ 0.5, -1.,  1. ], [ 1. , 0. ,  0. ],       [ 0. ,  1. , -0.5]])

X_test = np.array([[ -3., -1.,  4.]])

X_test_maxabs = max_abs_scaler.transform(X_test) #out: array([[-1.5, -1. ,  2. ]])

max_abs_scaler.scale_  #out: array([ 2.,  1.,  2.])

Mean variance normalization

Mean variance normalization is to normalize all data into a distribution with mean 0 and variance 1:
xscale=x−xmeansx_{scale}=\frac{x-x_{mean}}{s}xscale​=sx−xmean​​

from sklearn import preprocessing 

import numpy as np  

X = np.array([[1., -1., 2.], [2., 0., 0.], [0., 1., -1.]])  

X_scaled = preprocessing.scale(X) 

#output :X_scaled = [[ 0.         -1.22474487  1.33630621]
 				 [ 1.22474487  0.         -0.26726124]
 				 [-1.22474487  1.22474487 -1.06904497]]
 				  				 
#scaled Later data zero mean, unit variance
X_scaled.mean(axis = 0)  # column mean: array([ 0.,  0.,  0.])  

X_scaled.std(axis = 0)  #column standard deviation: array([ 1.,  1.,  1.])

Standard Scaler calculates the mean and standard deviation of the training set so that the test data set uses the same transformation.

scaler = preprocessing.StandardScaler().fit(X)    #out: StandardScaler(copy=True, with_mean=True, with_std=True)

scaler.mean_   #out: array([ 1.,  0. ,  0.33333333])  

scaler.std_    #out: array([ 0.81649658,  0.81649658,  1.24721913]) 

#The scaler is used to input data in the test, and the result after transformation is the same as the above.
scaler.transform(X) #out: array([[ 0., -1.22474487,  1.33630621],        [ 1.22474487, 0. , -0.26726124],  [-1.22474487,1.22474487, -1.06904497]])  

scaler.transform([[-1., 1., 0.]])  #scale the new data, out: array([[-2.44948974,  1.22474487, -0.26726124]])

Sources of learning summary:
Machine Learning Series (9) - Data Normalization and Scaler in Sklearn s
Sklearn preprocessing in machine learning

Posted by fibonacci on Thu, 08 Aug 2019 23:24:26 -0700