data normalization
Data normalization is the mapping of all data to the same scale
- Maximum Normalization
- Mean variance normalization
Maximum Normalization
Normalization maps all data between 0 and 1.
The reason for using this standardization method is that sometimes the standard deviation of data sets is very small, and sometimes there are many zero (sparse) elements in data that need to be preserved.
xscale=x−xminxmax−xminx_{scale}=\frac{x-x_{min}}{x_{max}-x_{min}}xscale=xmax−xminx−xmin
import numpy as np import matplotlib.pyplot as plt x = np.random.randint(0,100,size=100) (x-np.min(x)) / (np.max(x)-np.min(x))
Maximum normalization scales data between a given minimum and maximum, usually between 0 and 1, and can be implemented with MinMax Scaler. Or you can scale the maximum absolute value to a unit size, using MaxAbsScaler.
MinMax Scaler
Formula:
X_std = (X - X.min(axis=0)) / (X.max(axis=0) - X.min(axis=0)) ; X_scaler = X_std/ (max - min) + min
#Example: Scale data to [0, 1]. Training process: fit_transform() X_train = np.array([[1., -1., 2.], [2., 0., 0.], [0., 1., -1.]]) min_max_scaler = preprocessing.MinMaxScaler() X_train_minmax = min_max_scaler.fit_transform(X_train) #out: array([[ 0.5 , 0. , 1. ], [ 1. , 0.5 , 0.33333333], [ 0. , 1. , 0. ]]) #Apply the scale parameters above to the test data X_test = np.array([[ -3., -1., 4.]]) X_test_minmax = min_max_scaler.transform(X_test) #out: array([[-1.5 , 0. , 1.66666667]]) #You can view scaler properties in the following ways min_max_scaler.scale_ #out: array([ 0.5 , 0.5, 0.33...]) min_max_scaler.min_ #out: array([ 0., 0.5, 0.33...])
MaxAbsScaler
Similar to the above standardization method, it scales the training set to [-1,1] by dividing it by the maximum value. This means that the data is zero-centric or sparse with very, very many zeros.
X_train = np.array([[ 1., -1., 2.], [ 2., 0., 0.], [ 0., 1., -1.]]) max_abs_scaler = preprocessing.MaxAbsScaler() X_train_maxabs = max_abs_scaler.fit_transform(X_train) # doctest +NORMALIZE_WHITESPACE^, out: array([[ 0.5, -1., 1. ], [ 1. , 0. , 0. ], [ 0. , 1. , -0.5]]) X_test = np.array([[ -3., -1., 4.]]) X_test_maxabs = max_abs_scaler.transform(X_test) #out: array([[-1.5, -1. , 2. ]]) max_abs_scaler.scale_ #out: array([ 2., 1., 2.])
Mean variance normalization
Mean variance normalization is to normalize all data into a distribution with mean 0 and variance 1:
xscale=x−xmeansx_{scale}=\frac{x-x_{mean}}{s}xscale=sx−xmean
from sklearn import preprocessing import numpy as np X = np.array([[1., -1., 2.], [2., 0., 0.], [0., 1., -1.]]) X_scaled = preprocessing.scale(X) #output :X_scaled = [[ 0. -1.22474487 1.33630621] [ 1.22474487 0. -0.26726124] [-1.22474487 1.22474487 -1.06904497]] #scaled Later data zero mean, unit variance X_scaled.mean(axis = 0) # column mean: array([ 0., 0., 0.]) X_scaled.std(axis = 0) #column standard deviation: array([ 1., 1., 1.])
Standard Scaler calculates the mean and standard deviation of the training set so that the test data set uses the same transformation.
scaler = preprocessing.StandardScaler().fit(X) #out: StandardScaler(copy=True, with_mean=True, with_std=True) scaler.mean_ #out: array([ 1., 0. , 0.33333333]) scaler.std_ #out: array([ 0.81649658, 0.81649658, 1.24721913]) #The scaler is used to input data in the test, and the result after transformation is the same as the above. scaler.transform(X) #out: array([[ 0., -1.22474487, 1.33630621], [ 1.22474487, 0. , -0.26726124], [-1.22474487,1.22474487, -1.06904497]]) scaler.transform([[-1., 1., 0.]]) #scale the new data, out: array([[-2.44948974, 1.22474487, -0.26726124]])
Sources of learning summary:
Machine Learning Series (9) - Data Normalization and Scaler in Sklearn s
Sklearn preprocessing in machine learning