Feature scaling is one of the most important data preprocessing step in machine learning. Algorithms that compute the distance between the features are biased towards numerically larger values if the data is not scaled.
Tree-based algorithms are fairly insensitive to the scale of the features. Also, feature scaling helps machine learning, and deep learning algorithms train and converge faster.
There are some feature scaling techniques such as Normalisation and Standardisation that are the most popular and at the same time, the most confusing ones.
Let’s resolve that confusion.
Normalization or Min-Max Scaling is used to transform features to be on a similar scale. The new point is calculated as:
X_new = (X - X_min)/(X_max - X_min)
This scales the range to [0, 1] or sometimes [-1, 1]. Geometrically speaking, transformation squishes the n-dimensional data into an n-dimensional unit hypercube. Normalization is useful when there are no outliers as it cannot cope up with them. Usually, we would scale age and not incomes because only a few people have high incomes but the age is close to uniform.
Standardization or Z-Score Normalization is the transformation of features by subtracting from mean and dividing by standard deviation. This is often called as Z-score.
X_new = (X - mean)/Std
Standardization can be helpful in cases where the data follows a Gaussian distribution. However, this does not have to be necessarily true. Geometrically speaking, it translates the data to the mean vector of original data to the origin and squishes or expands the points if std is 1 respectively. We can see that we are just changing mean and standard deviation to a standard normal distribution which is still normal thus the shape of the distribution is not affected.
Standardization does not get affected by outliers because there is no predefined range of transformed features.
Difference between Normalisation and Standardisation
|1.||Minimum and maximum value of features are used for scaling||Mean and standard deviation is used for scaling.|
|2.||It is used when features are of different scales.||It is used when we want to ensure zero mean and unit standard deviation.|
|3.||Scales values between [0, 1] or [-1, 1].||It is not bounded to a certain range.|
|4.||It is really affected by outliers.||It is much less affected by outliers.|
|5.||Scikit-Learn provides a transformer called ||Scikit-Learn provides a transformer called |
|6.||This transformation squishes the n-dimensional data into an n-dimensional unit hypercube.||It translates the data to the mean vector of original data to the origin and squishes or expands.|
|7.||It is useful when we don’t know about the distribution||It is useful when the feature distribution is Normal or Gaussian.|
|8.||It is a often called as Scaling Normalization||It is a often called as Z-Score Normalization.|