Map Data to a Normal Distribution in Scikit Learn

Last Updated : 30 Jan, 2023

A Normal Distribution, also known as a Gaussian distribution, is a continuous probability distribution that is symmetrical around its mean. It is defined by its norm, which is the center of the distribution, and its standard deviation, which is a measure of the spread of the distribution.

The normal distribution is often used to model continuous and symmetrically distributed data around a central value, such as the heights of people in a population. Normal distributions are often plotted on a graph as a bell curve, with the mean at the center and the standard deviation indicating the spread of the curve.

To map data to a normal distribution using scikit-learn we can use:

StandardScaler:- StandardScaler is a transformer in scikit-learn that standardizes the features by removing the mean and scaling to unit variance. This is often used to bring different features onto the same scale so that they can be compared or combined more easily.
PowerTransformer:- It is a transformer in scikit-learn that applies a power transformation to the data to stabilize variance and make the data more symmetrical. This is often used to transform data that is skewed or has outliers so that it is more suitable for analysis or modeling.

Map data to Normal Distribution using StandardScaler

To use StandardScaler, you will first need to create an instance of the class and fit it to the data. The fit method estimates the mean and standard deviation of the data, and the transform method applies the transformation to the data. The basic syntax of the standard scaler transformer is given by:-

StandardScaler(*, copy=True, with_mean=True, with_std=True, axis=0)

copy: Boolean value indicating whether to copy the data before scaling. The default value is True, which means that a copy of the data will be made. Setting copy=False will modify the original data in place.

with_mean: Boolean value indicating whether to center the data by subtracting the mean. The default value is True, which means that the mean will be subtracted from the data. Setting with_mean=False will disable this transformation.

with_std: Boolean value indicating whether to scale the data by dividing by the standard deviation. The default value is True, which means that the data will be scaled to have a standard deviation of 1. Setting with_std=False will disable this transformation.

axis: Integer value indicating which axis to apply the scaling. The default value is 0, which scales the data along the rows. Setting axis=1 will scale the data along the columns.

Example:

Here is an example of how to use StandardScaler to standardize the features:

Python3

import matplotlib.pyplot as plt 
import numpy as np 
from sklearn.preprocessing import StandardScaler 
  
# Generate some random data with a skewed distribution 
np.random.seed(42) 
X = np.concatenate([np.random.exponential 
                        (scale=2, size=200),  
                    np.random.normal 
                        (loc=5, scale=1, size=200)]) 
  
# Plot the original data 
plt.hist(X, bins=20, density=True, alpha=0.5, 
                                 label='original') 
  
# Standardize the data 
scaler = StandardScaler() 
X_normalized = scaler.fit_transform(X.reshape(-1, 1)) 
  
# Plot the transformed data 
plt.hist(X_normalized, bins=20, density=True,  
                         alpha=0.5, label='normalized') 
  
plt.legend() 
plt.show()

Output:

The data shows a Normal Distribution dataset after applying the standard scaler.

Map data to Normal Distribution using PowerTransformer

There are several types of power transformations available in PowerTransformer, including the Yeo-Johnson and Box-Cox transformations.

Yeo-Johnson:- The Yeo-Johnson transformation can be applied to both positive and negative data
Box-Cox:- The Box-Cox transformation can only be applied to positive data

The Syntax of the powertransformer() function

Syntax: PowerTransformer(*, method=’yeo-johnson’, standardize=True, copy=True)

method: String value indicating the type of power transformation to apply. The options are ‘yeo-johnson’ and ‘box-cox’. The default value is ‘yeo-johnson’, which applies the Yeo-Johnson transformation. The ‘box-cox’ method applies the Box-Cox transformation.

standardize: Boolean value indicating whether to center and scale the data after the transformation. The default value is True, which means that the data will be standardized to have a mean of 0 and a standard deviation of 1. Setting standardize=False will disable this transformation.

copy: Boolean value indicating whether to copy the data before transforming. The default value is True, which means that a copy of the data will be made. Setting copy=False will modify the original data in place.

To use PowerTransformer, you will first need to create an instance of the class and fit it to the data. The fit method estimates the parameters of the transformation, and the transform method applies the transformation to the data.

Implementing yeo-johnson method

The Yeo-Johnson power transformation is a method for stabilizing variance and making data more symmetrical by applying a power transformation to the data. It is a generalization of the Box-Cox transformation, which can only be applied to positive data and can be applied to both positive and negative data.

Python3

import matplotlib.pyplot as plt 
import numpy as np 
from sklearn.preprocessing import PowerTransformer 
  
# Generate some random data with a skewed distribution 
np.random.seed(42) 
X = np.concatenate([np.random.exponential 
                        (scale=2, size=200),  
                    np.random.normal(loc=-3,  
                             scale=1, size=200)]) 
  
# Plot the original data 
plt.hist(X, bins=20, density=True,  
                         alpha=0.5, label='original') 
  
# Apply the Yeo-Johnson transformation 
transformer = PowerTransformer 
                    (method='yeo-johnson') 
X_transformed = transformer.fit_transform(X.reshape(-1, 1)) 
  
# Plot the transformed data 
plt.hist(X_transformed, bins=20, density=True,  
                         alpha=0.5, label='transformed') 
  
plt.title("Normal Distribution using Power \ 
                    Transformer using yeo-johnson") 
plt.legend() 
plt.show()

Output:

Implementing the Box-Cox method

The Box-Cox transformation is a method for stabilizing variance and making data more symmetrical by applying a power transformation to the data. It is a special case of the Yeo-Johnson transformation, and can only be applied to positive data.

Python3

import matplotlib.pyplot as plt 
import numpy as np 
from sklearn.preprocessing import PowerTransformer 
  
# Generate some random data with a skewed distribution 
np.random.seed(42) 
X = np.concatenate([np.random.exponential 
                    (scale=2, size=200), 
                    np.random.normal 
                    (loc=5, scale=1, size=200)]) 
  
# Plot the original data 
plt.hist(X, bins=20, density=True,  
         alpha=0.5, label='original') 
  
# Apply the Box-Cox transformation 
transformer = PowerTransformer 
                    (method='box-cox') 
X_transformed = transformer.fit_transform 
                        (X.reshape(-1, 1)) 
  
# Plot the transformed data 
plt.hist(X_transformed, bins=20, density=True,  
                     alpha=0.5, label='transformed') 
  
plt.title("Normal Distribution using Power \ 
                        Transformer using Box-Cox") 
plt.legend() 
plt.show()