Open In App

Map Data to a Normal Distribution in Scikit Learn

A Normal Distribution, also known as a Gaussian distribution, is a continuous probability distribution that is symmetrical around its mean. It is defined by its norm, which is the center of the distribution, and its standard deviation, which is a measure of the spread of the distribution.

The normal distribution is often used to model continuous and symmetrically distributed data around a central value, such as the heights of people in a population. Normal distributions are often plotted on a graph as a bell curve, with the mean at the center and the standard deviation indicating the spread of the curve.



To map data to a normal distribution using scikit-learn we can use:

Map data to Normal  Distribution using StandardScaler

To use StandardScaler, you will first need to create an instance of the class and fit it to the data. The fit method estimates the mean and standard deviation of the data, and the transform method applies the transformation to the data. The basic syntax of the standard scaler transformer is given by:-



StandardScaler(*, copy=True, with_mean=True, with_std=True, axis=0)

  • copy: Boolean value indicating whether to copy the data before scaling. The default value is True, which means that a copy of the data will be made. Setting copy=False will modify the original data in place.
  • with_mean: Boolean value indicating whether to center the data by subtracting the mean. The default value is True, which means that the mean will be subtracted from the data. Setting with_mean=False will disable this transformation.
  • with_std: Boolean value indicating whether to scale the data by dividing by the standard deviation. The default value is True, which means that the data will be scaled to have a standard deviation of 1. Setting with_std=False will disable this transformation.
  • axis: Integer value indicating which axis to apply the scaling. The default value is 0, which scales the data along the rows. Setting axis=1 will scale the data along the columns.

Example:

Here is an example of how to use StandardScaler to standardize the features:




import matplotlib.pyplot as plt
import numpy as np
from sklearn.preprocessing import StandardScaler
  
# Generate some random data with a skewed distribution
np.random.seed(42)
X = np.concatenate([np.random.exponential
                        (scale=2, size=200), 
                    np.random.normal
                        (loc=5, scale=1, size=200)])
  
# Plot the original data
plt.hist(X, bins=20, density=True, alpha=0.5,
                                 label='original')
  
# Standardize the data
scaler = StandardScaler()
X_normalized = scaler.fit_transform(X.reshape(-1, 1))
  
# Plot the transformed data
plt.hist(X_normalized, bins=20, density=True
                         alpha=0.5, label='normalized')
  
plt.legend()
plt.show()

Output:

The data shows a Normal Distribution dataset after applying the standard scaler.

 

Map data to Normal  Distribution using PowerTransformer

There are several types of power transformations available in PowerTransformer, including the Yeo-Johnson and Box-Cox transformations.

The Syntax of the powertransformer() function

Syntax: PowerTransformer(*, method=’yeo-johnson’, standardize=True, copy=True)

  • method: String value indicating the type of power transformation to apply. The options are ‘yeo-johnson’ and ‘box-cox’. The default value is ‘yeo-johnson’, which applies the Yeo-Johnson transformation. The ‘box-cox’ method applies the Box-Cox transformation.
  • standardize: Boolean value indicating whether to center and scale the data after the transformation. The default value is True, which means that the data will be standardized to have a mean of 0 and a standard deviation of 1. Setting standardize=False will disable this transformation.
  • copy: Boolean value indicating whether to copy the data before transforming. The default value is True, which means that a copy of the data will be made. Setting copy=False will modify the original data in place.

To use PowerTransformer, you will first need to create an instance of the class and fit it to the data. The fit method estimates the parameters of the transformation, and the transform method applies the transformation to the data.

Implementing yeo-johnson method

The Yeo-Johnson power transformation is a method for stabilizing variance and making data more symmetrical by applying a power transformation to the data. It is a generalization of the Box-Cox transformation, which can only be applied to positive data and can be applied to both positive and negative data.




import matplotlib.pyplot as plt
import numpy as np
from sklearn.preprocessing import PowerTransformer
  
# Generate some random data with a skewed distribution
np.random.seed(42)
X = np.concatenate([np.random.exponential
                        (scale=2, size=200), 
                    np.random.normal(loc=-3
                             scale=1, size=200)])
  
# Plot the original data
plt.hist(X, bins=20, density=True
                         alpha=0.5, label='original')
  
# Apply the Yeo-Johnson transformation
transformer = PowerTransformer
                    (method='yeo-johnson')
X_transformed = transformer.fit_transform(X.reshape(-1, 1))
  
# Plot the transformed data
plt.hist(X_transformed, bins=20, density=True
                         alpha=0.5, label='transformed')
  
plt.title("Normal Distribution using Power \
                    Transformer using yeo-johnson")
plt.legend()
plt.show()

Output:

 

Implementing the Box-Cox method

The Box-Cox transformation is a method for stabilizing variance and making data more symmetrical by applying a power transformation to the data. It is a special case of the Yeo-Johnson transformation, and can only be applied to positive data.




import matplotlib.pyplot as plt
import numpy as np
from sklearn.preprocessing import PowerTransformer
  
# Generate some random data with a skewed distribution
np.random.seed(42)
X = np.concatenate([np.random.exponential
                    (scale=2, size=200),
                    np.random.normal
                    (loc=5, scale=1, size=200)])
  
# Plot the original data
plt.hist(X, bins=20, density=True
         alpha=0.5, label='original')
  
# Apply the Box-Cox transformation
transformer = PowerTransformer
                    (method='box-cox')
X_transformed = transformer.fit_transform
                        (X.reshape(-1, 1))
  
# Plot the transformed data
plt.hist(X_transformed, bins=20, density=True
                     alpha=0.5, label='transformed')
  
plt.title("Normal Distribution using Power \
                        Transformer using Box-Cox")
plt.legend()
plt.show()

Output:

 


Article Tags :