How to transform the data to make normally distributed?

vishrut-singhal · 6 June 2021 11:31

Some machine learning models, like linear and logistic regression, have an assumption that the variable is following a normal distribution. More likely, variables in datasets have skewed distribution. To remove skewness of variable and make it normal or near to normal distribution we apply different transformations to increase the performance of our model.

Feature Engineering - Transformations

The most commonly used methods of transforming variables are listed below:

Logarithmic transformation
Reciprocal transformation
Exponential or power transformation
Box-cox transformation
Yeo-Johnson transformation

1. Logarithmic transformation

This is the most popular transformation among all transformations and also, it is the most simple one.’

Generally, it is implemented on right-skewed distributions to make it normal distribution or similar to normal distribution.

F(x) = ln(x)

This transformation can only be performed if there are all positive values in the variable.

import pandas as pd
import numpy as np
from sklearn.preprocessing import FunctionTransformer

#Load data
df = pd.read_csv('anydata.csv')

#create columns variables to hold the columns that needs transformation
columns = ['col_1','col_2','col_3']

#create the function transformer object with Logarithm transformation 
logarithm_transfer = FunctionTransformer(np.log, validate = True)

#Apply the transformation
data_new = logarithm_transfer.transform(data[columns])

2. Reciprocal Transformation:

Reciprocal transformation transforms large values to small values of the same sign and reverses the order among values of the same sign.

f(x) = 1/x

This can be achieved by the sklearn FunctionTransformer function. Below is a snippet of code implementation using the same function.

import pandas as pd
import numpy as np
from sklearn.preprocessing import FunctionTransformer

#Load data
df = pd.read_csv('anydata.csv')

#create columns variables to hold the columns that needs transformation
columns = ['col_1','col_2','col_3']

#create the function transformer object with Logarithm transformation 
reciprocal_transfer = FunctionTransformer(np.reciprocal, validate = True)

#Apply the transformation
data_new = reciprocal_transfer.transform(data[columns])

3. Exponential Transformation

Generally, it is implemented on left-skewed distributions to make it normal distribution or similar to normal distribution.

We can use square, cube, square root, etc exponential transformation depend on the distribution of the variable.

F(x) = x^2

F(x) = x^3

F(x) = x^n

Below is a snippet of code implementation using the same function.

import pandas as pd
import numpy as np
from sklearn.preprocessing import FunctionTransformer

#Load data
df = pd.read_csv('anydata.csv')

#create columns variables to hold the columns that needs transformation
columns = ['col_1','col_2','col_3']

# create the function transformer object with  your exponent transformation
# Using x^3 is arbitrary here, you can choose any exponent
exponential_transfer = FunctionTransformer(lambda x:x**(3), validate = True)

#Apply the transformation
data_new =exponential_transfer.transform(data[columns])