Some machine learning models, like linear and logistic regression, have an assumption that the variable is following a normal distribution. More likely, variables in datasets have skewed distribution. To remove skewness of variable and make it normal or near to normal distribution we apply different transformations to increase the performance of our model.
The most commonly used methods of transforming variables are listed below:
- Logarithmic transformation
- Reciprocal transformation
- Exponential or power transformation
- Box-cox transformation
- Yeo-Johnson transformation
1. Logarithmic transformation
This is the most popular transformation among all transformations and also, it is the most simple one.’
Generally, it is implemented on right-skewed distributions to make it normal distribution or similar to normal distribution.
F(x) = ln(x)
This transformation can only be performed if there are all positive values in the variable.
import pandas as pd import numpy as np from sklearn.preprocessing import FunctionTransformer #Load data df = pd.read_csv('anydata.csv') #create columns variables to hold the columns that needs transformation columns = ['col_1','col_2','col_3'] #create the function transformer object with Logarithm transformation logarithm_transfer = FunctionTransformer(np.log, validate = True) #Apply the transformation data_new = logarithm_transfer.transform(data[columns])
2. Reciprocal Transformation:
Reciprocal transformation transforms large values to small values of the same sign and reverses the order among values of the same sign.
f(x) = 1/x
This can be achieved by the sklearn FunctionTransformer function. Below is a snippet of code implementation using the same function.
import pandas as pd import numpy as np from sklearn.preprocessing import FunctionTransformer #Load data df = pd.read_csv('anydata.csv') #create columns variables to hold the columns that needs transformation columns = ['col_1','col_2','col_3'] #create the function transformer object with Logarithm transformation reciprocal_transfer = FunctionTransformer(np.reciprocal, validate = True) #Apply the transformation data_new = reciprocal_transfer.transform(data[columns])
3. Exponential Transformation
Generally, it is implemented on left-skewed distributions to make it normal distribution or similar to normal distribution.
We can use square, cube, square root, etc exponential transformation depend on the distribution of the variable.
F(x) = x^2
F(x) = x^3
F(x) = x^n
Below is a snippet of code implementation using the same function.
import pandas as pd import numpy as np from sklearn.preprocessing import FunctionTransformer #Load data df = pd.read_csv('anydata.csv') #create columns variables to hold the columns that needs transformation columns = ['col_1','col_2','col_3'] # create the function transformer object with your exponent transformation # Using x^3 is arbitrary here, you can choose any exponent exponential_transfer = FunctionTransformer(lambda x:x**(3), validate = True) #Apply the transformation data_new =exponential_transfer.transform(data[columns])