Target Encoding

chayan-kathuria · 10 October 2021 17:54

Target encoding is a Baysian encoding technique.

Bayesian encoders use information from dependent/target variables to encode the categorical data.

In target encoding, we calculate the mean of the target variable for each category and replace the category variable with the mean value. In the case of the categorical target variables, the posterior probability of the target replaces each category.

We perform Target encoding for train data only and code the test data using results obtained from the training dataset. Although, a very efficient coding system, it has the following issues responsible for deteriorating the model performance-

It can lead to target leakage or overfitting. To address overfitting we can use different techniques.
In the leave one out encoding, the current target value is reduced from the overall mean of the target to avoid leakage.
In another method, we may introduce some Gaussian noise in the target statistics. The value of this noise is hyperparameter to the model.
The second issue, we may face is the improper distribution of categories in train and test data. In such a case, the categories may assume extreme values. Therefore the target means for the category are mixed with the marginal mean of the target.