Categorical Variable Encoding🔡

chayan-kathuria · 14 July 2021 14:40

Machine Learning models usually require the data to be numeric, be it the features or the labels. And this data is not always readily available as numeric.

Many of the features/labels are categorical in nature, which can be either Ordinal or Nominal. How to deal with them? And when to use which encoder?

Ordinal features & labels: are those who have a sort of quantifiable relation among the values. For example, [low,high,extreme] is a type of feature which can be encoded as [1,2,3], as we can say that extreme>low.

To encode ordinal features, the most common way is to use Sklearn’s OrdinalEncoder class. However, for the ordinal target, we use the LabelEncoder class, as it expects only 1 dimensional input.

Nominal features & labels: are those who don’t have any quantifiable relation among the values. For example, [America, Asia, Europe], can still be encoded as [1,2,3]. But this wouldn’t make much sense as Europe is not > America.

In such case, we use the OneHotEncoder class of Sklearn, which makes binary variables out of the features. If there are N categories, then N-1 binary variables are added.

For the target, we use the LabelBinarizer class, which converts the target into binary values.

#machinelearning #datascience