Categorical Variable EncodingšŸ””

Machine Learning models usually require the data to be numeric, be it the features or the labels. And this data is not always readily available as numeric.:bulb:

Many of the features/labels are categorical in nature, which can be either Ordinal or Nominal. How to deal with them? And when to use which encoder?:thinking:

:fast_forward:Ordinal features & labels: are those who have a sort of quantifiable relation among the values. For example, [low,high,extreme] is a type of feature which can be encoded as [1,2,3], as we can say that extreme>low.:bulb:

To encode ordinal features, the most common way is to use Sklearn’s OrdinalEncoder class. However, for the ordinal target, we use the LabelEncoder class, as it expects only 1 dimensional input.

:fast_forward:Nominal features & labels: are those who don’t have any quantifiable relation among the values. For example, [America, Asia, Europe], can still be encoded as [1,2,3]. But this wouldn’t make much sense as Europe is not > America.:thinking:

In such case, we use the OneHotEncoder class of Sklearn, which makes binary variables out of the features. If there are N categories, then N-1 binary variables are added.:bulb:

For the target, we use the LabelBinarizer class, which converts the target into binary values.:bulb:

#machinelearning #datascience