What is Binning under feature engineering?

vishrut-singhal · 6 June 2021 11:20

While working with numeric data we come across some features where distributions of variables are skewed in the sense that some sets of values will occur a lot and some will be very rare. Directly using this type of feature may cause issues or can give inaccurate results.

Feature Engineering

Binning is a way to convert numerical continuous variables into discrete variables by categorizing them on the basis of the range of values of the column in which they fall. In this type of transformation, we create bins. Each bin allows a specific range of continuous numerical values. It prevents overfitting and increases the robustness of the model.

Let’s understand this using an example. We have scores of 10 students as 35, 46, 89, 20, 58, 99, 74, 60, 18, 81. Our task is to make 3 teams. Team 1 will have students with scores between 1-40, Team 2 will have students with scores between 41-80, and Team 3 will have students with scores between 81-100.

Binning can be done in different ways listed below.

Fixed – Width Binning
Quantile Binning
Binning by Instinct

1. Fixed – Width Binning

Just like the name indicates, in fixed-width binning, we have specific fixed widths for each of the bins which are usually pre-defined by the user analyzing the data. Each bin has a pre-fixed range of values that should be assigned to that bin on the basis of some domain knowledge, rules, or constraints.

Let’s take an example to understand it better, we can group a person’s age interval to 10 years(decades). 0- 9 years will be in bin-1, 10-19 in bin-2 similarly 20-29 in bin-3

Feature Engineering

This can be achieved by the below code in python.

import pandas as pd

#reading file
df_bin = pd.read_csv('stroke_prediction.csv')

#Creating bins and labels
bins = [1,10,20,30,40]
labels = ['bin-1','bin-2','bin-3','bin-4']

df_bin['age_range'] = pd.cut(df_bin['age'],bins = bins, labels = labels)

2. Quantile Binning

If there are a large number of gaps in the range of numerical feature fix-width binning will not be that effective, there will be many empty bins with no data. In such cases binning is done on the basis of quantile distribution.

Quantiles divide data into equal portions. The Median divides data into two parts, half of the data is smaller than the median, and half of the data is large than the median. Quartiles divide into quarters and deciles into tenth etc.

3. Binning by Instinct

This actually involves a manual process of binning manually based on your own personal insight of the data and setting ranges we would like to bin our data into.

Let’s take an example to understand it better, we can group a person’s age into interval where 1-18 falls under a minor, 19- 29 under young, 30-49 under old, and 50-100 in very old.

This can be achieved by the below code in python.

import pandas as pd

#reading file
df_bin = pd.read_csv('stroke_prediction.csv')

#Creating bins and labels
bins = [1,19,30,50,100]
labels = ['minor','young','old','very_old']

df_bin['age_range'] = pd.cut(df_bin['age'],bins = bins, labels = labels)