Exploratory Analysis

A Complete tutorial of Exploratory Data Analysis

Exploratory Data Analysis refers to the critical process of performing initial investigations on data so as to discover patterns, to spot anomalies, to test hypothesis and to check assumptions with the help of summary statistics and graphical representations.

It is a good practice to understand the data first and try to gather as many insights as possible from it. EDA is all about making sense of data in hand, before getting them dirty with it.

Below are some important steps to perform Exploratory analysis.

• Steps of Data Exploration and preparation
• Missing value Treatment
• Techniques of outlier detection and treatment
• The art of feature engineering

Steps of Data Exploration and preparation

• Variable Identification
• Univariate Analysis
• Bi-variate analysis
• Multi- variate analysis

Variable Identification

First, we need to identify the Predictor (Input) and Target (Output). And next to identify the data type and category of the variables

Type of variable –

• Predictor variable – Gender, Prev_Exam_marks, Height, Weight
• Target variable – pass or Fail

Data type –

• Object
• Numeric

Variable category –

• Categorical
• continuous

Univariate Analysis

In case of Continuous and continuous variable we need understand the central tendency and spread of the variable.

Below ae the statistics metrics in univariate analysis

Central tendency: -

• Mean
• Median
• Mode
• Min
• Max

Measure of dispersion: -

• Range
• Quartile
• IQR
• Variance
• Standard deviation
• Skewness and Kurtosis

Visualization methods: -

• Histogram
• Box plot

For categorical variables, we’ll use frequency table to understand distribution of each category. We can also read as percentage of values under each category. It can be measured using two metrics, Count and Count% against each category. Bar chart can be used as visualization.

Bi-variate analysis

Continuous and continuous variable

When we are doing bi-variate analysis we need to look at the scatter plot. It’s a nifty to find out the relationship between two variables

Correlation is metric used to find the correlation between 2 variables

Categorical and Categorical variable

Below are some tests used to find the relationship between two variables

• Two -way table
• Stacked column chart
• Chi-square test

Categorical and Continuous variables

Below are some tests used to find the relationship between two variables of categorical and continuous variables

• Z-test
• ANNOVA