A Complete tutorial of Exploratory Data Analysis
Exploratory Data Analysis refers to the critical process of performing initial investigations on data so as to discover patterns, to spot anomalies, to test hypothesis and to check assumptions with the help of summary statistics and graphical representations.
It is a good practice to understand the data first and try to gather as many insights as possible from it. EDA is all about making sense of data in hand, before getting them dirty with it.
Below are some important steps to perform Exploratory analysis.
- Steps of Data Exploration and preparation
- Missing value Treatment
- Techniques of outlier detection and treatment
- The art of feature engineering
Steps of Data Exploration and preparation
- Variable Identification
- Univariate Analysis
- Bi-variate analysis
- Multi- variate analysis
Variable Identification
First, we need to identify the Predictor (Input) and Target (Output). And next to identify the data type and category of the variables
Type of variable –
- Predictor variable – Gender, Prev_Exam_marks, Height, Weight
- Target variable – pass or Fail
Data type –
- Object
- Numeric
Variable category –
- Categorical
- continuous
Univariate Analysis
In case of Continuous and continuous variable we need understand the central tendency and spread of the variable.
Below ae the statistics metrics in univariate analysis
Central tendency: -
- Mean
- Median
- Mode
- Min
- Max
Measure of dispersion: -
- Range
- Quartile
- IQR
- Variance
- Standard deviation
- Skewness and Kurtosis
Visualization methods: -
- Histogram
- Box plot
For categorical variables, we’ll use frequency table to understand distribution of each category. We can also read as percentage of values under each category. It can be measured using two metrics, Count and Count% against each category. Bar chart can be used as visualization.
Bi-variate analysis
Continuous and continuous variable
When we are doing bi-variate analysis we need to look at the scatter plot. It’s a nifty to find out the relationship between two variables
Correlation is metric used to find the correlation between 2 variables
Categorical and Categorical variable
Below are some tests used to find the relationship between two variables
- Two -way table
- Stacked column chart
- Chi-square test
Categorical and Continuous variables
Below are some tests used to find the relationship between two variables of categorical and continuous variables
- Z-test
- ANNOVA